2022-06-22 14:49:55

by Petr Mladek

[permalink] [raw]
Subject: [PATCH] workqueue: Make create_worker() safe against spurious wakeups

A system crashed with the following BUG() report:

[115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000
[115147.050488] #PF: supervisor write access in kernel mode
[115147.050489] #PF: error_code(0x0002) - not-present page
[115147.050490] PGD 0 P4D 0
[115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI
[115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased)
[115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30
[115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90
[115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046
[115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000
[115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340
[115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0
[115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780
[115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000
[115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0
[115147.050524] Call Trace:
[115147.050533] worker_thread+0xb4/0x3c0
[115147.050538] ? process_one_work+0x4a0/0x4a0
[115147.050540] kthread+0x152/0x170
[115147.050542] ? kthread_park+0xa0/0xa0
[115147.050544] ret_from_fork+0x35/0x40

Further debugging shown that the worker thread was woken
before worker_attach_to_pool() finished in create_worker().

Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
until it is explicitly woken. But a spurious wakeup might
break this expectation.

As a result, worker_thread() might read worker->pool before
it was set in worker create_worker() by worker_attach_to_pool().
Also manage_workers() might want to create yet another worker
before worker->pool->nr_workers is updated. It is a kind off
a chicken & egg problem.

Synchronize these operations using a completion API.

Note that worker->pool might be then read without wq_pool_attach_mutex.
Normal worker always belongs to the same pool.

Also note that rescuer_thread() does not need this because all
needed values are set before the kthreads is created. It is
connected with a particular workqueue. It is attached to different
pools when needed.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/workqueue.c | 13 ++++++++++++-
kernel/workqueue_internal.h | 1 +
2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1ea50f6be843..9586b0797145 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1939,6 +1939,7 @@ static struct worker *create_worker(struct worker_pool *pool)
goto fail;

worker->id = id;
+ init_completion(&worker->ready_to_start);

if (pool->cpu >= 0)
snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
@@ -1961,6 +1962,7 @@ static struct worker *create_worker(struct worker_pool *pool)
raw_spin_lock_irq(&pool->lock);
worker->pool->nr_workers++;
worker_enter_idle(worker);
+ complete(&worker->ready_to_start);
wake_up_process(worker->task);
raw_spin_unlock_irq(&pool->lock);

@@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val)
static int worker_thread(void *__worker)
{
struct worker *worker = __worker;
- struct worker_pool *pool = worker->pool;
+ struct worker_pool *pool;

/* tell the scheduler that this is a workqueue worker */
set_pf_worker(true);
+
+ /*
+ * Wait until the worker is ready to get started. It must be attached
+ * to a pool first. This is needed because of spurious wakeups.
+ */
+ while (wait_for_completion_interruptible(&worker->ready_to_start))
+ ;
+ pool = worker->pool;
+
woke_up:
raw_spin_lock_irq(&pool->lock);

diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index e00b1204a8e9..ffca2783c8bf 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -41,6 +41,7 @@ struct worker {
/* L: for rescuers */
struct list_head node; /* A: anchored at pool->workers */
/* A: runs through worker->node */
+ struct completion ready_to_start; /* None: handle spurious wakeup */

unsigned long last_active; /* L: last active timestamp */
unsigned int flags; /* X: flags */
--
2.35.3


2022-06-23 07:30:29

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups

On Wed 2022-06-22 16:08:53, Petr Mladek wrote:
> A system crashed with the following BUG() report:
>
> [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [115147.050488] #PF: supervisor write access in kernel mode
> [115147.050489] #PF: error_code(0x0002) - not-present page
> [115147.050490] PGD 0 P4D 0
> [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI
> [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased)
> [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30
> [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90
> [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046
> [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000
> [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340
> [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0
> [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780
> [115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000
> [115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0
> [115147.050524] Call Trace:
> [115147.050533] worker_thread+0xb4/0x3c0
> [115147.050538] ? process_one_work+0x4a0/0x4a0
> [115147.050540] kthread+0x152/0x170
> [115147.050542] ? kthread_park+0xa0/0xa0
> [115147.050544] ret_from_fork+0x35/0x40
>
> Further debugging shown that the worker thread was woken
> before worker_attach_to_pool() finished in create_worker().
>
> Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
> until it is explicitly woken. But a spurious wakeup might
> break this expectation.
>
> As a result, worker_thread() might read worker->pool before
> it was set in worker create_worker() by worker_attach_to_pool().
> Also manage_workers() might want to create yet another worker
> before worker->pool->nr_workers is updated. It is a kind off
> a chicken & egg problem.
>
> Synchronize these operations using a completion API.
>
> Note that worker->pool might be then read without wq_pool_attach_mutex.
> Normal worker always belongs to the same pool.
>
> Also note that rescuer_thread() does not need this because all
> needed values are set before the kthreads is created. It is
> connected with a particular workqueue. It is attached to different
> pools when needed.
>
> Signed-off-by: Petr Mladek <[email protected]>
> ---
> kernel/workqueue.c | 13 ++++++++++++-
> kernel/workqueue_internal.h | 1 +
> 2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 1ea50f6be843..9586b0797145 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val)
> static int worker_thread(void *__worker)
> {
> struct worker *worker = __worker;
> - struct worker_pool *pool = worker->pool;
> + struct worker_pool *pool;
>
> /* tell the scheduler that this is a workqueue worker */
> set_pf_worker(true);
> +
> + /*
> + * Wait until the worker is ready to get started. It must be attached
> + * to a pool first. This is needed because of spurious wakeups.
> + */
> + while (wait_for_completion_interruptible(&worker->ready_to_start))

I have realized that this is wrong. I used _interruptible() variant
because we do not know how long it would need to sleep. And long
sleeping in TASK_UNINTERRUPTIBLE might trigger hung task warning.

But kthread_bind_mask() requires that the kthread will be in
TASK_UNINTERRUPTIBLE state.

Note that even the freshly created kthread is sleeping in
TASK_UNTERRUPTIBLE, see kthread() in kernel/kthread.c. But
it does not trigger the hung task warning because there is
a special check for this case, see check_hung_task():

/*
* When a freshly created task is scheduled once, changes its state to
* TASK_UNINTERRUPTIBLE without having ever been switched out once, it
* musn't be checked.
*/
if (unlikely(!switch_count))
return;

The following seems to work well:

while (!wait_for_completion_timeout(&worker->ready_to_start, 5 * HZ))
;

That said, I am not super happy with this solution. IMHO, it is a flaw
in the kthread() API. But I do not see a reasonable way how to fix
this except for adding a special API how to wake up the freshly
created kthread for the first time. It would require updating all
users. I am not sure how this problem is common and if it is worth it.

And I used completion API to avoid tricky code using low level scheduling
API.

> + ;
> + pool = worker->pool;
> +
> woke_up:
> raw_spin_lock_irq(&pool->lock);
>

I am going to wait with v2 until I get some feedback for this
approach.

Best Regards,
Petr

2022-06-23 07:31:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups

On Thu 23-06-22 09:00:12, Petr Mladek wrote:
> On Wed 2022-06-22 16:08:53, Petr Mladek wrote:
> > A system crashed with the following BUG() report:
> >
> > [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [115147.050488] #PF: supervisor write access in kernel mode
> > [115147.050489] #PF: error_code(0x0002) - not-present page
> > [115147.050490] PGD 0 P4D 0
> > [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI
> > [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased)
> > [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30
> > [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90
> > [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046
> > [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000
> > [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340
> > [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0
> > [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780
> > [115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000
> > [115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0
> > [115147.050524] Call Trace:
> > [115147.050533] worker_thread+0xb4/0x3c0
> > [115147.050538] ? process_one_work+0x4a0/0x4a0
> > [115147.050540] kthread+0x152/0x170
> > [115147.050542] ? kthread_park+0xa0/0xa0
> > [115147.050544] ret_from_fork+0x35/0x40
> >
> > Further debugging shown that the worker thread was woken
> > before worker_attach_to_pool() finished in create_worker().
> >
> > Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
> > until it is explicitly woken. But a spurious wakeup might
> > break this expectation.
> >
> > As a result, worker_thread() might read worker->pool before
> > it was set in worker create_worker() by worker_attach_to_pool().
> > Also manage_workers() might want to create yet another worker
> > before worker->pool->nr_workers is updated. It is a kind off
> > a chicken & egg problem.
> >
> > Synchronize these operations using a completion API.
> >
> > Note that worker->pool might be then read without wq_pool_attach_mutex.
> > Normal worker always belongs to the same pool.
> >
> > Also note that rescuer_thread() does not need this because all
> > needed values are set before the kthreads is created. It is
> > connected with a particular workqueue. It is attached to different
> > pools when needed.
> >
> > Signed-off-by: Petr Mladek <[email protected]>
> > ---
> > kernel/workqueue.c | 13 ++++++++++++-
> > kernel/workqueue_internal.h | 1 +
> > 2 files changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 1ea50f6be843..9586b0797145 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val)
> > static int worker_thread(void *__worker)
> > {
> > struct worker *worker = __worker;
> > - struct worker_pool *pool = worker->pool;
> > + struct worker_pool *pool;
> >
> > /* tell the scheduler that this is a workqueue worker */
> > set_pf_worker(true);
> > +
> > + /*
> > + * Wait until the worker is ready to get started. It must be attached
> > + * to a pool first. This is needed because of spurious wakeups.
> > + */
> > + while (wait_for_completion_interruptible(&worker->ready_to_start))
>
> I have realized that this is wrong. I used _interruptible() variant
> because we do not know how long it would need to sleep. And long
> sleeping in TASK_UNINTERRUPTIBLE might trigger hung task warning.
>
> But kthread_bind_mask() requires that the kthread will be in
> TASK_UNINTERRUPTIBLE state.
>
> Note that even the freshly created kthread is sleeping in
> TASK_UNTERRUPTIBLE, see kthread() in kernel/kthread.c. But
> it does not trigger the hung task warning because there is
> a special check for this case, see check_hung_task():
>
> /*
> * When a freshly created task is scheduled once, changes its state to
> * TASK_UNINTERRUPTIBLE without having ever been switched out once, it
> * musn't be checked.
> */
> if (unlikely(!switch_count))
> return;
>
> The following seems to work well:
>
> while (!wait_for_completion_timeout(&worker->ready_to_start, 5 * HZ))
> ;

This is really ugly and I do not really see why this is needed. Should
there be such an overloaded system that create_worker cannot finish the
initialization before khungtask triggers then we should be notified. I
do not see how papering this over would be helpful.

That being said, I really thing that this should be plain
wait_for_completion()

With that changed, feel free to add
Reviewed-by: Michal Hocko <[email protected]>

and thanks helping to debug this and prepare the fix. I was not really
aware of the spurious wake up scenario. Btw. it would be great to
describe that in a more detail as well for future reference.

Thanks!
--
Michal Hocko
SUSE Labs

2022-06-25 05:06:48

by Tejun Heo

[permalink] [raw]
Subject: re. Spurious wakeup on a newly created kthread

Hello,

cc'ing random assortment of ppl who touched kernel/kthread.c and
others who would know better.

So, Petr debugged a NULL deref in workqueue code to a spurious wakeup
on a newly created kthread. The abbreviated patch description follows.
The original message is at

http://lkml.kernel.org/r/[email protected]

On Wed, Jun 22, 2022 at 04:08:53PM +0200, Petr Mladek wrote:
> A system crashed with the following BUG() report:
>
> [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000
...
> [115147.050524] Call Trace:
> [115147.050533] worker_thread+0xb4/0x3c0
> [115147.050540] kthread+0x152/0x170
> [115147.050544] ret_from_fork+0x35/0x40
>
> Further debugging shown that the worker thread was woken
> before worker_attach_to_pool() finished in create_worker().
>
> Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
> until it is explicitly woken. But a spurious wakeup might
> break this expectation.
>
> As a result, worker_thread() might read worker->pool before
> it was set in worker create_worker() by worker_attach_to_pool().
> Also manage_workers() might want to create yet another worker
> before worker->pool->nr_workers is updated. It is a kind off
> a chicken & egg problem.

tl;dr is that the worker creation code expects a newly created worker
kthread to sit tight until the creator finishes setting up stuff and
sends the initial wakeup. However, something, which wasn't identified
in the report (Petr, it'd be great if you can find out who did the
wakeup), wakes up the new kthread before the creation path is done
with init which causes the new kthread to try to deref a NULL pointer.

Petr fixed the problem by adding an extra handshake step so that the
new kthread explicitly waits for the creation path, which is fine, but
the picture isn't making sense to me.

* Are spurious wakeups allowed? The way that we do set_current_state()
in every iteration in wait_event() seems to suggest that we expect
someone to spuriously flip task state to RUNNING.

* However, if we're to expect spurious wakeups for anybody anytime,
why does a newly created kthread bother with
schedule_preempt_disabled() in kernel/kthread.c::kthread() at all?
It can't guarantee anything and all it does is masking subtle bugs.

What am I missing here?

Thanks.

--
tejun

2022-06-25 17:04:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Fri, Jun 24, 2022 at 10:00 PM Tejun Heo <[email protected]> wrote:
>
> So, Petr debugged a NULL deref in workqueue code to a spurious wakeup
> on a newly created kthread.

What? No. That patch can't be right for several reasons.

What we call "spurious wakeups" exist, but they are about wakeups that
happen from being on a _previous_ wait-queue, and having already been
removed from it.

They aren't "really" spurious, they are just asynchronous enough (and
thus unexpected) that you basically should never have a "sleep on
wait-queue" without then looping and re-testing the condition.

There is no _truly_ spurious wakeup. You were always woken up for a
reason, it's just that there are more reasons than the entirely
obvious ones.

For example, the reason that quoted patch cannot be right is that this
code pattern:

while (wait_for_completion_interruptible(&worker->ready_to_start))
;

is not valid kernel code. EVER. There is absolutely no way that can be correct.

Either that code can take a signal, or it cannot. If it can take a
signal, it had better react to said signal. If it cannot, it must not
use an interruptble sleep - since now that loop turned into a
kernel-side busy-loop.

So NAK on this kind of crazy "I don't know what happened, so I'll just
add *more* bugs to the code" voodoo programming.

And no, we don't "fix" that by then adding a timeout.

Stop this "add random code" thing.

If you cannot be woken up before X happens, then you should:

- don't go to sleep before X happens

- don't add yourself to any wait-queues before X happens

- don't expose your process to others before X happens

The solution is *not* to add some completion with random "wait for
this before waking".

I think the problem here is much more fundamental: you expect a new
thread to not wake up until you've told it to.

We do have that infrastructure in the kernel: when you create a new
thread, you can do various setup, and the thread won't actually run
until you do "wake_up_new_task()" on it.

However, that's not how kernel_thread() (or kernel_clone()) works.
Those will call wake_up_new_task(p) for you, and as such a new kernel
thread will immediately start running.

So I think the expectations here are entirely wrong. I think
create_worker() is fundamentally buggy, in how it does that

/* start the newly created worker */
..
wake_up_process(worker->task);

because that wake_up_process() is already much too late. The process
got woken up already, because it was created by create_kthread() ->
kernel_thread() -> kernel_clone, which does that wake_up_new_task()
and it starts running.

If you want to do thread setup *bnefore* the kernel is running, it
needs to be done before that wake_up_new_task().

That's very possible. Look at what create_io_thread() does, for
example: it never calls wake_up_new_process() at all, and leaves that
to the caller, which has done the extra setup.

Or the kernel_clone_args could have a "init" function that gets called
before doing the wake_up_new_task() is done. Or a number of other
solutions.

But no, we're not randomly adding some new completion because people
were confused and thought they were waking things up when it was
already awake from before.

Linus

2022-06-25 17:46:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

Linus Torvalds <[email protected]> writes:

> On Fri, Jun 24, 2022 at 10:00 PM Tejun Heo <[email protected]> wrote:
>>
>> So, Petr debugged a NULL deref in workqueue code to a spurious wakeup
>> on a newly created kthread.
>
> What? No. That patch can't be right for several reasons.
>
> What we call "spurious wakeups" exist, but they are about wakeups that
> happen from being on a _previous_ wait-queue, and having already been
> removed from it.
>
> They aren't "really" spurious, they are just asynchronous enough (and
> thus unexpected) that you basically should never have a "sleep on
> wait-queue" without then looping and re-testing the condition.
>
> There is no _truly_ spurious wakeup. You were always woken up for a
> reason, it's just that there are more reasons than the entirely
> obvious ones.
>
> For example, the reason that quoted patch cannot be right is that this
> code pattern:
>
> while (wait_for_completion_interruptible(&worker->ready_to_start))
> ;
>
> is not valid kernel code. EVER. There is absolutely no way that can be correct.
>
> Either that code can take a signal, or it cannot. If it can take a
> signal, it had better react to said signal. If it cannot, it must not
> use an interruptble sleep - since now that loop turned into a
> kernel-side busy-loop.
>
> So NAK on this kind of crazy "I don't know what happened, so I'll just
> add *more* bugs to the code" voodoo programming.
>
> And no, we don't "fix" that by then adding a timeout.
>
> Stop this "add random code" thing.
>
> If you cannot be woken up before X happens, then you should:
>
> - don't go to sleep before X happens
>
> - don't add yourself to any wait-queues before X happens
>
> - don't expose your process to others before X happens
>
> The solution is *not* to add some completion with random "wait for
> this before waking".
>
> I think the problem here is much more fundamental: you expect a new
> thread to not wake up until you've told it to.
>
> We do have that infrastructure in the kernel: when you create a new
> thread, you can do various setup, and the thread won't actually run
> until you do "wake_up_new_task()" on it.
>
> However, that's not how kernel_thread() (or kernel_clone()) works.
> Those will call wake_up_new_task(p) for you, and as such a new kernel
> thread will immediately start running.
>
> So I think the expectations here are entirely wrong. I think
> create_worker() is fundamentally buggy, in how it does that
>
> /* start the newly created worker */
> ..
> wake_up_process(worker->task);
>
> because that wake_up_process() is already much too late. The process
> got woken up already, because it was created by create_kthread() ->
> kernel_thread() -> kernel_clone, which does that wake_up_new_task()
> and it starts running.
>
> If you want to do thread setup *bnefore* the kernel is running, it
> needs to be done before that wake_up_new_task().
>
> That's very possible. Look at what create_io_thread() does, for
> example: it never calls wake_up_new_process() at all, and leaves that
> to the caller, which has done the extra setup.
>
> Or the kernel_clone_args could have a "init" function that gets called
> before doing the wake_up_new_task() is done. Or a number of other
> solutions.
>
> But no, we're not randomly adding some new completion because people
> were confused and thought they were waking things up when it was
> already awake from before.

Ugh. The use of kthread_bind_mask is buggy and kthread_bind_mask and
kthread_bind look buggy by definition. As far as I know the kind of
games that involve wait_task_inactive are only safe in special stop
states which TASK_UNINTERRUPTIBLE is not.

I don't know if we have any current code in the kernel that does this
but I have seen code proposed that for suspend/resume that would wake up
everything not in a special stop state, and have every kernel task
figure out if it should be sleeping or not after such a resume.

This implies that kthread_park needs to be used for this logic to work
at all. So the only legitimate user of the __kthread_bind and
__kthread_bind_mask appears to be kthread_unpark.



Let me suggest someone create a new variant of kthread_create that takes
all of the parameters the workqueue code wants to set.

Hopefully that will be enough that we can use it to remove the handful
of buggy cases that call kthread_bind_mask and kthread_bind.

Eric

2022-06-25 18:50:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat, Jun 25, 2022 at 10:36 AM Eric W. Biederman
<[email protected]> wrote:
>
> Let me suggest someone create a new variant of kthread_create that takes
> all of the parameters the workqueue code wants to set.

I suspect the real issue is that that the kthread code simply
shouldn't use the kernel_thread() helper at all.

That helper is literally designed for "start a thread, run this thing".

That's what it *does*.

And that's not at all what the kthread code wants. It wants to set
affinity masks, it wants to create a name for the thread, it wants to
do all those other things.

That code really wants to just do copy_process().

Or maybe it really should just use create_io_thread(), which has a
much better interface, except it has that one oddity in that it sets
the flag that does this:

if (args->io_thread) {
/*
* Mark us an IO worker, and block any signal that isn't
* fatal or STOP
*/
p->flags |= PF_IO_WORKER;
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
}

that then has special semantics.

Now, those special semantics may actually be exactly what kthreads
wants, but they are important:

/*
* PF_IO_WORKER threads will catch and exit on fatal signals
* themselves. They have cleanup that must be performed, so
* we cannot call do_exit() on their behalf.
*/

which is about that "what happens for fatal signals" thing.

Linus

2022-06-25 19:10:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat, Jun 25, 2022 at 11:25 AM Linus Torvalds
<[email protected]> wrote:
>
> And that's not at all what the kthread code wants. It wants to set
> affinity masks, it wants to create a name for the thread, it wants to
> do all those other things.
>
> That code really wants to just do copy_process().

Honestly, I think kernel/kthread.c should be almost rewritten from scratch.

I do not understand why it does all those odd keventd games at all,
and why kthread_create_info exists in the first place.

Why does kthread_create() not just create the thread directly itself,
and instead does that odd queue it onto a work function?

Some of that goes back to before the git history, and very little of
it seems to make any sense. It's as if the code is meant to be able to
run from interrupt context, but that can't be it: it's literally doing
a GFP_KERNEL kmalloc, it's doing spin-locks without irq safety etc.

So why is it calling kthreadd_task() to create the thread? Purely for
some crazy odd "make that the parent" reason?

I dunno. The code is odd, unexplained, looks buggy, and most fo the
reasons are probably entirely historical.

I'm adding Christian to this thread too, since I get the feeling that
it really should be more tightly integrated with copy_process(), and
that Christian might have comments.

Christian, see some context in the thread here:

https://lore.kernel.org/all/CAHk-=wiC7rj1o7vTnYUPfD7YxAu09MZiZbahHqvLm9+Cgg1dFw@mail.gmail.com/

for some of this.

Linus

2022-06-25 23:59:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

Linus Torvalds <[email protected]> writes:

> On Sat, Jun 25, 2022 at 11:25 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> And that's not at all what the kthread code wants. It wants to set
>> affinity masks, it wants to create a name for the thread, it wants to
>> do all those other things.
>>
>> That code really wants to just do copy_process().
>
> Honestly, I think kernel/kthread.c should be almost rewritten from scratch.
>
> I do not understand why it does all those odd keventd games at all,
> and why kthread_create_info exists in the first place.

I presume you mean kthreadd games?

> Why does kthread_create() not just create the thread directly itself,
> and instead does that odd queue it onto a work function?
>
> Some of that goes back to before the git history, and very little of
> it seems to make any sense. It's as if the code is meant to be able to
> run from interrupt context, but that can't be it: it's literally doing
> a GFP_KERNEL kmalloc, it's doing spin-locks without irq safety etc.
>
> So why is it calling kthreadd_task() to create the thread? Purely for
> some crazy odd "make that the parent" reason?
>
> I dunno. The code is odd, unexplained, looks buggy, and most fo the
> reasons are probably entirely historical.

I can explain why kthreadd exists and why it creates the threads.

Very long ago in the context of random userspace processes people would
use kernel_thread to create threads and a helper function that I think
was called something like kernel_daemonize to scrub the userspace bits
off.

It was an unending sources of problems as the scrub was never complete
nor correct.

So with the introduction of kthreadd the kernel threads were moved
out of the userspace process tree, and userspace stopped being able to
influence the kernel threads.

AKA instead of doing the equivalent of a suid exec the code started
going the equivalent sshing into the local box.

We *need* to preserve that kind of separation.

I want to say that all that is required is that copy_process copies
from kthreadd. Unfortunately that means that it needs to be kthreadd
doing the work, as copy_process does always copies from current. It
would take quite a bit of work to untangle that mess.

It does appear possible to write a parallel function to copy_process
that is used only for creating kernel threads, and can streamline itself
because it knows it is creating kernel threads.

Short of that the code needs to keep routing through kthreadd.

Using create_io_thread or a dedicated wrapper around copy_process
certainly looks like it could simplify some of kthread creation.

> I'm adding Christian to this thread too, since I get the feeling that
> it really should be more tightly integrated with copy_process(), and
> that Christian might have comments.
>
> Christian, see some context in the thread here:
>
> https://lore.kernel.org/all/CAHk-=wiC7rj1o7vTnYUPfD7YxAu09MZiZbahHqvLm9+Cgg1dFw@mail.gmail.com/
>
> for some of this.
>
> Linus

Eric

2022-06-26 00:00:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat, Jun 25, 2022 at 4:43 PM Linus Torvalds
<[email protected]> wrote:
>
> I wonder if we could basically get rid of every use of 'current' in
> kernel/fork.c with just a task pointer that is passed in, and then for
> kernel threads pass in 'init_task'.

That might even help code generation. Instead of doing the 'look up
current' all the time, just having the parent task as an argument
might actually simplify things.

We historically used to do those kinds of things exactly because it
helps generate better code (particularly with inline functions, and
things like 'exit' that calls many different things), but have mostly
avoided it because 'current' may generate some small asm snippet all
the time, but it clarifies the "it can't be a random thread" and
locking issues.

But if it's always 'current or init_task', we don't have many locking issues.

Of course, there might be some reason not to want to use init_task
because it is _so_ special, and so parenting to something silly like
kthreadd ends up being simpler. But..

Anyway, the whole "don't wake up thread until it's all done" is a
separate and independent issue from the "odd use of kthreadd" issue.

Linus

2022-06-26 00:02:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

"Eric W. Biederman" <[email protected]> writes:

> Linus Torvalds <[email protected]> writes:
>
>> On Sat, Jun 25, 2022 at 11:25 AM Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> And that's not at all what the kthread code wants. It wants to set
>>> affinity masks, it wants to create a name for the thread, it wants to
>>> do all those other things.
>>>
>>> That code really wants to just do copy_process().
>>
>> Honestly, I think kernel/kthread.c should be almost rewritten from scratch.
>>
>> I do not understand why it does all those odd keventd games at all,
>> and why kthread_create_info exists in the first place.
>
> I presume you mean kthreadd games?
>
>> Why does kthread_create() not just create the thread directly itself,
>> and instead does that odd queue it onto a work function?
>>
>> Some of that goes back to before the git history, and very little of
>> it seems to make any sense. It's as if the code is meant to be able to
>> run from interrupt context, but that can't be it: it's literally doing
>> a GFP_KERNEL kmalloc, it's doing spin-locks without irq safety etc.
>>
>> So why is it calling kthreadd_task() to create the thread? Purely for
>> some crazy odd "make that the parent" reason?
>>
>> I dunno. The code is odd, unexplained, looks buggy, and most fo the
>> reasons are probably entirely historical.
>
> I can explain why kthreadd exists and why it creates the threads.
>
> Very long ago in the context of random userspace processes people would
> use kernel_thread to create threads and a helper function that I think
> was called something like kernel_daemonize to scrub the userspace bits
> off.
>
> It was an unending sources of problems as the scrub was never complete
> nor correct.
>
> So with the introduction of kthreadd the kernel threads were moved
> out of the userspace process tree, and userspace stopped being able to
> influence the kernel threads.
>
> AKA instead of doing the equivalent of a suid exec the code started
> going the equivalent sshing into the local box.
>
> We *need* to preserve that kind of separation.
>
> I want to say that all that is required is that copy_process copies
> from kthreadd. Unfortunately that means that it needs to be kthreadd
> doing the work, as copy_process does always copies from current. It
> would take quite a bit of work to untangle that mess.
>
> It does appear possible to write a parallel function to copy_process
> that is used only for creating kernel threads, and can streamline itself
> because it knows it is creating kernel threads.
>
> Short of that the code needs to keep routing through kthreadd.
>
> Using create_io_thread or a dedicated wrapper around copy_process
> certainly looks like it could simplify some of kthread creation.

Hmm. Looking at kthread() I completely agree that kernel_thread() has
the wrong set of semantics and we really could benefit from never waking
the fledgling kernel thread in the first place.

Eric

2022-06-26 00:23:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread


Linus Torvalds <[email protected]> writes:

> Anyway, the whole "don't wake up thread until it's all done" is a
> separate and independent issue from the "odd use of kthreadd" issue.

Yes. "don't wake up a kthread until it's all done" is a much easier
change that will simplify things tremendously.

Further it is necessary for Peter Zijlstra's rewrite of the kernel
freezer. As anything that isn't a special stop state (which
TASK_UNINTERRUPTIBLE is not) will receive a spurious wake up on when
thawed out.

Eric

2022-06-26 00:24:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat, Jun 25, 2022 at 4:28 PM Eric W. Biederman <[email protected]> wrote:
>
> I presume you mean kthreadd games?

Yeah, sorry.

> So with the introduction of kthreadd the kernel threads were moved
> out of the userspace process tree, and userspace stopped being able to
> influence the kernel threads.

Ahh. So essentially it's indeed just basically the parenting issue.

> I want to say that all that is required is that copy_process copies
> from kthreadd. Unfortunately that means that it needs to be kthreadd
> doing the work, as copy_process does always copies from current. It
> would take quite a bit of work to untangle that mess.

Hmm. Yeah, it would probably be painful to replace 'current' with
something else, due to locking etc. Using "current" has some
fundamental advantages wrt using another thread pointer that may or
may not be active at the same time.

That said, there *is* another thread that has very similar "we don't
need locking, because it's always around and it doesn't change":
init_task.

I wonder if we could basically get rid of every use of 'current' in
kernel/fork.c with just a task pointer that is passed in, and then for
kernel threads pass in 'init_task'.

I'd hate to duplicate copy_process() entirely, when we probably could
just consider it just a parenting issue, and say that you can have
either 'current' or 'init_task' as the parent. And that 'init_task'
would basically be the 'clean slate for kthreads'.

Christian, comments?

Linus

2022-06-26 00:27:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

Linus Torvalds <[email protected]> writes:

> On Sat, Jun 25, 2022 at 4:28 PM Eric W. Biederman <[email protected]> wrote:
>>
>> I presume you mean kthreadd games?
>
> Yeah, sorry.
>
>> So with the introduction of kthreadd the kernel threads were moved
>> out of the userspace process tree, and userspace stopped being able to
>> influence the kernel threads.
>
> Ahh. So essentially it's indeed just basically the parenting issue.

That is one way to look at it. The way I described it at the time was:

> commit 73c279927f89561ecb45b2dfdf9314bafcfd9f67
> Author: Eric W. Biederman <[email protected]>
> Date: Wed May 9 02:34:32 2007 -0700
>
> kthread: don't depend on work queues
>
> Currently there is a circular reference between work queue initialization
> and kthread initialization. This prevents the kthread infrastructure from
> initializing until after work queues have been initialized.
>
> We want the properties of tasks created with kthread_create to be as close
> as possible to the init_task and to not be contaminated by user processes.
> The later we start our kthreadd that creates these tasks the harder it is
> to avoid contamination from user processes and the more of a mess we have
> to clean up because the defaults have changed on us.
>
> So this patch modifies the kthread support to not use work queues but to
> instead use a simple list of structures, and to have kthreadd start from
> init_task immediately after our kernel thread that execs /sbin/init.
>
> By being a true child of init_task we only have to change those process
> settings that we want to have different from init_task, such as our process
> name, the cpus that are allowed, blocking all signals and setting SIGCHLD
> to SIG_IGN so that all of our children are reaped automatically.
>
> By being a true child of init_task we also naturally get our ppid set to 0
> and do not wind up as a child of PID == 1. Ensuring that tasks generated
> by kthread_create will not slow down the functioning of the wait family of
> functions.
>
> [[email protected]: use interruptible sleeps]
> Signed-off-by: Eric W. Biederman <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>

Eric

2022-06-26 00:29:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

Linus Torvalds <[email protected]> writes:

> On Sat, Jun 25, 2022 at 10:36 AM Eric W. Biederman
> <[email protected]> wrote:
>>
>> Let me suggest someone create a new variant of kthread_create that takes
>> all of the parameters the workqueue code wants to set.
>
> I suspect the real issue is that that the kthread code simply
> shouldn't use the kernel_thread() helper at all.
>
> That helper is literally designed for "start a thread, run this thing".
>
> That's what it *does*.
>
> And that's not at all what the kthread code wants. It wants to set
> affinity masks, it wants to create a name for the thread, it wants to
> do all those other things.
>
> That code really wants to just do copy_process().
>
> Or maybe it really should just use create_io_thread(), which has a
> much better interface, except it has that one oddity in that it sets
> the flag that does this:
>
> if (args->io_thread) {
> /*
> * Mark us an IO worker, and block any signal that isn't
> * fatal or STOP
> */
> p->flags |= PF_IO_WORKER;
> siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> }
>
> that then has special semantics.

It is worth pointing out kthreads also have special semantics for
signals and they are different. In particular kthreads ignore all
signals by default.

The io_threads are much more userspace threads and the userspace process
handles signals, it is just the io_thread that blocks the signals so
they will go to real userspace processes.

Eric

2022-06-26 02:13:13

by Tejun Heo

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

Hello,

On Sat, Jun 25, 2022 at 10:01:35AM -0700, Linus Torvalds wrote:
> On Fri, Jun 24, 2022 at 10:00 PM Tejun Heo <[email protected]> wrote:
> >
> > So, Petr debugged a NULL deref in workqueue code to a spurious wakeup
> > on a newly created kthread.
>
> What? No. That patch can't be right for several reasons.
>
> What we call "spurious wakeups" exist, but they are about wakeups that
> happen from being on a _previous_ wait-queue, and having already been
> removed from it.
>
> They aren't "really" spurious, they are just asynchronous enough (and
> thus unexpected) that you basically should never have a "sleep on
> wait-queue" without then looping and re-testing the condition.

Can you elaborate on this a bit? At least for the standard
wait_event-ish wait blocks, the waiter always does finish_wait()
before leavig a wait. finish_wait() does lockless check on
wq_entry->entry and may or may not grab wq_head->lock. When it does,
it's fully synchronized against the waker. Even when it doesn't, while
the lack of memory ordering on the finish_wait() side may let things
slide a bit, I can't see how it can slide after the set_current_state
in the next wait block.

I'm probably missing sometihng. Is it about bespoke wait mechanisms?
Can you give a concrete example of an async wakeup scenario?

> There is no _truly_ spurious wakeup. You were always woken up for a
> reason, it's just that there are more reasons than the entirely
> obvious ones.

So, the deferred wakeups from earlier waits are one. Can you give some
other examples? This is something which has always bothered me and I
couldn't find explanations which aren't hand-wavy on my own. It'd be
really great to have clarity.

> For example, the reason that quoted patch cannot be right is that this
> code pattern:
>
> while (wait_for_completion_interruptible(&worker->ready_to_start))
> ;
>
> is not valid kernel code. EVER. There is absolutely no way that can be correct.
>
> Either that code can take a signal, or it cannot. If it can take a
> signal, it had better react to said signal. If it cannot, it must not
> use an interruptble sleep - since now that loop turned into a
> kernel-side busy-loop.
>
> So NAK on this kind of crazy "I don't know what happened, so I'll just
> add *more* bugs to the code" voodoo programming.
>
> And no, we don't "fix" that by then adding a timeout.

Yeah, I should've been more explicit on this. Michal already pointed
out that it doesn't make sense to loop over interruptible timed sleeps
and it should use one plain uninterruptible sleep, so this part isn't
in question.

...
> I think the problem here is much more fundamental: you expect a new
> thread to not wake up until you've told it to.
>
> We do have that infrastructure in the kernel: when you create a new
> thread, you can do various setup, and the thread won't actually run
> until you do "wake_up_new_task()" on it.

That's because that's the only thing which ignores TASK_NEW, right?

> However, that's not how kernel_thread() (or kernel_clone()) works.
> Those will call wake_up_new_task(p) for you, and as such a new kernel
> thread will immediately start running.
>
> So I think the expectations here are entirely wrong. I think
> create_worker() is fundamentally buggy, in how it does that
>
> /* start the newly created worker */
> ..
> wake_up_process(worker->task);
>
> because that wake_up_process() is already much too late. The process
> got woken up already, because it was created by create_kthread() ->
> kernel_thread() -> kernel_clone, which does that wake_up_new_task()
> and it starts running.

A couple things still aren't clear for me.

* If there are no true spurious wakeups, where did the racing wakeup
come from? The task just got created w/ TASK_NEW and woken up once
with wake_up_new_task(). It hasn't been on any wait queue or
advertised itself to anything.

* If there are spurious wakeups, why is kthread() scheduling after
signaling creation completion in the first place? As I wrote before,
all it would do is masking these bugs. If we can't gurantee that the
kthread will stay blocked, shouldn't we just remove the
schedule_preempt_disabled() call in kthread()?

Thanks.

--
tejun

2022-06-26 03:02:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat, Jun 25, 2022 at 6:58 PM Tejun Heo <[email protected]> wrote:
> >
> > They aren't "really" spurious, they are just asynchronous enough (and
> > thus unexpected) that you basically should never have a "sleep on
> > wait-queue" without then looping and re-testing the condition.
>
> Can you elaborate on this a bit? At least for the standard
> wait_event-ish wait blocks, the waiter always does finish_wait()
> before leavig a wait.

Correct.

As long as *all* you ever use is a standard wait_event thing, you are
always serialized by the spinlock on the wait queue.

However.

Very few processes only use standard wait_event things. There are a
number of places that use "wake_up_process()" which doesn't use a
wait-queue, but a wake-up directed right at a particular task.

Is that the _common_ pattern? It's not *uncommon*. It's maybe not the
strictly normal wait-queue one, but if you grep for
"wake_up_process()" you will find quite a lot of them.

And yes, many of those uses too will use strictly serialized locking:
in that both the waker and the sleeper (ie the target of the
wake_up_process()) will have serialized the target thread pointer with
some lock, exactly like the normal wait-queues serialize using the
wait-queue lock.

But several do *not* use locking, and instead rely on the
thread_struct being RCU-free'd.

In fact, I think kthread_create() itself is such an example, with that

wake_up_process(kthreadd_task);

just doing a blind directed wakeup with no locking what-so-ever, just
a straight wake_up.

And the important thing to notice that if you have even just *one*
such user, that kthreadd_task will basically get randomyl "spuriously"
woken up while it is waiting for something else.

Now, the kthreadd_task task is kind of specialized, and that
particular wake_up_process() doesn't affect anybody else, so think of
that just as an example.

But now imagine any of the other unlocked directed wake_up_process()
users. Imagine them targeting regular user processes that may be doing
regular system calls. Imagine that wake_up_process() being just done
under a RCU read lock, and maybe the process had already woken up
*AND* had already gone on to do entirely other things!

IOW, maybe you are now a thread that does a perfectly normal locked
wait_queue thing - but because the *previous* system call did
something that did less strictly locked things, there may be another
CPU that is still in the process of waking you up from that previous
system call. So now your "strictly locked" waitqueue usage sees a
"spurious" wakeup, because a previous not-so-stictly-locked usage had
been delayed by interrupts on another CPU.

And yes, those "wake up process under RCU read lock" really do exist.
There are literally things that take a reference to a task struct, add
it to some RCU-safe list, and then do the wakeup without locking.

> I'm probably missing sometihng. Is it about bespoke wait mechanisms?
> Can you give a concrete example of an async wakeup scenario?

Grep for wake_up_process() and just look for them/

Side note: signals end up doing effectively the same thing if you're
doing interruptible waits, of course, but people are probably more
*aware* of signals when they use TASK_INTERRUPTIBLE. But these
RCU-delayed wakeups can wake up even TASK_UNINTERRUPTIBLE calls.

Anyway, it's easy enough to deal with: use the "event" macros, and
you'll never have to worry about it.

But if you use explicit wait-queues and manual scheduling (rather than
wait_event() and friends), you need to be aware that when you go to
sleep, the fact that you woke up is *not* a guarantee that the wakeup
came from the wait queue.

So you need to always do that in a loop. The wait_event code will do
that loop for you, but if you do manual wait-queues you are required
to do the looping yourself.

> So, the deferred wakeups from earlier waits are one. Can you give some
> other examples? This is something which has always bothered me and I
> couldn't find explanations which aren't hand-wavy on my own. It'd be
> really great to have clarity.

There's a second class of "spurious" waits that aren't really spurious
at all, and aren't even deferred, and are simply due to "there were
multiple things you waited for, but you didn't even *think* about it".

The obvious one is for poll/select, but there people are aware of it.

The less obvious one is for taking a page fault or being put to sleep
and being woken up again while actively *inside* a wait loop, and
already with a wait-queue entry added.

For a historical example of *both* of those kinds of situations, take
a look at n_tty_read() in drivers/tty/n_tty.c, and in particular,
notice how the wait-loop looks something like this:

add_wait_queue(&tty->read_wait, &wait);
while (nr) {
..
}
...
remove_wait_queue(&tty->read_wait, &wait);

and *inside* the loop, you have both

(a) locking:

down_read(&tty->termios_rwsem);

(b) historically (but not any more) user space accesses

and both of those kinds mean that you actually have overlapping
lifetimes of wait-queues for the same process.

That's very much how wait-queues were always designed to work - it
solves a lot of race conditions (ie the traditional "sleep_on()" model
is broken garbage), but it means that each individual user doesn't
necessarily understand that other active wait-queues can wake up a
process *while* the code thinks about another wait-queue entirely.

IOW, when you are in the page fault code, and you are waiting for the
filesystem IO to complete, you may *also* be on that kind of
"tty->read_wait" waitqueue, and a character coming in on that tty will
wake you up.

The *filesyustem* code doesn't care about the tty wakeup, so it's very
much an example of a "spurious" wake event as far as the filesystem
code is concerned.

Similarly, when you are locking a semaphote, the only wait-queue
activity you care about at the time of the lock is the ones coming
from the unlockers, but you may get "spurious" wakeups from other
things that the process is *also* interested in.

Again, none of these are *really* spurious. They are real wakeup
events. It's just that within the *local* code they look spurious,
because the locking code, the disk IO code, whatever the code is
doesn't know or care about all the other things that process is
involved in.

And once again, it's not that different from "hey, signals can wake
you up at pretty much any time", but people *think* that because they
are doing a non-interruptible wait it is somehow "exclusive", and
don't think about all the other things that that process has been
involved in

> * If there are no true spurious wakeups, where did the racing wakeup
> come from? The task just got created w/ TASK_NEW and woken up once
> with wake_up_new_task(). It hasn't been on any wait queue or
> advertised itself to anything.

I don't think it was ever a spurious wakeup at all.

The create_worker() code does:

worker->task = kthread_create_on_node(..
..
worker_attach_to_pool(worker, pool);
..
wake_up_process(worker->task);

and thinks that the wake_up_process() happens after the worker_attach_to_pool().

But I don't see that at all.

The reality seems to be that the wake_up_process() is a complete
no-op, because the task was already woken up by
kthread_create_on_node().

But I dunno. I think the whole argument of

Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
until it is explicitly woken.

was completely broken to begin with, and then the belief that there's
some spurious wakeup came from that.

But hey, that "wake_up_process(worker->task)" itself does seem spurious.

Anyway, I think this whole "spurious wakeup" is a complete red
herring. They don't "really" exist, but any code that expects some
exclusive wakeup tends to be broken code and that kind of thinking is
dangerous.

So you should *ALWAYS* write your code as if things can get random
spurious wakeups at any time. Because with various CPU hotplug events
etc, it really might happen.

But I don't think that's what's going on here. I think the workqueue
code is just confused, and should have initielized "worker->pool" much
earlier. Because as things are now, when worker_thread() starts
running, and does that

static int worker_thread(void *__worker)
{
struct worker *worker = __worker;
struct worker_pool *pool = worker->pool;

thing, that can happen *immediately* after that

kthread_create_on_node(worker_thread, worker,

happens. It just so happens that *normally* the create_worker() code
ends up finishing setup before the new worker has actually finished
scheduling..

No?

Linus

2022-06-26 06:15:18

by Tejun Heo

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

Hello,

On Sat, Jun 25, 2022 at 07:53:34PM -0700, Linus Torvalds wrote:
...
> Is that the _common_ pattern? It's not *uncommon*. It's maybe not the
> strictly normal wait-queue one, but if you grep for
> "wake_up_process()" you will find quite a lot of them.
...
> > I'm probably missing sometihng. Is it about bespoke wait mechanisms?
> > Can you give a concrete example of an async wakeup scenario?
>
> Grep for wake_up_process() and just look for them/

Right, for kthreads, custom work list + directed wakeup at the known
handler task is a pattern seen across the code base and that does
affect all other waits the kthread would do.

...
> So you need to always do that in a loop. The wait_event code will do
> that loop for you, but if you do manual wait-queues you are required
> to do the looping yourself.

The reason why this bothered me sometimes is that w/ simple kthread
use cases, there are place where all the legtimate wakeup sources are
clearly known. In those cases, the fact that I can't think of a case
where the looping would be needed creates subtle nagging feeling when
writing them.

...
> Again, none of these are *really* spurious. They are real wakeup
> events. It's just that within the *local* code they look spurious,
> because the locking code, the disk IO code, whatever the code is
> doesn't know or care about all the other things that process is
> involved in.

I see. Yeah, waiting on multiple sources which may not be known to
each wait logic makes sense.

...
> But I don't think that's what's going on here. I think the workqueue
> code is just confused, and should have initielized "worker->pool" much
> earlier. Because as things are now, when worker_thread() starts
> running, and does that
>
> static int worker_thread(void *__worker)
> {
> struct worker *worker = __worker;
> struct worker_pool *pool = worker->pool;
>
> thing, that can happen *immediately* after that
>
> kthread_create_on_node(worker_thread, worker,
>
> happens. It just so happens that *normally* the create_worker() code
> ends up finishing setup before the new worker has actually finished
> scheduling..
>
> No?

I'm not sure. Putting aside the always-loop-with-cond-check princicple
for the time being, I can't yet think of a scenario where the
interlocking would break down unless there really is a wakeup which is
coming from an unrelated source.

Just experimented with commenting out that wakeup in create_worker().
Simply commenting it out doesn't break anything but if I wait for
PF_WQ_WORKER to be set by the new worker thread, it doesn't happen.
ie. the initial wakeup is spurious because there is a later wakeup
which comes when a work item is actually dispatched to the new worker.
But the newly created kworker won't start executing its callback
function without at least one extra wakeup, whereever that may be
coming from.

After the initial TASK_NEW handshake, the new kthread notifies
kthread_create_info->done and then goes into an UNINTERRUPTIBLE sleep
and won't start executing the callback function unless somebody wakes
it up. This being a brand new task, it's a pretty sterile wakeup
environment. So, there actually has to be an outside wakeup source
here.

If we say that anyone should expect external wakeups that it has no
direct control over, the kthread_bind interface is broken too cuz
that's making exactly the same assumption that the task is dormant at
that point till the owner sends a wakeup and thus its internal state
can be manipulated safely.

Petr, regardless of how this eventually gets resolved, I'm really
curious where the wakeup that you saw came from. Would it be possible
for you to find out?

Thanks.

--
tejun

2022-06-26 19:25:59

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/3] kthread: Stop using TASK_UNINTERRUPTIBLE


Being silly I figured I would poke my nose in and see how much work it
is to never wake up kthreads until we are ready to use them.

This is my first draft at that and something that can hopefully shape
the conversation on how we want to fix things a little bit.

The big thing that needs to happen that I haven't implemented is that
kthread_run and kthread_run_on_cpu need to be uninlined and moved into
kthread.c. This will allow them to call wake_up_new_task even from
modular code.

The handful of drivers that are using kthread_create_on_node by
extension need to be modified to use kthread_run or kthread_run_on_cpu.

Eric W. Biederman (3):
kthread: Remove the flags argument from kernel_thread
kthread: Replace kernel_thread with new_kthread
kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

arch/arm/common/bL_switcher.c | 2 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
drivers/block/mtip32xx/mtip32xx.c | 2 +-
drivers/firmware/psci/psci_checker.c | 2 +-
drivers/firmware/stratix10-svc.c | 4 +-
drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 2 +-
drivers/scsi/bnx2i/bnx2i_init.c | 2 +-
drivers/scsi/qedi/qedi_main.c | 2 +-
include/linux/kthread.h | 4 +-
include/linux/sched/task.h | 2 +-
init/main.c | 6 +-
kernel/bpf/cpumap.c | 2 +-
kernel/dma/map_benchmark.c | 2 +-
kernel/fork.c | 5 +-
kernel/kthread.c | 120 +++++++++++++++---------------
kernel/smpboot.c | 1 +
kernel/workqueue.c | 2 +-
net/core/pktgen.c | 2 +-
net/sunrpc/svc.c | 2 +-
19 files changed, 82 insertions(+), 86 deletions(-)

Eric

2022-06-26 19:50:31

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/3] kthread: Remove the flags argument from kernel_thread


There are only two callers of kernel_thread remaining. The calling in
init/main.c that creates kthreadd, and the caller in kernel/kthread.c

Both callers pass CLONE_FS|CLONE_FILES. The argument SIGCHLD causes
terminate to exit with the oridnary process SIGCHLD semantics.

As kthreadd never exists it simply does not matter what kind of exit
it has. So for simplicity make it look like everything else and use
SIGCHLD.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/task.h | 2 +-
init/main.c | 2 +-
kernel/fork.c | 3 ++-
kernel/kthread.c | 2 +-
4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 505aaf9fe477..d95930e220da 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -91,7 +91,7 @@ extern pid_t kernel_clone(struct kernel_clone_args *kargs);
struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
struct task_struct *fork_idle(int);
struct mm_struct *copy_init_mm(void);
-extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
+extern pid_t kernel_thread(int (*fn)(void *), void *arg);
extern pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags);
extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
int kernel_wait(pid_t pid, int *stat);
diff --git a/init/main.c b/init/main.c
index 0ee39cdcfcac..211d38db0d16 100644
--- a/init/main.c
+++ b/init/main.c
@@ -701,7 +701,7 @@ noinline void __ref rest_init(void)
rcu_read_unlock();

numa_default_policy();
- pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
+ pid = kernel_thread(kthreadd, NULL);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
diff --git a/kernel/fork.c b/kernel/fork.c
index 9d44f2d46c69..65909ded0ea7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2694,8 +2694,9 @@ pid_t kernel_clone(struct kernel_clone_args *args)
/*
* Create a kernel thread.
*/
-pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
+pid_t kernel_thread(int (*fn)(void *), void *arg)
{
+ unsigned long flags = CLONE_FS | CLONE_FILES | SIGCHLD;
struct kernel_clone_args args = {
.flags = ((lower_32_bits(flags) | CLONE_VM |
CLONE_UNTRACED) & ~CSIGNAL),
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 544fd4097406..c0505e6b7142 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -396,7 +396,7 @@ static void create_kthread(struct kthread_create_info *create)
current->pref_node_fork = create->node;
#endif
/* We want our own signal handler (we take no signals by default). */
- pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
+ pid = kernel_thread(kthread, create);
if (pid < 0) {
/* If user was SIGKILLed, I release the structure. */
struct completion *done = xchg(&create->done, NULL);
--
2.35.3

2022-06-26 19:56:09

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/3] kthread: Replace kernel_thread with new_kthread


It is desriable to be able to perform all of the kthread setup before
the kernel thread is awaked for the first time.

To make that possible replace kernel_thread with new_kthread that does
all of the same work except it does not call wake_up_new_task.

Replace the two uses of kernel_threadd with new_kthread and a call
to wake_up_new_task.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/task.h | 2 +-
init/main.c | 6 ++----
kernel/fork.c | 4 ++--
kernel/kthread.c | 10 ++++++----
4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index d95930e220da..c4c7a0118553 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -91,7 +91,7 @@ extern pid_t kernel_clone(struct kernel_clone_args *kargs);
struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
struct task_struct *fork_idle(int);
struct mm_struct *copy_init_mm(void);
-extern pid_t kernel_thread(int (*fn)(void *), void *arg);
+extern struct task_struct *new_kthread(int (*fn)(void *), void *arg, int node);
extern pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags);
extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
int kernel_wait(pid_t pid, int *stat);
diff --git a/init/main.c b/init/main.c
index 211d38db0d16..b437581f8001 100644
--- a/init/main.c
+++ b/init/main.c
@@ -701,10 +701,8 @@ noinline void __ref rest_init(void)
rcu_read_unlock();

numa_default_policy();
- pid = kernel_thread(kthreadd, NULL);
- rcu_read_lock();
- kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
- rcu_read_unlock();
+ kthreadd_task = new_kthread(kthreadd, NULL, NUMA_NO_NODE);
+ wake_up_new_task(kthreadd_task);

/*
* Enable might_sleep() and smp_processor_id() checks.
diff --git a/kernel/fork.c b/kernel/fork.c
index 65909ded0ea7..794d9f9c78bc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2694,7 +2694,7 @@ pid_t kernel_clone(struct kernel_clone_args *args)
/*
* Create a kernel thread.
*/
-pid_t kernel_thread(int (*fn)(void *), void *arg)
+struct task_struct *new_kthread(int (*fn)(void *), void *arg, int node)
{
unsigned long flags = CLONE_FS | CLONE_FILES | SIGCHLD;
struct kernel_clone_args args = {
@@ -2706,7 +2706,7 @@ pid_t kernel_thread(int (*fn)(void *), void *arg)
.kthread = 1,
};

- return kernel_clone(&args);
+ return copy_process(NULL, 0, node, &args);
}

/*
diff --git a/kernel/kthread.c b/kernel/kthread.c
index c0505e6b7142..8529f6b1582b 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -390,14 +390,14 @@ int tsk_fork_get_node(struct task_struct *tsk)

static void create_kthread(struct kthread_create_info *create)
{
- int pid;
+ struct task_struct *new;

#ifdef CONFIG_NUMA
current->pref_node_fork = create->node;
#endif
/* We want our own signal handler (we take no signals by default). */
- pid = kernel_thread(kthread, create);
- if (pid < 0) {
+ new = new_kthread(kthread, create, NUMA_NO_NODE);
+ if (IS_ERR(new)) {
/* If user was SIGKILLed, I release the structure. */
struct completion *done = xchg(&create->done, NULL);

@@ -405,8 +405,10 @@ static void create_kthread(struct kthread_create_info *create)
kfree(create);
return;
}
- create->result = ERR_PTR(pid);
+ create->result = ERR_CAST(new);
complete(done);
+ } else {
+ wake_up_new_task(new);
}
}

--
2.35.3

2022-06-26 19:57:05

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)


Instead leave the task as a new unscheduled task and require the
caller to call wake_up_new_task.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/arm/common/bL_switcher.c | 2 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
drivers/block/mtip32xx/mtip32xx.c | 2 +-
drivers/firmware/psci/psci_checker.c | 2 +-
drivers/firmware/stratix10-svc.c | 4 +-
drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 2 +-
drivers/scsi/bnx2i/bnx2i_init.c | 2 +-
drivers/scsi/qedi/qedi_main.c | 2 +-
include/linux/kthread.h | 4 +-
kernel/bpf/cpumap.c | 2 +-
kernel/dma/map_benchmark.c | 2 +-
kernel/kthread.c | 114 ++++++++++------------
kernel/smpboot.c | 1 +
kernel/workqueue.c | 2 +-
net/core/pktgen.c | 2 +-
net/sunrpc/svc.c | 2 +-
16 files changed, 72 insertions(+), 77 deletions(-)

diff --git a/arch/arm/common/bL_switcher.c b/arch/arm/common/bL_switcher.c
index 9a9aa53547a6..15c9841c3c15 100644
--- a/arch/arm/common/bL_switcher.c
+++ b/arch/arm/common/bL_switcher.c
@@ -311,7 +311,7 @@ static struct task_struct *bL_switcher_thread_create(int cpu, void *arg)
cpu_to_node(cpu), "kswitcher_%d", cpu);
if (!IS_ERR(task)) {
kthread_bind(task, cpu);
- wake_up_process(task);
+ wake_up_new_task(task);
} else
pr_err("%s failed for CPU %d\n", __func__, cpu);
return task;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index db813f819ad6..ba09f5cf1773 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -1206,7 +1206,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
goto out;
}
kthread_bind(thread, cpu);
- wake_up_process(thread);
+ wake_up_new_task(thread);

ret = wait_event_interruptible(plr->lock_thread_wq,
plr->thread_done == 1);
@@ -1304,7 +1304,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp)
}

kthread_bind(thread, plr->cpu);
- wake_up_process(thread);
+ wake_up_new_task(thread);

ret = wait_event_interruptible(plr->lock_thread_wq,
plr->thread_done == 1);
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 27386a572ba4..b2f5b30a27aa 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3649,7 +3649,7 @@ static int mtip_block_initialize(struct driver_data *dd)
rv = -EFAULT;
goto kthread_run_error;
}
- wake_up_process(dd->mtip_svc_handler);
+ wake_up_new_task(dd->mtip_svc_handler);
if (wait_for_rebuild == MTIP_FTL_REBUILD_MAGIC)
rv = wait_for_rebuild;

diff --git a/drivers/firmware/psci/psci_checker.c b/drivers/firmware/psci/psci_checker.c
index 116eb465cdb4..52fcd122e2b6 100644
--- a/drivers/firmware/psci/psci_checker.c
+++ b/drivers/firmware/psci/psci_checker.c
@@ -418,7 +418,7 @@ static int suspend_tests(void)
* wait for the completion of suspend_threads_started.
*/
for (i = 0; i < nb_threads; ++i)
- wake_up_process(threads[i]);
+ wake_up_new_task(threads[i]);
complete_all(&suspend_threads_started);

wait_for_completion(&suspend_threads_done);
diff --git a/drivers/firmware/stratix10-svc.c b/drivers/firmware/stratix10-svc.c
index 14663f671323..64d07aaa68bf 100644
--- a/drivers/firmware/stratix10-svc.c
+++ b/drivers/firmware/stratix10-svc.c
@@ -581,7 +581,7 @@ static int svc_get_sh_memory(struct platform_device *pdev,
return -EINVAL;
}

- wake_up_process(sh_memory_task);
+ wake_up_new_task(sh_memory_task);

if (!wait_for_completion_timeout(&sh_memory->sync_complete, 10 * HZ)) {
dev_err(dev,
@@ -834,7 +834,7 @@ int stratix10_svc_send(struct stratix10_svc_chan *chan, void *msg)
return -EINVAL;
}
kthread_bind(chan->ctrl->task, cpu);
- wake_up_process(chan->ctrl->task);
+ wake_up_new_task(chan->ctrl->task);
}

pr_debug("%s: sent P-va=%p, P-com=%x, P-size=%u\n", __func__,
diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 05ddbb9bb7d8..1b3af0e01d67 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -2622,7 +2622,7 @@ static int bnx2fc_cpu_online(unsigned int cpu)
/* bind thread to the cpu */
kthread_bind(thread, cpu);
p->iothread = thread;
- wake_up_process(thread);
+ wake_up_new_task(thread);
return 0;
}

diff --git a/drivers/scsi/bnx2i/bnx2i_init.c b/drivers/scsi/bnx2i/bnx2i_init.c
index 2b3f0c10478e..cd4976bb86fc 100644
--- a/drivers/scsi/bnx2i/bnx2i_init.c
+++ b/drivers/scsi/bnx2i/bnx2i_init.c
@@ -424,7 +424,7 @@ static int bnx2i_cpu_online(unsigned int cpu)
/* bind thread to the cpu */
kthread_bind(thread, cpu);
p->iothread = thread;
- wake_up_process(thread);
+ wake_up_new_task(thread);
return 0;
}

diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
index 83ffba7f51da..97ced4f12d6e 100644
--- a/drivers/scsi/qedi/qedi_main.c
+++ b/drivers/scsi/qedi/qedi_main.c
@@ -1967,7 +1967,7 @@ static int qedi_cpu_online(unsigned int cpu)

kthread_bind(thread, cpu);
p->iothread = thread;
- wake_up_process(thread);
+ wake_up_new_task(thread);
return 0;
}

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 30e5bec81d2b..fa5e24df84bf 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -53,7 +53,7 @@ bool kthread_is_per_cpu(struct task_struct *k);
struct task_struct *__k \
= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
if (!IS_ERR(__k)) \
- wake_up_process(__k); \
+ wake_up_new_task(__k); \
__k; \
})

@@ -77,7 +77,7 @@ kthread_run_on_cpu(int (*threadfn)(void *data), void *data,

p = kthread_create_on_cpu(threadfn, data, cpu, namefmt);
if (!IS_ERR(p))
- wake_up_process(p);
+ wake_up_new_task(p);

return p;
}
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index f4860ac756cd..e72bbb766f01 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -475,7 +475,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,

/* Make sure kthread runs on a single CPU */
kthread_bind(rcpu->kthread, cpu);
- wake_up_process(rcpu->kthread);
+ wake_up_new_task(rcpu->kthread);

return rcpu;

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index 0520a8f4fb1d..b28e33c07c92 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -134,7 +134,7 @@ static int do_map_benchmark(struct map_benchmark_data *map)

for (i = 0; i < threads; i++) {
get_task_struct(tsk[i]);
- wake_up_process(tsk[i]);
+ wake_up_new_task(tsk[i]);
}

msleep_interruptible(map->bparam.seconds * 1000);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8529f6b1582b..1769b5118694 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -329,51 +329,12 @@ EXPORT_SYMBOL(kthread_complete_and_exit);

static int kthread(void *_create)
{
- static const struct sched_param param = { .sched_priority = 0 };
- /* Copy data: it's on kthread's stack */
- struct kthread_create_info *create = _create;
- int (*threadfn)(void *data) = create->threadfn;
- void *data = create->data;
- struct completion *done;
- struct kthread *self;
- int ret;
-
- self = to_kthread(current);
+ struct kthread *self = to_kthread(current);
+ int ret = -EINTR;

- /* If user was SIGKILLed, I release the structure. */
- done = xchg(&create->done, NULL);
- if (!done) {
- kfree(create);
- kthread_exit(-EINTR);
- }
-
- self->threadfn = threadfn;
- self->data = data;
-
- /*
- * The new thread inherited kthreadd's priority and CPU mask. Reset
- * back to default in case they have been changed.
- */
- sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
- set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
-
- /* OK, tell user we're spawned, wait for stop or wakeup */
- __set_current_state(TASK_UNINTERRUPTIBLE);
- create->result = current;
- /*
- * Thread is going to call schedule(), do not preempt it,
- * or the creator may spend more time in wait_task_inactive().
- */
- preempt_disable();
- complete(done);
- schedule_preempt_disabled();
- preempt_enable();
-
- ret = -EINTR;
if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
cgroup_kthread_ready();
- __kthread_parkme(self);
- ret = threadfn(data);
+ ret = self->threadfn(self->data);
}
kthread_exit(ret);
}
@@ -391,25 +352,41 @@ int tsk_fork_get_node(struct task_struct *tsk)
static void create_kthread(struct kthread_create_info *create)
{
struct task_struct *new;
+ struct completion *done;

#ifdef CONFIG_NUMA
current->pref_node_fork = create->node;
#endif
/* We want our own signal handler (we take no signals by default). */
new = new_kthread(kthread, create, NUMA_NO_NODE);
+ create->result = new;
+ /* Claim the completion */
+ done = xchg(&create->done, NULL);
if (IS_ERR(new)) {
- /* If user was SIGKILLed, I release the structure. */
- struct completion *done = xchg(&create->done, NULL);
+ if (done)
+ complete(done);
+ } else if (done) {
+ static const struct sched_param param = { .sched_priority = 0 };
+ struct kthread *kthread = to_kthread(new);

- if (!done) {
- kfree(create);
- return;
- }
- create->result = ERR_CAST(new);
- complete(done);
- } else {
- wake_up_new_task(new);
+ kthread->threadfn = create->threadfn;
+ kthread->data = create->data;
+
+ /*
+ * The new thread inherited kthreadd's priority and CPU mask. Reset
+ * back to default in case they have been changed.
+ */
+ sched_setscheduler_nocheck(new, SCHED_NORMAL, &param);
+ set_cpus_allowed_ptr(new, housekeeping_cpumask(HK_TYPE_KTHREAD));
+
+ /* OK, tell user we're spawned, wait for stop or wakeup */
+ //wake_up_new_task(new);
}
+ /* If user was SIGKILLed, release the structure. */
+ if (!done)
+ kfree(create);
+ else
+ complete(done);
}

static __printf(4, 0)
@@ -518,11 +495,11 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
}
EXPORT_SYMBOL(kthread_create_on_node);

-static void __kthread_bind_mask(struct task_struct *p, const struct cpumask *mask, unsigned int state)
+static void kthread_bind_mask_parked(struct task_struct *p, const struct cpumask *mask)
{
unsigned long flags;

- if (!wait_task_inactive(p, state)) {
+ if (!wait_task_inactive(p, TASK_PARKED)) {
WARN_ON(1);
return;
}
@@ -534,14 +511,31 @@ static void __kthread_bind_mask(struct task_struct *p, const struct cpumask *mas
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
}

-static void __kthread_bind(struct task_struct *p, unsigned int cpu, unsigned int state)
+static void kthread_bind_mask_new(struct task_struct *new, const struct cpumask *mask)
+{
+ unsigned long flags;
+
+ /*
+ * FIXME: verify that p is a new task that
+ * has not yet been passed through
+ * wake_up_new_task
+ */
+
+ /* It's safe because new has never been scheduled. */
+ raw_spin_lock_irqsave(&new->pi_lock, flags);
+ do_set_cpus_allowed(new, mask);
+ new->flags |= PF_NO_SETAFFINITY;
+ raw_spin_unlock_irqrestore(&new->pi_lock, flags);
+}
+
+static void __kthread_bind(struct task_struct *p, unsigned int cpu)
{
- __kthread_bind_mask(p, cpumask_of(cpu), state);
+ kthread_bind_mask_new(p, cpumask_of(cpu));
}

void kthread_bind_mask(struct task_struct *p, const struct cpumask *mask)
{
- __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
+ kthread_bind_mask_new(p, mask);
}

/**
@@ -555,7 +549,7 @@ void kthread_bind_mask(struct task_struct *p, const struct cpumask *mask)
*/
void kthread_bind(struct task_struct *p, unsigned int cpu)
{
- __kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
+ __kthread_bind(p, cpu);
}
EXPORT_SYMBOL(kthread_bind);

@@ -629,7 +623,7 @@ void kthread_unpark(struct task_struct *k)
* The binding was lost and we need to set it again.
*/
if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu, TASK_PARKED);
+ kthread_bind_mask_parked(k, cpumask_of(kthread->cpu));

clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
/*
@@ -863,7 +857,7 @@ __kthread_create_worker(int cpu, unsigned int flags,

worker->flags = flags;
worker->task = task;
- wake_up_process(task);
+ wake_up_new_task(task);
return worker;

fail_task:
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index b9f54544e749..79b8d05164d6 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -192,6 +192,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
* Park the thread so that it could start right on the CPU
* when it is available.
*/
+ wake_up_new_task(tsk);
kthread_park(tsk);
get_task_struct(tsk);
*per_cpu_ptr(ht->store, cpu) = tsk;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1ea50f6be843..2d16a933f452 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1961,7 +1961,7 @@ static struct worker *create_worker(struct worker_pool *pool)
raw_spin_lock_irq(&pool->lock);
worker->pool->nr_workers++;
worker_enter_idle(worker);
- wake_up_process(worker->task);
+ wake_up_new_task(worker->task);
raw_spin_unlock_irq(&pool->lock);

return worker;
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 84b62cd7bc57..40efd9b678fa 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3864,7 +3864,7 @@ static int __net_init pktgen_create_thread(int cpu, struct pktgen_net *pn)

t->net = pn;
get_task_struct(p);
- wake_up_process(p);
+ wake_up_new_task(p);
wait_for_completion(&t->start_done);

return 0;
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 7c9a0d0b1230..addbba323b9c 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -769,7 +769,7 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
svc_pool_map_set_cpumask(task, chosen_pool->sp_id);

svc_sock_update_bufs(serv);
- wake_up_process(task);
+ wake_up_new_task(task);
} while (nrservs > 0);

return 0;
--
2.35.3

2022-06-26 20:19:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Sun, Jun 26, 2022 at 12:16 PM Eric W. Biederman
<[email protected]> wrote:
>
> Instead leave the task as a new unscheduled task and require the
> caller to call wake_up_new_task.

So I think this is somewhat error-prone, and we should probably
abstract things out a bit more.

Almost every single case that does this does this for one single
reason: it wants to run setup code before the new kthread is actually
activated.

And I think *that* should be the change - add a "setup()" function
pointer to the whole kthread infrastructure. Allow it to return an
error, which will then just kill the new thread again without ever
even starting it up.

I'd really prefer to avoid having random drivers and subsystems know
about the *very* magical "wake_up_new_task()" thing. Yes, it's a real
thing, but it's a thing that normal code should not ever use.

The whole "wake_up_process()" model for kthread creation was wrong.
But moving existing users of a bad interface to using the even more
special "wake_up_new_task()" thing is not the solution, I feel.

Also, since you're very much in this area - you also please look into
getting rid of that horrible 'done' completion pointer entirely? The
xchg games on that thing are horrendous and very non-intuitive.

Linus

2022-06-26 21:29:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Sun, Jun 26, 2022 at 1:23 PM Tejun Heo <[email protected]> wrote:
>
> Would it be a reasonable tradeoff to have a kthread wrapper -
> kthread_start() or whatever - which ensures that it is actually called
> once on a new task? That way, we can keep the init code inline and
> bugs on both sides (not starting and starting incorrectly) are
> obvious.

Yeah, that sounds reasonable to me.

Linus

2022-06-26 21:34:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

Hello,

On Sun, Jun 26, 2022 at 12:59:09PM -0700, Linus Torvalds wrote:
> And I think *that* should be the change - add a "setup()" function
> pointer to the whole kthread infrastructure. Allow it to return an
> error, which will then just kill the new thread again without ever
> even starting it up.
>
> I'd really prefer to avoid having random drivers and subsystems know
> about the *very* magical "wake_up_new_task()" thing. Yes, it's a real
> thing, but it's a thing that normal code should not ever use.
>
> The whole "wake_up_process()" model for kthread creation was wrong.
> But moving existing users of a bad interface to using the even more
> special "wake_up_new_task()" thing is not the solution, I feel.

This is a bit of bike-shedding but there are inherent downsides to
callback based interface in terms of write/readability. Now you have
to move the init code out of line, and if the context that needs to be
passing doesn't fit in a single pointer, you gotta define a struct to
carry it adding to the boilerplate.

Having separate create and run steps makes sense as a pattern and
wake_up_new_task() is less error prone in one sense - if the caller
doesn't call it, the new kthread actually won't start running making
the bug obvious. The fact that the function blindly trusts the caller
to be handing in an actual new task is scary and we likely don't wanna
proliferate its use across the code base.

Would it be a reasonable tradeoff to have a kthread wrapper -
kthread_start() or whatever - which ensures that it is actually called
once on a new task? That way, we can keep the init code inline and
bugs on both sides (not starting and starting incorrectly) are
obvious.

Thanks.

--
tejun

2022-06-26 21:52:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/3] kthread: Remove the flags argument from kernel_thread

On Sun, Jun 26, 2022 at 12:15 PM Eric W. Biederman
<[email protected]> wrote:
>
> As kthreadd never exists it simply does not matter what kind of exit
> it has. So for simplicity make it look like everything else and use
> SIGCHLD.

That "never exists" should be "never exits" ;)

But:

> +pid_t kernel_thread(int (*fn)(void *), void *arg)
> {
> + unsigned long flags = CLONE_FS | CLONE_FILES | SIGCHLD;
> struct kernel_clone_args args = {
> .flags = ((lower_32_bits(flags) | CLONE_VM |
> CLONE_UNTRACED) & ~CSIGNAL),

Please just get rid of that 'flags' thing, and the lower_32_bits()
games and write this all as

pid_t kernel_thread(int (*fn)(void *), void *arg)
{
struct kernel_clone_args args = {
.flags = CLONE_FS | CLONE_FILES |
CLONE_VM | CLONE_UNTRACED,
.exit_signal = SIGCHLD,
...

which does that whole thing much more clearly.

Linus

2022-06-26 22:34:46

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

Hi "Eric,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mkp-scsi/for-next]
[also build test ERROR on jejb-scsi/for-next linus/master v5.19-rc3]
[cannot apply to linux/master next-20220624]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Eric-W-Biederman/kthread-Remove-the-flags-argument-from-kernel_thread/20220627-041557
base: https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next
config: x86_64-randconfig-a013
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/950f4922fa2aef3d2e68b4b2ab728c6945830991
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Eric-W-Biederman/kthread-Remove-the-flags-argument-from-kernel_thread/20220627-041557
git checkout 950f4922fa2aef3d2e68b4b2ab728c6945830991
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "wake_up_new_task" [kernel/rcu/rcutorture.ko] undefined!

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-06-26 22:58:33

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

Hi "Eric,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mkp-scsi/for-next]
[also build test ERROR on jejb-scsi/for-next linus/master v5.19-rc3]
[cannot apply to linux/master next-20220624]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Eric-W-Biederman/kthread-Remove-the-flags-argument-from-kernel_thread/20220627-041557
base: https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next
config: x86_64-randconfig-a012
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project b0d6dd3905db145853c7c744ac92d49b00b1fa20)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/950f4922fa2aef3d2e68b4b2ab728c6945830991
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Eric-W-Biederman/kthread-Remove-the-flags-argument-from-kernel_thread/20220627-041557
git checkout 950f4922fa2aef3d2e68b4b2ab728c6945830991
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>, old ones prefixed by <<):

ERROR: modpost: "wake_up_new_task" [kernel/rcu/rcutorture.ko] undefined!
ERROR: modpost: "wake_up_new_task" [fs/nfs/nfsv4.ko] undefined!
>> ERROR: modpost: "wake_up_new_task" [fs/ext4/ext4.ko] undefined!
>> ERROR: modpost: "wake_up_new_task" [fs/jbd2/jbd2.ko] undefined!
>> ERROR: modpost: "wake_up_new_task" [fs/ecryptfs/ecryptfs.ko] undefined!
ERROR: modpost: "wake_up_new_task" [fs/cifs/cifs.ko] undefined!
>> ERROR: modpost: "wake_up_new_task" [fs/nilfs2/nilfs2.ko] undefined!
>> ERROR: modpost: "wake_up_new_task" [fs/gfs2/gfs2.ko] undefined!
>> ERROR: modpost: "wake_up_new_task" [fs/f2fs/f2fs.ko] undefined!
>> ERROR: modpost: "wake_up_new_task" [drivers/char/ipmi/ipmi_si.ko] undefined!
WARNING: modpost: suppressed 30 unresolved symbol warnings because there were too many)

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-06-27 00:18:58

by Wedson Almeida Filho

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat, Jun 24, 2022 at 07:19:27PM -0500, Eric W. Biederman wrote:
>
> Further it is necessary for Peter Zijlstra's rewrite of the kernel
> freezer. As anything that isn't a special stop state (which
> TASK_UNINTERRUPTIBLE is not) will receive a spurious wake up on when
> thawed out.

Do you know if the current (i.e., prior to the rewrite) kernel freezer
also sends spurious wakeups when thawing tasks?

I'm trying to understand if this source of spurious wakeups will be a
new one or already exists (and might therefore explain what's described
in the original email).

2022-06-27 07:46:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Mon, Jun 27, 2022 at 12:01:04AM +0000, Wedson Almeida Filho wrote:
> On Sat, Jun 24, 2022 at 07:19:27PM -0500, Eric W. Biederman wrote:
> >
> > Further it is necessary for Peter Zijlstra's rewrite of the kernel
> > freezer. As anything that isn't a special stop state (which
> > TASK_UNINTERRUPTIBLE is not) will receive a spurious wake up on when
> > thawed out.
>
> Do you know if the current (i.e., prior to the rewrite) kernel freezer
> also sends spurious wakeups when thawing tasks?

Current freezer can thaw at random points in time, even before SMP
bringup if you're unlucky. And yes, I think it can induce 'spurious'
wakeups as well.

But really; like Linus already said upsteam, every wait loop *MUST*
already be able to deal with spurious wakeups. This is why pretty much
every wait primitive we have looks like:

for (;;) {
set_current_state(state);
if (cond)
break;
schedule();
}
__set_current_state(RUNNING);

Which is immune to random wake-ups since it need @cond to make progress.
*NEVER* rely on just the wakeup itself for progress, that's buggy as
heck in lots of ways.

There are a few exceptions, but they all require special wait states and
much carefulness.

2022-06-27 07:59:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)


Bit off a tangent..

On Mon, Jun 27, 2022 at 05:23:00AM +0900, Tejun Heo wrote:

> This is a bit of bike-shedding but there are inherent downsides to
> callback based interface in terms of write/readability. Now you have
> to move the init code out of line, and if the context that needs to be
> passing doesn't fit in a single pointer, you gotta define a struct to
> carry it adding to the boilerplate.

Yes, I so wish C had reasonable lambda expressions :/ But even if we
could come up with a semi sensible syntax and implementation, it would
still be many *many* years before we could actually use it in-kernel :-(

2022-06-27 08:34:32

by Michal Hocko

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat 25-06-22 14:00:10, Tejun Heo wrote:
[...]
> tl;dr is that the worker creation code expects a newly created worker
> kthread to sit tight until the creator finishes setting up stuff and
> sends the initial wakeup. However, something, which wasn't identified
> in the report (Petr, it'd be great if you can find out who did the
> wakeup), wakes up the new kthread before the creation path is done
> with init which causes the new kthread to try to deref a NULL pointer.

One thing that hasn't been really mentioned explicitly but it might be
really important. 5.3.18-2-rt is the RT patchset backported to 5.3
based SLES kernel. We do not know whether the upstream RT patchset is
affected the same way and due to nature of the customer we cannot really
have this tested on that kernel.

I have tried to dig that out from the crash dump but no luck
unfortunately. I was able to find the kthread pointer on few stacks:
crash> search -t ffff9ca77e9a8000
PID: 2 TASK: ffff9ca700381d80 CPU: 0 COMMAND: "kthreadd"
ffffb0f68001fd78: ffff9ca77e9a8000
ffffb0f68001fe38: ffff9ca77e9a8000

PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon"
ffffb0f6808e7c50: ffff9ca77e9a8000
ffffb0f6808e7da8: ffff9ca77e9a8000

PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob"
ffffb0f68300fba8: ffff9ca77e9a8000
ffffb0f68300fbf0: ffff9ca77e9a8000

PID: 16127 TASK: ffff9ca77e9abb00 CPU: 0 COMMAND: "kworker/0:0H"
ffffb0f688067dd0: ffff9ca77e9a8000

PID: 16213 TASK: ffff9ca77e9a8000 CPU: 1 COMMAND: "kworker/0:2H"
ffffb0f68822fba0: ffff9ca77e9a8000
ffffb0f68822fca0: ffff9ca77e9a8000
ffffb0f68822fd70: ffff9ca77e9a8000
ffffb0f68822fe38: ffff9ca77e9a8000
ffffb0f68822fef8: ffff9ca77e9a8000

kthreadadd is not all that interesting because that one has created the
thread. sysmon is
crash> bt ffff9ca75e6349c0
PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon"
#0 [ffffb0f6808e7be8] __schedule at ffffffffaa94bef1
#1 [ffffb0f6808e7c78] preempt_schedule_common at ffffffffaa94c5d1
#2 [ffffb0f6808e7c80] ___preempt_schedule at ffffffffaa201bb6
#3 [ffffb0f6808e7cd8] rt_mutex_postunlock at ffffffffaa309f6b
#4 [ffffb0f6808e7ce8] rt_mutex_futex_unlock at ffffffffaa94ea7c
#5 [ffffb0f6808e7d30] rt_spin_unlock at ffffffffaa950ace
#6 [ffffb0f6808e7d38] proc_pid_status at ffffffffaa4c9ff4
#7 [ffffb0f6808e7dd8] proc_single_show at ffffffffaa4c3801
#8 [ffffb0f6808e7e10] seq_read at ffffffffaa47c1b8
#9 [ffffb0f6808e7e70] vfs_read at ffffffffaa455fd9
#10 [ffffb0f6808e7ea0] ksys_read at ffffffffaa456364

so collecting /pro/status data and releasing the sighand lock.

crash> bt ffff9ca71e252c40
PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob"
#0 [ffffb0f68300fbb0] __schedule at ffffffffaa94bef1
#1 [ffffb0f68300fc40] schedule at ffffffffaa94c4e6
#2 [ffffb0f68300fc50] futex_wait_queue_me at ffffffffaa33e2c0
#3 [ffffb0f68300fc88] futex_wait at ffffffffaa33f199
#4 [ffffb0f68300fd98] do_futex at ffffffffaa341062
#5 [ffffb0f68300fe70] __x64_sys_futex at ffffffffaa341a7e
#6 [ffffb0f68300fee0] do_syscall_64 at ffffffffaa2025f0

is scheduled into the kthread which is expected because the newly
created kernel thread should be running kthread() where it sleeps before
calling the thread function.

kworker/0:2H is our tkrehad worker.

crash> bt ffff9ca77e9abb00
#5 [ffffb0f688067c90] delay_tsc at ffffffffaa9479fc
#6 [ffffb0f688067c90] wait_for_xmitr at ffffffffaa6c1e45
#7 [ffffb0f688067cb0] serial8250_console_putchar at ffffffffaa6c1ee6
#8 [ffffb0f688067cc8] uart_console_write at ffffffffaa6bb095
#9 [ffffb0f688067cf0] serial8250_console_write at ffffffffaa6c5962
#10 [ffffb0f688067d70] console_unlock at ffffffffaa30d795
#11 [ffffb0f688067db0] vprintk_emit at ffffffffaa30fc11
#12 [ffffb0f688067e00] printk at ffffffffaa3103c0
#13 [ffffb0f688067e68] __kthread_bind_mask at ffffffffaa2d74df
#14 [ffffb0f688067e90] create_worker at ffffffffaa2cfc13
#15 [ffffb0f688067ed8] worker_thread at ffffffffaa2d1cbe
#16 [ffffb0f688067f10] kthread at ffffffffaa2d6da2

is the delayed thread which is creating the worker and currently stuck
somewhere in the printk (console) code.

So if somebody has woken up our thread from inside kthread() then it
doesn't have that pointer on the stack and I couldn't it find elsewhere
either. Maybe somebody has an idea where to look at.

Btw. I still haven't caught up with the rest of the email thread and
today will be quite busy for me. Anyway, if somebody has ideas about a
further post mortem debugging then I am more than happy to help out.
--
Michal Hocko
SUSE Labs

2022-06-27 08:49:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Mon, Jun 27, 2022 at 09:22:37AM +0200, Peter Zijlstra wrote:
> Yes, I so wish C had reasonable lambda expressions :/ But even if we
> could come up with a semi sensible syntax and implementation, it would
> still be many *many* years before we could actually use it in-kernel :-(

Yeah, I have a hard time imagining this happening in C but maybe we'll
get pretty good closure support through rust-in-kernel if that works
out. That'd be pretty sweet even though we might not be able to use it
everywhere.

Thanks.

--
tejun

2022-06-27 08:53:53

by Tejun Heo

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

Hello, Michal.

On Mon, Jun 27, 2022 at 10:07:22AM +0200, Michal Hocko wrote:
> So if somebody has woken up our thread from inside kthread() then it
> doesn't have that pointer on the stack and I couldn't it find elsewhere
> either. Maybe somebody has an idea where to look at.

One way could be bpftrace'ing or printking __wake_up_common() and
friends to dump backtrace if it's trying to wake a kthread whose comm
starts with kworker/ and doesn't have (struct worker
*)kthread_data(task)->pool set.

Thanks.

--
tejun

2022-06-27 10:49:09

by Michal Hocko

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Mon 27-06-22 17:21:12, Tejun Heo wrote:
> Hello, Michal.
>
> On Mon, Jun 27, 2022 at 10:07:22AM +0200, Michal Hocko wrote:
> > So if somebody has woken up our thread from inside kthread() then it
> > doesn't have that pointer on the stack and I couldn't it find elsewhere
> > either. Maybe somebody has an idea where to look at.
>
> One way could be bpftrace'ing or printking __wake_up_common() and
> friends to dump backtrace if it's trying to wake a kthread whose comm
> starts with kworker/ and doesn't have (struct worker
> *)kthread_data(task)->pool set.

I am afraid I won't have a chance for a runtime debugging. All I have at
this stage is a crash dump.
--
Michal Hocko
SUSE Labs

2022-06-27 12:35:32

by Michal Hocko

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat 25-06-22 19:53:34, Linus Torvalds wrote:
> On Sat, Jun 25, 2022 at 6:58 PM Tejun Heo <[email protected]> wrote:
[...]
> > * If there are no true spurious wakeups, where did the racing wakeup
> > come from? The task just got created w/ TASK_NEW and woken up once
> > with wake_up_new_task(). It hasn't been on any wait queue or
> > advertised itself to anything.
>
> I don't think it was ever a spurious wakeup at all.
>
> The create_worker() code does:
>
> worker->task = kthread_create_on_node(..
> ..
> worker_attach_to_pool(worker, pool);
> ..
> wake_up_process(worker->task);
>
> and thinks that the wake_up_process() happens after the worker_attach_to_pool().
>
> But I don't see that at all.
>
> The reality seems to be that the wake_up_process() is a complete
> no-op, because the task was already woken up by
> kthread_create_on_node().

Just for the record.
the newly created thread is not running our thread function at this
stage. It is rather subtle and took me some time to decypher but
kthread_create_on_node will create and wake up kernel thread running
kthread() function:
[...]
/*
* Thread is going to call schedule(), do not preempt it,
* or the creator may spend more time in wait_task_inactive().
*/
preempt_disable();
complete(done);
schedule_preempt_disabled();
preempt_enable();

ret = -EINTR;
if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
cgroup_kthread_ready();
__kthread_parkme(self);
ret = threadfn(data);
}

so the newly created thread will go into sleep before calling the
threadfn (worker_thread here). Somebody must have woken it up other than
create_worker. I couldn't have found out who that was (see my other
email with some notes from the crash dump).

I do agree that a simple schedule without checking for a condition is
dubious, fragile and wrong. If anything wait_for_completion would be less
confusing and targetted waiting.

Petr has added that completion into worker_thread to address this
specific case and a better fix would be to address all callers because
who knows how many of those are similarly broken.

I also do agree that this whole scheme is rather convoluted and having
an init() callback to be executed before threadfn would be much more
easier to follow.
--
Michal Hocko
SUSE Labs

2022-06-27 18:36:17

by Wedson Almeida Filho

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Mon, Jun 27, 2022 at 09:11:10AM +0200, Peter Zijlstra wrote:
> Current freezer can thaw at random points in time, even before SMP
> bringup if you're unlucky. And yes, I think it can induce 'spurious'
> wakeups as well.

Thanks, Peter, that's good to know!

> But really; like Linus already said upsteam, every wait loop *MUST*
> already be able to deal with spurious wakeups. This is why pretty much
> every wait primitive we have looks like:
>
> for (;;) {
> set_current_state(state);
> if (cond)
> break;
> schedule();
> }
> __set_current_state(RUNNING);
>
> Which is immune to random wake-ups since it need @cond to make progress.
> *NEVER* rely on just the wakeup itself for progress, that's buggy as
> heck in lots of ways.

Yes, I wonder how many more instances of this kind of bug we have
lurking around given that this one in core kernel code appears to have
been around for at least 17 years.

2022-06-27 18:58:52

by Wedson Almeida Filho

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Mon, Jun 27, 2022 at 05:11:36PM +0900, Tejun Heo wrote:
> Yeah, I have a hard time imagining this happening in C but maybe we'll
> get pretty good closure support through rust-in-kernel if that works
> out. That'd be pretty sweet even though we might not be able to use it
> everywhere.

While Rust does support closures and it would work just fine here, I
think in this case its type system allows for better ergonomics and
flexibility without them, for example:

// the pr_info! part is a closure for the body of the thread. Could
// also be replaced with a function.
let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?;

// Do whatever initialisation one wants to do using new_thread. Only
// functions that _can_ be used on a new kthread would be available
// (e.g., wake_up_process() wouldn't).

new_thread.start();

// new_thread isn't accessible anymore. The compiler fails compilation
// if one attempts to use it again, for example, to call start()
// again.

The type returned by task::new_paused() wouldn't be copyable, so we can
guarantee that start() is called at most once.

It would have a Drop implemention (destructor) that puts the task, which
means that we could use the question mark operator for error handling
between new_paused() & start() (or really any kind of early-return
technique) and all error paths would properly clean the new task up
without any goto mess. It also means that if one forgets to call
start(), not only will the thread never start, it will also be freed
(i.e., no leaks).

If the caller wants to keep a reference to the task, they would do
something like the following (instead of calling new_thread.start()):

let task = new_thread.start_and_get();

Then `task` could be used as any task. For example, wake_up() would be
available, but not wake_up_new_task(). It also has automatic handling of
refcounting such that we are garanteed to never have a dangling pointer
to the task.

Lastly, all the checks I mentioned above happen at compile time, so
there is absolutely zero cost at runtime.

Anyway, sorry for the digression. I thought this would be a good
opportunity to talk about some of the possibilities in API design and
enforcement that Rust affords us since this kind of design was the topic
in discussion and Rust was brought up by someone else :)

Cheers,
-Wedson

2022-06-27 19:41:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Mon, Jun 27, 2022 at 11:23 AM Wedson Almeida Filho
<[email protected]> wrote:
>
> Yes, I wonder how many more instances of this kind of bug we have
> lurking around given that this one in core kernel code appears to have
> been around for at least 17 years.

The good news is that this "explicit wait loop" pattern is actually
starting to be pretty rare.

New code seldom uses it, because "wait_event()" and friends are just
*so* much easier to use, and will magically expand to that kind of
loop properly.

So the explicit loop with add_wait_queue and friends is very
traditional and our original wait model (well, I think I did have
"sleep_on()" *very* early, but I fixed that broken model quickly since
it fundamentally doesn't work for multiple events), but it's not
actually very common any more.

And the "just call schedule, wait for wakeup" should be very rare
indeed. It has never been a valid pattern, so that kthread code is
just plain strange, and I wouldn't expect to see it commonly
elsewhere. It's simply not how things have ever been supposed to be
done.

Linus

2022-06-27 22:35:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote:

> let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?;

I'm still having a really hard time with this Rust stuff, the above
looks like a syntax error and random characters to me :/

2022-06-27 22:48:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Mon, Jun 27, 2022 at 3:07 PM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote:
>
> > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?;
>
> I'm still having a really hard time with this Rust stuff, the above
> looks like a syntax error and random characters to me :/

Heh. The '!' for macros is probably my least favorite part of Rust
syntax, it just makes macros look so unintegrated.

Not at all the kind of "you can use a macro instead of a function"
thing, because macros always have that '!' thing.

And yeah, the pipe characters used by closures sure make that
particular line look extra magical.

The question mark is a "do if ok, return if error". Think of it like a
"try" thing for exception catching.

But yeah, all the special characters does make me think of perl.

I haven't really gotten the hang of reading rust without a google
window open to figure things out, but I think that's just a "you have
to get used to it".

Or, alternatively, you have to just ignore the rust parts.

As I mentioned at OSS NA last week - it's not like most people can
read our MM code either - even when you know C, some of that code is
pretty incomprehensible unless you know how it all works.

If people can be productive kernel developers without understanding
the MM layer, I'm sure people can be kernel developers without
understanding rust..

Linus

2022-06-27 23:15:50

by Wedson Almeida Filho

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Tue, Jun 28, 2022 at 12:06:29AM +0200, Peter Zijlstra wrote:
> On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote:
>
> > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?;
>
> I'm still having a really hard time with this Rust stuff, the above
> looks like a syntax error and random characters to me :/

Fair enough, there are a few things going on there. The code above could
be expanded to the following:

fn thread_func() {
pr_info!("Hello world\n");
}

let retval = task::new_paused(thread_func);
if retval.is_err() {
return retval.unwrap_err();
}

let new_thread = retval.unwrap();

I hope the above is clearer. The question-mark operator is just
syntax-sugar for the error handling above, the pipes indicate a closure,
the full syntax is: |args| -> ret_type { statement1; statement2; etc. }
-- the line you quote has the compiler infer the type (so it's omitted),
there are no args (so the nothing between the pipes), and when there's a
single statement that is an expression whose type matches the return
type, the curly braces can thus be omitted, so we end up with "|| x" as
the closure.

I'm confident people can get used to this. It's in the rare cases when
one has to think about say subtyping, invariance, covariance, and
contravriance that things may get hairy (or exciting for the more
math-inclined).

2022-06-28 00:35:05

by Wedson Almeida Filho

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Tue, Jun 28, 2022 at 12:06:29AM +0200, Peter Zijlstra wrote:
> On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote:
>
> > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?;
>
> I'm still having a really hard time with this Rust stuff, the above
> looks like a syntax error and random characters to me :/

Peter, I meant to ask in my previous email: setting aside the syntax for
a moment, do you have an opinion on the sort of things that Rust allows
us to enforce at compile time (as exemplified in the new_paused()
fragment)?

2022-06-28 08:29:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Tue, Jun 28, 2022 at 12:32:33AM +0000, Wedson Almeida Filho wrote:
> On Tue, Jun 28, 2022 at 12:06:29AM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote:
> >
> > > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?;
> >
> > I'm still having a really hard time with this Rust stuff, the above
> > looks like a syntax error and random characters to me :/
>
> Peter, I meant to ask in my previous email: setting aside the syntax for
> a moment, do you have an opinion on the sort of things that Rust allows
> us to enforce at compile time (as exemplified in the new_paused()
> fragment)?

So I used to do quite a lot of C++ in a previous life; I think I'm more
or less familiar with a lot of the things Rust offers, except it is a
lot stricter. C++ allows you to do the right thing, but also allows you
to take your own foot off (a bit like C, except you can make an even
bigger mess of things), where Rust tries really hard to protect the
foot.

The one thing I dread is compile times, C++ is bad, but given Rust has
to do even more compile time enforcement it'll suck worse. And I'm
already not using clang because it's so much worse than gcc.

I've just not had *any* time to actually look at Rust in any detail :/

But given I'm the kind of idiot that does tree-wide cleanups just
because it's the right thing, I'm bound to run into it sooner rather
than later, and then I'll curse my way through having to learn it just
to get crap done I expect ...

Anyway; from what I understand Rust is a fair way away from core code.

2022-06-28 10:44:43

by Tejun Heo

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Tue, Jun 28, 2022 at 11:51:37AM +0200, Petr Mladek wrote:
> I quess that I felt overloaded and wished a calm day
> or something like this. I agree that it does not make
> much sense.

Hey, we all get overwhelmed sometimes. I wish you a calm day!

--
tejun

2022-06-28 10:46:40

by Petr Mladek

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat 2022-06-25 10:01:35, Linus Torvalds wrote:
> On Fri, Jun 24, 2022 at 10:00 PM Tejun Heo <[email protected]> wrote:
> For example, the reason that quoted patch cannot be right is that this
> code pattern:
>
> while (wait_for_completion_interruptible(&worker->ready_to_start))
> ;
>
> is not valid kernel code. EVER. There is absolutely no way that can be correct.
>
> Either that code can take a signal, or it cannot. If it can take a
> signal, it had better react to said signal. If it cannot, it must not
> use an interruptble sleep - since now that loop turned into a
> kernel-side busy-loop.

I agree that the code is ugly and does not make sense.

JFYI, I did it this way because I wanted to calm down the hung task
detector. kthreads ignore signals so that I did not think about
this aspect. I though that I saw this trick somewhere.

Do not ask me why I wanted to calm down the hung task detector.
I quess that I felt overloaded and wished a calm day
or something like this. I agree that it does not make
much sense.

Best Regards,
Petr

2022-06-28 14:21:47

by Christian Brauner

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Sat, Jun 25, 2022 at 11:43:15AM -0700, Linus Torvalds wrote:
> On Sat, Jun 25, 2022 at 11:25 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > And that's not at all what the kthread code wants. It wants to set
> > affinity masks, it wants to create a name for the thread, it wants to
> > do all those other things.
> >
> > That code really wants to just do copy_process().
>
> Honestly, I think kernel/kthread.c should be almost rewritten from scratch.
>
> I do not understand why it does all those odd keventd games at all,
> and why kthread_create_info exists in the first place.
>
> Why does kthread_create() not just create the thread directly itself,
> and instead does that odd queue it onto a work function?
>
> Some of that goes back to before the git history, and very little of
> it seems to make any sense. It's as if the code is meant to be able to
> run from interrupt context, but that can't be it: it's literally doing
> a GFP_KERNEL kmalloc, it's doing spin-locks without irq safety etc.
>
> So why is it calling kthreadd_task() to create the thread? Purely for
> some crazy odd "make that the parent" reason?
>
> I dunno. The code is odd, unexplained, looks buggy, and most fo the
> reasons are probably entirely historical.
>
> I'm adding Christian to this thread too, since I get the feeling that
> it really should be more tightly integrated with copy_process(), and
> that Christian might have comments.
>
> Christian, see some context in the thread here:
>
> https://lore.kernel.org/all/CAHk-=wiC7rj1o7vTnYUPfD7YxAu09MZiZbahHqvLm9+Cgg1dFw@mail.gmail.com/
>
> for some of this.

Sorry, I was at LSS last week.

I honestly didn't touch the code back then because it seemed almost
entirely unrelated to regular task creation apart from kernel_thread()
that I added. I didn't feel comfortable changing a lot of stuff there.

Iirc, just a few months ago io_uring still made us of the kthread
infrastructure and I think that made the limits of the interface more
obvious. Now we soon will have two users that create a version of kernel
generated threads with properties of another process (io_uring and [1]).

In my head, the kthread infra should be able to support generation of
pure kernel threads as well as the creation of users workers instead of
adding specialized interfaces to do this. The fact that it doesn't is a
limitation of the interface that imho shows it hasn't grown to adapt to
the new use-cases we have. And imho we'll see more of those.

In this context it's really worth looking at [1] because to some extent
it duplicates bits we have for the kthread infra whereas I still think
the kthread infra should support both possibly exposing two apis one
to return pure kernel threads and the other returning struct user_worker
or similar. Idk, it might just be a heat-stroke talking...

I don't feel comfortable making strong assertions about the original
implementation of kthreads. I wasn't around and there might be
historical context I'm missing.

One issue that Tejun also mentioned later in the thread and that we run
into is that we have a pattern where we create a kthread and then trust
the caller to handle/activate the new task. This is more problematic
once we start supporting something like [1] where that's exposed to a
driver. (Ideally creation of such a task would generate a unique
callback - I think Peter suggested something like this? - that could
only be used on that task...)

[1]: https://lore.kernel.org/lkml/[email protected]

2022-06-28 15:12:47

by Petr Mladek

[permalink] [raw]
Subject: Re: re. Spurious wakeup on a newly created kthread

On Mon 2022-06-27 10:07:22, Michal Hocko wrote:
> On Sat 25-06-22 14:00:10, Tejun Heo wrote:
> [...]
> > tl;dr is that the worker creation code expects a newly created worker
> > kthread to sit tight until the creator finishes setting up stuff and
> > sends the initial wakeup. However, something, which wasn't identified
> > in the report (Petr, it'd be great if you can find out who did the
> > wakeup), wakes up the new kthread before the creation path is done
> > with init which causes the new kthread to try to deref a NULL pointer.
>
> One thing that hasn't been really mentioned explicitly but it might be
> really important. 5.3.18-2-rt is the RT patchset backported to 5.3
> based SLES kernel. We do not know whether the upstream RT patchset is
> affected the same way and due to nature of the customer we cannot really
> have this tested on that kernel.
>
> I have tried to dig that out from the crash dump but no luck
> unfortunately. I was able to find the kthread pointer on few stacks:

I tried to dig more but I am not sure if it explains the problem.

The newly created worker was woken from "sysmon" process because
it was in wake_q_sleeper queue for its own &task->sighand->siglock.
I should not cause problems if the worker was really blocked
and waiting to this spin lock.

It looks to me that ptrace might in theory cause the problematic
wakeup.

Note that it is SUSE kernel 5.3 with backported RT patchset.


See below for more details.

Michal found this process by searching for ffff9ca77e9a8000
on process stacks.

> sysmon is
> crash> bt ffff9ca75e6349c0
> PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon"
> #0 [ffffb0f6808e7be8] __schedule at ffffffffaa94bef1
> #1 [ffffb0f6808e7c78] preempt_schedule_common at ffffffffaa94c5d1
> #2 [ffffb0f6808e7c80] ___preempt_schedule at ffffffffaa201bb6
> #3 [ffffb0f6808e7cd8] rt_mutex_postunlock at ffffffffaa309f6b
> #4 [ffffb0f6808e7ce8] rt_mutex_futex_unlock at ffffffffaa94ea7c
> #5 [ffffb0f6808e7d30] rt_spin_unlock at ffffffffaa950ace
> #6 [ffffb0f6808e7d38] proc_pid_status at ffffffffaa4c9ff4
> #7 [ffffb0f6808e7dd8] proc_single_show at ffffffffaa4c3801
> #8 [ffffb0f6808e7e10] seq_read at ffffffffaa47c1b8
> #9 [ffffb0f6808e7e70] vfs_read at ffffffffaa455fd9
> #10 [ffffb0f6808e7ea0] ksys_read at ffffffffaa456364
>
> so collecting /pro/status data and releasing the sighand lock.


I found the pointer to the task struct actually twice there:

1. Direct pointer to task struct:

The stack of the process "sysmon" is here:

#5 [ffffb0f6808e7d30] rt_spin_unlock+0x3e at ffffffffaa950ace
ffffb0f6808e7d38: proc_pid_status+1636
#6 [ffffb0f6808e7d38] proc_pid_status+0x664 at ffffffffaa4c9ff4
ffffb0f6808e7d40: init_user_ns init_pid_ns
ffffb0f6808e7d50: 0000000000000000 0000000000000000
ffffb0f6808e7d60: 00003f550019fca0 ffff9ca700000000
ffffb0f6808e7d70: 0000000000000000 0000000000000000
ffffb0f6808e7d80: 0000000000000000 0000000000000000
ffffb0f6808e7d90: ffffffffffffffff 0000000000000000
ffffb0f6808e7da0: 2c93096daf3bc600 [ffff9ca77e9a8000:task_struct]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ffffb0f6808e7db0: [ffff9ca7602f6408:proc_inode_cache] [ffff9ca77e934908:seq_file]
ffffb0f6808e7dc0: [ffff9ca71bd7ad80:pid] init_pid_ns
ffffb0f6808e7dd0: [ffff9ca77e934908:seq_file] proc_single_show+81

So, the pointer on an actively used and valid stack.

It seems it is the *task_struct that was passed to rec_pid_status().
Therefore, proc_pid_status is showing status of the newly created
worker that gets woken prematurely.

According to the assembly, it is the code:

+ proc_pid_status()
+ task_sig()
+ unlock_task_sighand()
+ spin_unlock_irqrestore(&task->sighand->siglock)
+ rt_spin_unlock()


2. Indirect pointer to task struct:

The process "sysmon" is interrupted in rt_mutex_postunlock().
It looks in this kernel like:

void rt_mutex_postunlock(struct wake_q_head *wake_q,
struct wake_q_head *wake_sleeper_q)
{
wake_up_q(wake_q);
wake_up_q_sleeper(wake_sleeper_q);

/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
preempt_enable();
}

Note that pointers to wake_q and wake_sleeper_q are passed as
parameters.

They point to the related fields in task-struct:

crash> struct -x -o task_struct | grep wake_q
[0x860] struct wake_q_node wake_q;
[0x868] struct wake_q_node wake_q_sleeper;

The relevant part of the stack of "sysmon" is:

crash> bt -FFxs 842
[...]
#2 [ffffb0f6808e7c80] ___preempt_schedule+0x16 at ffffffffaa201bb6
ffffb0f6808e7c88: [ffff9ca764ee90bd:kmalloc-4k] ffffb0f6808e7c20
ffffb0f6808e7c98: 0000000000027340 fair_sched_class
ffffb0f6808e7ca8: 0000000000000003 0000000000000000
ffffb0f6808e7cb8: 0000000000000000 0000000000000202
ffffb0f6808e7cc8: [ffff9ca77e9a8028:task_struct] ffffb0f6808e7d08
ffffb0f6808e7cd8: rt_mutex_postunlock+43
#3 [ffffb0f6808e7cd8] rt_mutex_postunlock+0x2b at ffffffffaa309f6b
ffffb0f6808e7ce0: [ffff9ca762400000:sighand_cache] rt_mutex_futex_unlock+172
#4 [ffffb0f6808e7ce8] rt_mutex_futex_unlock+0xac at ffffffffaa94ea7c
ffffb0f6808e7cf0: 0000000000000246 0000000000000001
ffffb0f6808e7d00: ffffb0f6808e7cf8 [ffff9ca77e9a8868:task_struct]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ffffb0f6808e7d10: [ffff9ca77e9a8868:task_struct] 2c93096daf3bc600
ffffb0f6808e7d20: [ffff9ca77e934908:seq_file] 0000000000000000
ffffb0f6808e7d30: rt_spin_unlock+62

and if we look at the task struct via the offset to wake_q_sleeper:

crash> struct task_struct -l 0x868 ffff9ca77e9a8868 | grep -e pid
pid = 16213,

So, the newly created kthread, PID 16213, was on the
wake_q_sleeper queue related to its own &task->sighand->siglock


Doubts:

Everything should be good when the kthread was really blocked and waiting for
&task->sighand->siglock.

Is there any other reason why the kthread could end up on the
wake_q_sleeper queue even when it did not take &task->sighand->siglock
directly?

+ RT specifix hacks to avoid priority inversion of processes?
+ ptrace?

Or is it possible that the following code takes task->sighand->siglock
on RT kernel?

static int kthread(void *_create)
{
[...]
/* OK, tell user we're spawned, wait for stop or wakeup */
__set_current_state(TASK_UNINTERRUPTIBLE);
create->result = current;
/*
* Thread is going to call schedule(), do not preempt it,
* or the creator may spend more time in wait_task_inactive().
*/
preempt_disable();
complete(done);
schedule_preempt_disabled();
[...]
}

Some more info:

The last running times are:

crash> ps -l
[115147050548884] [RU] PID: 16127 TASK: ffff9ca77e9abb00 CPU: 0 COMMAND: "kworker/0:0H"
[115147050481426] [IN] PID: 1829 TASK: ffff9ca71f581d80 CPU: 0 COMMAND: "vReqJob"
[115147050467716] [IN] PID: 704 TASK: ffff9ca7624e0ec0 CPU: 0 COMMAND: "irq/127-eth0-Tx"
[115147050456263] [RU] PID: 16213 TASK: ffff9ca77e9a8000 CPU: 1 COMMAND: "kworker/0:2H"
[115147050394683] [IN] PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob"
[115147050370507] [IN] PID: 2146 TASK: ffff9ca71dcc6740 CPU: 1 COMMAND: "vReqJob"
[115147050363317] [RU] PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon"
[115147050334723] [IN] PID: 1875 TASK: ffff9ca71f6c1d80 CPU: 1 COMMAND: "vReqJob"
[115147050287774] [IN] PID: 705 TASK: ffff9ca7624e0000 CPU: 1 COMMAND: "irq/128-eth0-Tx"
[115147050279924] [RU] PID: 2 TASK: ffff9ca700381d80 CPU: 0 COMMAND: "kthreadd"

I am not sure what vRegJob tasks do. The new worker was scheduled on
CPU1 right after vReqJob. PID 2035. In theory, the process might
have triggered the rescheduling as well. But I do not see it:

crash> bt 2035
PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob"
#0 [ffffb0f68300fbb0] __schedule at ffffffffaa94bef1
#1 [ffffb0f68300fc40] schedule at ffffffffaa94c4e6
#2 [ffffb0f68300fc50] futex_wait_queue_me at ffffffffaa33e2c0
#3 [ffffb0f68300fc88] futex_wait at ffffffffaa33f199
#4 [ffffb0f68300fd98] do_futex at ffffffffaa341062
#5 [ffffb0f68300fe70] __x64_sys_futex at ffffffffaa341a7e
#6 [ffffb0f68300fee0] do_syscall_64 at ffffffffaa2025f0
#7 [ffffb0f68300ff50] entry_SYSCALL_64_after_hwframe at ffffffffaaa0008c
RIP: 00007f7317e0487d RSP: 00007f7208777cd0 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007f72f00209d0 RCX: 00007f7317e0487d
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f72f00209fc
RBP: 0000000000000000 R8: 0000000000000000 R9: 00007f72f800eef0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f72f0020998
R13: 000000001df92377 R14: 0000000000000000 R15: 00007f72f00209fc
ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b

From the stack, I do not think that it is related:

#3 [ffffb0f68300fc88] futex_wait+0xd9 at ffffffffaa33f199
ffffb0f68300fc90: 00007f72f00209fc ffffffff00000000
ffffb0f68300fca0: ffff9ca77e4eec00 0000000000000001
ffffb0f68300fcb0: 4063c3da32ce9400 ffffb0f68300fe68
ffffb0f68300fcc0: ffffb0f68300fe40 ffffb0f68300fe68
ffffb0f68300fcd0: ffffb0f68300fe40 [ffff9ca762a0ec00:sock_inode_cache]
ffffb0f68300fce0: 0000000000004000 ___sys_sendmsg+153
ffffb0f68300fcf0: 0000000000000064 ffffb0f68300fcf8
ffffb0f68300fd00: ffffb0f68300fcf8 ffff9ca77e4eec38
ffffb0f68300fd10: ffffb0f682befd08 [ffff9ca71e252c40:task_struct]
ffffb0f68300fd20: ffff9ca77e4eec08 [ffff9ca764b2b400:mm_struct]
ffffb0f68300fd30: 00007f72f0020000 00000000000009fc
ffffb0f68300fd40: 0000000000000000 0000000000000000
ffffb0f68300fd50: 0000000000000000 00000000ffffffff
ffffb0f68300fd60: 4063c3da32ce9400 0000000000000000
ffffb0f68300fd70: 0000000000000000 0000000000000000
ffffb0f68300fd80: 0000000000000000 00007f72f800eef0
ffffb0f68300fd90: 00007f72f00209fc do_futex+322
#4 [ffffb0f68300fd98] do_futex+0x142 at ffffffffaa341062
ffffb0f68300fda0: __switch_to_asm+52 __switch_to_asm+64
ffffb0f68300fdb0: __switch_to_asm+52 __switch_to_asm+64
ffffb0f68300fdc0: __switch_to_asm+52 __switch_to_asm+64
ffffb0f68300fdd0: __switch_to_asm+52 __switch_to_asm+64
ffffb0f68300fde0: audit_filter_rules.constprop.22+583 [ffff9ca7632b1e40:cred_jar]
ffffb0f68300fdf0: __switch_to_asm+52 [ffff9ca71e252c40:task_struct]
ffffb0f68300fe00: __switch_to_asm+52 0000000132ce9400
ffffb0f68300fe10: ffffb0f68300fe6c ffffb0f68300fe60
ffffb0f68300fe20: 0000000000004000 0000000000000000
ffffb0f68300fe30: auditd_test_task+56 4063c3da32ce9400
ffffb0f68300fe40: 0000000000000080 0000000000000000
ffffb0f68300fe50: 00007f72f00209fc 0000000000000000
ffffb0f68300fe60: 00007f72f800eef0 0000000000000000
ffffb0f68300fe70: __x64_sys_futex+94
#5 [ffffb0f68300fe70] __x64_sys_futex+0x5e at ffffffffaa341a7e
ffffb0f68300fe78: 00007f72ffffffff 0000000000000000
ffffb0f68300fe88: auditd_test_task+56 [ffff9ca71e246400:kmalloc-1k]
ffffb0f68300fe98: 00000000000000ca __audit_syscall_entry+235
ffffb0f68300fea8: 4063c3da32ce9400 00000000000000ca
ffffb0f68300feb8: ffffb0f68300ff58 00000000000000ca
ffffb0f68300fec8: 0000000000000000 0000000000000000
ffffb0f68300fed8: 0000000000000000 do_syscall_64+128


Best Regards,
Petr

2022-06-30 01:08:56

by Wedson Almeida Filho

[permalink] [raw]
Subject: Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE)

On Tue, Jun 28, 2022 at 09:58:43AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 28, 2022 at 12:32:33AM +0000, Wedson Almeida Filho wrote:
> > Peter, I meant to ask in my previous email: setting aside the syntax for
> > a moment, do you have an opinion on the sort of things that Rust allows
> > us to enforce at compile time (as exemplified in the new_paused()
> > fragment)?
>
> So I used to do quite a lot of C++ in a previous life; I think I'm more
> or less familiar with a lot of the things Rust offers, except it is a
> lot stricter. C++ allows you to do the right thing, but also allows you
> to take your own foot off (a bit like C, except you can make an even
> bigger mess of things), where Rust tries really hard to protect the
> foot.

That's a fair assessment. I'd just like to emphasise one aspect: if I
shoot myself on the foot with Rust, it's either a bug in the compiler or
because I violated a precondition of an unsafe block.

> The one thing I dread is compile times, C++ is bad, but given Rust has
> to do even more compile time enforcement it'll suck worse. And I'm
> already not using clang because it's so much worse than gcc.

Yes, it's definitely going to be slower, but I think this is a good
tradeoff: I'd rather have checks performed when compiling than when
running the kernel.

One thing that may speed things up is to disable the borrow checker when
compiling a known-good revision. I don't think rustc allows this but
rust-gcc doesn't implement borrow checking at all AFAIU, so we could
presumably ask them to add a flag to keep it disabled (when/if they
decide to implement it) in such cases.

> I've just not had *any* time to actually look at Rust in any detail :/
>
> But given I'm the kind of idiot that does tree-wide cleanups just
> because it's the right thing, I'm bound to run into it sooner rather
> than later, and then I'll curse my way through having to learn it just
> to get crap done I expect ...

Looking forward to this :)

Jokes aside, when the day comes, I'm happy to discuss anything that
doesn't seem to make sense.

> Anyway; from what I understand Rust is a fair way away from core code.

Indeed. However, we do want to expose core APIs (like paused thread
creation) to developers writing Rust code, so while the implementation
will remain in C, we'll implement the _API_ for a bunch of core code.

Cheers,
-Wedson

2022-08-04 09:48:04

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups

On Wed, Jun 22, 2022 at 10:10 PM Petr Mladek <[email protected]> wrote:
>
> A system crashed with the following BUG() report:
>
> [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [115147.050488] #PF: supervisor write access in kernel mode
> [115147.050489] #PF: error_code(0x0002) - not-present page
> [115147.050490] PGD 0 P4D 0
> [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI
> [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased)
> [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30
> [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90
> [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046
> [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000
> [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340
> [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0
> [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780
> [115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000
> [115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0
> [115147.050524] Call Trace:
> [115147.050533] worker_thread+0xb4/0x3c0
> [115147.050538] ? process_one_work+0x4a0/0x4a0
> [115147.050540] kthread+0x152/0x170
> [115147.050542] ? kthread_park+0xa0/0xa0
> [115147.050544] ret_from_fork+0x35/0x40
>
> Further debugging shown that the worker thread was woken
> before worker_attach_to_pool() finished in create_worker().
>
> Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
> until it is explicitly woken. But a spurious wakeup might
> break this expectation.
>
> As a result, worker_thread() might read worker->pool before
> it was set in worker create_worker() by worker_attach_to_pool().
> Also manage_workers() might want to create yet another worker
> before worker->pool->nr_workers is updated. It is a kind off
> a chicken & egg problem.
>
> Synchronize these operations using a completion API.
>
> Note that worker->pool might be then read without wq_pool_attach_mutex.
> Normal worker always belongs to the same pool.
>
> Also note that rescuer_thread() does not need this because all
> needed values are set before the kthreads is created. It is
> connected with a particular workqueue. It is attached to different
> pools when needed.
>
> Signed-off-by: Petr Mladek <[email protected]>
> ---
> kernel/workqueue.c | 13 ++++++++++++-
> kernel/workqueue_internal.h | 1 +
> 2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 1ea50f6be843..9586b0797145 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1939,6 +1939,7 @@ static struct worker *create_worker(struct worker_pool *pool)
> goto fail;
>
> worker->id = id;
> + init_completion(&worker->ready_to_start);
>
> if (pool->cpu >= 0)
> snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
> @@ -1961,6 +1962,7 @@ static struct worker *create_worker(struct worker_pool *pool)
> raw_spin_lock_irq(&pool->lock);
> worker->pool->nr_workers++;
> worker_enter_idle(worker);
> + complete(&worker->ready_to_start);
> wake_up_process(worker->task);
> raw_spin_unlock_irq(&pool->lock);
>
> @@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val)
> static int worker_thread(void *__worker)
> {
> struct worker *worker = __worker;
> - struct worker_pool *pool = worker->pool;
> + struct worker_pool *pool;
>
> /* tell the scheduler that this is a workqueue worker */
> set_pf_worker(true);
> +
> + /*
> + * Wait until the worker is ready to get started. It must be attached
> + * to a pool first. This is needed because of spurious wakeups.
> + */
> + while (wait_for_completion_interruptible(&worker->ready_to_start))
> + ;

There might be some wakeups from wake_up_worker() since it is in
the idle list. These wakeups will be "spurious wakeups" in the view
of the completion subsystem.

> + pool = worker->pool;
> +
> woke_up:
> raw_spin_lock_irq(&pool->lock);
>
> diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
> index e00b1204a8e9..ffca2783c8bf 100644
> --- a/kernel/workqueue_internal.h
> +++ b/kernel/workqueue_internal.h
> @@ -41,6 +41,7 @@ struct worker {
> /* L: for rescuers */
> struct list_head node; /* A: anchored at pool->workers */
> /* A: runs through worker->node */
> + struct completion ready_to_start; /* None: handle spurious wakeup */
>
> unsigned long last_active; /* L: last active timestamp */
> unsigned int flags; /* X: flags */
> --
> 2.35.3
>

2022-08-04 10:49:12

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups

On Thu, Aug 4, 2022 at 4:57 PM Lai Jiangshan <[email protected]> wrote:
;
>
> There might be some wakeups from wake_up_worker() since it is in
> the idle list. These wakeups will be "spurious wakeups" in the view
> of the completion subsystem.
>


Oh, sorry, I was wrong. "complete(&worker->ready_to_start);" and
"worker_enter_idle(worker);" are in the same pool lock region.

There are no "spurious wakeups" from "wake_up_worker()" as I have
wrongly described.