Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20220622140853.31383-1-pmladek@suse.com> <YraWWl+Go17uPOgR@mtj.duckdns.org>
 <CAHk-=wiC7rj1o7vTnYUPfD7YxAu09MZiZbahHqvLm9+Cgg1dFw@mail.gmail.com> <Yre9LO2nj+Hbr67V@mtj.duckdns.org>
In-Reply-To: <Yre9LO2nj+Hbr67V@mtj.duckdns.org>
From:   Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sat, 25 Jun 2022 19:53:34 -0700
Message-ID: <CAHk-=wjmWUSdK7-LLjpUrH_TX78emb3ajxZ1ueT2HU0_FVJQfA@mail.gmail.com>
Subject: Re: re. Spurious wakeup on a newly created kthread
To:     Tejun Heo <tj@kernel.org>
Cc:     Petr Mladek <pmladek@suse.com>,
        Lai Jiangshan <jiangshanlai@gmail.com>,
        Michal Hocko <mhocko@suse.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Oleg Nesterov <oleg@redhat.com>,
        "Eric W. Biederman" <ebiederm@xmission.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Sat, Jun 25, 2022 at 6:58 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > They aren't "really" spurious, they are just asynchronous enough (and
> > thus unexpected) that you basically should never have a "sleep on
> > wait-queue" without then looping and re-testing the condition.
>
> Can you elaborate on this a bit? At least for the standard
> wait_event-ish wait blocks, the waiter always does finish_wait()
> before leavig a wait.

Correct.

As long as *all* you ever use is a standard wait_event thing, you are
always serialized by the spinlock on the wait queue.

However.

Very few processes only use standard wait_event things. There are a
number of places that use "wake_up_process()" which doesn't use a
wait-queue, but a wake-up directed right at a particular task.

Is that the _common_ pattern? It's not *uncommon*. It's maybe not the
strictly normal wait-queue one, but if you grep for
"wake_up_process()" you will find quite a lot of them.

And yes, many of those uses too will use strictly serialized locking:
in that both the waker and the sleeper (ie the target of the
wake_up_process()) will have serialized the target thread pointer with
some lock, exactly like the normal wait-queues serialize using the
wait-queue lock.

But several do *not* use locking, and instead rely on the
thread_struct being RCU-free'd.

In fact, I think kthread_create() itself is such an example, with that

        wake_up_process(kthreadd_task);

just doing a blind directed wakeup with no locking what-so-ever, just
a straight wake_up.

And the important thing to notice that if you have even just *one*
such user, that kthreadd_task will basically get randomyl "spuriously"
woken up while it is waiting for something else.

Now, the kthreadd_task task is kind of specialized, and that
particular wake_up_process() doesn't affect anybody else, so think of
that just as an example.

But now imagine any of the other unlocked directed wake_up_process()
users. Imagine them targeting regular user processes that may be doing
regular system calls.  Imagine that wake_up_process() being just done
under a RCU read lock, and maybe the process had already woken up
*AND* had already gone on to do entirely other things!

IOW, maybe you are now a thread that does a perfectly normal locked
wait_queue thing - but because the *previous* system call did
something that did less strictly locked things, there may be another
CPU that is still in the process of waking you up from that previous
system call. So now your "strictly locked" waitqueue usage sees a
"spurious" wakeup, because a previous not-so-stictly-locked usage had
been delayed by interrupts on another CPU.

And yes, those "wake up process under RCU read lock" really do exist.
There are literally things that take a reference to a task struct, add
it to some RCU-safe list, and then do the wakeup without locking.

> I'm probably missing sometihng. Is it about bespoke wait mechanisms?
> Can you give a concrete example of an async wakeup scenario?

Grep for wake_up_process() and just look for them/

Side note: signals end up doing effectively the same thing if you're
doing interruptible waits, of course, but people are probably more
*aware* of signals when they use TASK_INTERRUPTIBLE. But these
RCU-delayed wakeups can wake up even TASK_UNINTERRUPTIBLE calls.

Anyway, it's easy enough to deal with: use the "event" macros, and
you'll never have to worry about it.

But if you use explicit wait-queues and manual scheduling (rather than
wait_event() and friends), you need to be aware that when you go to
sleep, the fact that you woke up is *not* a guarantee that the wakeup
came from the wait queue.

So you need to always do that in a loop. The wait_event code will do
that loop for you, but if you do manual wait-queues you are required
to do the looping yourself.

> So, the deferred wakeups from earlier waits are one. Can you give some
> other examples? This is something which has always bothered me and I
> couldn't find explanations which aren't hand-wavy on my own. It'd be
> really great to have clarity.

There's a second class of "spurious" waits that aren't really spurious
at all, and aren't even deferred, and are simply due to "there were
multiple things you waited for, but you didn't even *think* about it".

The obvious one is for poll/select, but there people are aware of it.

The less obvious one is for taking a page fault or being put to sleep
and being woken up again while actively *inside* a wait loop, and
already with a wait-queue entry added.

For a historical example of *both* of those kinds of situations, take
a look at n_tty_read() in drivers/tty/n_tty.c, and in particular,
notice how the wait-loop looks something like this:

        add_wait_queue(&tty->read_wait, &wait);
        while (nr) {
                ..
        }
        ...
        remove_wait_queue(&tty->read_wait, &wait);

and *inside* the loop, you have both

 (a) locking:

                        down_read(&tty->termios_rwsem);

 (b) historically (but not any more) user space accesses

and both of those kinds mean that you actually have overlapping
lifetimes of wait-queues for the same process.

That's very much how wait-queues were always designed to work - it
solves a lot of race conditions (ie the traditional "sleep_on()" model
is broken garbage), but it means that each individual user doesn't
necessarily understand that other active wait-queues can wake up a
process *while* the code thinks about another wait-queue entirely.

IOW, when you are in the page fault code, and you are waiting for the
filesystem IO to complete, you may *also* be on that kind of
"tty->read_wait" waitqueue, and a character coming in on that tty will
wake you up.

The *filesyustem* code doesn't care about the tty wakeup, so it's very
much an example of a "spurious" wake event as far as the filesystem
code is concerned.

Similarly, when you are locking a semaphote, the only wait-queue
activity you care about at the time of the lock is the ones coming
from the unlockers, but you may get "spurious" wakeups from other
things that the process is *also* interested in.

Again, none of these are *really* spurious. They are real wakeup
events. It's just that within the *local* code they look spurious,
because the locking code, the disk IO code, whatever the code is
doesn't know or care about all the other things that process is
involved in.

And once again, it's not that different from "hey, signals can wake
you up at pretty much any time", but people *think* that because they
are doing a non-interruptible wait it is somehow "exclusive", and
don't think about all the other things that that process has been
involved in

> * If there are no true spurious wakeups, where did the racing wakeup
>   come from? The task just got created w/ TASK_NEW and woken up once
>   with wake_up_new_task(). It hasn't been on any wait queue or
>   advertised itself to anything.

I don't think it was ever a spurious wakeup at all.

The create_worker() code does:

        worker->task = kthread_create_on_node(..
        ..
        worker_attach_to_pool(worker, pool);
        ..
        wake_up_process(worker->task);

and thinks that the wake_up_process() happens after the worker_attach_to_pool().

But I don't see that at all.

The reality seems to be that the wake_up_process() is a complete
no-op, because the task was already woken up by
kthread_create_on_node().

But I dunno. I think the whole argument of

  Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
  until it is explicitly woken.

was completely broken to begin with, and then the belief that there's
some spurious wakeup came from that.

But hey, that "wake_up_process(worker->task)" itself does seem spurious.

Anyway, I think this whole "spurious wakeup" is a complete red
herring. They don't "really" exist, but any code that expects some
exclusive wakeup tends to be broken code and that kind of thinking is
dangerous.

So you should *ALWAYS* write your code as if things can get random
spurious wakeups at any time. Because with various CPU hotplug events
etc, it really might happen.

But I don't think that's what's going on here. I think the workqueue
code is just confused, and should have initielized "worker->pool" much
earlier.  Because as things are now, when worker_thread() starts
running, and does that

  static int worker_thread(void *__worker)
  {
        struct worker *worker = __worker;
        struct worker_pool *pool = worker->pool;

thing, that can happen *immediately* after that

   kthread_create_on_node(worker_thread, worker,

happens. It just so happens that *normally* the create_worker() code
ends up finishing setup before the new worker has actually finished
scheduling..

No?

               Linus