2021-04-27 21:19:02

by Jörn Engel

[permalink] [raw]
Subject: sched: wakeup setting TIF_NEED_RESCHED too frequently

I came across something that looks wrong and is certainly inefficient.
Here is a trace of what happened. Kernel was 5.4, so not too ancient.

+ nanoseconds before present
| + 0 for voluntary (going to sleep), 1 for involuntary (TASK_RUNNING)
| | + PID of prev task
| | | + PID of next task
| | | | + RT priority
| | | | | + comm
v v v v v v
8729751 0 6893 6894 0 one
8723802 0 6894 106051 0 one
8718671 1 106051 6899 0 other
8711411 0 6899 6900 0 one
8705643 0 6900 6902 0 one
8699444 0 6902 6903 0 one
8693987 0 6903 6906 0 one
8687832 0 6906 6904 0 one
8681188 0 6904 6901 0 one
8674083 0 6901 106051 0 one
8639249 1 106051 6893 0 other
8631961 0 6893 6894 0 one
8626684 0 6894 106051 0 one
8623375 1 106051 6899 0 other
8617434 0 6899 6900 0 one
8612341 0 6900 6901 0 one
8607215 0 6901 6902 0 one
8600824 0 6902 6903 0 one
8595214 0 6903 6906 0 one
8590161 0 6906 6904 0 one
8584825 0 6904 106051 0 one
8465395 1 106051 6893 0 other
8358696 0 6893 6903 0 one

The "one" process seems to frequently wake a bunch of threads that
do very little work before going back to sleep. Clearly inefficient,
but we are not interested in userspace problems here. The "other"
process has a thread that would like to continue running, but only gets
to run for very short periods, 3.3?s in one case. That also seems
inefficient and looks like either a scheduler or debatable policy.

My rule of thumb for task switch overhead is 1-3?s. Probably 1?s for
the raw switch, plus another 2?s for cold cache effects. If we want to
amortize that overhead a bit, time slices should be in the region of
1ms, give or take. I would also be happy if the "one" process would
have to wait 1ms before the woken threads actually get to run. So in my
book the observed behaviour is wrong, but opinions may differ.

Anyway, trying to find a cause, I noticed the following call chain:
set_nr_if_polling()
ttwu_queue_remote()
ttwu_queue()
try_to_wake_up()
default_wake_function()
curr->func()
__wake_up_common()
__wake_up_common_lock()
__wake_up()
wake_up()

Call chain above is manually created from source code. Closest sample I
caught with instrumentation is missing the leaf calls after
try_to_wake_up():
_raw_spin_unlock_irqrestore+0x1f/0x40
try_to_wake_up+0x425/0x5e0
wake_up_q+0x3f/0x80
futex_wake+0x159/0x180
do_futex+0xcd/0xba0

Afaics, the result is us setting TIF_NEED_RESCHED on any wakeup, unless
wake_list is already populated. Is that actually intentional? And is
that useful for performance of latency? I think it isn't, but I am
probably missing something here.

J?rn

--
With a PC, I always felt limited by the software available. On Unix,
I am limited only by my knowledge.
-- Peter J. Schoenster


2021-04-28 08:34:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: wakeup setting TIF_NEED_RESCHED too frequently

On Tue, Apr 27, 2021 at 02:16:50PM -0700, J?rn Engel wrote:
> Anyway, trying to find a cause, I noticed the following call chain:
> set_nr_if_polling()
> ttwu_queue_remote()
> ttwu_queue()
> try_to_wake_up()
> default_wake_function()
> curr->func()
> __wake_up_common()
> __wake_up_common_lock()
> __wake_up()
> wake_up()
>
> Call chain above is manually created from source code. Closest sample I
> caught with instrumentation is missing the leaf calls after
> try_to_wake_up():
> _raw_spin_unlock_irqrestore+0x1f/0x40
> try_to_wake_up+0x425/0x5e0
> wake_up_q+0x3f/0x80
> futex_wake+0x159/0x180
> do_futex+0xcd/0xba0
>
> Afaics, the result is us setting TIF_NEED_RESCHED on any wakeup, unless
> wake_list is already populated. Is that actually intentional? And is
> that useful for performance of latency? I think it isn't, but I am
> probably missing something here.

set_nr_if_polling() only sets TIF_NEED_RESCHED IFF TIF_POLLING_NRFLAG is
set. TIF_POLLING_NRFLAG is only set when idle and is then sufficient to
wake that cpu and reschedule.

If not TIF_POLLING_NRFLAG, then we send an IPI to deal with the
non-empty wake_list.

2021-04-28 21:08:44

by Jörn Engel

[permalink] [raw]
Subject: Re: sched: wakeup setting TIF_NEED_RESCHED too frequently

On Wed, Apr 28, 2021 at 10:30:05AM +0200, Peter Zijlstra wrote:
>
> set_nr_if_polling() only sets TIF_NEED_RESCHED IFF TIF_POLLING_NRFLAG is
> set. TIF_POLLING_NRFLAG is only set when idle and is then sufficient to
> wake that cpu and reschedule.
>
> If not TIF_POLLING_NRFLAG, then we send an IPI to deal with the
> non-empty wake_list.

Yeah, saw that myself. My explanation of the behaviour is wrong. The
observed behaviour remains, though.

Guess I have to add more instrumentation if I cannot find the cause by
reading the code.

Thank you!

J?rn

--
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall