2023-09-12 18:26:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>
> Peter Zijlstra <[email protected]> writes:
>
> > On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:
> >
> >> I was hoping that we'd have some generic way to deal with this where
> >> we could just say "this thing is reschedulable", and get rid of - or
> >> at least not increasingly add to - the cond_resched() mess.
> >
> > Isn't that called PREEMPT=y ? That tracks precisely all the constraints
> > required to know when/if we can preempt.
> >
> > The whole voluntary preempt model is basically the traditional
> > co-operative preemption model and that fully relies on manual yields.
>
> Yeah, but as Linus says, this means a lot of code is just full of
> cond_resched(). For instance a loop the process_huge_page() uses
> this pattern:
>
> for (...) {
> cond_resched();
> clear_page(i);
>
> cond_resched();
> clear_page(j);
> }

Yeah, that's what co-operative preemption gets you.

> > The problem with the REP prefix (and Xen hypercalls) is that
> > they're long running instructions and it becomes fundamentally
> > impossible to put a cond_resched() in.
> >
> >> Yes. I'm starting to think that that the only sane solution is to
> >> limit cases that can do this a lot, and the "instruciton pointer
> >> region" approach would certainly work.
> >
> > From a code locality / I-cache POV, I think a sorted list of
> > (non overlapping) ranges might be best.
>
> Yeah, agreed. There are a few problems with doing that though.
>
> I was thinking of using a check of this kind to schedule out when
> it is executing in this "reschedulable" section:
> !preempt_count() && in_resched_function(regs->rip);
>
> For preemption=full, this should mostly work.
> For preemption=voluntary, though this'll only work with out-of-line
> locks, not if the lock is inlined.
>
> (Both, should have problems with __this_cpu_* and the like, but
> maybe we can handwave that away with sparse/objtool etc.)

So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
thing, and then only search the range when TIF flag is set.

And I'm thinking it might be a good idea to have objtool validate the
range only contains simple instructions, the moment it contains control
flow I'm thinking it's too complicated.

> How expensive would be always having PREEMPT_COUNT=y?

Effectively I think that is true today. At the very least Debian and
SuSE (I can't find a RHEL .config in a hurry but I would think they too)
ship with PREEMPT_DYNAMIC=y.

Mel, I'm sure you ran numbers at the time (you always do), what if any
was the measured overhead from PREEMPT_DYNAMIC vs 'regular' voluntary
preemption?


2023-09-13 05:47:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Tue, Sep 12, 2023 at 10:26:06AM +0200, Peter Zijlstra wrote:
> > How expensive would be always having PREEMPT_COUNT=y?
>
> Effectively I think that is true today. At the very least Debian and
> SuSE (I can't find a RHEL .config in a hurry but I would think they too)
> ship with PREEMPT_DYNAMIC=y.

$ grep PREEMPT uek-rpm/ol9/config-x86_64
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

$ grep PREEMPT uek-rpm/ol9/config-aarch64
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

2023-09-19 02:00:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>> > The problem with the REP prefix (and Xen hypercalls) is that
>> > they're long running instructions and it becomes fundamentally
>> > impossible to put a cond_resched() in.
>> >
>> >> Yes. I'm starting to think that that the only sane solution is to
>> >> limit cases that can do this a lot, and the "instruciton pointer
>> >> region" approach would certainly work.
>> >
>> > From a code locality / I-cache POV, I think a sorted list of
>> > (non overlapping) ranges might be best.
>>
>> Yeah, agreed. There are a few problems with doing that though.
>>
>> I was thinking of using a check of this kind to schedule out when
>> it is executing in this "reschedulable" section:
>> !preempt_count() && in_resched_function(regs->rip);
>>
>> For preemption=full, this should mostly work.
>> For preemption=voluntary, though this'll only work with out-of-line
>> locks, not if the lock is inlined.
>>
>> (Both, should have problems with __this_cpu_* and the like, but
>> maybe we can handwave that away with sparse/objtool etc.)
>
> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
> thing, and then only search the range when TIF flag is set.
>
> And I'm thinking it might be a good idea to have objtool validate the
> range only contains simple instructions, the moment it contains control
> flow I'm thinking it's too complicated.

Can we take a step back and look at the problem from a scheduling
perspective?

The basic operation of a non-preemptible kernel is time slice
scheduling, which means that a task can run more or less undisturbed for
a full time slice once it gets on the CPU unless it schedules away
voluntary via a blocking operation.

This works pretty well as long as everything runs in userspace as the
preemption points in the return to user space path are independent of
the preemption model.

These preemption points handle both time slice exhaustion and priority
based preemption.

With PREEMPT=NONE these are the only available preemption points.

That means that kernel code can run more or less indefinitely until it
schedules out or returns to user space, which is obviously not possible
for kernel threads.

To prevent starvation the kernel gained voluntary preemption points,
i.e. cond_resched(), which has to be added manually to code as a
developer sees fit.

Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
additional preemption points. might_resched() utilizes the existing
might_sched() debug points, which are in code paths which might block on
a contended resource. These debug points are mostly in core and
infrastructure code and are in code paths which can block anyway. The
only difference is that they allow preemption even when the resource is
uncontended.

Additionally we have PREEMPT=FULL which utilizes every zero transition
of preeempt_count as a potential preemption point.

Now we have the situation of long running data copies or data clear
operations which run fully in hardware, but can be interrupted. As the
interrupt return to kernel mode does not preempt in the NONE and
VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
chunk size and adding cond_reched() again.

That's ugly and does not work for long lasting hardware operations so we
ended up with the suggestion of TIF_ALLOW_RESCHED to work around
that. But again this needs to be manually annotated in the same way as a
IP range based preemption scheme requires annotation.

TBH. I detest all of this.

Both cond_resched() and might_sleep/sched() are completely random
mechanisms as seen from time slice operation and the data chunk based
mechanism is just heuristics which works as good as heuristics tend to
work. allow_resched() is not any different and IP based preemption
mechanism are not going to be any better.

The approach here is: Prevent the scheduler to make decisions and then
mitigate the fallout with heuristics.

That's just backwards as it moves resource control out of the scheduler
into random code which has absolutely no business to do resource
control.

We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
held sections became preemtible caused even more preemption activity
than on a PREEMPT=FULL kernel. The worst side effect of that was
extensive lock contention.

The way how we addressed that was to add a lazy preemption mode, which
tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
preempt tasks which all belong to the SCHED_OTHER scheduling class. This
works pretty well and gains back a massive amount of performance for the
non-realtime throughput oriented tasks without affecting the
schedulability of real-time tasks at all. IOW, it does not take control
away from the scheduler. It cooperates with the scheduler and leaves the
ultimate decisions to it.

I think we can do something similar for the problem at hand, which
avoids most of these heuristic horrors and control boundary violations.

The main issue is that long running operations do not honour the time
slice and we work around that with cond_resched() and now have ideas
with this new TIF bit and IP ranges.

None of that is really well defined in respect to time slices. In fact
its not defined at all versus any aspect of scheduling behaviour.

What about the following:

1) Keep preemption count and the real preemption points enabled
unconditionally. That's not more overhead than the current
DYNAMIC_PREEMPT mechanism as long as the preemption count does not
go to zero, i.e. the folded NEED_RESCHED bit stays set.

From earlier experiments I know that the overhead of preempt_count
is minimal and only really observable with micro benchmarks.
Otherwise it ends up in the noise as long as the slow path is not
taken.

I did a quick check comparing a plain inc/dec pair vs. the
DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
in the non-conclusive noise.

20 years ago this was a real issue because we did not have:

- the folding of NEED_RESCHED into the preempt count

- the cacheline optimizations which make the preempt count cache
pretty much always cache hot

- the hardware was way less capable

I'm not saying that preempt_count is completely free today as it
obviously adds more text and affects branch predictors, but as the
major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
acceptable and tolerable tradeoff.

2) When the scheduler wants to set NEED_RESCHED due it sets
NEED_RESCHED_LAZY instead which is only evaluated in the return to
user space preemption points.

As NEED_RESCHED_LAZY is not folded into the preemption count the
preemption count won't become zero, so the task can continue until
it hits return to user space.

That preserves the existing behaviour.

3) When the scheduler tick observes that the time slice is exhausted,
then it folds the NEED_RESCHED bit into the preempt count which
causes the real preemption points to actually preempt including
the return from interrupt to kernel path.

That even allows the scheduler to enforce preemption for e.g. RT
class tasks without changing anything else.

I'm pretty sure that this gets rid of cond_resched(), which is an
impressive list of instances:

./drivers 392
./fs 318
./mm 189
./kernel 184
./arch 95
./net 83
./include 46
./lib 36
./crypto 16
./sound 16
./block 11
./io_uring 13
./security 11
./ipc 3

That list clearly documents that the majority of these
cond_resched() invocations is in code which neither should care
nor should have any influence on the core scheduling decision
machinery.

I think it's worth a try as it just fits into the existing preemption
scheme, solves the issue of long running kernel functions, prevents
invalid preemption and can utilize the existing instrumentation and
debug infrastructure.

Most importantly it gives control back to the scheduler and does not
make it depend on the mercy of cond_resched(), allow_resched() or
whatever heuristics sprinkled all over the kernel.

To me this makes a lot of sense, but I might be on the completely wrong
track. Se feel free to tell me that I'm completely nuts and/or just not
seeing the obvious.

Thanks,

tglx

2023-09-19 09:48:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <[email protected]> wrote:
>
> What about the following:
>
> 1) Keep preemption count and the real preemption points enabled
> unconditionally.

Well, it's certainly the simplest solution, and gets rid of not just
the 'rep string' issue, but gets rid of all the cond_resched() hackery
entirely.

> 20 years ago this was a real issue because we did not have:
>
> - the folding of NEED_RESCHED into the preempt count
>
> - the cacheline optimizations which make the preempt count cache
> pretty much always cache hot
>
> - the hardware was way less capable
>
> I'm not saying that preempt_count is completely free today as it
> obviously adds more text and affects branch predictors, but as the
> major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> acceptable and tolerable tradeoff.

Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in
most distros does speak for just admitting that the PREEMPT_NONE /
VOLUNTARY approach isn't actually used, and is only causing pain.

> 2) When the scheduler wants to set NEED_RESCHED due it sets
> NEED_RESCHED_LAZY instead which is only evaluated in the return to
> user space preemption points.

Is this just to try to emulate the existing PREEMPT_NONE behavior?

If the new world order is that the time slice is always honored, then
the "this might be a latency issue" goes away. Good.

And we'd also get better coverage for the *debug* aim of
"might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
PREEMPT_COUNT always existing.

But because the latency argument is gone, the "might_resched()" should
then just be removed entirely from "might_sleep()", so that
might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.

That argues for your suggestion too, since we had a performance issue
due to "might_sleep()" _not_ being just a debug thing, and pointlessly
causing a reschedule in a place where reschedules were _allowed_, but
certainly much less than optimal.

Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
'might_sleep()' in get_mmap_lock_carefully()").

However, that does bring up an issue: even with full preemption, there
are certainly places where we are *allowed* to schedule (when the
preempt count is zero), but there are also some places that are
*better* than other places to schedule (for example, when we don't
hold any other locks).

So, I do think that if we just decide to go "let's just always be
preemptible", we might still have points in the kernel where
preemption might be *better* than in others points.

But none of might_resched(), might_sleep() _or_ cond_resched() are
necessarily that kind of "this is a good point" thing. They come from
a different background.

So what I think what you are saying is that we'd have the following situation:

- scheduling at "return to user space" is presumably always a good thing.

A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
whatever) would cover that, and would give us basically the existing
CONFIG_PREEMPT_NONE behavior.

So a config variable (either compile-time with PREEMPT_NONE or a
dynamic one with DYNAMIC_PREEMPT set to none) would make any external
wakeup only set that bit.

And then a "fully preemptible low-latency desktop" would set the
preempt-count bit too.

- but the "timeslice over" case would always set the
preempt-count-bit, regardless of any config, and would guarantee that
we have reasonable latencies.

This all makes cond_resched() (and might_resched()) pointless, and
they can just go away.

Then the question becomes whether we'd want to introduce a *new*
concept, which is a "if you are going to schedule, do it now rather
than later, because I'm taking a lock, and while it's a preemptible
lock, I'd rather not sleep while holding this resource".

I suspect we want to avoid that for now, on the assumption that it's
hopefully not a problem in practice (the recently addressed problem
with might_sleep() was that it actively *moved* the scheduling point
to a bad place, not that scheduling could happen there, so instead of
optimizing scheduling, it actively pessimized it). But I thought I'd
mention it.

Anyway, I'm definitely not opposed. We'd get rid of a config option
that is presumably not very widely used, and we'd simplify a lot of
issues, and get rid of all these badly defined "cond_preempt()"
things.

Linus

2023-09-19 12:19:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


* Linus Torvalds <[email protected]> wrote:

> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <[email protected]> wrote:
> >
> > What about the following:
> >
> > 1) Keep preemption count and the real preemption points enabled
> > unconditionally.
>
> Well, it's certainly the simplest solution, and gets rid of not just
> the 'rep string' issue, but gets rid of all the cond_resched() hackery
> entirely.
>
> > 20 years ago this was a real issue because we did not have:
> >
> > - the folding of NEED_RESCHED into the preempt count
> >
> > - the cacheline optimizations which make the preempt count cache
> > pretty much always cache hot
> >
> > - the hardware was way less capable
> >
> > I'm not saying that preempt_count is completely free today as it
> > obviously adds more text and affects branch predictors, but as the
> > major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> > acceptable and tolerable tradeoff.
>
> Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most
> distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY
> approach isn't actually used, and is only causing pain.

The macro-behavior of NONE/VOLUNTARY is still used & relied upon in server
distros - and that's the behavior that enterprise distros truly cared
about.

Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the 'noise'
category for all major distros I'd say.

And that's what Thomas's proposal achieves: keep the nicely execution-batched
NONE/VOLUNTARY scheduling behavior for SCHED_OTHER tasks, while having the
latency advantages of fully-preemptible kernel code for RT and critical
tasks.

So I'm fully on board with this. It would reduce the number of preemption
variants to just two: regular kernel and PREEMPT_RT. Yummie!

> > 2) When the scheduler wants to set NEED_RESCHED due it sets
> > NEED_RESCHED_LAZY instead which is only evaluated in the return to
> > user space preemption points.
>
> Is this just to try to emulate the existing PREEMPT_NONE behavior?

Yes: I'd guesstimate that the batching caused by timeslice-laziness that is
naturally part of NONE/VOLUNTARY resolves ~90%+ of observable
macro-performance regressions between NONE/VOLUNTARY and PREEMPT/RT.

> If the new world order is that the time slice is always honored, then the
> "this might be a latency issue" goes away. Good.
>
> And we'd also get better coverage for the *debug* aim of "might_sleep()"
> and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on PREEMPT_COUNT always
> existing.
>
> But because the latency argument is gone, the "might_resched()" should
> then just be removed entirely from "might_sleep()", so that might_sleep()
> would *only* be that DEBUG_ATOMIC_SLEEP thing.

Correct. And that's even a minor code generation advantage, as we wouldn't
have these additional hundreds of random/statistical preemption checks.

> That argues for your suggestion too, since we had a performance issue due
> to "might_sleep()" _not_ being just a debug thing, and pointlessly
> causing a reschedule in a place where reschedules were _allowed_, but
> certainly much less than optimal.
>
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
> 'might_sleep()' in get_mmap_lock_carefully()").

4542057e18ca is arguably kind of a workaround though - and with the
preempt_count + NEED_RESCHED_LAZY approach we'd have both the latency
advantages *and* the execution-batching performance advantages of
NONE/VOLUNTARY that 4542057e18ca exposed.

> However, that does bring up an issue: even with full preemption, there
> are certainly places where we are *allowed* to schedule (when the preempt
> count is zero), but there are also some places that are *better* than
> other places to schedule (for example, when we don't hold any other
> locks).
>
> So, I do think that if we just decide to go "let's just always be
> preemptible", we might still have points in the kernel where preemption
> might be *better* than in others points.

So in the broadest sense we have 3 stages of pending preemption:

NEED_RESCHED_LAZY
NEED_RESCHED_SOON
NEED_RESCHED_NOW

And we'd transition:

- from 0 -> SOON when an eligible task is woken up,
- from LAZY -> SOON when current timeslice is exhausted,
- from SOON -> NOW when no locks/resources are held.

[ With a fast-track for RT or other urgent tasks to enter NOW immediately. ]

On the regular kernels it's probably not worth modeling the SOON/NOW split,
as we'd have to track the depth of sleeping locks as well, which we don't
do right now.

On PREEMPT_RT the SOON/NOW distinction possibly makes sense, as there we
are aware of locking depth already and it would be relatively cheap to
check for it on natural 0-preempt_count boundaries.


> But none of might_resched(), might_sleep() _or_ cond_resched() are
> necessarily that kind of "this is a good point" thing. They come from a
> different background.

Correct, they come from two sources:

- They are hundreds of points that we know are 'technically correct'
preemption points, and they break up ~90% of long latencies by brute
force & chance.

- Explicitly identified problem points that added a cond_resched() or its
equivalent. These are rare and also tend to bitrot, because *removing*
them is always more risky than adding them, so they tend to accumulate.

> So what I think what you are saying is that we'd have the following
> situation:
>
> - scheduling at "return to user space" is presumably always a good thing.
>
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
>
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
>
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.

I'd even argue that we only need two preemption modes, and that 'fully
preemptible low-latency desktop' is an artifact of poor latencies on
PREEMPT_NONE.

Ie. in the long run - after a careful period of observing performance
regressions and other dragons - we'd only have *two* preemption modes left:

!PREEMPT_RT # regular kernel. Single default behavior.
PREEMPT_RT=y # -rt kernel, because rockets, satellites & cars matter.

Any other application level preemption preferences can be expressed via
scheduling policies & priorities.

Nothing else. We don't need PREEMPT_DYNAMIC, PREEMPT_VOLUNTARY or
PREEMPT_NONE in any of their variants, probably not even as runtime knobs.

People who want shorter timeslices can set shorter timeslices, and people
who want immediate preemption of certain tasks can manage priorities.

> - but the "timeslice over" case would always set the preempt-count-bit,
> regardless of any config, and would guarantee that we have reasonable
> latencies.

Yes. Probably a higher nice-priority task becoming runnable would cause
immediate preemption too, in addition to RT tasks.

Ie. the execution batching would be for same-priority groups of SCHED_OTHER
tasks.

> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.

Yep.

> Then the question becomes whether we'd want to introduce a *new* concept,
> which is a "if you are going to schedule, do it now rather than later,
> because I'm taking a lock, and while it's a preemptible lock, I'd rather
> not sleep while holding this resource".

Something close to this concept is naturally available on PREEMPT_RT
kernels, which only use a single central lock primitive (rt_mutex), but it
would have be added explicitly for regular kernels.

We could do the following intermediate step:

- Remove all the random cond_resched() points such as might_sleep()
- Turn all explicit cond_resched() points into 'ideal point to reschedule'.

- Maybe even rename it from cond_resched() to resched_point(), to signal
the somewhat different role.

While cond_resched() and resched_point() are not 100% matches, they are
close enough, as most existing cond_resched() points were added to places
that cause the least amount of disruption with held resources.

But I think it would be better to add resched_point() as a new API, and add
it to places where there's a performance benefit. Clean slate,
documentation, and all that.

> I suspect we want to avoid that for now, on the assumption that it's
> hopefully not a problem in practice (the recently addressed problem with
> might_sleep() was that it actively *moved* the scheduling point to a bad
> place, not that scheduling could happen there, so instead of optimizing
> scheduling, it actively pessimized it). But I thought I'd mention it.
>
> Anyway, I'm definitely not opposed. We'd get rid of a config option that
> is presumably not very widely used, and we'd simplify a lot of issues,
> and get rid of all these badly defined "cond_preempt()" things.

I think we can get rid of *all* the preemption model Kconfig knobs, except
PREEMPT_RT. :-)

Thanks,

Ingo

2023-09-19 12:34:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


* Ingo Molnar <[email protected]> wrote:

> > Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most
> > distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY
> > approach isn't actually used, and is only causing pain.
>
> The macro-behavior of NONE/VOLUNTARY is still used & relied upon in
> server distros - and that's the behavior that enterprise distros truly
> cared about.
>
> Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the
> 'noise' category for all major distros I'd say.
>
> And that's what Thomas's proposal achieves: keep the nicely
> execution-batched NONE/VOLUNTARY scheduling behavior for SCHED_OTHER
> tasks, while having the latency advantages of fully-preemptible kernel
> code for RT and critical tasks.
>
> So I'm fully on board with this. It would reduce the number of preemption
> variants to just two: regular kernel and PREEMPT_RT. Yummie!

As an additional side note: with various changes such as EEVDF the
scheduler is a lot less preemption-happy these days, without wrecking
latencies & timeslice distribution.

So in principle we might not even need the NEED_RESCHED_LAZY extra bit,
which -rt uses as a kind of additional layer to make sure they don't change
scheduling policy.

Ie. a modern scheduler might have mooted much of this change:

4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()")

... because now we'll only reschedule on timeslice exhaustion, or if a task
comes in with a big deadline deficit.

And even the deadline-deficit wakeup preemption can be turned off further
with:

$ echo NO_WAKEUP_PREEMPTION > /debug/sched/features

And we are considering making that the default behavior for same-prio tasks
- basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which
should be quite similar to what NEED_RESCHED_LAZY achieves on -rt.

Thanks,

Ingo

2023-09-19 14:02:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


* Thomas Gleixner <[email protected]> wrote:

> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.

Just to complete this nice new entry to Documentation/sched/: in
PREEMPT=FULL there's also IRQ-return driven preemption of kernel-mode code,
at almost any instruction boundary the hardware allows, in addition to the
preemption driven by regular zero transition of preempt_count in
syscall/kthread code.

Thanks,

Ingo

2023-09-19 14:29:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

Ingo!

On Tue, Sep 19 2023 at 10:03, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>> Then the question becomes whether we'd want to introduce a *new* concept,
>> which is a "if you are going to schedule, do it now rather than later,
>> because I'm taking a lock, and while it's a preemptible lock, I'd rather
>> not sleep while holding this resource".
>
> Something close to this concept is naturally available on PREEMPT_RT
> kernels, which only use a single central lock primitive (rt_mutex), but it
> would have be added explicitly for regular kernels.
>
> We could do the following intermediate step:
>
> - Remove all the random cond_resched() points such as might_sleep()
> - Turn all explicit cond_resched() points into 'ideal point to reschedule'.
>
> - Maybe even rename it from cond_resched() to resched_point(), to signal
> the somewhat different role.
>
> While cond_resched() and resched_point() are not 100% matches, they are
> close enough, as most existing cond_resched() points were added to places
> that cause the least amount of disruption with held resources.
>
> But I think it would be better to add resched_point() as a new API, and add
> it to places where there's a performance benefit. Clean slate,
> documentation, and all that.

Lets not go there. You just replace one magic mushroom with a different
flavour. We want to get rid of them completely.

The whole point is to let the scheduler decide and give it enough
information to make informed decisions.

So with the LAZY scheme in effect, there is no real reason to have these
extra points and I rather add task::sleepable_locks_held and do that
accounting in the relevant lock/unlock paths. Based on that the
scheduler can decide whether it grants a time slice expansion or just
says no.

That's extremly cheap and well defined.

You can document the hell out of resched_point(), but it won't be any
different from the existing ones and always subject to personal
preference and goals and its going to be sprinkled all over the place
just like the existing ones. So where is the gain?

Thanks,

tglx




2023-09-19 17:12:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

Linus!

On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <[email protected]> wrote:
>> 2) When the scheduler wants to set NEED_RESCHED due it sets
>> NEED_RESCHED_LAZY instead which is only evaluated in the return to
>> user space preemption points.
>
> Is this just to try to emulate the existing PREEMPT_NONE behavior?

To some extent yes.

> If the new world order is that the time slice is always honored, then
> the "this might be a latency issue" goes away. Good.

That's the point.

> And we'd also get better coverage for the *debug* aim of
> "might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
> PREEMPT_COUNT always existing.
>
> But because the latency argument is gone, the "might_resched()" should
> then just be removed entirely from "might_sleep()", so that
> might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.

True. And this gives the scheduler the flexibility to enforce preemption
under certain conditions, e.g. when a task with RT scheduling class or a
task with a sporadic event handler is woken up. That's what VOLUNTARY
tries to achieve with all the might_sleep()/might_resched() magic.

> That argues for your suggestion too, since we had a performance issue
> due to "might_sleep()" _not_ being just a debug thing, and pointlessly
> causing a reschedule in a place where reschedules were _allowed_, but
> certainly much less than optimal.
>
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
> 'might_sleep()' in get_mmap_lock_carefully()").

Awesome.

> However, that does bring up an issue: even with full preemption, there
> are certainly places where we are *allowed* to schedule (when the
> preempt count is zero), but there are also some places that are
> *better* than other places to schedule (for example, when we don't
> hold any other locks).
>
> So, I do think that if we just decide to go "let's just always be
> preemptible", we might still have points in the kernel where
> preemption might be *better* than in others points.
>
> But none of might_resched(), might_sleep() _or_ cond_resched() are
> necessarily that kind of "this is a good point" thing. They come from
> a different background.

They are subject to subsystem/driver specific preferences and therefore
biased towards a certain usage scenario, which is not necessarily to the
benefit of everyone else.

> So what I think what you are saying is that we'd have the following situation:
>
> - scheduling at "return to user space" is presumably always a good thing.
>
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
>
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
>
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.

Correct.

> - but the "timeslice over" case would always set the
> preempt-count-bit, regardless of any config, and would guarantee that
> we have reasonable latencies.

Yes. That's the reasoning.

> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.

:)

So the decision matrix would be:

Ret2user Ret2kernel PreemptCnt=0

NEED_RESCHED Y Y Y
LAZY_RESCHED Y N N

That is completely independent of the preemption model and the
differentiation of the preemption models happens solely at the scheduler
level:

PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time
slice where it sets NEED_RESCHED.

PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class
tasks or sporadic event tasks sets NEED_RESCHED too.

PREEMPT_FULL always sets NEED_RESCHED like today.

We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
only end up with two variants or even subsume PREEMPT_FULL into that
model because that's what is closer to the RT LAZY preempt behaviour,
which has two goals:

1) Make low latency guarantees for RT workloads

2) Preserve the throughput for non-RT workloads

But in any case this decision happens solely in the core scheduler code
and nothing outside of it needs to be changed.

So we not only get rid of the cond/might_resched() muck, we also get rid
of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
The only place which still needs that runtime tweaking is the scheduler
itself.

Though it just occured to me that there are dragons lurking:

arch/alpha/Kconfig: select ARCH_NO_PREEMPT
arch/hexagon/Kconfig: select ARCH_NO_PREEMPT
arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE
arch/um/Kconfig: select ARCH_NO_PREEMPT

So we have four architectures which refuse to enable preemption points,
i.e. the only model they allow is NONE and they rely on cond_resched()
for breaking large computations.

But they support PREEMPT_COUNT, so we might get away with a reduced
preemption point coverage:

Ret2user Ret2kernel PreemptCnt=0

NEED_RESCHED Y N Y
LAZY_RESCHED Y N N

i.e. the only difference is that Ret2kernel is not a preemption
point. That's where the scheduler tick enforcement of the time slice
happens.

It still might work out good enough and if not then it should not be
real rocket science to add that Ret2kernel preemption point to cure it.

> Then the question becomes whether we'd want to introduce a *new*
> concept, which is a "if you are going to schedule, do it now rather
> than later, because I'm taking a lock, and while it's a preemptible
> lock, I'd rather not sleep while holding this resource".
>
> I suspect we want to avoid that for now, on the assumption that it's
> hopefully not a problem in practice (the recently addressed problem
> with might_sleep() was that it actively *moved* the scheduling point
> to a bad place, not that scheduling could happen there, so instead of
> optimizing scheduling, it actively pessimized it). But I thought I'd
> mention it.

I think we want to avoid that completely and if this becomes an issue,
we rather be smart about it at the core level.

It's trivial enough to have a per task counter which tells whether a
preemtible lock is held (or about to be acquired) or not. Then the
scheduler can take that hint into account and decide to grant a
timeslice extension once in the expectation that the task leaves the
lock held section soonish and either returns to user space or schedules
out. It still can enforce it later on.

We really want to let the scheduler decide and rather give it proper
hints at the conceptual level instead of letting developers make random
decisions which might work well for a particular use case and completely
suck for the rest. I think we wasted enough time already on those.

> Anyway, I'm definitely not opposed. We'd get rid of a config option
> that is presumably not very widely used, and we'd simplify a lot of
> issues, and get rid of all these badly defined "cond_preempt()"
> things.

Hmm. Didn't I promise a year ago that I won't do further large scale
cleanups and simplifications beyond printk.

Maybe I get away this time with just suggesting it. :)

Thanks,

tglx

2023-09-19 19:40:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Tue, Sep 19 2023 at 10:43, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
> Ie. a modern scheduler might have mooted much of this change:
>
> 4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()")
>
> ... because now we'll only reschedule on timeslice exhaustion, or if a task
> comes in with a big deadline deficit.
>
> And even the deadline-deficit wakeup preemption can be turned off further
> with:
>
> $ echo NO_WAKEUP_PREEMPTION > /debug/sched/features
>
> And we are considering making that the default behavior for same-prio tasks
> - basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which
> should be quite similar to what NEED_RESCHED_LAZY achieves on -rt.

I don't think that you can get rid of NEED_RESCHED_LAZY for !RT because
there is a clear advantage of having the return to user preemption
point.

It spares to have the kernel/user transition just to get the task back
via the timeslice interrupt. I experimented with that on RT and the
result was definitely worse.

We surely can revisit that, but I'd really start with the straight
forward mappable LAZY bit approach and if experimentation turns out to
provide good enough results by not setting that bit at all, then we
still can do so without changing anything except the core scheduler
decision logic.

It's again a cheap thing due to the way how the return to user TIF
handling works:

ti_work = read_thread_flags();
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);

TIF_LAZY_RESCHED is part of EXIT_TO_USER_MODE_WORK, so the non-work case
does not become more expensive than today. If any of the bits is set,
then the slowpath wont get measurably different performance whether the bit
is evaluated or not in exit_to_user_mode_loop().

As we really want TIF_LAZY_RESCHED for RT, we just keep all of this
consistent in terms of code and purely a scheduler decision whether it
utilizes it or not. As a consequence PREEMPT_RT is not longer special in
that regard and the main RT difference becomes the lock substitution and
forced interrupt threading.

For the magic 'spare me the extra conditional' optimization of
exit_to_user_mode_loop() if LAZY can be optimized out for !RT because
the scheduler is sooo clever (which I doubt), we can just use the same
approach as for other TIF bits and define them to 0 :)

So lets start consistent and optimize on top if really required.

Thanks,

tglx

2023-09-20 02:45:57

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


Thomas Gleixner <[email protected]> writes:

> On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
>> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>>> > The problem with the REP prefix (and Xen hypercalls) is that
>>> > they're long running instructions and it becomes fundamentally
>>> > impossible to put a cond_resched() in.
>>> >
>>> >> Yes. I'm starting to think that that the only sane solution is to
>>> >> limit cases that can do this a lot, and the "instruciton pointer
>>> >> region" approach would certainly work.
>>> >
>>> > From a code locality / I-cache POV, I think a sorted list of
>>> > (non overlapping) ranges might be best.
>>>
>>> Yeah, agreed. There are a few problems with doing that though.
>>>
>>> I was thinking of using a check of this kind to schedule out when
>>> it is executing in this "reschedulable" section:
>>> !preempt_count() && in_resched_function(regs->rip);
>>>
>>> For preemption=full, this should mostly work.
>>> For preemption=voluntary, though this'll only work with out-of-line
>>> locks, not if the lock is inlined.
>>>
>>> (Both, should have problems with __this_cpu_* and the like, but
>>> maybe we can handwave that away with sparse/objtool etc.)
>>
>> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
>> thing, and then only search the range when TIF flag is set.
>>
>> And I'm thinking it might be a good idea to have objtool validate the
>> range only contains simple instructions, the moment it contains control
>> flow I'm thinking it's too complicated.
>
> Can we take a step back and look at the problem from a scheduling
> perspective?
>
> The basic operation of a non-preemptible kernel is time slice
> scheduling, which means that a task can run more or less undisturbed for
> a full time slice once it gets on the CPU unless it schedules away
> voluntary via a blocking operation.
>
> This works pretty well as long as everything runs in userspace as the
> preemption points in the return to user space path are independent of
> the preemption model.
>
> These preemption points handle both time slice exhaustion and priority
> based preemption.
>
> With PREEMPT=NONE these are the only available preemption points.
>
> That means that kernel code can run more or less indefinitely until it
> schedules out or returns to user space, which is obviously not possible
> for kernel threads.
>
> To prevent starvation the kernel gained voluntary preemption points,
> i.e. cond_resched(), which has to be added manually to code as a
> developer sees fit.
>
> Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
> additional preemption points. might_resched() utilizes the existing
> might_sched() debug points, which are in code paths which might block on
> a contended resource. These debug points are mostly in core and
> infrastructure code and are in code paths which can block anyway. The
> only difference is that they allow preemption even when the resource is
> uncontended.
>
> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.
>
> Now we have the situation of long running data copies or data clear
> operations which run fully in hardware, but can be interrupted. As the
> interrupt return to kernel mode does not preempt in the NONE and
> VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
> chunk size and adding cond_reched() again.
>
> That's ugly and does not work for long lasting hardware operations so we
> ended up with the suggestion of TIF_ALLOW_RESCHED to work around
> that. But again this needs to be manually annotated in the same way as a
> IP range based preemption scheme requires annotation.
>
> TBH. I detest all of this.
>
> Both cond_resched() and might_sleep/sched() are completely random
> mechanisms as seen from time slice operation and the data chunk based
> mechanism is just heuristics which works as good as heuristics tend to
> work. allow_resched() is not any different and IP based preemption
> mechanism are not going to be any better.

Agreed. I was looking at how to add resched sections etc, and in
addition to the randomness the choice of where exactly to add it seemed
to be quite fuzzy. A recipe for future kruft.

> The approach here is: Prevent the scheduler to make decisions and then
> mitigate the fallout with heuristics.
>
> That's just backwards as it moves resource control out of the scheduler
> into random code which has absolutely no business to do resource
> control.
>
> We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
> held sections became preemtible caused even more preemption activity
> than on a PREEMPT=FULL kernel. The worst side effect of that was
> extensive lock contention.
>
> The way how we addressed that was to add a lazy preemption mode, which
> tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
> preempt tasks which all belong to the SCHED_OTHER scheduling class. This
> works pretty well and gains back a massive amount of performance for the
> non-realtime throughput oriented tasks without affecting the
> schedulability of real-time tasks at all. IOW, it does not take control
> away from the scheduler. It cooperates with the scheduler and leaves the
> ultimate decisions to it.
>
> I think we can do something similar for the problem at hand, which
> avoids most of these heuristic horrors and control boundary violations.
>
> The main issue is that long running operations do not honour the time
> slice and we work around that with cond_resched() and now have ideas
> with this new TIF bit and IP ranges.
>
> None of that is really well defined in respect to time slices. In fact
> its not defined at all versus any aspect of scheduling behaviour.
>
> What about the following:
>
> 1) Keep preemption count and the real preemption points enabled
> unconditionally. That's not more overhead than the current
> DYNAMIC_PREEMPT mechanism as long as the preemption count does not
> go to zero, i.e. the folded NEED_RESCHED bit stays set.
>
> From earlier experiments I know that the overhead of preempt_count
> is minimal and only really observable with micro benchmarks.
> Otherwise it ends up in the noise as long as the slow path is not
> taken.
>
> I did a quick check comparing a plain inc/dec pair vs. the
> DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
> in the non-conclusive noise.
>
> 20 years ago this was a real issue because we did not have:
>
> - the folding of NEED_RESCHED into the preempt count
>
> - the cacheline optimizations which make the preempt count cache
> pretty much always cache hot
>
> - the hardware was way less capable
>
> I'm not saying that preempt_count is completely free today as it
> obviously adds more text and affects branch predictors, but as the
> major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> acceptable and tolerable tradeoff.
>
> 2) When the scheduler wants to set NEED_RESCHED due it sets
> NEED_RESCHED_LAZY instead which is only evaluated in the return to
> user space preemption points.
>
> As NEED_RESCHED_LAZY is not folded into the preemption count the
> preemption count won't become zero, so the task can continue until
> it hits return to user space.
>
> That preserves the existing behaviour.
>
> 3) When the scheduler tick observes that the time slice is exhausted,
> then it folds the NEED_RESCHED bit into the preempt count which
> causes the real preemption points to actually preempt including
> the return from interrupt to kernel path.

Right, and currently we check cond_resched() all the time in expectation
that something might need a resched.

Folding it in with the scheduler determining when next preemption happens
seems to make a lot of sense to me.


Thanks
Ankur

> That even allows the scheduler to enforce preemption for e.g. RT
> class tasks without changing anything else.
>
> I'm pretty sure that this gets rid of cond_resched(), which is an
> impressive list of instances:
>
> ./drivers 392
> ./fs 318
> ./mm 189
> ./kernel 184
> ./arch 95
> ./net 83
> ./include 46
> ./lib 36
> ./crypto 16
> ./sound 16
> ./block 11
> ./io_uring 13
> ./security 11
> ./ipc 3
>
> That list clearly documents that the majority of these
> cond_resched() invocations is in code which neither should care
> nor should have any influence on the core scheduling decision
> machinery.
>
> I think it's worth a try as it just fits into the existing preemption
> scheme, solves the issue of long running kernel functions, prevents
> invalid preemption and can utilize the existing instrumentation and
> debug infrastructure.
>
> Most importantly it gives control back to the scheduler and does not
> make it depend on the mercy of cond_resched(), allow_resched() or
> whatever heuristics sprinkled all over the kernel.

> To me this makes a lot of sense, but I might be on the completely wrong
> track. Se feel free to tell me that I'm completely nuts and/or just not
> seeing the obvious.
>
> Thanks,
>
> tglx


--
ankur

2023-09-20 22:13:06

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


Thomas Gleixner <[email protected]> writes:

> So the decision matrix would be:
>
> Ret2user Ret2kernel PreemptCnt=0
>
> NEED_RESCHED Y Y Y
> LAZY_RESCHED Y N N
>
> That is completely independent of the preemption model and the
> differentiation of the preemption models happens solely at the scheduler
> level:

This is relatively minor, but do we need two flags? Seems to me we
can get to the same decision matrix by letting the scheduler fold
into the preempt-count based on current preemption model.

> PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time
> slice where it sets NEED_RESCHED.

PREEMPT_NONE sets up TIF_NEED_RESCHED. For the time-slice expiry case,
also fold into preempt-count.

> PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class
> tasks or sporadic event tasks sets NEED_RESCHED too.

PREEMPT_NONE sets up TIF_NEED_RESCHED and also folds it for the
RT/sporadic tasks.

> PREEMPT_FULL always sets NEED_RESCHED like today.

Always fold the TIF_NEED_RESCHED into the preempt-count.

> We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
> only end up with two variants or even subsume PREEMPT_FULL into that
> model because that's what is closer to the RT LAZY preempt behaviour,
> which has two goals:
>
> 1) Make low latency guarantees for RT workloads
>
> 2) Preserve the throughput for non-RT workloads
>
> But in any case this decision happens solely in the core scheduler code
> and nothing outside of it needs to be changed.
>
> So we not only get rid of the cond/might_resched() muck, we also get rid
> of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
> The only place which still needs that runtime tweaking is the scheduler
> itself.

True. The dynamic preemption could just become a scheduler tunable.

> Though it just occured to me that there are dragons lurking:
>
> arch/alpha/Kconfig: select ARCH_NO_PREEMPT
> arch/hexagon/Kconfig: select ARCH_NO_PREEMPT
> arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE
> arch/um/Kconfig: select ARCH_NO_PREEMPT
>
> So we have four architectures which refuse to enable preemption points,
> i.e. the only model they allow is NONE and they rely on cond_resched()
> for breaking large computations.
>
> But they support PREEMPT_COUNT, so we might get away with a reduced
> preemption point coverage:
>
> Ret2user Ret2kernel PreemptCnt=0
>
> NEED_RESCHED Y N Y
> LAZY_RESCHED Y N N

So from the discussion in the other thread, for the ARCH_NO_PREEMPT
configs that don't support preemption, we probably need a fourth
preemption model, say PREEMPT_UNSAFE.

These could use only the Ret2user preemption points and just fallback
to the !PREEMPT_COUNT primitives.

Thanks

--
ankur

2023-09-20 23:18:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
> Thomas Gleixner <[email protected]> writes:
>
>> So the decision matrix would be:
>>
>> Ret2user Ret2kernel PreemptCnt=0
>>
>> NEED_RESCHED Y Y Y
>> LAZY_RESCHED Y N N
>>
>> That is completely independent of the preemption model and the
>> differentiation of the preemption models happens solely at the scheduler
>> level:
>
> This is relatively minor, but do we need two flags? Seems to me we
> can get to the same decision matrix by letting the scheduler fold
> into the preempt-count based on current preemption model.

You still need the TIF flags because there is no way to do remote
modification of preempt count.

The preempt count folding is an optimization which simplifies the
preempt_enable logic:

if (--preempt_count && need_resched())
schedule()
to
if (--preempt_count)
schedule()

i.e. a single conditional instead of two.

The lazy bit is only evaluated in:

1) The return to user path

2) need_reched()

In neither case preempt_count is involved.

So it does not buy us enything. We might revisit that later, but for
simplicity sake the extra TIF bit is way simpler.

Premature optimization is the enemy of correctness.

>> We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
>> only end up with two variants or even subsume PREEMPT_FULL into that
>> model because that's what is closer to the RT LAZY preempt behaviour,
>> which has two goals:
>>
>> 1) Make low latency guarantees for RT workloads
>>
>> 2) Preserve the throughput for non-RT workloads
>>
>> But in any case this decision happens solely in the core scheduler code
>> and nothing outside of it needs to be changed.
>>
>> So we not only get rid of the cond/might_resched() muck, we also get rid
>> of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
>> The only place which still needs that runtime tweaking is the scheduler
>> itself.
>
> True. The dynamic preemption could just become a scheduler tunable.

That's the point.

>> But they support PREEMPT_COUNT, so we might get away with a reduced
>> preemption point coverage:
>>
>> Ret2user Ret2kernel PreemptCnt=0
>>
>> NEED_RESCHED Y N Y
>> LAZY_RESCHED Y N N
>
> So from the discussion in the other thread, for the ARCH_NO_PREEMPT
> configs that don't support preemption, we probably need a fourth
> preemption model, say PREEMPT_UNSAFE.

As discussed they wont really notice the latency issues because the
museum pieces are not used for anything crucial and for UM that's the
least of the correctness worries.

So no, we don't need yet another knob. We keep them chucking along and
if they really want they can adopt to the new world order. :)

Thanks,

tglx

2023-09-21 02:02:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>> Anyway, I'm definitely not opposed. We'd get rid of a config option
>> that is presumably not very widely used, and we'd simplify a lot of
>> issues, and get rid of all these badly defined "cond_preempt()"
>> things.
>
> Hmm. Didn't I promise a year ago that I won't do further large scale
> cleanups and simplifications beyond printk.
>
> Maybe I get away this time with just suggesting it. :)

Maybe not. As I'm inveterate curious, I sat down and figured out how
that might look like.

To some extent I really curse my curiosity as the amount of macro maze,
config options and convoluted mess behind all these preempt mechanisms
is beyond disgusting.

Find below a PoC which implements that scheme. It's not even close to
correct, but it builds, boots and survives lightweight testing.

I did not even try to look into time-slice enforcement, but I really want
to share this for illustration and for others to experiment.

This keeps all the existing mechanisms in place and introduces a new
config knob in the preemption model Kconfig switch: PREEMPT_AUTO

If selected it builds a CONFIG_PREEMPT kernel, which disables the
cond_resched() machinery and switches the fair scheduler class to use
the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
the preempt NONE model except that cond_resched() is a NOOP and I did
not validate the time-slice enforcement. The latter should be a
no-brainer to figure out and fix if required.

For run-time switching this to the FULL preemption model which always
uses TIF_NEED_RESCHED, you need to enable CONFIG_SCHED_DEBUG and then
you can enable "FULL" via:

echo FORCE_NEED_RESCHED >/sys/kernel/debug/sched/features

and switch back to some sort of "NONE" via

echo NO_FORCE_NEED_RESCHED >/sys/kernel/debug/sched/features

It seems to work as expected for a simple hackbench -l 10000 run:

NO_FORCE_NEED_RESCHED FORCE_NEED_RESCHED
schedule() [1] 3646163 2701641
preemption 12554 927856
total 3658717 3629497

[1] is voluntary schedule() AND_ schedule() from return to user space. I
did not come around to account them separately yet, but for a quick
check this clearly shows that this "works" as advertised.

Of course this needs way more analysis than this quick PoC+check, but
you get the idea.

Contrary to other hot of the press hacks, I'm pretty sure it won't
destroy your hard-disk, but I won't recommend that you deploy it on your
alarm-clock as it might make you miss the bus.

If this concept holds, which I'm pretty convinced of by now, then this
is an opportunity to trade ~3000 lines of unholy hacks for about 100-200
lines of understandable code :)

Thanks,

tglx
---
arch/x86/Kconfig | 1
arch/x86/include/asm/thread_info.h | 2 +
drivers/acpi/processor_idle.c | 2 -
include/linux/entry-common.h | 2 -
include/linux/entry-kvm.h | 2 -
include/linux/sched.h | 18 +++++++++++-----
include/linux/sched/idle.h | 8 +++----
include/linux/thread_info.h | 19 +++++++++++++++++
kernel/Kconfig.preempt | 12 +++++++++-
kernel/entry/common.c | 2 -
kernel/sched/core.c | 41 ++++++++++++++++++++++++-------------
kernel/sched/fair.c | 10 ++++-----
kernel/sched/features.h | 2 +
kernel/sched/idle.c | 3 --
kernel/sched/sched.h | 1
kernel/trace/trace.c | 2 -
16 files changed, 91 insertions(+), 36 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -898,14 +898,14 @@ static inline void hrtick_rq_init(struct

#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
/*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int nr_bit)
{
struct thread_info *ti = task_thread_info(p);
- return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+ return !(fetch_or(&ti->flags, 1 << nr_bit) & _TIF_POLLING_NRFLAG);
}

/*
@@ -931,9 +931,9 @@ static bool set_nr_if_polling(struct tas
}

#else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int nr_bit)
{
- set_tsk_need_resched(p);
+ set_tsk_thread_flag(p, nr_bit);
return true;
}

@@ -1038,28 +1038,42 @@ void wake_up_q(struct wake_q_head *head)
* might also involve a cross-CPU call to trigger the scheduler on
* the target CPU.
*/
-void resched_curr(struct rq *rq)
+static void __resched_curr(struct rq *rq, int nr_bit)
{
struct task_struct *curr = rq->curr;
int cpu;

lockdep_assert_rq_held(rq);

- if (test_tsk_need_resched(curr))
+ if (test_tsk_need_resched_type(curr, nr_bit))
return;

cpu = cpu_of(rq);

if (cpu == smp_processor_id()) {
- set_tsk_need_resched(curr);
- set_preempt_need_resched();
+ set_tsk_thread_flag(curr, nr_bit);
+ if (nr_bit == TIF_NEED_RESCHED)
+ set_preempt_need_resched();
return;
}

- if (set_nr_and_not_polling(curr))
- smp_send_reschedule(cpu);
- else
+ if (set_nr_and_not_polling(curr, nr_bit)) {
+ if (nr_bit == TIF_NEED_RESCHED)
+ smp_send_reschedule(cpu);
+ } else {
trace_sched_wake_idle_without_ipi(cpu);
+ }
+}
+
+void resched_curr(struct rq *rq)
+{
+ __resched_curr(rq, TIF_NEED_RESCHED);
+}
+
+void resched_curr_lazy(struct rq *rq)
+{
+ __resched_curr(rq, sched_feat(FORCE_NEED_RESCHED) ?
+ TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY);
}

void resched_cpu(int cpu)
@@ -1132,7 +1146,7 @@ static void wake_up_idle_cpu(int cpu)
if (cpu == smp_processor_id())
return;

- if (set_nr_and_not_polling(rq->idle))
+ if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
@@ -8872,7 +8886,6 @@ static void __init preempt_dynamic_init(
WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
return preempt_dynamic_mode == preempt_dynamic_##mode; \
} \
- EXPORT_SYMBOL_GPL(preempt_model_##mode)

PREEMPT_MODEL_ACCESSOR(none);
PREEMPT_MODEL_ACCESSOR(voluntary);
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,11 @@ enum syscall_work_bit {

#include <asm/thread_info.h>

+#ifndef CONFIG_PREEMPT_AUTO
+# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
+# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
+#endif
+
#ifdef __KERNEL__

#ifndef arch_set_restart_data
@@ -185,6 +190,13 @@ static __always_inline bool tif_need_res
(unsigned long *)(&current_thread_info()->flags));
}

+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+ arch_test_bit(TIF_NEED_RESCHED_LAZY,
+ (unsigned long *)(&current_thread_info()->flags));
+}
+
#else

static __always_inline bool tif_need_resched(void)
@@ -193,6 +205,13 @@ static __always_inline bool tif_need_res
(unsigned long *)(&current_thread_info()->flags));
}

+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+ test_bit(TIF_NEED_RESCHED_LAZY,
+ (unsigned long *)(&current_thread_info()->flags));
+}
+
#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */

#ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,6 +11,9 @@ config PREEMPT_BUILD
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK

+config HAVE_PREEMPT_AUTO
+ bool
+
choice
prompt "Preemption Model"
default PREEMPT_NONE
@@ -67,6 +70,13 @@ config PREEMPT
embedded system with latency requirements in the milliseconds
range.

+config PREEMPT_AUTO
+ bool "Automagic preemption mode with runtime tweaking support"
+ depends on HAVE_PREEMPT_AUTO
+ select PREEMPT_BUILD
+ help
+ Add some sensible blurb here
+
config PREEMPT_RT
bool "Fully Preemptible Kernel (Real-Time)"
depends on EXPERT && ARCH_SUPPORTS_RT
@@ -95,7 +105,7 @@ config PREEMPTION

config PREEMPT_DYNAMIC
bool "Preemption behaviour defined on boot"
- depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
+ depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
select PREEMPT_BUILD
default y if HAVE_PREEMPT_DYNAMIC_CALL
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -60,7 +60,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
- ARCH_EXIT_TO_USER_MODE_WORK)
+ _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)

/**
* arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@

#define XFER_TO_GUEST_MODE_WORK \
(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
- _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+ _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)

struct kvm_vcpu;

--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l

local_irq_enable_exit_to_user(ti_work);

- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();

if (ti_work & _TIF_UPROBE)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(HZ_BW, true)
+
+SCHED_FEAT(FORCE_NEED_RESCHED, false)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
extern void reweight_task(struct task_struct *p, int prio);

extern void resched_curr(struct rq *rq);
+extern void resched_curr_lazy(struct rq *rq);
extern void resched_cpu(int cpu);

extern struct rt_bandwidth def_rt_bandwidth;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
update_ti_thread_flag(task_thread_info(tsk), flag, value);
}

-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
}

-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
}

-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_ti_thread_flag(task_thread_info(tsk), flag);
}
@@ -2069,13 +2069,21 @@ static inline void set_tsk_need_resched(
static inline void clear_tsk_need_resched(struct task_struct *tsk)
{
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+ if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+ clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
}

-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
{
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}

+static inline bool test_tsk_need_resched_type(struct task_struct *tsk,
+ int nr_bit)
+{
+ return unlikely(test_tsk_thread_flag(tsk, 1 << nr_bit));
+}
+
/*
* cond_resched() and cond_resched_lock(): latency reduction via
* explicit rescheduling in places that are safe. The return
@@ -2252,7 +2260,7 @@ static inline int rwlock_needbreak(rwloc

static __always_inline bool need_resched(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(tif_need_resched_lazy() || tif_need_resched());
}

/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -985,7 +985,7 @@ static void update_deadline(struct cfs_r
* The task has consumed its request, reschedule.
*/
if (cfs_rq->nr_running > 1) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
clear_buddies(cfs_rq, se);
}
}
@@ -5267,7 +5267,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
* validating it and just reschedule.
*/
if (queued) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
return;
}
/*
@@ -5413,7 +5413,7 @@ static void __account_cfs_rq_runtime(str
* hierarchy can be throttled
*/
if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
}

static __always_inline
@@ -5673,7 +5673,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf

/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
- resched_curr(rq);
+ resched_curr_lazy(rq);
}

#ifdef CONFIG_SMP
@@ -8073,7 +8073,7 @@ static void check_preempt_wakeup(struct
return;

preempt:
- resched_curr(rq);
+ resched_curr_lazy(rq);
}

#ifdef CONFIG_SMP
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -108,7 +108,7 @@ static const struct dmi_system_id proces
*/
static void __cpuidle acpi_safe_halt(void)
{
- if (!tif_need_resched()) {
+ if (!need_resched()) {
raw_safe_halt();
raw_local_irq_disable();
}
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -63,7 +63,7 @@ static __always_inline bool __must_check
*/
smp_mb__after_atomic();

- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}

static __always_inline bool __must_check current_clr_polling_and_test(void)
@@ -76,7 +76,7 @@ static __always_inline bool __must_check
*/
smp_mb__after_atomic();

- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}

#else
@@ -85,11 +85,11 @@ static inline void __current_clr_polling

static inline bool __must_check current_set_polling_and_test(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
static inline bool __must_check current_clr_polling_and_test(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
#endif

--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
ct_cpuidle_enter();

raw_local_irq_enable();
- while (!tif_need_resched() &&
- (cpu_idle_force_poll || tick_check_broadcast_expired()))
+ while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
raw_local_irq_disable();

--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2720,7 +2720,7 @@ unsigned int tracing_gen_ctx_irq_test(un
if (softirq_count() >> (SOFTIRQ_SHIFT + 1))
trace_flags |= TRACE_FLAG_BH_OFF;

- if (tif_need_resched())
+ if (need_resched())
trace_flags |= TRACE_FLAG_NEED_RESCHED;
if (test_preempt_need_resched())
trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -271,6 +271,7 @@ config X86
select HAVE_STATIC_CALL
select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
select HAVE_PREEMPT_DYNAMIC_CALL
+ select HAVE_PREEMPT_AUTO
select HAVE_RSEQ
select HAVE_RUST if X86_64
select HAVE_SYSCALL_TRACEPOINTS
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
#define TIF_SSBD 5 /* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_LAZY 6 /* Lazy rescheduling */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
@@ -106,6 +107,7 @@ struct thread_info {
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
#define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)

2023-09-21 03:56:01

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


Thomas Gleixner <[email protected]> writes:

> On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
>> Thomas Gleixner <[email protected]> writes:
>>
>>> So the decision matrix would be:
>>>
>>> Ret2user Ret2kernel PreemptCnt=0
>>>
>>> NEED_RESCHED Y Y Y
>>> LAZY_RESCHED Y N N
>>>
>>> That is completely independent of the preemption model and the
>>> differentiation of the preemption models happens solely at the scheduler
>>> level:
>>
>> This is relatively minor, but do we need two flags? Seems to me we
>> can get to the same decision matrix by letting the scheduler fold
>> into the preempt-count based on current preemption model.
>
> You still need the TIF flags because there is no way to do remote
> modification of preempt count.

Yes, agreed. In my version, I was envisaging that the remote cpu always
only sets up TIF_NEED_RESCHED and then we decide which one we want at
the preemption point.

Anyway, I see what you meant in your PoC.

>>> But they support PREEMPT_COUNT, so we might get away with a reduced
>>> preemption point coverage:
>>>
>>> Ret2user Ret2kernel PreemptCnt=0
>>>
>>> NEED_RESCHED Y N Y
>>> LAZY_RESCHED Y N N
>>
>> So from the discussion in the other thread, for the ARCH_NO_PREEMPT
>> configs that don't support preemption, we probably need a fourth
>> preemption model, say PREEMPT_UNSAFE.
>
> As discussed they wont really notice the latency issues because the
> museum pieces are not used for anything crucial and for UM that's the
> least of the correctness worries.
>
> So no, we don't need yet another knob. We keep them chucking along and
> if they really want they can adopt to the new world order. :)

Will they chuckle along, or die trying ;)?

I grepped for "preempt_enable|preempt_disable" for all the archs and
hexagon and m68k don't seem to do any explicit accounting at all.
(Though, neither do nios2 and openrisc, and both csky and microblaze
only do it in the tlbflush path.)

arch/hexagon 0
arch/m68k 0
arch/nios2 0
arch/openrisc 0
arch/csky 3
arch/microblaze 3
arch/um 4
arch/riscv 8
arch/arc 14
arch/parisc 15
arch/arm 16
arch/sparc 16
arch/xtensa 19
arch/sh 21
arch/alpha 23
arch/ia64 27
arch/loongarch 53
arch/arm64 54
arch/s390 91
arch/mips 115
arch/x86 146
arch/powerpc 201

My concern is given that we preempt on timeslice expiration for all
three preemption models, we could end up preempting at an unsafe
location.

Still, not the most pressing of problems.


Thanks
--
ankur

2023-09-21 04:16:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Wed, Sep 20 2023 at 17:58, Ankur Arora wrote:
> Thomas Gleixner <[email protected]> writes:
>> So no, we don't need yet another knob. We keep them chucking along and
>> if they really want they can adopt to the new world order. :)
>
> Will they chuckle along, or die trying ;)?

Either way is fine :)

> I grepped for "preempt_enable|preempt_disable" for all the archs and
> hexagon and m68k don't seem to do any explicit accounting at all.
> (Though, neither do nios2 and openrisc, and both csky and microblaze
> only do it in the tlbflush path.)
>
> arch/hexagon 0
> arch/m68k 0
...
> arch/s390 91
> arch/mips 115
> arch/x86 146
> arch/powerpc 201
>
> My concern is given that we preempt on timeslice expiration for all
> three preemption models, we could end up preempting at an unsafe
> location.

As I said in my reply to Linus, that count is not really conclusive.

arch/m68k has a count of 0 and supports PREEMPT for the COLDFIRE
sub-architecture and I know for sure that at some point in the past
PREEMPT_RT was supported on COLDFIRE with minimal changes to the
architecture code.

That said, I'm pretty sure that quite some of these
preempt_disable/enable pairs in arch/* are subject to voodoo
programming, but that's a different problem to analyze.

> Still, not the most pressing of problems.

Exactly :)

Thanks,

tglx

2023-09-21 05:12:29

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


Thomas Gleixner <[email protected]> writes:

> On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
>> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>>> Anyway, I'm definitely not opposed. We'd get rid of a config option
>>> that is presumably not very widely used, and we'd simplify a lot of
>>> issues, and get rid of all these badly defined "cond_preempt()"
>>> things.
>>
>> Hmm. Didn't I promise a year ago that I won't do further large scale
>> cleanups and simplifications beyond printk.
>>
>> Maybe I get away this time with just suggesting it. :)
>
> Maybe not. As I'm inveterate curious, I sat down and figured out how
> that might look like.
>
> To some extent I really curse my curiosity as the amount of macro maze,
> config options and convoluted mess behind all these preempt mechanisms
> is beyond disgusting.
>
> Find below a PoC which implements that scheme. It's not even close to
> correct, but it builds, boots and survives lightweight testing.

Whew, that was electric. I had barely managed to sort through some of
the config maze.
From a quick look this is pretty much how you described it.

> I did not even try to look into time-slice enforcement, but I really want
> to share this for illustration and for others to experiment.
>
> This keeps all the existing mechanisms in place and introduces a new
> config knob in the preemption model Kconfig switch: PREEMPT_AUTO
>
> If selected it builds a CONFIG_PREEMPT kernel, which disables the
> cond_resched() machinery and switches the fair scheduler class to use
> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
> the preempt NONE model except that cond_resched() is a NOOP and I did
> not validate the time-slice enforcement. The latter should be a
> no-brainer to figure out and fix if required.

Yeah, let me try this out.

Thanks
Ankur

2023-09-21 06:50:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Wed, Sep 20 2023 at 22:51, Thomas Gleixner wrote:
> On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
>
> The preempt count folding is an optimization which simplifies the
> preempt_enable logic:
>
> if (--preempt_count && need_resched())
> schedule()
> to
> if (--preempt_count)
> schedule()

That should be (!(--preempt_count... in both cases of course :)

2023-09-22 00:59:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

Ok, I like this.

That said, this part of it:

On Wed, 20 Sept 2023 at 16:58, Thomas Gleixner <[email protected]> wrote:
>
> -void resched_curr(struct rq *rq)
> +static void __resched_curr(struct rq *rq, int nr_bit)
> [...]
> - set_tsk_need_resched(curr);
> - set_preempt_need_resched();
> + set_tsk_thread_flag(curr, nr_bit);
> + if (nr_bit == TIF_NEED_RESCHED)
> + set_preempt_need_resched();

feels really hacky.

I think that instead of passing a random TIF bit around, it should
just pass a "lazy or not" value around.

Then you make the TIF bit be some easily computable thing (eg something like

#define TIF_RESCHED(lazy) (TIF_NEED_RESCHED + (lazy))

or whatever), and write the above conditional as

if (!lazy)
set_preempt_need_resched();

so that it all *does* the same thing, but the code makes it clear
about what the logic is.

Because honestly, without having been part of this thread, I would look at that

if (nr_bit == TIF_NEED_RESCHED)
set_preempt_need_resched();

and I'd be completely lost. It doesn't make conceptual sense, I feel.

So I'd really like the source code to be more directly expressing the
*intent* of the code, not be so centered around the implementation
detail.

Put another way: I think we can make the compiler turn the intent into
the implementation, and I'd rather *not* have us humans have to infer
the intent from the implementation.

That said - I think as a proof of concept and "look, with this we get
the expected scheduling event counts", that patch is perfect. I think
you more than proved the concept.

Linus

2023-09-22 04:14:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

Linus!

On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
> Ok, I like this.

Thanks!

> That said, this part of it:

> On Wed, 20 Sept 2023 at 16:58, Thomas Gleixner <[email protected]> wrote:

> Because honestly, without having been part of this thread, I would look at that
>
> if (nr_bit == TIF_NEED_RESCHED)
> set_preempt_need_resched();
>
> and I'd be completely lost. It doesn't make conceptual sense, I feel.
>
> So I'd really like the source code to be more directly expressing the
> *intent* of the code, not be so centered around the implementation
> detail.
>
> Put another way: I think we can make the compiler turn the intent into
> the implementation, and I'd rather *not* have us humans have to infer
> the intent from the implementation.

No argument about that. I didn't like it either, but at 10PM ...

> That said - I think as a proof of concept and "look, with this we get
> the expected scheduling event counts", that patch is perfect. I think
> you more than proved the concept.

There is certainly quite some analyis work to do to make this a one to
one replacement.

With a handful of benchmarks the PoC (tweaked with some obvious fixes)
is pretty much on par with the current mainline variants (NONE/FULL),
but the memtier benchmark makes a massive dent.

It sports a whopping 10% regression with the LAZY mode versus the mainline
NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.

That benchmark is really sensitive to the preemption model. With current
mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
performance drop versus preempt=NONE.

I have no clue what's going on there yet, but that shows that there is
obviously quite some work ahead to get this sorted.

Though I'm pretty convinced by now that this is the right direction and
well worth the effort which needs to be put into that.

Thanks,

tglx

2023-09-22 21:12:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Wed, Sep 20 2023 at 17:57, Ankur Arora wrote:
> Thomas Gleixner <[email protected]> writes:
>> Find below a PoC which implements that scheme. It's not even close to
>> correct, but it builds, boots and survives lightweight testing.
>
> Whew, that was electric. I had barely managed to sort through some of
> the config maze.
> From a quick look this is pretty much how you described it.

Unsurpringly I spent at least 10x the time to describe it than to hack
it up.

IOW, I had done the analysis before I offered the idea and before I
changed a single line of code. The tools I used for that are git-grep,
tags, paper, pencil, accrued knowledge and patience, i.e. nothing even
close to rocket science.

Converting the analysis into code was mostly a matter of brain dumping
the analysis and adherence to accrued methodology.

What's electric about that?

I might be missing some meaning of 'electric' which is not covered by my
mostly Webster restricted old-school understanding of the english language :)

>> I did not even try to look into time-slice enforcement, but I really want
>> to share this for illustration and for others to experiment.
>>
>> This keeps all the existing mechanisms in place and introduces a new
>> config knob in the preemption model Kconfig switch: PREEMPT_AUTO
>>
>> If selected it builds a CONFIG_PREEMPT kernel, which disables the
>> cond_resched() machinery and switches the fair scheduler class to use
>> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
>> the preempt NONE model except that cond_resched() is a NOOP and I did
>> not validate the time-slice enforcement. The latter should be a
>> no-brainer to figure out and fix if required.
>
> Yeah, let me try this out.

That's what I hoped for :)

Thanks,

tglx

2023-09-23 09:51:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
> On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
>> That said - I think as a proof of concept and "look, with this we get
>> the expected scheduling event counts", that patch is perfect. I think
>> you more than proved the concept.
>
> There is certainly quite some analyis work to do to make this a one to
> one replacement.
>
> With a handful of benchmarks the PoC (tweaked with some obvious fixes)
> is pretty much on par with the current mainline variants (NONE/FULL),
> but the memtier benchmark makes a massive dent.
>
> It sports a whopping 10% regression with the LAZY mode versus the mainline
> NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
>
> That benchmark is really sensitive to the preemption model. With current
> mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
> performance drop versus preempt=NONE.

That 20% was a tired pilot error. The real number is in the 5% ballpark.

> I have no clue what's going on there yet, but that shows that there is
> obviously quite some work ahead to get this sorted.

It took some head scratching to figure that out. The initial fix broke
the handling of the hog issue, i.e. the problem that Ankur tried to
solve, but I hacked up a "solution" for that too.

With that the memtier benchmark is roughly back to the mainline numbers,
but my throughput benchmark know how is pretty close to zero, so that
should be looked at by people who actually understand these things.

Likewise the hog prevention is just at the PoC level and clearly beyond
my knowledge of scheduler details: It unconditionally forces a
reschedule when the looping task is not responding to a lazy reschedule
request before the next tick. IOW it forces a reschedule on the second
tick, which is obviously different from the cond_resched()/might_sleep()
behaviour.

The changes vs. the original PoC aside of the bug and thinko fixes:

1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
lazy preempt bit as the trace_entry::flags field is full already.

That obviously breaks the tracer ABI, but if we go there then
this needs to be fixed. Steven?

2) debugfs file to validate that loops can be force preempted w/o
cond_resched()

The usage is:

# taskset -c 1 bash
# echo 1 > /sys/kernel/debug/sched/hog &
# echo 1 > /sys/kernel/debug/sched/hog &
# echo 1 > /sys/kernel/debug/sched/hog &

top shows ~33% CPU for each of the hogs and tracing confirms that
the crude hack in the scheduler tick works:

bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr
bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr
bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr
bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr
bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr
bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr
bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr
bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr

The 'l' instead of the usual 'N' reflects that the lazy resched
bit is set. That makes __update_curr() invoke resched_curr()
instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
and folds it into preempt_count so that preemption happens at the
next possible point, i.e. either in return from interrupt or at
the next preempt_enable().

That's as much as I wanted to demonstrate and I'm not going to spend
more cycles on it as I have already too many other things on flight and
the resulting scheduler woes are clearly outside of my expertice.

Though definitely I'm putting a permanent NAK in place for any attempts
to duct tape the preempt=NONE model any further by sprinkling more
cond*() and whatever warts around.

Thanks,

tglx
---
arch/x86/Kconfig | 1
arch/x86/include/asm/thread_info.h | 6 ++--
drivers/acpi/processor_idle.c | 2 -
include/linux/entry-common.h | 2 -
include/linux/entry-kvm.h | 2 -
include/linux/sched.h | 12 +++++---
include/linux/sched/idle.h | 8 ++---
include/linux/thread_info.h | 24 +++++++++++++++++
include/linux/trace_events.h | 8 ++---
kernel/Kconfig.preempt | 17 +++++++++++-
kernel/entry/common.c | 4 +-
kernel/entry/kvm.c | 2 -
kernel/sched/core.c | 51 +++++++++++++++++++++++++------------
kernel/sched/debug.c | 19 +++++++++++++
kernel/sched/fair.c | 46 ++++++++++++++++++++++-----------
kernel/sched/features.h | 2 +
kernel/sched/idle.c | 3 --
kernel/sched/sched.h | 1
kernel/trace/trace.c | 2 +
kernel/trace/trace_output.c | 16 ++++++++++-
20 files changed, 171 insertions(+), 57 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct

#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
/*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
{
struct thread_info *ti = task_thread_info(p);
- return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+
+ return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
}

/*
@@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
for (;;) {
if (!(val & _TIF_POLLING_NRFLAG))
return false;
- if (val & _TIF_NEED_RESCHED)
+ if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
return true;
if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
break;
@@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
}

#else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
{
- set_tsk_need_resched(p);
+ set_tsk_thread_flag(p, tif_bit);
return true;
}

@@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
* might also involve a cross-CPU call to trigger the scheduler on
* the target CPU.
*/
-void resched_curr(struct rq *rq)
+static void __resched_curr(struct rq *rq, int lazy)
{
+ int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
struct task_struct *curr = rq->curr;
- int cpu;

lockdep_assert_rq_held(rq);

- if (test_tsk_need_resched(curr))
+ if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
return;

cpu = cpu_of(rq);

if (cpu == smp_processor_id()) {
- set_tsk_need_resched(curr);
- set_preempt_need_resched();
+ set_tsk_thread_flag(curr, tif_bit);
+ if (!lazy)
+ set_preempt_need_resched();
return;
}

- if (set_nr_and_not_polling(curr))
- smp_send_reschedule(cpu);
- else
+ if (set_nr_and_not_polling(curr, tif_bit)) {
+ if (!lazy)
+ smp_send_reschedule(cpu);
+ } else {
trace_sched_wake_idle_without_ipi(cpu);
+ }
+}
+
+void resched_curr(struct rq *rq)
+{
+ __resched_curr(rq, 0);
+}
+
+void resched_curr_lazy(struct rq *rq)
+{
+ int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
+ TIF_NEED_RESCHED_LAZY_OFFSET : 0;
+
+ if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
+ return;
+
+ __resched_curr(rq, lazy);
}

void resched_cpu(int cpu)
@@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
if (cpu == smp_processor_id())
return;

- if (set_nr_and_not_polling(rq->idle))
+ if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
@@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
return preempt_dynamic_mode == preempt_dynamic_##mode; \
} \
- EXPORT_SYMBOL_GPL(preempt_model_##mode)

PREEMPT_MODEL_ACCESSOR(none);
PREEMPT_MODEL_ACCESSOR(voluntary);
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,16 @@ enum syscall_work_bit {

#include <asm/thread_info.h>

+#ifdef CONFIG_PREEMPT_AUTO
+# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY
+# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY
+# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
+#else
+# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
+# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
+# define TIF_NEED_RESCHED_LAZY_OFFSET 0
+#endif
+
#ifdef __KERNEL__

#ifndef arch_set_restart_data
@@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
(unsigned long *)(&current_thread_info()->flags));
}

+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+ arch_test_bit(TIF_NEED_RESCHED_LAZY,
+ (unsigned long *)(&current_thread_info()->flags));
+}
+
#else

static __always_inline bool tif_need_resched(void)
@@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
(unsigned long *)(&current_thread_info()->flags));
}

+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+ test_bit(TIF_NEED_RESCHED_LAZY,
+ (unsigned long *)(&current_thread_info()->flags));
+}
+
#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */

#ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,6 +11,13 @@ config PREEMPT_BUILD
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK

+config PREEMPT_BUILD_AUTO
+ bool
+ select PREEMPT_BUILD
+
+config HAVE_PREEMPT_AUTO
+ bool
+
choice
prompt "Preemption Model"
default PREEMPT_NONE
@@ -67,9 +74,17 @@ config PREEMPT
embedded system with latency requirements in the milliseconds
range.

+config PREEMPT_AUTO
+ bool "Automagic preemption mode with runtime tweaking support"
+ depends on HAVE_PREEMPT_AUTO
+ select PREEMPT_BUILD_AUTO
+ help
+ Add some sensible blurb here
+
config PREEMPT_RT
bool "Fully Preemptible Kernel (Real-Time)"
depends on EXPERT && ARCH_SUPPORTS_RT
+ select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
select PREEMPTION
help
This option turns the kernel into a real-time kernel by replacing
@@ -95,7 +110,7 @@ config PREEMPTION

config PREEMPT_DYNAMIC
bool "Preemption behaviour defined on boot"
- depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
+ depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
select PREEMPT_BUILD
default y if HAVE_PREEMPT_DYNAMIC_CALL
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -60,7 +60,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
- ARCH_EXIT_TO_USER_MODE_WORK)
+ _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)

/**
* arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@

#define XFER_TO_GUEST_MODE_WORK \
(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
- _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+ _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)

struct kvm_vcpu;

--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l

local_irq_enable_exit_to_user(ti_work);

- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();

if (ti_work & _TIF_UPROBE)
@@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
rcu_irq_exit_check_preempt();
if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
WARN_ON_ONCE(!on_thread_stack());
- if (need_resched())
+ if (test_tsk_need_resched(current))
preempt_schedule_irq();
}
}
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(HZ_BW, true)
+
+SCHED_FEAT(FORCE_NEED_RESCHED, false)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
extern void reweight_task(struct task_struct *p, int prio);

extern void resched_curr(struct rq *rq);
+extern void resched_curr_lazy(struct rq *rq);
extern void resched_cpu(int cpu);

extern struct rt_bandwidth def_rt_bandwidth;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
update_ti_thread_flag(task_thread_info(tsk), flag, value);
}

-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
}

-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
}

-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_ti_thread_flag(task_thread_info(tsk), flag);
}
@@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
static inline void clear_tsk_need_resched(struct task_struct *tsk)
{
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+ if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+ clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
}

-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
{
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}
@@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc

static __always_inline bool need_resched(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(tif_need_resched_lazy() || tif_need_resched());
}

/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
* XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
* this is probably good enough.
*/
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
{
+ struct rq *rq = rq_of(cfs_rq);
+
if ((s64)(se->vruntime - se->deadline) < 0)
return;

@@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
/*
* The task has consumed its request, reschedule.
*/
- if (cfs_rq->nr_running > 1) {
- resched_curr(rq_of(cfs_rq));
- clear_buddies(cfs_rq, se);
+ if (cfs_rq->nr_running < 2)
+ return;
+
+ if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
+ resched_curr(rq);
+ } else {
+ /* Did the task ignore the lazy reschedule request? */
+ if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+ resched_curr(rq);
+ else
+ resched_curr_lazy(rq);
}
+ clear_buddies(cfs_rq, se);
}

#include "pelt.h"
@@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
/*
* Update the current task's runtime statistics.
*/
-static void update_curr(struct cfs_rq *cfs_rq)
+static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_clock_task(rq_of(cfs_rq));
@@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
schedstat_add(cfs_rq->exec_clock, delta_exec);

curr->vruntime += calc_delta_fair(delta_exec, curr);
- update_deadline(cfs_rq, curr);
+ update_deadline(cfs_rq, curr, tick);
update_min_vruntime(cfs_rq);

if (entity_is_task(curr)) {
@@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
account_cfs_rq_runtime(cfs_rq, delta_exec);
}

+static inline void update_curr(struct cfs_rq *cfs_rq)
+{
+ __update_curr(cfs_rq, false);
+}
+
static void update_curr_fair(struct rq *rq)
{
update_curr(cfs_rq_of(&rq->curr->se));
@@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
/*
* Update run-time statistics of the 'current'.
*/
- update_curr(cfs_rq);
+ __update_curr(cfs_rq, true);

/*
* Ensure that runnable average is periodically updated.
@@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
* validating it and just reschedule.
*/
if (queued) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
return;
}
/*
@@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
* hierarchy can be throttled
*/
if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
}

static __always_inline
@@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf

/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
- resched_curr(rq);
+ resched_curr_lazy(rq);
}

#ifdef CONFIG_SMP
@@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq

if (delta < 0) {
if (task_current(rq, p))
- resched_curr(rq);
+ resched_curr_lazy(rq);
return;
}
hrtick_start(rq, delta);
@@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
- if (test_tsk_need_resched(curr))
+ if (need_resched())
return;

/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
return;

preempt:
- resched_curr(rq);
+ resched_curr_lazy(rq);
}

#ifdef CONFIG_SMP
@@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
*/
if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
__entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
- resched_curr(rq);
+ resched_curr_lazy(rq);
}

/*
@@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
*/
if (task_current(rq, p)) {
if (p->prio > oldprio)
- resched_curr(rq);
+ resched_curr_lazy(rq);
} else
check_preempt_curr(rq, p, 0);
}
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -108,7 +108,7 @@ static const struct dmi_system_id proces
*/
static void __cpuidle acpi_safe_halt(void)
{
- if (!tif_need_resched()) {
+ if (!need_resched()) {
raw_safe_halt();
raw_local_irq_disable();
}
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -63,7 +63,7 @@ static __always_inline bool __must_check
*/
smp_mb__after_atomic();

- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}

static __always_inline bool __must_check current_clr_polling_and_test(void)
@@ -76,7 +76,7 @@ static __always_inline bool __must_check
*/
smp_mb__after_atomic();

- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}

#else
@@ -85,11 +85,11 @@ static inline void __current_clr_polling

static inline bool __must_check current_set_polling_and_test(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
static inline bool __must_check current_clr_polling_and_test(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
#endif

--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
ct_cpuidle_enter();

raw_local_irq_enable();
- while (!tif_need_resched() &&
- (cpu_idle_force_poll || tick_check_broadcast_expired()))
+ while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
raw_local_irq_disable();

--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un

if (tif_need_resched())
trace_flags |= TRACE_FLAG_NEED_RESCHED;
+ if (tif_need_resched_lazy())
+ trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
if (test_preempt_need_resched())
trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -271,6 +271,7 @@ config X86
select HAVE_STATIC_CALL
select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
select HAVE_PREEMPT_DYNAMIC_CALL
+ select HAVE_PREEMPT_AUTO
select HAVE_RSEQ
select HAVE_RUST if X86_64
select HAVE_SYSCALL_TRACEPOINTS
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -81,8 +81,9 @@ struct thread_info {
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
-#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
-#define TIF_SSBD 5 /* Speculative store bypass disable */
+#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */
+#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
+#define TIF_SSBD 6 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
@@ -104,6 +105,7 @@ struct thread_info {
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
+#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY)
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
return -EINTR;
}

- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
schedule();

if (ti_work & _TIF_NOTIFY_RESUME)
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un

enum trace_flag_type {
TRACE_FLAG_IRQS_OFF = 0x01,
- TRACE_FLAG_IRQS_NOSUPPORT = 0x02,
- TRACE_FLAG_NEED_RESCHED = 0x04,
+ TRACE_FLAG_NEED_RESCHED = 0x02,
+ TRACE_FLAG_NEED_RESCHED_LAZY = 0x04,
TRACE_FLAG_HARDIRQ = 0x08,
TRACE_FLAG_SOFTIRQ = 0x10,
TRACE_FLAG_PREEMPT_RESCHED = 0x20,
@@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c

static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
{
- return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+ return tracing_gen_ctx_irq_test(0);
}
static inline unsigned int tracing_gen_ctx(void)
{
- return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+ return tracing_gen_ctx_irq_test(0);
}
#endif

--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
bh_off ? 'b' :
- (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+ !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
'.';

- switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
+ switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
TRACE_FLAG_PREEMPT_RESCHED)) {
+ case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+ need_resched = 'B';
+ break;
case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
need_resched = 'N';
break;
+ case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+ need_resched = 'L';
+ break;
+ case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
+ need_resched = 'b';
+ break;
case TRACE_FLAG_NEED_RESCHED:
need_resched = 'n';
break;
+ case TRACE_FLAG_NEED_RESCHED_LAZY:
+ need_resched = 'l';
+ break;
case TRACE_FLAG_PREEMPT_RESCHED:
need_resched = 'p';
break;
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,6 +333,23 @@ static const struct file_operations sche
.release = seq_release,
};

+static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ unsigned long end = jiffies + 60 * HZ;
+
+ for (; time_before(jiffies, end) && !signal_pending(current);)
+ cpu_relax();
+
+ return cnt;
+}
+
+static const struct file_operations sched_hog_fops = {
+ .write = sched_hog_write,
+ .open = simple_open,
+ .llseek = default_llseek,
+};
+
static struct dentry *debugfs_sched;

static __init int sched_init_debug(void)
@@ -374,6 +391,8 @@ static __init int sched_init_debug(void)

debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

+ debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
+
return 0;
}
late_initcall(sched_init_debug);

2023-09-25 18:10:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>> Then the question becomes whether we'd want to introduce a *new*
>> concept, which is a "if you are going to schedule, do it now rather
>> than later, because I'm taking a lock, and while it's a preemptible
>> lock, I'd rather not sleep while holding this resource".
>>
>> I suspect we want to avoid that for now, on the assumption that it's
>> hopefully not a problem in practice (the recently addressed problem
>> with might_sleep() was that it actively *moved* the scheduling point
>> to a bad place, not that scheduling could happen there, so instead of
>> optimizing scheduling, it actively pessimized it). But I thought I'd
>> mention it.
>
> I think we want to avoid that completely and if this becomes an issue,
> we rather be smart about it at the core level.
>
> It's trivial enough to have a per task counter which tells whether a
> preemtible lock is held (or about to be acquired) or not. Then the
> scheduler can take that hint into account and decide to grant a
> timeslice extension once in the expectation that the task leaves the
> lock held section soonish and either returns to user space or schedules
> out. It still can enforce it later on.
>
> We really want to let the scheduler decide and rather give it proper
> hints at the conceptual level instead of letting developers make random
> decisions which might work well for a particular use case and completely
> suck for the rest. I think we wasted enough time already on those.

Finally I realized why cond_resched() & et al. are so disgusting. They
are scope-less and just a random spot which someone decided to be a good
place to reschedule.

But in fact the really relevant measure is scope. Full preemption is
scope based:

preempt_disable();
do_stuff();
preempt_enable();

which also nests properly:

preempt_disable();
do_stuff()
preempt_disable();
do_other_stuff();
preempt_enable();
preempt_enable();

cond_resched() cannot nest and is obviously scope-less.

The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only
pretends to be scoped.

As Peter pointed out it does not properly nest with other mechanisms and
it cannot even nest in itself because it is boolean.

The worst thing about it is that it is semantically reverse to the
established model of preempt_disable()/enable(),
i.e. allow_resched()/disallow_resched().

So instead of giving the scheduler a hint about 'this might be a good
place to preempt', providing proper scope would make way more sense:

preempt_lazy_disable();
do_stuff();
preempt_lazy_enable();

That would be the obvious and semantically consistent counterpart to the
existing preemption control primitives with proper nesting support.

might_sleep(), which is in all the lock acquire functions or your
variant of hint (resched better now before I take the lock) are the
wrong place.

hint();
lock();
do_stuff();
unlock();

hint() might schedule and when the task comes back schedule immediately
again because the lock is contended. hint() does again not have scope
and might be meaningless or even counterproductive if called in a deeper
callchain.

Proper scope based hints avoid that.

preempt_lazy_disable();
lock();
do_stuff();
unlock();
preempt_lazy_enable();

That's way better because it describes the scope and the task will
either schedule out in lock() on contention or provide a sensible lazy
preemption point in preempt_lazy_enable(). It also nests properly:

preempt_lazy_disable();
lock(A);
do_stuff()
preempt_lazy_disable();
lock(B);
do_other_stuff();
unlock(B);
preempt_lazy_enable();
unlock(A);
preempt_lazy_enable();

So in this case it does not matter wheter do_stuff() is invoked from a
lock held section or not. The scope which defines the throughput
relevant hint to the scheduler is correct in any case.

Contrary to preempt_disable() the lazy variant does neither prevent
scheduling nor preemption, but its a understandable properly nestable
mechanism.

I seriously hope to avoid it alltogether :)

Thanks,

tglx