On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>
> Peter Zijlstra <[email protected]> writes:
>
> > On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:
> >
> >> I was hoping that we'd have some generic way to deal with this where
> >> we could just say "this thing is reschedulable", and get rid of - or
> >> at least not increasingly add to - the cond_resched() mess.
> >
> > Isn't that called PREEMPT=y ? That tracks precisely all the constraints
> > required to know when/if we can preempt.
> >
> > The whole voluntary preempt model is basically the traditional
> > co-operative preemption model and that fully relies on manual yields.
>
> Yeah, but as Linus says, this means a lot of code is just full of
> cond_resched(). For instance a loop the process_huge_page() uses
> this pattern:
>
> for (...) {
> cond_resched();
> clear_page(i);
>
> cond_resched();
> clear_page(j);
> }
Yeah, that's what co-operative preemption gets you.
> > The problem with the REP prefix (and Xen hypercalls) is that
> > they're long running instructions and it becomes fundamentally
> > impossible to put a cond_resched() in.
> >
> >> Yes. I'm starting to think that that the only sane solution is to
> >> limit cases that can do this a lot, and the "instruciton pointer
> >> region" approach would certainly work.
> >
> > From a code locality / I-cache POV, I think a sorted list of
> > (non overlapping) ranges might be best.
>
> Yeah, agreed. There are a few problems with doing that though.
>
> I was thinking of using a check of this kind to schedule out when
> it is executing in this "reschedulable" section:
> !preempt_count() && in_resched_function(regs->rip);
>
> For preemption=full, this should mostly work.
> For preemption=voluntary, though this'll only work with out-of-line
> locks, not if the lock is inlined.
>
> (Both, should have problems with __this_cpu_* and the like, but
> maybe we can handwave that away with sparse/objtool etc.)
So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
thing, and then only search the range when TIF flag is set.
And I'm thinking it might be a good idea to have objtool validate the
range only contains simple instructions, the moment it contains control
flow I'm thinking it's too complicated.
> How expensive would be always having PREEMPT_COUNT=y?
Effectively I think that is true today. At the very least Debian and
SuSE (I can't find a RHEL .config in a hurry but I would think they too)
ship with PREEMPT_DYNAMIC=y.
Mel, I'm sure you ran numbers at the time (you always do), what if any
was the measured overhead from PREEMPT_DYNAMIC vs 'regular' voluntary
preemption?
On Tue, Sep 12, 2023 at 10:26:06AM +0200, Peter Zijlstra wrote:
> > How expensive would be always having PREEMPT_COUNT=y?
>
> Effectively I think that is true today. At the very least Debian and
> SuSE (I can't find a RHEL .config in a hurry but I would think they too)
> ship with PREEMPT_DYNAMIC=y.
$ grep PREEMPT uek-rpm/ol9/config-x86_64
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
$ grep PREEMPT uek-rpm/ol9/config-aarch64
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>> > The problem with the REP prefix (and Xen hypercalls) is that
>> > they're long running instructions and it becomes fundamentally
>> > impossible to put a cond_resched() in.
>> >
>> >> Yes. I'm starting to think that that the only sane solution is to
>> >> limit cases that can do this a lot, and the "instruciton pointer
>> >> region" approach would certainly work.
>> >
>> > From a code locality / I-cache POV, I think a sorted list of
>> > (non overlapping) ranges might be best.
>>
>> Yeah, agreed. There are a few problems with doing that though.
>>
>> I was thinking of using a check of this kind to schedule out when
>> it is executing in this "reschedulable" section:
>> !preempt_count() && in_resched_function(regs->rip);
>>
>> For preemption=full, this should mostly work.
>> For preemption=voluntary, though this'll only work with out-of-line
>> locks, not if the lock is inlined.
>>
>> (Both, should have problems with __this_cpu_* and the like, but
>> maybe we can handwave that away with sparse/objtool etc.)
>
> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
> thing, and then only search the range when TIF flag is set.
>
> And I'm thinking it might be a good idea to have objtool validate the
> range only contains simple instructions, the moment it contains control
> flow I'm thinking it's too complicated.
Can we take a step back and look at the problem from a scheduling
perspective?
The basic operation of a non-preemptible kernel is time slice
scheduling, which means that a task can run more or less undisturbed for
a full time slice once it gets on the CPU unless it schedules away
voluntary via a blocking operation.
This works pretty well as long as everything runs in userspace as the
preemption points in the return to user space path are independent of
the preemption model.
These preemption points handle both time slice exhaustion and priority
based preemption.
With PREEMPT=NONE these are the only available preemption points.
That means that kernel code can run more or less indefinitely until it
schedules out or returns to user space, which is obviously not possible
for kernel threads.
To prevent starvation the kernel gained voluntary preemption points,
i.e. cond_resched(), which has to be added manually to code as a
developer sees fit.
Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
additional preemption points. might_resched() utilizes the existing
might_sched() debug points, which are in code paths which might block on
a contended resource. These debug points are mostly in core and
infrastructure code and are in code paths which can block anyway. The
only difference is that they allow preemption even when the resource is
uncontended.
Additionally we have PREEMPT=FULL which utilizes every zero transition
of preeempt_count as a potential preemption point.
Now we have the situation of long running data copies or data clear
operations which run fully in hardware, but can be interrupted. As the
interrupt return to kernel mode does not preempt in the NONE and
VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
chunk size and adding cond_reched() again.
That's ugly and does not work for long lasting hardware operations so we
ended up with the suggestion of TIF_ALLOW_RESCHED to work around
that. But again this needs to be manually annotated in the same way as a
IP range based preemption scheme requires annotation.
TBH. I detest all of this.
Both cond_resched() and might_sleep/sched() are completely random
mechanisms as seen from time slice operation and the data chunk based
mechanism is just heuristics which works as good as heuristics tend to
work. allow_resched() is not any different and IP based preemption
mechanism are not going to be any better.
The approach here is: Prevent the scheduler to make decisions and then
mitigate the fallout with heuristics.
That's just backwards as it moves resource control out of the scheduler
into random code which has absolutely no business to do resource
control.
We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
held sections became preemtible caused even more preemption activity
than on a PREEMPT=FULL kernel. The worst side effect of that was
extensive lock contention.
The way how we addressed that was to add a lazy preemption mode, which
tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
preempt tasks which all belong to the SCHED_OTHER scheduling class. This
works pretty well and gains back a massive amount of performance for the
non-realtime throughput oriented tasks without affecting the
schedulability of real-time tasks at all. IOW, it does not take control
away from the scheduler. It cooperates with the scheduler and leaves the
ultimate decisions to it.
I think we can do something similar for the problem at hand, which
avoids most of these heuristic horrors and control boundary violations.
The main issue is that long running operations do not honour the time
slice and we work around that with cond_resched() and now have ideas
with this new TIF bit and IP ranges.
None of that is really well defined in respect to time slices. In fact
its not defined at all versus any aspect of scheduling behaviour.
What about the following:
1) Keep preemption count and the real preemption points enabled
unconditionally. That's not more overhead than the current
DYNAMIC_PREEMPT mechanism as long as the preemption count does not
go to zero, i.e. the folded NEED_RESCHED bit stays set.
From earlier experiments I know that the overhead of preempt_count
is minimal and only really observable with micro benchmarks.
Otherwise it ends up in the noise as long as the slow path is not
taken.
I did a quick check comparing a plain inc/dec pair vs. the
DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
in the non-conclusive noise.
20 years ago this was a real issue because we did not have:
- the folding of NEED_RESCHED into the preempt count
- the cacheline optimizations which make the preempt count cache
pretty much always cache hot
- the hardware was way less capable
I'm not saying that preempt_count is completely free today as it
obviously adds more text and affects branch predictors, but as the
major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
acceptable and tolerable tradeoff.
2) When the scheduler wants to set NEED_RESCHED due it sets
NEED_RESCHED_LAZY instead which is only evaluated in the return to
user space preemption points.
As NEED_RESCHED_LAZY is not folded into the preemption count the
preemption count won't become zero, so the task can continue until
it hits return to user space.
That preserves the existing behaviour.
3) When the scheduler tick observes that the time slice is exhausted,
then it folds the NEED_RESCHED bit into the preempt count which
causes the real preemption points to actually preempt including
the return from interrupt to kernel path.
That even allows the scheduler to enforce preemption for e.g. RT
class tasks without changing anything else.
I'm pretty sure that this gets rid of cond_resched(), which is an
impressive list of instances:
./drivers 392
./fs 318
./mm 189
./kernel 184
./arch 95
./net 83
./include 46
./lib 36
./crypto 16
./sound 16
./block 11
./io_uring 13
./security 11
./ipc 3
That list clearly documents that the majority of these
cond_resched() invocations is in code which neither should care
nor should have any influence on the core scheduling decision
machinery.
I think it's worth a try as it just fits into the existing preemption
scheme, solves the issue of long running kernel functions, prevents
invalid preemption and can utilize the existing instrumentation and
debug infrastructure.
Most importantly it gives control back to the scheduler and does not
make it depend on the mercy of cond_resched(), allow_resched() or
whatever heuristics sprinkled all over the kernel.
To me this makes a lot of sense, but I might be on the completely wrong
track. Se feel free to tell me that I'm completely nuts and/or just not
seeing the obvious.
Thanks,
tglx
On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <[email protected]> wrote:
>
> What about the following:
>
> 1) Keep preemption count and the real preemption points enabled
> unconditionally.
Well, it's certainly the simplest solution, and gets rid of not just
the 'rep string' issue, but gets rid of all the cond_resched() hackery
entirely.
> 20 years ago this was a real issue because we did not have:
>
> - the folding of NEED_RESCHED into the preempt count
>
> - the cacheline optimizations which make the preempt count cache
> pretty much always cache hot
>
> - the hardware was way less capable
>
> I'm not saying that preempt_count is completely free today as it
> obviously adds more text and affects branch predictors, but as the
> major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> acceptable and tolerable tradeoff.
Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in
most distros does speak for just admitting that the PREEMPT_NONE /
VOLUNTARY approach isn't actually used, and is only causing pain.
> 2) When the scheduler wants to set NEED_RESCHED due it sets
> NEED_RESCHED_LAZY instead which is only evaluated in the return to
> user space preemption points.
Is this just to try to emulate the existing PREEMPT_NONE behavior?
If the new world order is that the time slice is always honored, then
the "this might be a latency issue" goes away. Good.
And we'd also get better coverage for the *debug* aim of
"might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
PREEMPT_COUNT always existing.
But because the latency argument is gone, the "might_resched()" should
then just be removed entirely from "might_sleep()", so that
might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.
That argues for your suggestion too, since we had a performance issue
due to "might_sleep()" _not_ being just a debug thing, and pointlessly
causing a reschedule in a place where reschedules were _allowed_, but
certainly much less than optimal.
Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
'might_sleep()' in get_mmap_lock_carefully()").
However, that does bring up an issue: even with full preemption, there
are certainly places where we are *allowed* to schedule (when the
preempt count is zero), but there are also some places that are
*better* than other places to schedule (for example, when we don't
hold any other locks).
So, I do think that if we just decide to go "let's just always be
preemptible", we might still have points in the kernel where
preemption might be *better* than in others points.
But none of might_resched(), might_sleep() _or_ cond_resched() are
necessarily that kind of "this is a good point" thing. They come from
a different background.
So what I think what you are saying is that we'd have the following situation:
- scheduling at "return to user space" is presumably always a good thing.
A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
whatever) would cover that, and would give us basically the existing
CONFIG_PREEMPT_NONE behavior.
So a config variable (either compile-time with PREEMPT_NONE or a
dynamic one with DYNAMIC_PREEMPT set to none) would make any external
wakeup only set that bit.
And then a "fully preemptible low-latency desktop" would set the
preempt-count bit too.
- but the "timeslice over" case would always set the
preempt-count-bit, regardless of any config, and would guarantee that
we have reasonable latencies.
This all makes cond_resched() (and might_resched()) pointless, and
they can just go away.
Then the question becomes whether we'd want to introduce a *new*
concept, which is a "if you are going to schedule, do it now rather
than later, because I'm taking a lock, and while it's a preemptible
lock, I'd rather not sleep while holding this resource".
I suspect we want to avoid that for now, on the assumption that it's
hopefully not a problem in practice (the recently addressed problem
with might_sleep() was that it actively *moved* the scheduling point
to a bad place, not that scheduling could happen there, so instead of
optimizing scheduling, it actively pessimized it). But I thought I'd
mention it.
Anyway, I'm definitely not opposed. We'd get rid of a config option
that is presumably not very widely used, and we'd simplify a lot of
issues, and get rid of all these badly defined "cond_preempt()"
things.
Linus
* Linus Torvalds <[email protected]> wrote:
> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <[email protected]> wrote:
> >
> > What about the following:
> >
> > 1) Keep preemption count and the real preemption points enabled
> > unconditionally.
>
> Well, it's certainly the simplest solution, and gets rid of not just
> the 'rep string' issue, but gets rid of all the cond_resched() hackery
> entirely.
>
> > 20 years ago this was a real issue because we did not have:
> >
> > - the folding of NEED_RESCHED into the preempt count
> >
> > - the cacheline optimizations which make the preempt count cache
> > pretty much always cache hot
> >
> > - the hardware was way less capable
> >
> > I'm not saying that preempt_count is completely free today as it
> > obviously adds more text and affects branch predictors, but as the
> > major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> > acceptable and tolerable tradeoff.
>
> Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most
> distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY
> approach isn't actually used, and is only causing pain.
The macro-behavior of NONE/VOLUNTARY is still used & relied upon in server
distros - and that's the behavior that enterprise distros truly cared
about.
Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the 'noise'
category for all major distros I'd say.
And that's what Thomas's proposal achieves: keep the nicely execution-batched
NONE/VOLUNTARY scheduling behavior for SCHED_OTHER tasks, while having the
latency advantages of fully-preemptible kernel code for RT and critical
tasks.
So I'm fully on board with this. It would reduce the number of preemption
variants to just two: regular kernel and PREEMPT_RT. Yummie!
> > 2) When the scheduler wants to set NEED_RESCHED due it sets
> > NEED_RESCHED_LAZY instead which is only evaluated in the return to
> > user space preemption points.
>
> Is this just to try to emulate the existing PREEMPT_NONE behavior?
Yes: I'd guesstimate that the batching caused by timeslice-laziness that is
naturally part of NONE/VOLUNTARY resolves ~90%+ of observable
macro-performance regressions between NONE/VOLUNTARY and PREEMPT/RT.
> If the new world order is that the time slice is always honored, then the
> "this might be a latency issue" goes away. Good.
>
> And we'd also get better coverage for the *debug* aim of "might_sleep()"
> and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on PREEMPT_COUNT always
> existing.
>
> But because the latency argument is gone, the "might_resched()" should
> then just be removed entirely from "might_sleep()", so that might_sleep()
> would *only* be that DEBUG_ATOMIC_SLEEP thing.
Correct. And that's even a minor code generation advantage, as we wouldn't
have these additional hundreds of random/statistical preemption checks.
> That argues for your suggestion too, since we had a performance issue due
> to "might_sleep()" _not_ being just a debug thing, and pointlessly
> causing a reschedule in a place where reschedules were _allowed_, but
> certainly much less than optimal.
>
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
> 'might_sleep()' in get_mmap_lock_carefully()").
4542057e18ca is arguably kind of a workaround though - and with the
preempt_count + NEED_RESCHED_LAZY approach we'd have both the latency
advantages *and* the execution-batching performance advantages of
NONE/VOLUNTARY that 4542057e18ca exposed.
> However, that does bring up an issue: even with full preemption, there
> are certainly places where we are *allowed* to schedule (when the preempt
> count is zero), but there are also some places that are *better* than
> other places to schedule (for example, when we don't hold any other
> locks).
>
> So, I do think that if we just decide to go "let's just always be
> preemptible", we might still have points in the kernel where preemption
> might be *better* than in others points.
So in the broadest sense we have 3 stages of pending preemption:
NEED_RESCHED_LAZY
NEED_RESCHED_SOON
NEED_RESCHED_NOW
And we'd transition:
- from 0 -> SOON when an eligible task is woken up,
- from LAZY -> SOON when current timeslice is exhausted,
- from SOON -> NOW when no locks/resources are held.
[ With a fast-track for RT or other urgent tasks to enter NOW immediately. ]
On the regular kernels it's probably not worth modeling the SOON/NOW split,
as we'd have to track the depth of sleeping locks as well, which we don't
do right now.
On PREEMPT_RT the SOON/NOW distinction possibly makes sense, as there we
are aware of locking depth already and it would be relatively cheap to
check for it on natural 0-preempt_count boundaries.
> But none of might_resched(), might_sleep() _or_ cond_resched() are
> necessarily that kind of "this is a good point" thing. They come from a
> different background.
Correct, they come from two sources:
- They are hundreds of points that we know are 'technically correct'
preemption points, and they break up ~90% of long latencies by brute
force & chance.
- Explicitly identified problem points that added a cond_resched() or its
equivalent. These are rare and also tend to bitrot, because *removing*
them is always more risky than adding them, so they tend to accumulate.
> So what I think what you are saying is that we'd have the following
> situation:
>
> - scheduling at "return to user space" is presumably always a good thing.
>
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
>
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
>
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.
I'd even argue that we only need two preemption modes, and that 'fully
preemptible low-latency desktop' is an artifact of poor latencies on
PREEMPT_NONE.
Ie. in the long run - after a careful period of observing performance
regressions and other dragons - we'd only have *two* preemption modes left:
!PREEMPT_RT # regular kernel. Single default behavior.
PREEMPT_RT=y # -rt kernel, because rockets, satellites & cars matter.
Any other application level preemption preferences can be expressed via
scheduling policies & priorities.
Nothing else. We don't need PREEMPT_DYNAMIC, PREEMPT_VOLUNTARY or
PREEMPT_NONE in any of their variants, probably not even as runtime knobs.
People who want shorter timeslices can set shorter timeslices, and people
who want immediate preemption of certain tasks can manage priorities.
> - but the "timeslice over" case would always set the preempt-count-bit,
> regardless of any config, and would guarantee that we have reasonable
> latencies.
Yes. Probably a higher nice-priority task becoming runnable would cause
immediate preemption too, in addition to RT tasks.
Ie. the execution batching would be for same-priority groups of SCHED_OTHER
tasks.
> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.
Yep.
> Then the question becomes whether we'd want to introduce a *new* concept,
> which is a "if you are going to schedule, do it now rather than later,
> because I'm taking a lock, and while it's a preemptible lock, I'd rather
> not sleep while holding this resource".
Something close to this concept is naturally available on PREEMPT_RT
kernels, which only use a single central lock primitive (rt_mutex), but it
would have be added explicitly for regular kernels.
We could do the following intermediate step:
- Remove all the random cond_resched() points such as might_sleep()
- Turn all explicit cond_resched() points into 'ideal point to reschedule'.
- Maybe even rename it from cond_resched() to resched_point(), to signal
the somewhat different role.
While cond_resched() and resched_point() are not 100% matches, they are
close enough, as most existing cond_resched() points were added to places
that cause the least amount of disruption with held resources.
But I think it would be better to add resched_point() as a new API, and add
it to places where there's a performance benefit. Clean slate,
documentation, and all that.
> I suspect we want to avoid that for now, on the assumption that it's
> hopefully not a problem in practice (the recently addressed problem with
> might_sleep() was that it actively *moved* the scheduling point to a bad
> place, not that scheduling could happen there, so instead of optimizing
> scheduling, it actively pessimized it). But I thought I'd mention it.
>
> Anyway, I'm definitely not opposed. We'd get rid of a config option that
> is presumably not very widely used, and we'd simplify a lot of issues,
> and get rid of all these badly defined "cond_preempt()" things.
I think we can get rid of *all* the preemption model Kconfig knobs, except
PREEMPT_RT. :-)
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
> > Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most
> > distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY
> > approach isn't actually used, and is only causing pain.
>
> The macro-behavior of NONE/VOLUNTARY is still used & relied upon in
> server distros - and that's the behavior that enterprise distros truly
> cared about.
>
> Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the
> 'noise' category for all major distros I'd say.
>
> And that's what Thomas's proposal achieves: keep the nicely
> execution-batched NONE/VOLUNTARY scheduling behavior for SCHED_OTHER
> tasks, while having the latency advantages of fully-preemptible kernel
> code for RT and critical tasks.
>
> So I'm fully on board with this. It would reduce the number of preemption
> variants to just two: regular kernel and PREEMPT_RT. Yummie!
As an additional side note: with various changes such as EEVDF the
scheduler is a lot less preemption-happy these days, without wrecking
latencies & timeslice distribution.
So in principle we might not even need the NEED_RESCHED_LAZY extra bit,
which -rt uses as a kind of additional layer to make sure they don't change
scheduling policy.
Ie. a modern scheduler might have mooted much of this change:
4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()")
... because now we'll only reschedule on timeslice exhaustion, or if a task
comes in with a big deadline deficit.
And even the deadline-deficit wakeup preemption can be turned off further
with:
$ echo NO_WAKEUP_PREEMPTION > /debug/sched/features
And we are considering making that the default behavior for same-prio tasks
- basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which
should be quite similar to what NEED_RESCHED_LAZY achieves on -rt.
Thanks,
Ingo
* Thomas Gleixner <[email protected]> wrote:
> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.
Just to complete this nice new entry to Documentation/sched/: in
PREEMPT=FULL there's also IRQ-return driven preemption of kernel-mode code,
at almost any instruction boundary the hardware allows, in addition to the
preemption driven by regular zero transition of preempt_count in
syscall/kthread code.
Thanks,
Ingo
Ingo!
On Tue, Sep 19 2023 at 10:03, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>> Then the question becomes whether we'd want to introduce a *new* concept,
>> which is a "if you are going to schedule, do it now rather than later,
>> because I'm taking a lock, and while it's a preemptible lock, I'd rather
>> not sleep while holding this resource".
>
> Something close to this concept is naturally available on PREEMPT_RT
> kernels, which only use a single central lock primitive (rt_mutex), but it
> would have be added explicitly for regular kernels.
>
> We could do the following intermediate step:
>
> - Remove all the random cond_resched() points such as might_sleep()
> - Turn all explicit cond_resched() points into 'ideal point to reschedule'.
>
> - Maybe even rename it from cond_resched() to resched_point(), to signal
> the somewhat different role.
>
> While cond_resched() and resched_point() are not 100% matches, they are
> close enough, as most existing cond_resched() points were added to places
> that cause the least amount of disruption with held resources.
>
> But I think it would be better to add resched_point() as a new API, and add
> it to places where there's a performance benefit. Clean slate,
> documentation, and all that.
Lets not go there. You just replace one magic mushroom with a different
flavour. We want to get rid of them completely.
The whole point is to let the scheduler decide and give it enough
information to make informed decisions.
So with the LAZY scheme in effect, there is no real reason to have these
extra points and I rather add task::sleepable_locks_held and do that
accounting in the relevant lock/unlock paths. Based on that the
scheduler can decide whether it grants a time slice expansion or just
says no.
That's extremly cheap and well defined.
You can document the hell out of resched_point(), but it won't be any
different from the existing ones and always subject to personal
preference and goals and its going to be sprinkled all over the place
just like the existing ones. So where is the gain?
Thanks,
tglx
Linus!
On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <[email protected]> wrote:
>> 2) When the scheduler wants to set NEED_RESCHED due it sets
>> NEED_RESCHED_LAZY instead which is only evaluated in the return to
>> user space preemption points.
>
> Is this just to try to emulate the existing PREEMPT_NONE behavior?
To some extent yes.
> If the new world order is that the time slice is always honored, then
> the "this might be a latency issue" goes away. Good.
That's the point.
> And we'd also get better coverage for the *debug* aim of
> "might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
> PREEMPT_COUNT always existing.
>
> But because the latency argument is gone, the "might_resched()" should
> then just be removed entirely from "might_sleep()", so that
> might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.
True. And this gives the scheduler the flexibility to enforce preemption
under certain conditions, e.g. when a task with RT scheduling class or a
task with a sporadic event handler is woken up. That's what VOLUNTARY
tries to achieve with all the might_sleep()/might_resched() magic.
> That argues for your suggestion too, since we had a performance issue
> due to "might_sleep()" _not_ being just a debug thing, and pointlessly
> causing a reschedule in a place where reschedules were _allowed_, but
> certainly much less than optimal.
>
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
> 'might_sleep()' in get_mmap_lock_carefully()").
Awesome.
> However, that does bring up an issue: even with full preemption, there
> are certainly places where we are *allowed* to schedule (when the
> preempt count is zero), but there are also some places that are
> *better* than other places to schedule (for example, when we don't
> hold any other locks).
>
> So, I do think that if we just decide to go "let's just always be
> preemptible", we might still have points in the kernel where
> preemption might be *better* than in others points.
>
> But none of might_resched(), might_sleep() _or_ cond_resched() are
> necessarily that kind of "this is a good point" thing. They come from
> a different background.
They are subject to subsystem/driver specific preferences and therefore
biased towards a certain usage scenario, which is not necessarily to the
benefit of everyone else.
> So what I think what you are saying is that we'd have the following situation:
>
> - scheduling at "return to user space" is presumably always a good thing.
>
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
>
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
>
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.
Correct.
> - but the "timeslice over" case would always set the
> preempt-count-bit, regardless of any config, and would guarantee that
> we have reasonable latencies.
Yes. That's the reasoning.
> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.
:)
So the decision matrix would be:
Ret2user Ret2kernel PreemptCnt=0
NEED_RESCHED Y Y Y
LAZY_RESCHED Y N N
That is completely independent of the preemption model and the
differentiation of the preemption models happens solely at the scheduler
level:
PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time
slice where it sets NEED_RESCHED.
PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class
tasks or sporadic event tasks sets NEED_RESCHED too.
PREEMPT_FULL always sets NEED_RESCHED like today.
We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
only end up with two variants or even subsume PREEMPT_FULL into that
model because that's what is closer to the RT LAZY preempt behaviour,
which has two goals:
1) Make low latency guarantees for RT workloads
2) Preserve the throughput for non-RT workloads
But in any case this decision happens solely in the core scheduler code
and nothing outside of it needs to be changed.
So we not only get rid of the cond/might_resched() muck, we also get rid
of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
The only place which still needs that runtime tweaking is the scheduler
itself.
Though it just occured to me that there are dragons lurking:
arch/alpha/Kconfig: select ARCH_NO_PREEMPT
arch/hexagon/Kconfig: select ARCH_NO_PREEMPT
arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE
arch/um/Kconfig: select ARCH_NO_PREEMPT
So we have four architectures which refuse to enable preemption points,
i.e. the only model they allow is NONE and they rely on cond_resched()
for breaking large computations.
But they support PREEMPT_COUNT, so we might get away with a reduced
preemption point coverage:
Ret2user Ret2kernel PreemptCnt=0
NEED_RESCHED Y N Y
LAZY_RESCHED Y N N
i.e. the only difference is that Ret2kernel is not a preemption
point. That's where the scheduler tick enforcement of the time slice
happens.
It still might work out good enough and if not then it should not be
real rocket science to add that Ret2kernel preemption point to cure it.
> Then the question becomes whether we'd want to introduce a *new*
> concept, which is a "if you are going to schedule, do it now rather
> than later, because I'm taking a lock, and while it's a preemptible
> lock, I'd rather not sleep while holding this resource".
>
> I suspect we want to avoid that for now, on the assumption that it's
> hopefully not a problem in practice (the recently addressed problem
> with might_sleep() was that it actively *moved* the scheduling point
> to a bad place, not that scheduling could happen there, so instead of
> optimizing scheduling, it actively pessimized it). But I thought I'd
> mention it.
I think we want to avoid that completely and if this becomes an issue,
we rather be smart about it at the core level.
It's trivial enough to have a per task counter which tells whether a
preemtible lock is held (or about to be acquired) or not. Then the
scheduler can take that hint into account and decide to grant a
timeslice extension once in the expectation that the task leaves the
lock held section soonish and either returns to user space or schedules
out. It still can enforce it later on.
We really want to let the scheduler decide and rather give it proper
hints at the conceptual level instead of letting developers make random
decisions which might work well for a particular use case and completely
suck for the rest. I think we wasted enough time already on those.
> Anyway, I'm definitely not opposed. We'd get rid of a config option
> that is presumably not very widely used, and we'd simplify a lot of
> issues, and get rid of all these badly defined "cond_preempt()"
> things.
Hmm. Didn't I promise a year ago that I won't do further large scale
cleanups and simplifications beyond printk.
Maybe I get away this time with just suggesting it. :)
Thanks,
tglx
On Tue, Sep 19 2023 at 10:43, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
> Ie. a modern scheduler might have mooted much of this change:
>
> 4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()")
>
> ... because now we'll only reschedule on timeslice exhaustion, or if a task
> comes in with a big deadline deficit.
>
> And even the deadline-deficit wakeup preemption can be turned off further
> with:
>
> $ echo NO_WAKEUP_PREEMPTION > /debug/sched/features
>
> And we are considering making that the default behavior for same-prio tasks
> - basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which
> should be quite similar to what NEED_RESCHED_LAZY achieves on -rt.
I don't think that you can get rid of NEED_RESCHED_LAZY for !RT because
there is a clear advantage of having the return to user preemption
point.
It spares to have the kernel/user transition just to get the task back
via the timeslice interrupt. I experimented with that on RT and the
result was definitely worse.
We surely can revisit that, but I'd really start with the straight
forward mappable LAZY bit approach and if experimentation turns out to
provide good enough results by not setting that bit at all, then we
still can do so without changing anything except the core scheduler
decision logic.
It's again a cheap thing due to the way how the return to user TIF
handling works:
ti_work = read_thread_flags();
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);
TIF_LAZY_RESCHED is part of EXIT_TO_USER_MODE_WORK, so the non-work case
does not become more expensive than today. If any of the bits is set,
then the slowpath wont get measurably different performance whether the bit
is evaluated or not in exit_to_user_mode_loop().
As we really want TIF_LAZY_RESCHED for RT, we just keep all of this
consistent in terms of code and purely a scheduler decision whether it
utilizes it or not. As a consequence PREEMPT_RT is not longer special in
that regard and the main RT difference becomes the lock substitution and
forced interrupt threading.
For the magic 'spare me the extra conditional' optimization of
exit_to_user_mode_loop() if LAZY can be optimized out for !RT because
the scheduler is sooo clever (which I doubt), we can just use the same
approach as for other TIF bits and define them to 0 :)
So lets start consistent and optimize on top if really required.
Thanks,
tglx
Thomas Gleixner <[email protected]> writes:
> On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
>> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>>> > The problem with the REP prefix (and Xen hypercalls) is that
>>> > they're long running instructions and it becomes fundamentally
>>> > impossible to put a cond_resched() in.
>>> >
>>> >> Yes. I'm starting to think that that the only sane solution is to
>>> >> limit cases that can do this a lot, and the "instruciton pointer
>>> >> region" approach would certainly work.
>>> >
>>> > From a code locality / I-cache POV, I think a sorted list of
>>> > (non overlapping) ranges might be best.
>>>
>>> Yeah, agreed. There are a few problems with doing that though.
>>>
>>> I was thinking of using a check of this kind to schedule out when
>>> it is executing in this "reschedulable" section:
>>> !preempt_count() && in_resched_function(regs->rip);
>>>
>>> For preemption=full, this should mostly work.
>>> For preemption=voluntary, though this'll only work with out-of-line
>>> locks, not if the lock is inlined.
>>>
>>> (Both, should have problems with __this_cpu_* and the like, but
>>> maybe we can handwave that away with sparse/objtool etc.)
>>
>> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
>> thing, and then only search the range when TIF flag is set.
>>
>> And I'm thinking it might be a good idea to have objtool validate the
>> range only contains simple instructions, the moment it contains control
>> flow I'm thinking it's too complicated.
>
> Can we take a step back and look at the problem from a scheduling
> perspective?
>
> The basic operation of a non-preemptible kernel is time slice
> scheduling, which means that a task can run more or less undisturbed for
> a full time slice once it gets on the CPU unless it schedules away
> voluntary via a blocking operation.
>
> This works pretty well as long as everything runs in userspace as the
> preemption points in the return to user space path are independent of
> the preemption model.
>
> These preemption points handle both time slice exhaustion and priority
> based preemption.
>
> With PREEMPT=NONE these are the only available preemption points.
>
> That means that kernel code can run more or less indefinitely until it
> schedules out or returns to user space, which is obviously not possible
> for kernel threads.
>
> To prevent starvation the kernel gained voluntary preemption points,
> i.e. cond_resched(), which has to be added manually to code as a
> developer sees fit.
>
> Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
> additional preemption points. might_resched() utilizes the existing
> might_sched() debug points, which are in code paths which might block on
> a contended resource. These debug points are mostly in core and
> infrastructure code and are in code paths which can block anyway. The
> only difference is that they allow preemption even when the resource is
> uncontended.
>
> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.
>
> Now we have the situation of long running data copies or data clear
> operations which run fully in hardware, but can be interrupted. As the
> interrupt return to kernel mode does not preempt in the NONE and
> VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
> chunk size and adding cond_reched() again.
>
> That's ugly and does not work for long lasting hardware operations so we
> ended up with the suggestion of TIF_ALLOW_RESCHED to work around
> that. But again this needs to be manually annotated in the same way as a
> IP range based preemption scheme requires annotation.
>
> TBH. I detest all of this.
>
> Both cond_resched() and might_sleep/sched() are completely random
> mechanisms as seen from time slice operation and the data chunk based
> mechanism is just heuristics which works as good as heuristics tend to
> work. allow_resched() is not any different and IP based preemption
> mechanism are not going to be any better.
Agreed. I was looking at how to add resched sections etc, and in
addition to the randomness the choice of where exactly to add it seemed
to be quite fuzzy. A recipe for future kruft.
> The approach here is: Prevent the scheduler to make decisions and then
> mitigate the fallout with heuristics.
>
> That's just backwards as it moves resource control out of the scheduler
> into random code which has absolutely no business to do resource
> control.
>
> We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
> held sections became preemtible caused even more preemption activity
> than on a PREEMPT=FULL kernel. The worst side effect of that was
> extensive lock contention.
>
> The way how we addressed that was to add a lazy preemption mode, which
> tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
> preempt tasks which all belong to the SCHED_OTHER scheduling class. This
> works pretty well and gains back a massive amount of performance for the
> non-realtime throughput oriented tasks without affecting the
> schedulability of real-time tasks at all. IOW, it does not take control
> away from the scheduler. It cooperates with the scheduler and leaves the
> ultimate decisions to it.
>
> I think we can do something similar for the problem at hand, which
> avoids most of these heuristic horrors and control boundary violations.
>
> The main issue is that long running operations do not honour the time
> slice and we work around that with cond_resched() and now have ideas
> with this new TIF bit and IP ranges.
>
> None of that is really well defined in respect to time slices. In fact
> its not defined at all versus any aspect of scheduling behaviour.
>
> What about the following:
>
> 1) Keep preemption count and the real preemption points enabled
> unconditionally. That's not more overhead than the current
> DYNAMIC_PREEMPT mechanism as long as the preemption count does not
> go to zero, i.e. the folded NEED_RESCHED bit stays set.
>
> From earlier experiments I know that the overhead of preempt_count
> is minimal and only really observable with micro benchmarks.
> Otherwise it ends up in the noise as long as the slow path is not
> taken.
>
> I did a quick check comparing a plain inc/dec pair vs. the
> DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
> in the non-conclusive noise.
>
> 20 years ago this was a real issue because we did not have:
>
> - the folding of NEED_RESCHED into the preempt count
>
> - the cacheline optimizations which make the preempt count cache
> pretty much always cache hot
>
> - the hardware was way less capable
>
> I'm not saying that preempt_count is completely free today as it
> obviously adds more text and affects branch predictors, but as the
> major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> acceptable and tolerable tradeoff.
>
> 2) When the scheduler wants to set NEED_RESCHED due it sets
> NEED_RESCHED_LAZY instead which is only evaluated in the return to
> user space preemption points.
>
> As NEED_RESCHED_LAZY is not folded into the preemption count the
> preemption count won't become zero, so the task can continue until
> it hits return to user space.
>
> That preserves the existing behaviour.
>
> 3) When the scheduler tick observes that the time slice is exhausted,
> then it folds the NEED_RESCHED bit into the preempt count which
> causes the real preemption points to actually preempt including
> the return from interrupt to kernel path.
Right, and currently we check cond_resched() all the time in expectation
that something might need a resched.
Folding it in with the scheduler determining when next preemption happens
seems to make a lot of sense to me.
Thanks
Ankur
> That even allows the scheduler to enforce preemption for e.g. RT
> class tasks without changing anything else.
>
> I'm pretty sure that this gets rid of cond_resched(), which is an
> impressive list of instances:
>
> ./drivers 392
> ./fs 318
> ./mm 189
> ./kernel 184
> ./arch 95
> ./net 83
> ./include 46
> ./lib 36
> ./crypto 16
> ./sound 16
> ./block 11
> ./io_uring 13
> ./security 11
> ./ipc 3
>
> That list clearly documents that the majority of these
> cond_resched() invocations is in code which neither should care
> nor should have any influence on the core scheduling decision
> machinery.
>
> I think it's worth a try as it just fits into the existing preemption
> scheme, solves the issue of long running kernel functions, prevents
> invalid preemption and can utilize the existing instrumentation and
> debug infrastructure.
>
> Most importantly it gives control back to the scheduler and does not
> make it depend on the mercy of cond_resched(), allow_resched() or
> whatever heuristics sprinkled all over the kernel.
> To me this makes a lot of sense, but I might be on the completely wrong
> track. Se feel free to tell me that I'm completely nuts and/or just not
> seeing the obvious.
>
> Thanks,
>
> tglx
--
ankur
Thomas Gleixner <[email protected]> writes:
> So the decision matrix would be:
>
> Ret2user Ret2kernel PreemptCnt=0
>
> NEED_RESCHED Y Y Y
> LAZY_RESCHED Y N N
>
> That is completely independent of the preemption model and the
> differentiation of the preemption models happens solely at the scheduler
> level:
This is relatively minor, but do we need two flags? Seems to me we
can get to the same decision matrix by letting the scheduler fold
into the preempt-count based on current preemption model.
> PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time
> slice where it sets NEED_RESCHED.
PREEMPT_NONE sets up TIF_NEED_RESCHED. For the time-slice expiry case,
also fold into preempt-count.
> PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class
> tasks or sporadic event tasks sets NEED_RESCHED too.
PREEMPT_NONE sets up TIF_NEED_RESCHED and also folds it for the
RT/sporadic tasks.
> PREEMPT_FULL always sets NEED_RESCHED like today.
Always fold the TIF_NEED_RESCHED into the preempt-count.
> We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
> only end up with two variants or even subsume PREEMPT_FULL into that
> model because that's what is closer to the RT LAZY preempt behaviour,
> which has two goals:
>
> 1) Make low latency guarantees for RT workloads
>
> 2) Preserve the throughput for non-RT workloads
>
> But in any case this decision happens solely in the core scheduler code
> and nothing outside of it needs to be changed.
>
> So we not only get rid of the cond/might_resched() muck, we also get rid
> of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
> The only place which still needs that runtime tweaking is the scheduler
> itself.
True. The dynamic preemption could just become a scheduler tunable.
> Though it just occured to me that there are dragons lurking:
>
> arch/alpha/Kconfig: select ARCH_NO_PREEMPT
> arch/hexagon/Kconfig: select ARCH_NO_PREEMPT
> arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE
> arch/um/Kconfig: select ARCH_NO_PREEMPT
>
> So we have four architectures which refuse to enable preemption points,
> i.e. the only model they allow is NONE and they rely on cond_resched()
> for breaking large computations.
>
> But they support PREEMPT_COUNT, so we might get away with a reduced
> preemption point coverage:
>
> Ret2user Ret2kernel PreemptCnt=0
>
> NEED_RESCHED Y N Y
> LAZY_RESCHED Y N N
So from the discussion in the other thread, for the ARCH_NO_PREEMPT
configs that don't support preemption, we probably need a fourth
preemption model, say PREEMPT_UNSAFE.
These could use only the Ret2user preemption points and just fallback
to the !PREEMPT_COUNT primitives.
Thanks
--
ankur
On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
> Thomas Gleixner <[email protected]> writes:
>
>> So the decision matrix would be:
>>
>> Ret2user Ret2kernel PreemptCnt=0
>>
>> NEED_RESCHED Y Y Y
>> LAZY_RESCHED Y N N
>>
>> That is completely independent of the preemption model and the
>> differentiation of the preemption models happens solely at the scheduler
>> level:
>
> This is relatively minor, but do we need two flags? Seems to me we
> can get to the same decision matrix by letting the scheduler fold
> into the preempt-count based on current preemption model.
You still need the TIF flags because there is no way to do remote
modification of preempt count.
The preempt count folding is an optimization which simplifies the
preempt_enable logic:
if (--preempt_count && need_resched())
schedule()
to
if (--preempt_count)
schedule()
i.e. a single conditional instead of two.
The lazy bit is only evaluated in:
1) The return to user path
2) need_reched()
In neither case preempt_count is involved.
So it does not buy us enything. We might revisit that later, but for
simplicity sake the extra TIF bit is way simpler.
Premature optimization is the enemy of correctness.
>> We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
>> only end up with two variants or even subsume PREEMPT_FULL into that
>> model because that's what is closer to the RT LAZY preempt behaviour,
>> which has two goals:
>>
>> 1) Make low latency guarantees for RT workloads
>>
>> 2) Preserve the throughput for non-RT workloads
>>
>> But in any case this decision happens solely in the core scheduler code
>> and nothing outside of it needs to be changed.
>>
>> So we not only get rid of the cond/might_resched() muck, we also get rid
>> of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
>> The only place which still needs that runtime tweaking is the scheduler
>> itself.
>
> True. The dynamic preemption could just become a scheduler tunable.
That's the point.
>> But they support PREEMPT_COUNT, so we might get away with a reduced
>> preemption point coverage:
>>
>> Ret2user Ret2kernel PreemptCnt=0
>>
>> NEED_RESCHED Y N Y
>> LAZY_RESCHED Y N N
>
> So from the discussion in the other thread, for the ARCH_NO_PREEMPT
> configs that don't support preemption, we probably need a fourth
> preemption model, say PREEMPT_UNSAFE.
As discussed they wont really notice the latency issues because the
museum pieces are not used for anything crucial and for UM that's the
least of the correctness worries.
So no, we don't need yet another knob. We keep them chucking along and
if they really want they can adopt to the new world order. :)
Thanks,
tglx
On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>> Anyway, I'm definitely not opposed. We'd get rid of a config option
>> that is presumably not very widely used, and we'd simplify a lot of
>> issues, and get rid of all these badly defined "cond_preempt()"
>> things.
>
> Hmm. Didn't I promise a year ago that I won't do further large scale
> cleanups and simplifications beyond printk.
>
> Maybe I get away this time with just suggesting it. :)
Maybe not. As I'm inveterate curious, I sat down and figured out how
that might look like.
To some extent I really curse my curiosity as the amount of macro maze,
config options and convoluted mess behind all these preempt mechanisms
is beyond disgusting.
Find below a PoC which implements that scheme. It's not even close to
correct, but it builds, boots and survives lightweight testing.
I did not even try to look into time-slice enforcement, but I really want
to share this for illustration and for others to experiment.
This keeps all the existing mechanisms in place and introduces a new
config knob in the preemption model Kconfig switch: PREEMPT_AUTO
If selected it builds a CONFIG_PREEMPT kernel, which disables the
cond_resched() machinery and switches the fair scheduler class to use
the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
the preempt NONE model except that cond_resched() is a NOOP and I did
not validate the time-slice enforcement. The latter should be a
no-brainer to figure out and fix if required.
For run-time switching this to the FULL preemption model which always
uses TIF_NEED_RESCHED, you need to enable CONFIG_SCHED_DEBUG and then
you can enable "FULL" via:
echo FORCE_NEED_RESCHED >/sys/kernel/debug/sched/features
and switch back to some sort of "NONE" via
echo NO_FORCE_NEED_RESCHED >/sys/kernel/debug/sched/features
It seems to work as expected for a simple hackbench -l 10000 run:
NO_FORCE_NEED_RESCHED FORCE_NEED_RESCHED
schedule() [1] 3646163 2701641
preemption 12554 927856
total 3658717 3629497
[1] is voluntary schedule() AND_ schedule() from return to user space. I
did not come around to account them separately yet, but for a quick
check this clearly shows that this "works" as advertised.
Of course this needs way more analysis than this quick PoC+check, but
you get the idea.
Contrary to other hot of the press hacks, I'm pretty sure it won't
destroy your hard-disk, but I won't recommend that you deploy it on your
alarm-clock as it might make you miss the bus.
If this concept holds, which I'm pretty convinced of by now, then this
is an opportunity to trade ~3000 lines of unholy hacks for about 100-200
lines of understandable code :)
Thanks,
tglx
---
arch/x86/Kconfig | 1
arch/x86/include/asm/thread_info.h | 2 +
drivers/acpi/processor_idle.c | 2 -
include/linux/entry-common.h | 2 -
include/linux/entry-kvm.h | 2 -
include/linux/sched.h | 18 +++++++++++-----
include/linux/sched/idle.h | 8 +++----
include/linux/thread_info.h | 19 +++++++++++++++++
kernel/Kconfig.preempt | 12 +++++++++-
kernel/entry/common.c | 2 -
kernel/sched/core.c | 41 ++++++++++++++++++++++++-------------
kernel/sched/fair.c | 10 ++++-----
kernel/sched/features.h | 2 +
kernel/sched/idle.c | 3 --
kernel/sched/sched.h | 1
kernel/trace/trace.c | 2 -
16 files changed, 91 insertions(+), 36 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -898,14 +898,14 @@ static inline void hrtick_rq_init(struct
#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
/*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int nr_bit)
{
struct thread_info *ti = task_thread_info(p);
- return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+ return !(fetch_or(&ti->flags, 1 << nr_bit) & _TIF_POLLING_NRFLAG);
}
/*
@@ -931,9 +931,9 @@ static bool set_nr_if_polling(struct tas
}
#else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int nr_bit)
{
- set_tsk_need_resched(p);
+ set_tsk_thread_flag(p, nr_bit);
return true;
}
@@ -1038,28 +1038,42 @@ void wake_up_q(struct wake_q_head *head)
* might also involve a cross-CPU call to trigger the scheduler on
* the target CPU.
*/
-void resched_curr(struct rq *rq)
+static void __resched_curr(struct rq *rq, int nr_bit)
{
struct task_struct *curr = rq->curr;
int cpu;
lockdep_assert_rq_held(rq);
- if (test_tsk_need_resched(curr))
+ if (test_tsk_need_resched_type(curr, nr_bit))
return;
cpu = cpu_of(rq);
if (cpu == smp_processor_id()) {
- set_tsk_need_resched(curr);
- set_preempt_need_resched();
+ set_tsk_thread_flag(curr, nr_bit);
+ if (nr_bit == TIF_NEED_RESCHED)
+ set_preempt_need_resched();
return;
}
- if (set_nr_and_not_polling(curr))
- smp_send_reschedule(cpu);
- else
+ if (set_nr_and_not_polling(curr, nr_bit)) {
+ if (nr_bit == TIF_NEED_RESCHED)
+ smp_send_reschedule(cpu);
+ } else {
trace_sched_wake_idle_without_ipi(cpu);
+ }
+}
+
+void resched_curr(struct rq *rq)
+{
+ __resched_curr(rq, TIF_NEED_RESCHED);
+}
+
+void resched_curr_lazy(struct rq *rq)
+{
+ __resched_curr(rq, sched_feat(FORCE_NEED_RESCHED) ?
+ TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY);
}
void resched_cpu(int cpu)
@@ -1132,7 +1146,7 @@ static void wake_up_idle_cpu(int cpu)
if (cpu == smp_processor_id())
return;
- if (set_nr_and_not_polling(rq->idle))
+ if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
@@ -8872,7 +8886,6 @@ static void __init preempt_dynamic_init(
WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
return preempt_dynamic_mode == preempt_dynamic_##mode; \
} \
- EXPORT_SYMBOL_GPL(preempt_model_##mode)
PREEMPT_MODEL_ACCESSOR(none);
PREEMPT_MODEL_ACCESSOR(voluntary);
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,11 @@ enum syscall_work_bit {
#include <asm/thread_info.h>
+#ifndef CONFIG_PREEMPT_AUTO
+# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
+# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
+#endif
+
#ifdef __KERNEL__
#ifndef arch_set_restart_data
@@ -185,6 +190,13 @@ static __always_inline bool tif_need_res
(unsigned long *)(¤t_thread_info()->flags));
}
+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+ arch_test_bit(TIF_NEED_RESCHED_LAZY,
+ (unsigned long *)(¤t_thread_info()->flags));
+}
+
#else
static __always_inline bool tif_need_resched(void)
@@ -193,6 +205,13 @@ static __always_inline bool tif_need_res
(unsigned long *)(¤t_thread_info()->flags));
}
+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+ test_bit(TIF_NEED_RESCHED_LAZY,
+ (unsigned long *)(¤t_thread_info()->flags));
+}
+
#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
#ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,6 +11,9 @@ config PREEMPT_BUILD
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
+config HAVE_PREEMPT_AUTO
+ bool
+
choice
prompt "Preemption Model"
default PREEMPT_NONE
@@ -67,6 +70,13 @@ config PREEMPT
embedded system with latency requirements in the milliseconds
range.
+config PREEMPT_AUTO
+ bool "Automagic preemption mode with runtime tweaking support"
+ depends on HAVE_PREEMPT_AUTO
+ select PREEMPT_BUILD
+ help
+ Add some sensible blurb here
+
config PREEMPT_RT
bool "Fully Preemptible Kernel (Real-Time)"
depends on EXPERT && ARCH_SUPPORTS_RT
@@ -95,7 +105,7 @@ config PREEMPTION
config PREEMPT_DYNAMIC
bool "Preemption behaviour defined on boot"
- depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
+ depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
select PREEMPT_BUILD
default y if HAVE_PREEMPT_DYNAMIC_CALL
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -60,7 +60,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
- ARCH_EXIT_TO_USER_MODE_WORK)
+ _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
/**
* arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@
#define XFER_TO_GUEST_MODE_WORK \
(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
- _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+ _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
struct kvm_vcpu;
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();
if (ti_work & _TIF_UPROBE)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
SCHED_FEAT(LATENCY_WARN, false)
SCHED_FEAT(HZ_BW, true)
+
+SCHED_FEAT(FORCE_NEED_RESCHED, false)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
extern void reweight_task(struct task_struct *p, int prio);
extern void resched_curr(struct rq *rq);
+extern void resched_curr_lazy(struct rq *rq);
extern void resched_cpu(int cpu);
extern struct rt_bandwidth def_rt_bandwidth;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
update_ti_thread_flag(task_thread_info(tsk), flag, value);
}
-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
}
-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
}
-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_ti_thread_flag(task_thread_info(tsk), flag);
}
@@ -2069,13 +2069,21 @@ static inline void set_tsk_need_resched(
static inline void clear_tsk_need_resched(struct task_struct *tsk)
{
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+ if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+ clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
}
-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
{
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}
+static inline bool test_tsk_need_resched_type(struct task_struct *tsk,
+ int nr_bit)
+{
+ return unlikely(test_tsk_thread_flag(tsk, 1 << nr_bit));
+}
+
/*
* cond_resched() and cond_resched_lock(): latency reduction via
* explicit rescheduling in places that are safe. The return
@@ -2252,7 +2260,7 @@ static inline int rwlock_needbreak(rwloc
static __always_inline bool need_resched(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(tif_need_resched_lazy() || tif_need_resched());
}
/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -985,7 +985,7 @@ static void update_deadline(struct cfs_r
* The task has consumed its request, reschedule.
*/
if (cfs_rq->nr_running > 1) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
clear_buddies(cfs_rq, se);
}
}
@@ -5267,7 +5267,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
* validating it and just reschedule.
*/
if (queued) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
return;
}
/*
@@ -5413,7 +5413,7 @@ static void __account_cfs_rq_runtime(str
* hierarchy can be throttled
*/
if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
}
static __always_inline
@@ -5673,7 +5673,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
- resched_curr(rq);
+ resched_curr_lazy(rq);
}
#ifdef CONFIG_SMP
@@ -8073,7 +8073,7 @@ static void check_preempt_wakeup(struct
return;
preempt:
- resched_curr(rq);
+ resched_curr_lazy(rq);
}
#ifdef CONFIG_SMP
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -108,7 +108,7 @@ static const struct dmi_system_id proces
*/
static void __cpuidle acpi_safe_halt(void)
{
- if (!tif_need_resched()) {
+ if (!need_resched()) {
raw_safe_halt();
raw_local_irq_disable();
}
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -63,7 +63,7 @@ static __always_inline bool __must_check
*/
smp_mb__after_atomic();
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
static __always_inline bool __must_check current_clr_polling_and_test(void)
@@ -76,7 +76,7 @@ static __always_inline bool __must_check
*/
smp_mb__after_atomic();
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
#else
@@ -85,11 +85,11 @@ static inline void __current_clr_polling
static inline bool __must_check current_set_polling_and_test(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
static inline bool __must_check current_clr_polling_and_test(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
#endif
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
ct_cpuidle_enter();
raw_local_irq_enable();
- while (!tif_need_resched() &&
- (cpu_idle_force_poll || tick_check_broadcast_expired()))
+ while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
raw_local_irq_disable();
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2720,7 +2720,7 @@ unsigned int tracing_gen_ctx_irq_test(un
if (softirq_count() >> (SOFTIRQ_SHIFT + 1))
trace_flags |= TRACE_FLAG_BH_OFF;
- if (tif_need_resched())
+ if (need_resched())
trace_flags |= TRACE_FLAG_NEED_RESCHED;
if (test_preempt_need_resched())
trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -271,6 +271,7 @@ config X86
select HAVE_STATIC_CALL
select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
select HAVE_PREEMPT_DYNAMIC_CALL
+ select HAVE_PREEMPT_AUTO
select HAVE_RSEQ
select HAVE_RUST if X86_64
select HAVE_SYSCALL_TRACEPOINTS
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
#define TIF_SSBD 5 /* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_LAZY 6 /* Lazy rescheduling */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
@@ -106,6 +107,7 @@ struct thread_info {
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
#define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
Thomas Gleixner <[email protected]> writes:
> On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
>> Thomas Gleixner <[email protected]> writes:
>>
>>> So the decision matrix would be:
>>>
>>> Ret2user Ret2kernel PreemptCnt=0
>>>
>>> NEED_RESCHED Y Y Y
>>> LAZY_RESCHED Y N N
>>>
>>> That is completely independent of the preemption model and the
>>> differentiation of the preemption models happens solely at the scheduler
>>> level:
>>
>> This is relatively minor, but do we need two flags? Seems to me we
>> can get to the same decision matrix by letting the scheduler fold
>> into the preempt-count based on current preemption model.
>
> You still need the TIF flags because there is no way to do remote
> modification of preempt count.
Yes, agreed. In my version, I was envisaging that the remote cpu always
only sets up TIF_NEED_RESCHED and then we decide which one we want at
the preemption point.
Anyway, I see what you meant in your PoC.
>>> But they support PREEMPT_COUNT, so we might get away with a reduced
>>> preemption point coverage:
>>>
>>> Ret2user Ret2kernel PreemptCnt=0
>>>
>>> NEED_RESCHED Y N Y
>>> LAZY_RESCHED Y N N
>>
>> So from the discussion in the other thread, for the ARCH_NO_PREEMPT
>> configs that don't support preemption, we probably need a fourth
>> preemption model, say PREEMPT_UNSAFE.
>
> As discussed they wont really notice the latency issues because the
> museum pieces are not used for anything crucial and for UM that's the
> least of the correctness worries.
>
> So no, we don't need yet another knob. We keep them chucking along and
> if they really want they can adopt to the new world order. :)
Will they chuckle along, or die trying ;)?
I grepped for "preempt_enable|preempt_disable" for all the archs and
hexagon and m68k don't seem to do any explicit accounting at all.
(Though, neither do nios2 and openrisc, and both csky and microblaze
only do it in the tlbflush path.)
arch/hexagon 0
arch/m68k 0
arch/nios2 0
arch/openrisc 0
arch/csky 3
arch/microblaze 3
arch/um 4
arch/riscv 8
arch/arc 14
arch/parisc 15
arch/arm 16
arch/sparc 16
arch/xtensa 19
arch/sh 21
arch/alpha 23
arch/ia64 27
arch/loongarch 53
arch/arm64 54
arch/s390 91
arch/mips 115
arch/x86 146
arch/powerpc 201
My concern is given that we preempt on timeslice expiration for all
three preemption models, we could end up preempting at an unsafe
location.
Still, not the most pressing of problems.
Thanks
--
ankur
On Wed, Sep 20 2023 at 17:58, Ankur Arora wrote:
> Thomas Gleixner <[email protected]> writes:
>> So no, we don't need yet another knob. We keep them chucking along and
>> if they really want they can adopt to the new world order. :)
>
> Will they chuckle along, or die trying ;)?
Either way is fine :)
> I grepped for "preempt_enable|preempt_disable" for all the archs and
> hexagon and m68k don't seem to do any explicit accounting at all.
> (Though, neither do nios2 and openrisc, and both csky and microblaze
> only do it in the tlbflush path.)
>
> arch/hexagon 0
> arch/m68k 0
...
> arch/s390 91
> arch/mips 115
> arch/x86 146
> arch/powerpc 201
>
> My concern is given that we preempt on timeslice expiration for all
> three preemption models, we could end up preempting at an unsafe
> location.
As I said in my reply to Linus, that count is not really conclusive.
arch/m68k has a count of 0 and supports PREEMPT for the COLDFIRE
sub-architecture and I know for sure that at some point in the past
PREEMPT_RT was supported on COLDFIRE with minimal changes to the
architecture code.
That said, I'm pretty sure that quite some of these
preempt_disable/enable pairs in arch/* are subject to voodoo
programming, but that's a different problem to analyze.
> Still, not the most pressing of problems.
Exactly :)
Thanks,
tglx
Thomas Gleixner <[email protected]> writes:
> On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
>> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>>> Anyway, I'm definitely not opposed. We'd get rid of a config option
>>> that is presumably not very widely used, and we'd simplify a lot of
>>> issues, and get rid of all these badly defined "cond_preempt()"
>>> things.
>>
>> Hmm. Didn't I promise a year ago that I won't do further large scale
>> cleanups and simplifications beyond printk.
>>
>> Maybe I get away this time with just suggesting it. :)
>
> Maybe not. As I'm inveterate curious, I sat down and figured out how
> that might look like.
>
> To some extent I really curse my curiosity as the amount of macro maze,
> config options and convoluted mess behind all these preempt mechanisms
> is beyond disgusting.
>
> Find below a PoC which implements that scheme. It's not even close to
> correct, but it builds, boots and survives lightweight testing.
Whew, that was electric. I had barely managed to sort through some of
the config maze.
From a quick look this is pretty much how you described it.
> I did not even try to look into time-slice enforcement, but I really want
> to share this for illustration and for others to experiment.
>
> This keeps all the existing mechanisms in place and introduces a new
> config knob in the preemption model Kconfig switch: PREEMPT_AUTO
>
> If selected it builds a CONFIG_PREEMPT kernel, which disables the
> cond_resched() machinery and switches the fair scheduler class to use
> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
> the preempt NONE model except that cond_resched() is a NOOP and I did
> not validate the time-slice enforcement. The latter should be a
> no-brainer to figure out and fix if required.
Yeah, let me try this out.
Thanks
Ankur
On Wed, Sep 20 2023 at 22:51, Thomas Gleixner wrote:
> On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
>
> The preempt count folding is an optimization which simplifies the
> preempt_enable logic:
>
> if (--preempt_count && need_resched())
> schedule()
> to
> if (--preempt_count)
> schedule()
That should be (!(--preempt_count... in both cases of course :)
Ok, I like this.
That said, this part of it:
On Wed, 20 Sept 2023 at 16:58, Thomas Gleixner <[email protected]> wrote:
>
> -void resched_curr(struct rq *rq)
> +static void __resched_curr(struct rq *rq, int nr_bit)
> [...]
> - set_tsk_need_resched(curr);
> - set_preempt_need_resched();
> + set_tsk_thread_flag(curr, nr_bit);
> + if (nr_bit == TIF_NEED_RESCHED)
> + set_preempt_need_resched();
feels really hacky.
I think that instead of passing a random TIF bit around, it should
just pass a "lazy or not" value around.
Then you make the TIF bit be some easily computable thing (eg something like
#define TIF_RESCHED(lazy) (TIF_NEED_RESCHED + (lazy))
or whatever), and write the above conditional as
if (!lazy)
set_preempt_need_resched();
so that it all *does* the same thing, but the code makes it clear
about what the logic is.
Because honestly, without having been part of this thread, I would look at that
if (nr_bit == TIF_NEED_RESCHED)
set_preempt_need_resched();
and I'd be completely lost. It doesn't make conceptual sense, I feel.
So I'd really like the source code to be more directly expressing the
*intent* of the code, not be so centered around the implementation
detail.
Put another way: I think we can make the compiler turn the intent into
the implementation, and I'd rather *not* have us humans have to infer
the intent from the implementation.
That said - I think as a proof of concept and "look, with this we get
the expected scheduling event counts", that patch is perfect. I think
you more than proved the concept.
Linus
Linus!
On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
> Ok, I like this.
Thanks!
> That said, this part of it:
> On Wed, 20 Sept 2023 at 16:58, Thomas Gleixner <[email protected]> wrote:
> Because honestly, without having been part of this thread, I would look at that
>
> if (nr_bit == TIF_NEED_RESCHED)
> set_preempt_need_resched();
>
> and I'd be completely lost. It doesn't make conceptual sense, I feel.
>
> So I'd really like the source code to be more directly expressing the
> *intent* of the code, not be so centered around the implementation
> detail.
>
> Put another way: I think we can make the compiler turn the intent into
> the implementation, and I'd rather *not* have us humans have to infer
> the intent from the implementation.
No argument about that. I didn't like it either, but at 10PM ...
> That said - I think as a proof of concept and "look, with this we get
> the expected scheduling event counts", that patch is perfect. I think
> you more than proved the concept.
There is certainly quite some analyis work to do to make this a one to
one replacement.
With a handful of benchmarks the PoC (tweaked with some obvious fixes)
is pretty much on par with the current mainline variants (NONE/FULL),
but the memtier benchmark makes a massive dent.
It sports a whopping 10% regression with the LAZY mode versus the mainline
NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
That benchmark is really sensitive to the preemption model. With current
mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
performance drop versus preempt=NONE.
I have no clue what's going on there yet, but that shows that there is
obviously quite some work ahead to get this sorted.
Though I'm pretty convinced by now that this is the right direction and
well worth the effort which needs to be put into that.
Thanks,
tglx
On Wed, Sep 20 2023 at 17:57, Ankur Arora wrote:
> Thomas Gleixner <[email protected]> writes:
>> Find below a PoC which implements that scheme. It's not even close to
>> correct, but it builds, boots and survives lightweight testing.
>
> Whew, that was electric. I had barely managed to sort through some of
> the config maze.
> From a quick look this is pretty much how you described it.
Unsurpringly I spent at least 10x the time to describe it than to hack
it up.
IOW, I had done the analysis before I offered the idea and before I
changed a single line of code. The tools I used for that are git-grep,
tags, paper, pencil, accrued knowledge and patience, i.e. nothing even
close to rocket science.
Converting the analysis into code was mostly a matter of brain dumping
the analysis and adherence to accrued methodology.
What's electric about that?
I might be missing some meaning of 'electric' which is not covered by my
mostly Webster restricted old-school understanding of the english language :)
>> I did not even try to look into time-slice enforcement, but I really want
>> to share this for illustration and for others to experiment.
>>
>> This keeps all the existing mechanisms in place and introduces a new
>> config knob in the preemption model Kconfig switch: PREEMPT_AUTO
>>
>> If selected it builds a CONFIG_PREEMPT kernel, which disables the
>> cond_resched() machinery and switches the fair scheduler class to use
>> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
>> the preempt NONE model except that cond_resched() is a NOOP and I did
>> not validate the time-slice enforcement. The latter should be a
>> no-brainer to figure out and fix if required.
>
> Yeah, let me try this out.
That's what I hoped for :)
Thanks,
tglx
On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
> On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
>> That said - I think as a proof of concept and "look, with this we get
>> the expected scheduling event counts", that patch is perfect. I think
>> you more than proved the concept.
>
> There is certainly quite some analyis work to do to make this a one to
> one replacement.
>
> With a handful of benchmarks the PoC (tweaked with some obvious fixes)
> is pretty much on par with the current mainline variants (NONE/FULL),
> but the memtier benchmark makes a massive dent.
>
> It sports a whopping 10% regression with the LAZY mode versus the mainline
> NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
>
> That benchmark is really sensitive to the preemption model. With current
> mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
> performance drop versus preempt=NONE.
That 20% was a tired pilot error. The real number is in the 5% ballpark.
> I have no clue what's going on there yet, but that shows that there is
> obviously quite some work ahead to get this sorted.
It took some head scratching to figure that out. The initial fix broke
the handling of the hog issue, i.e. the problem that Ankur tried to
solve, but I hacked up a "solution" for that too.
With that the memtier benchmark is roughly back to the mainline numbers,
but my throughput benchmark know how is pretty close to zero, so that
should be looked at by people who actually understand these things.
Likewise the hog prevention is just at the PoC level and clearly beyond
my knowledge of scheduler details: It unconditionally forces a
reschedule when the looping task is not responding to a lazy reschedule
request before the next tick. IOW it forces a reschedule on the second
tick, which is obviously different from the cond_resched()/might_sleep()
behaviour.
The changes vs. the original PoC aside of the bug and thinko fixes:
1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
lazy preempt bit as the trace_entry::flags field is full already.
That obviously breaks the tracer ABI, but if we go there then
this needs to be fixed. Steven?
2) debugfs file to validate that loops can be force preempted w/o
cond_resched()
The usage is:
# taskset -c 1 bash
# echo 1 > /sys/kernel/debug/sched/hog &
# echo 1 > /sys/kernel/debug/sched/hog &
# echo 1 > /sys/kernel/debug/sched/hog &
top shows ~33% CPU for each of the hogs and tracing confirms that
the crude hack in the scheduler tick works:
bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr
bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr
bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr
bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr
bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr
bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr
bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr
bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr
The 'l' instead of the usual 'N' reflects that the lazy resched
bit is set. That makes __update_curr() invoke resched_curr()
instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
and folds it into preempt_count so that preemption happens at the
next possible point, i.e. either in return from interrupt or at
the next preempt_enable().
That's as much as I wanted to demonstrate and I'm not going to spend
more cycles on it as I have already too many other things on flight and
the resulting scheduler woes are clearly outside of my expertice.
Though definitely I'm putting a permanent NAK in place for any attempts
to duct tape the preempt=NONE model any further by sprinkling more
cond*() and whatever warts around.
Thanks,
tglx
---
arch/x86/Kconfig | 1
arch/x86/include/asm/thread_info.h | 6 ++--
drivers/acpi/processor_idle.c | 2 -
include/linux/entry-common.h | 2 -
include/linux/entry-kvm.h | 2 -
include/linux/sched.h | 12 +++++---
include/linux/sched/idle.h | 8 ++---
include/linux/thread_info.h | 24 +++++++++++++++++
include/linux/trace_events.h | 8 ++---
kernel/Kconfig.preempt | 17 +++++++++++-
kernel/entry/common.c | 4 +-
kernel/entry/kvm.c | 2 -
kernel/sched/core.c | 51 +++++++++++++++++++++++++------------
kernel/sched/debug.c | 19 +++++++++++++
kernel/sched/fair.c | 46 ++++++++++++++++++++++-----------
kernel/sched/features.h | 2 +
kernel/sched/idle.c | 3 --
kernel/sched/sched.h | 1
kernel/trace/trace.c | 2 +
kernel/trace/trace_output.c | 16 ++++++++++-
20 files changed, 171 insertions(+), 57 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
/*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
{
struct thread_info *ti = task_thread_info(p);
- return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+
+ return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
}
/*
@@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
for (;;) {
if (!(val & _TIF_POLLING_NRFLAG))
return false;
- if (val & _TIF_NEED_RESCHED)
+ if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
return true;
if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
break;
@@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
}
#else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
{
- set_tsk_need_resched(p);
+ set_tsk_thread_flag(p, tif_bit);
return true;
}
@@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
* might also involve a cross-CPU call to trigger the scheduler on
* the target CPU.
*/
-void resched_curr(struct rq *rq)
+static void __resched_curr(struct rq *rq, int lazy)
{
+ int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
struct task_struct *curr = rq->curr;
- int cpu;
lockdep_assert_rq_held(rq);
- if (test_tsk_need_resched(curr))
+ if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
return;
cpu = cpu_of(rq);
if (cpu == smp_processor_id()) {
- set_tsk_need_resched(curr);
- set_preempt_need_resched();
+ set_tsk_thread_flag(curr, tif_bit);
+ if (!lazy)
+ set_preempt_need_resched();
return;
}
- if (set_nr_and_not_polling(curr))
- smp_send_reschedule(cpu);
- else
+ if (set_nr_and_not_polling(curr, tif_bit)) {
+ if (!lazy)
+ smp_send_reschedule(cpu);
+ } else {
trace_sched_wake_idle_without_ipi(cpu);
+ }
+}
+
+void resched_curr(struct rq *rq)
+{
+ __resched_curr(rq, 0);
+}
+
+void resched_curr_lazy(struct rq *rq)
+{
+ int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
+ TIF_NEED_RESCHED_LAZY_OFFSET : 0;
+
+ if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
+ return;
+
+ __resched_curr(rq, lazy);
}
void resched_cpu(int cpu)
@@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
if (cpu == smp_processor_id())
return;
- if (set_nr_and_not_polling(rq->idle))
+ if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
@@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
return preempt_dynamic_mode == preempt_dynamic_##mode; \
} \
- EXPORT_SYMBOL_GPL(preempt_model_##mode)
PREEMPT_MODEL_ACCESSOR(none);
PREEMPT_MODEL_ACCESSOR(voluntary);
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,16 @@ enum syscall_work_bit {
#include <asm/thread_info.h>
+#ifdef CONFIG_PREEMPT_AUTO
+# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY
+# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY
+# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
+#else
+# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
+# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
+# define TIF_NEED_RESCHED_LAZY_OFFSET 0
+#endif
+
#ifdef __KERNEL__
#ifndef arch_set_restart_data
@@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
(unsigned long *)(¤t_thread_info()->flags));
}
+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+ arch_test_bit(TIF_NEED_RESCHED_LAZY,
+ (unsigned long *)(¤t_thread_info()->flags));
+}
+
#else
static __always_inline bool tif_need_resched(void)
@@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
(unsigned long *)(¤t_thread_info()->flags));
}
+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+ test_bit(TIF_NEED_RESCHED_LAZY,
+ (unsigned long *)(¤t_thread_info()->flags));
+}
+
#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
#ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,6 +11,13 @@ config PREEMPT_BUILD
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
+config PREEMPT_BUILD_AUTO
+ bool
+ select PREEMPT_BUILD
+
+config HAVE_PREEMPT_AUTO
+ bool
+
choice
prompt "Preemption Model"
default PREEMPT_NONE
@@ -67,9 +74,17 @@ config PREEMPT
embedded system with latency requirements in the milliseconds
range.
+config PREEMPT_AUTO
+ bool "Automagic preemption mode with runtime tweaking support"
+ depends on HAVE_PREEMPT_AUTO
+ select PREEMPT_BUILD_AUTO
+ help
+ Add some sensible blurb here
+
config PREEMPT_RT
bool "Fully Preemptible Kernel (Real-Time)"
depends on EXPERT && ARCH_SUPPORTS_RT
+ select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
select PREEMPTION
help
This option turns the kernel into a real-time kernel by replacing
@@ -95,7 +110,7 @@ config PREEMPTION
config PREEMPT_DYNAMIC
bool "Preemption behaviour defined on boot"
- depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
+ depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
select PREEMPT_BUILD
default y if HAVE_PREEMPT_DYNAMIC_CALL
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -60,7 +60,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
- ARCH_EXIT_TO_USER_MODE_WORK)
+ _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
/**
* arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@
#define XFER_TO_GUEST_MODE_WORK \
(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
- _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+ _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
struct kvm_vcpu;
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();
if (ti_work & _TIF_UPROBE)
@@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
rcu_irq_exit_check_preempt();
if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
WARN_ON_ONCE(!on_thread_stack());
- if (need_resched())
+ if (test_tsk_need_resched(current))
preempt_schedule_irq();
}
}
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
SCHED_FEAT(LATENCY_WARN, false)
SCHED_FEAT(HZ_BW, true)
+
+SCHED_FEAT(FORCE_NEED_RESCHED, false)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
extern void reweight_task(struct task_struct *p, int prio);
extern void resched_curr(struct rq *rq);
+extern void resched_curr_lazy(struct rq *rq);
extern void resched_cpu(int cpu);
extern struct rt_bandwidth def_rt_bandwidth;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
update_ti_thread_flag(task_thread_info(tsk), flag, value);
}
-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
}
-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
}
-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_ti_thread_flag(task_thread_info(tsk), flag);
}
@@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
static inline void clear_tsk_need_resched(struct task_struct *tsk)
{
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+ if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+ clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
}
-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
{
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}
@@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
static __always_inline bool need_resched(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(tif_need_resched_lazy() || tif_need_resched());
}
/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
* XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
* this is probably good enough.
*/
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
{
+ struct rq *rq = rq_of(cfs_rq);
+
if ((s64)(se->vruntime - se->deadline) < 0)
return;
@@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
/*
* The task has consumed its request, reschedule.
*/
- if (cfs_rq->nr_running > 1) {
- resched_curr(rq_of(cfs_rq));
- clear_buddies(cfs_rq, se);
+ if (cfs_rq->nr_running < 2)
+ return;
+
+ if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
+ resched_curr(rq);
+ } else {
+ /* Did the task ignore the lazy reschedule request? */
+ if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+ resched_curr(rq);
+ else
+ resched_curr_lazy(rq);
}
+ clear_buddies(cfs_rq, se);
}
#include "pelt.h"
@@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
/*
* Update the current task's runtime statistics.
*/
-static void update_curr(struct cfs_rq *cfs_rq)
+static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_clock_task(rq_of(cfs_rq));
@@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
schedstat_add(cfs_rq->exec_clock, delta_exec);
curr->vruntime += calc_delta_fair(delta_exec, curr);
- update_deadline(cfs_rq, curr);
+ update_deadline(cfs_rq, curr, tick);
update_min_vruntime(cfs_rq);
if (entity_is_task(curr)) {
@@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
account_cfs_rq_runtime(cfs_rq, delta_exec);
}
+static inline void update_curr(struct cfs_rq *cfs_rq)
+{
+ __update_curr(cfs_rq, false);
+}
+
static void update_curr_fair(struct rq *rq)
{
update_curr(cfs_rq_of(&rq->curr->se));
@@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
/*
* Update run-time statistics of the 'current'.
*/
- update_curr(cfs_rq);
+ __update_curr(cfs_rq, true);
/*
* Ensure that runnable average is periodically updated.
@@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
* validating it and just reschedule.
*/
if (queued) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
return;
}
/*
@@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
* hierarchy can be throttled
*/
if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
}
static __always_inline
@@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
- resched_curr(rq);
+ resched_curr_lazy(rq);
}
#ifdef CONFIG_SMP
@@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
if (delta < 0) {
if (task_current(rq, p))
- resched_curr(rq);
+ resched_curr_lazy(rq);
return;
}
hrtick_start(rq, delta);
@@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
- if (test_tsk_need_resched(curr))
+ if (need_resched())
return;
/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
return;
preempt:
- resched_curr(rq);
+ resched_curr_lazy(rq);
}
#ifdef CONFIG_SMP
@@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
*/
if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
__entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
- resched_curr(rq);
+ resched_curr_lazy(rq);
}
/*
@@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
*/
if (task_current(rq, p)) {
if (p->prio > oldprio)
- resched_curr(rq);
+ resched_curr_lazy(rq);
} else
check_preempt_curr(rq, p, 0);
}
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -108,7 +108,7 @@ static const struct dmi_system_id proces
*/
static void __cpuidle acpi_safe_halt(void)
{
- if (!tif_need_resched()) {
+ if (!need_resched()) {
raw_safe_halt();
raw_local_irq_disable();
}
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -63,7 +63,7 @@ static __always_inline bool __must_check
*/
smp_mb__after_atomic();
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
static __always_inline bool __must_check current_clr_polling_and_test(void)
@@ -76,7 +76,7 @@ static __always_inline bool __must_check
*/
smp_mb__after_atomic();
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
#else
@@ -85,11 +85,11 @@ static inline void __current_clr_polling
static inline bool __must_check current_set_polling_and_test(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
static inline bool __must_check current_clr_polling_and_test(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(need_resched());
}
#endif
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
ct_cpuidle_enter();
raw_local_irq_enable();
- while (!tif_need_resched() &&
- (cpu_idle_force_poll || tick_check_broadcast_expired()))
+ while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
raw_local_irq_disable();
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
if (tif_need_resched())
trace_flags |= TRACE_FLAG_NEED_RESCHED;
+ if (tif_need_resched_lazy())
+ trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
if (test_preempt_need_resched())
trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -271,6 +271,7 @@ config X86
select HAVE_STATIC_CALL
select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
select HAVE_PREEMPT_DYNAMIC_CALL
+ select HAVE_PREEMPT_AUTO
select HAVE_RSEQ
select HAVE_RUST if X86_64
select HAVE_SYSCALL_TRACEPOINTS
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -81,8 +81,9 @@ struct thread_info {
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
-#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
-#define TIF_SSBD 5 /* Speculative store bypass disable */
+#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */
+#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
+#define TIF_SSBD 6 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
@@ -104,6 +105,7 @@ struct thread_info {
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
+#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY)
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
return -EINTR;
}
- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
schedule();
if (ti_work & _TIF_NOTIFY_RESUME)
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
enum trace_flag_type {
TRACE_FLAG_IRQS_OFF = 0x01,
- TRACE_FLAG_IRQS_NOSUPPORT = 0x02,
- TRACE_FLAG_NEED_RESCHED = 0x04,
+ TRACE_FLAG_NEED_RESCHED = 0x02,
+ TRACE_FLAG_NEED_RESCHED_LAZY = 0x04,
TRACE_FLAG_HARDIRQ = 0x08,
TRACE_FLAG_SOFTIRQ = 0x10,
TRACE_FLAG_PREEMPT_RESCHED = 0x20,
@@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
{
- return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+ return tracing_gen_ctx_irq_test(0);
}
static inline unsigned int tracing_gen_ctx(void)
{
- return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+ return tracing_gen_ctx_irq_test(0);
}
#endif
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
bh_off ? 'b' :
- (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+ !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
'.';
- switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
+ switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
TRACE_FLAG_PREEMPT_RESCHED)) {
+ case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+ need_resched = 'B';
+ break;
case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
need_resched = 'N';
break;
+ case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+ need_resched = 'L';
+ break;
+ case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
+ need_resched = 'b';
+ break;
case TRACE_FLAG_NEED_RESCHED:
need_resched = 'n';
break;
+ case TRACE_FLAG_NEED_RESCHED_LAZY:
+ need_resched = 'l';
+ break;
case TRACE_FLAG_PREEMPT_RESCHED:
need_resched = 'p';
break;
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,6 +333,23 @@ static const struct file_operations sche
.release = seq_release,
};
+static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ unsigned long end = jiffies + 60 * HZ;
+
+ for (; time_before(jiffies, end) && !signal_pending(current);)
+ cpu_relax();
+
+ return cnt;
+}
+
+static const struct file_operations sched_hog_fops = {
+ .write = sched_hog_write,
+ .open = simple_open,
+ .llseek = default_llseek,
+};
+
static struct dentry *debugfs_sched;
static __init int sched_init_debug(void)
@@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
+ debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
+
return 0;
}
late_initcall(sched_init_debug);
On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>> Then the question becomes whether we'd want to introduce a *new*
>> concept, which is a "if you are going to schedule, do it now rather
>> than later, because I'm taking a lock, and while it's a preemptible
>> lock, I'd rather not sleep while holding this resource".
>>
>> I suspect we want to avoid that for now, on the assumption that it's
>> hopefully not a problem in practice (the recently addressed problem
>> with might_sleep() was that it actively *moved* the scheduling point
>> to a bad place, not that scheduling could happen there, so instead of
>> optimizing scheduling, it actively pessimized it). But I thought I'd
>> mention it.
>
> I think we want to avoid that completely and if this becomes an issue,
> we rather be smart about it at the core level.
>
> It's trivial enough to have a per task counter which tells whether a
> preemtible lock is held (or about to be acquired) or not. Then the
> scheduler can take that hint into account and decide to grant a
> timeslice extension once in the expectation that the task leaves the
> lock held section soonish and either returns to user space or schedules
> out. It still can enforce it later on.
>
> We really want to let the scheduler decide and rather give it proper
> hints at the conceptual level instead of letting developers make random
> decisions which might work well for a particular use case and completely
> suck for the rest. I think we wasted enough time already on those.
Finally I realized why cond_resched() & et al. are so disgusting. They
are scope-less and just a random spot which someone decided to be a good
place to reschedule.
But in fact the really relevant measure is scope. Full preemption is
scope based:
preempt_disable();
do_stuff();
preempt_enable();
which also nests properly:
preempt_disable();
do_stuff()
preempt_disable();
do_other_stuff();
preempt_enable();
preempt_enable();
cond_resched() cannot nest and is obviously scope-less.
The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only
pretends to be scoped.
As Peter pointed out it does not properly nest with other mechanisms and
it cannot even nest in itself because it is boolean.
The worst thing about it is that it is semantically reverse to the
established model of preempt_disable()/enable(),
i.e. allow_resched()/disallow_resched().
So instead of giving the scheduler a hint about 'this might be a good
place to preempt', providing proper scope would make way more sense:
preempt_lazy_disable();
do_stuff();
preempt_lazy_enable();
That would be the obvious and semantically consistent counterpart to the
existing preemption control primitives with proper nesting support.
might_sleep(), which is in all the lock acquire functions or your
variant of hint (resched better now before I take the lock) are the
wrong place.
hint();
lock();
do_stuff();
unlock();
hint() might schedule and when the task comes back schedule immediately
again because the lock is contended. hint() does again not have scope
and might be meaningless or even counterproductive if called in a deeper
callchain.
Proper scope based hints avoid that.
preempt_lazy_disable();
lock();
do_stuff();
unlock();
preempt_lazy_enable();
That's way better because it describes the scope and the task will
either schedule out in lock() on contention or provide a sensible lazy
preemption point in preempt_lazy_enable(). It also nests properly:
preempt_lazy_disable();
lock(A);
do_stuff()
preempt_lazy_disable();
lock(B);
do_other_stuff();
unlock(B);
preempt_lazy_enable();
unlock(A);
preempt_lazy_enable();
So in this case it does not matter wheter do_stuff() is invoked from a
lock held section or not. The scope which defines the throughput
relevant hint to the scheduler is correct in any case.
Contrary to preempt_disable() the lazy variant does neither prevent
scheduling nor preemption, but its a understandable properly nestable
mechanism.
I seriously hope to avoid it alltogether :)
Thanks,
tglx
On Sat, 23 Sep 2023 03:11:05 +0200
Thomas Gleixner <[email protected]> wrote:
> Though definitely I'm putting a permanent NAK in place for any attempts
> to duct tape the preempt=NONE model any further by sprinkling more
> cond*() and whatever warts around.
Well, until we have this fix in, we will still need to sprinkle those
around when they are triggering watchdog timeouts. I just had to add one
recently due to a timeout report :-(
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>
> if (tif_need_resched())
> trace_flags |= TRACE_FLAG_NEED_RESCHED;
> + if (tif_need_resched_lazy())
> + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
> if (test_preempt_need_resched())
> trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
> return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>
> enum trace_flag_type {
> TRACE_FLAG_IRQS_OFF = 0x01,
> - TRACE_FLAG_IRQS_NOSUPPORT = 0x02,
I never cared for that NOSUPPORT flag. It's from 2008 and only used by
archs that do not support irq tracing (aka lockdep). I'm fine with dropping
it and just updating the user space libraries (which will no longer see it
not supported, but that's fine with me).
> - TRACE_FLAG_NEED_RESCHED = 0x04,
> + TRACE_FLAG_NEED_RESCHED = 0x02,
> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04,
Is LAZY only used for PREEMPT_NONE? Or do we use it for CONFIG_PREEMPT?
Because, NEED_RESCHED is known, and moving that to bit 2 will break user
space. Having LAZY replace the IRQS_NOSUPPORT will cause the least
"breakage".
-- Steve
> TRACE_FLAG_HARDIRQ = 0x08,
> TRACE_FLAG_SOFTIRQ = 0x10,
> TRACE_FLAG_PREEMPT_RESCHED = 0x20,
> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
>
> static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
> {
> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> + return tracing_gen_ctx_irq_test(0);
> }
> static inline unsigned int tracing_gen_ctx(void)
> {
> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> + return tracing_gen_ctx_irq_test(0);
> }
> #endif
>
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
> (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
> (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
> bh_off ? 'b' :
> - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
> + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
> '.';
>
> - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
> + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
> TRACE_FLAG_PREEMPT_RESCHED)) {
> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> + need_resched = 'B';
> + break;
> case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
> need_resched = 'N';
> break;
> + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> + need_resched = 'L';
> + break;
> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
> + need_resched = 'b';
> + break;
> case TRACE_FLAG_NEED_RESCHED:
> need_resched = 'n';
> break;
> + case TRACE_FLAG_NEED_RESCHED_LAZY:
> + need_resched = 'l';
> + break;
> case TRACE_FLAG_PREEMPT_RESCHED:
> need_resched = 'p';
> break;
On Mon, Oct 02 2023 at 10:15, Steven Rostedt wrote:
> On Sat, 23 Sep 2023 03:11:05 +0200
> Thomas Gleixner <[email protected]> wrote:
>
>> Though definitely I'm putting a permanent NAK in place for any attempts
>> to duct tape the preempt=NONE model any further by sprinkling more
>> cond*() and whatever warts around.
>
> Well, until we have this fix in, we will still need to sprinkle those
> around when they are triggering watchdog timeouts. I just had to add one
> recently due to a timeout report :-(
cond_resched() sure. But not new flavours of it, like the
[dis]allow_resched() which sparked this discussion.
>> - TRACE_FLAG_NEED_RESCHED = 0x04,
>> + TRACE_FLAG_NEED_RESCHED = 0x02,
>> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04,
>
> Is LAZY only used for PREEMPT_NONE? Or do we use it for CONFIG_PREEMPT?
> Because, NEED_RESCHED is known, and moving that to bit 2 will break user
> space. Having LAZY replace the IRQS_NOSUPPORT will cause the least
> "breakage".
Either way works for me.
Thanks,
tglx
Hi Thomas,
On Tue, Sep 19, 2023 at 9:57 PM Thomas Gleixner <[email protected]> wrote:
> Though it just occured to me that there are dragons lurking:
>
> arch/alpha/Kconfig: select ARCH_NO_PREEMPT
> arch/hexagon/Kconfig: select ARCH_NO_PREEMPT
> arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE
> arch/um/Kconfig: select ARCH_NO_PREEMPT
>
> So we have four architectures which refuse to enable preemption points,
> i.e. the only model they allow is NONE and they rely on cond_resched()
> for breaking large computations.
Looks like there is a fifth one hidden: although openrisc does not
select ARCH_NO_PREEMPT, it does not call preempt_schedule_irq() or
select GENERIC_ENTRY?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote:
> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
> >> That said - I think as a proof of concept and "look, with this we get
> >> the expected scheduling event counts", that patch is perfect. I think
> >> you more than proved the concept.
> >
> > There is certainly quite some analyis work to do to make this a one to
> > one replacement.
> >
> > With a handful of benchmarks the PoC (tweaked with some obvious fixes)
> > is pretty much on par with the current mainline variants (NONE/FULL),
> > but the memtier benchmark makes a massive dent.
> >
> > It sports a whopping 10% regression with the LAZY mode versus the mainline
> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
> >
> > That benchmark is really sensitive to the preemption model. With current
> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
> > performance drop versus preempt=NONE.
>
> That 20% was a tired pilot error. The real number is in the 5% ballpark.
>
> > I have no clue what's going on there yet, but that shows that there is
> > obviously quite some work ahead to get this sorted.
>
> It took some head scratching to figure that out. The initial fix broke
> the handling of the hog issue, i.e. the problem that Ankur tried to
> solve, but I hacked up a "solution" for that too.
>
> With that the memtier benchmark is roughly back to the mainline numbers,
> but my throughput benchmark know how is pretty close to zero, so that
> should be looked at by people who actually understand these things.
>
> Likewise the hog prevention is just at the PoC level and clearly beyond
> my knowledge of scheduler details: It unconditionally forces a
> reschedule when the looping task is not responding to a lazy reschedule
> request before the next tick. IOW it forces a reschedule on the second
> tick, which is obviously different from the cond_resched()/might_sleep()
> behaviour.
>
> The changes vs. the original PoC aside of the bug and thinko fixes:
>
> 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
> lazy preempt bit as the trace_entry::flags field is full already.
>
> That obviously breaks the tracer ABI, but if we go there then
> this needs to be fixed. Steven?
>
> 2) debugfs file to validate that loops can be force preempted w/o
> cond_resched()
>
> The usage is:
>
> # taskset -c 1 bash
> # echo 1 > /sys/kernel/debug/sched/hog &
> # echo 1 > /sys/kernel/debug/sched/hog &
> # echo 1 > /sys/kernel/debug/sched/hog &
>
> top shows ~33% CPU for each of the hogs and tracing confirms that
> the crude hack in the scheduler tick works:
>
> bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr
> bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr
> bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr
> bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr
> bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr
> bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr
> bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr
> bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr
>
> The 'l' instead of the usual 'N' reflects that the lazy resched
> bit is set. That makes __update_curr() invoke resched_curr()
> instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
> and folds it into preempt_count so that preemption happens at the
> next possible point, i.e. either in return from interrupt or at
> the next preempt_enable().
Belatedly calling out some RCU issues. Nothing fatal, just a
(surprisingly) few adjustments that will need to be made. The key thing
to note is that from RCU's viewpoint, with this change, all kernels
are preemptible, though rcu_read_lock() readers remain non-preemptible.
With that:
1. As an optimization, given that preempt_count() would always give
good information, the scheduling-clock interrupt could sense RCU
readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the
IPI handlers for expedited grace periods. A nice optimization.
Except that...
2. The quiescent-state-forcing code currently relies on the presence
of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix
would be to do resched_cpu() more quickly, but some workloads
might not love the additional IPIs. Another approach to do #1
above to replace the quiescent states from cond_resched() with
scheduler-tick-interrupt-sensed quiescent states.
Plus...
3. For nohz_full CPUs that run for a long time in the kernel,
there are no scheduling-clock interrupts. RCU reaches for
the resched_cpu() hammer a few jiffies into the grace period.
And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
interrupt-entry code will re-enable its scheduling-clock interrupt
upon receiving the resched_cpu() IPI.
So nohz_full CPUs should be OK as far as RCU is concerned.
Other subsystems might have other opinions.
4. As another optimization, kvfree_rcu() could unconditionally
check preempt_count() to sense a clean environment suitable for
memory allocation.
5. Kconfig files with "select TASKS_RCU if PREEMPTION" must
instead say "select TASKS_RCU". This means that the #else
in include/linux/rcupdate.h that defines TASKS_RCU in terms of
vanilla RCU must go. There might be be some fallout if something
fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
rcu_tasks_classic_qs() do do something useful.
6. You might think that RCU Tasks (as opposed to RCU Tasks Trace
or RCU Tasks Rude) would need those pesky cond_resched() calls
to stick around. The reason is that RCU Tasks readers are ended
only by voluntary context switches. This means that although a
preemptible infinite loop in the kernel won't inconvenience a
real-time task (nor an non-real-time task for all that long),
and won't delay grace periods for the other flavors of RCU,
it would indefinitely delay an RCU Tasks grace period.
However, RCU Tasks grace periods seem to be finite in preemptible
kernels today, so they should remain finite in limited-preemptible
kernels tomorrow. Famous last words...
7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
any algorithmic difference from this change.
8. As has been noted elsewhere, in this new limited-preemption
mode of operation, rcu_read_lock() readers remain preemptible.
This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
9. The rcu_preempt_depth() macro could do something useful in
limited-preemption kernels. Its current lack of ability in
CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
10. The cond_resched_rcu() function must remain because we still
have non-preemptible rcu_read_lock() readers.
11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
unchanged, but I must defer to the include/net/ip_vs.h people.
12. I need to check with the BPF folks on the BPF verifier's
definition of BTF_ID(func, rcu_read_unlock_strict).
13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
function might have some redundancy across the board instead
of just on CONFIG_PREEMPT_RCU=y. Or might not.
14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
might need to do something for non-preemptible RCU to make
up for the lack of cond_resched() calls. Maybe just drop the
"IS_ENABLED()" and execute the body of the current "if" statement
unconditionally.
15. I must defer to others on the mm/pgtable-generic.c file's
#ifdef that depends on CONFIG_PREEMPT_RCU.
While in the area, I noted that KLP seems to depend on cond_resched(),
but on this I must defer to the KLP people.
I am sure that I am missing something, but I have not yet seen any
show-stoppers. Just some needed adjustments.
Thoughts?
Thanx, Paul
> That's as much as I wanted to demonstrate and I'm not going to spend
> more cycles on it as I have already too many other things on flight and
> the resulting scheduler woes are clearly outside of my expertice.
>
> Though definitely I'm putting a permanent NAK in place for any attempts
> to duct tape the preempt=NONE model any further by sprinkling more
> cond*() and whatever warts around.
>
> Thanks,
>
> tglx
> ---
> arch/x86/Kconfig | 1
> arch/x86/include/asm/thread_info.h | 6 ++--
> drivers/acpi/processor_idle.c | 2 -
> include/linux/entry-common.h | 2 -
> include/linux/entry-kvm.h | 2 -
> include/linux/sched.h | 12 +++++---
> include/linux/sched/idle.h | 8 ++---
> include/linux/thread_info.h | 24 +++++++++++++++++
> include/linux/trace_events.h | 8 ++---
> kernel/Kconfig.preempt | 17 +++++++++++-
> kernel/entry/common.c | 4 +-
> kernel/entry/kvm.c | 2 -
> kernel/sched/core.c | 51 +++++++++++++++++++++++++------------
> kernel/sched/debug.c | 19 +++++++++++++
> kernel/sched/fair.c | 46 ++++++++++++++++++++++-----------
> kernel/sched/features.h | 2 +
> kernel/sched/idle.c | 3 --
> kernel/sched/sched.h | 1
> kernel/trace/trace.c | 2 +
> kernel/trace/trace_output.c | 16 ++++++++++-
> 20 files changed, 171 insertions(+), 57 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
>
> #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
> /*
> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
> * this avoids any races wrt polling state changes and thereby avoids
> * spurious IPIs.
> */
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
> {
> struct thread_info *ti = task_thread_info(p);
> - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
> +
> + return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
> }
>
> /*
> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
> for (;;) {
> if (!(val & _TIF_POLLING_NRFLAG))
> return false;
> - if (val & _TIF_NEED_RESCHED)
> + if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> return true;
> if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
> break;
> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
> }
>
> #else
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
> {
> - set_tsk_need_resched(p);
> + set_tsk_thread_flag(p, tif_bit);
> return true;
> }
>
> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
> * might also involve a cross-CPU call to trigger the scheduler on
> * the target CPU.
> */
> -void resched_curr(struct rq *rq)
> +static void __resched_curr(struct rq *rq, int lazy)
> {
> + int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
> struct task_struct *curr = rq->curr;
> - int cpu;
>
> lockdep_assert_rq_held(rq);
>
> - if (test_tsk_need_resched(curr))
> + if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
> return;
>
> cpu = cpu_of(rq);
>
> if (cpu == smp_processor_id()) {
> - set_tsk_need_resched(curr);
> - set_preempt_need_resched();
> + set_tsk_thread_flag(curr, tif_bit);
> + if (!lazy)
> + set_preempt_need_resched();
> return;
> }
>
> - if (set_nr_and_not_polling(curr))
> - smp_send_reschedule(cpu);
> - else
> + if (set_nr_and_not_polling(curr, tif_bit)) {
> + if (!lazy)
> + smp_send_reschedule(cpu);
> + } else {
> trace_sched_wake_idle_without_ipi(cpu);
> + }
> +}
> +
> +void resched_curr(struct rq *rq)
> +{
> + __resched_curr(rq, 0);
> +}
> +
> +void resched_curr_lazy(struct rq *rq)
> +{
> + int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
> + TIF_NEED_RESCHED_LAZY_OFFSET : 0;
> +
> + if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
> + return;
> +
> + __resched_curr(rq, lazy);
> }
>
> void resched_cpu(int cpu)
> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
> if (cpu == smp_processor_id())
> return;
>
> - if (set_nr_and_not_polling(rq->idle))
> + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
> smp_send_reschedule(cpu);
> else
> trace_sched_wake_idle_without_ipi(cpu);
> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
> WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
> return preempt_dynamic_mode == preempt_dynamic_##mode; \
> } \
> - EXPORT_SYMBOL_GPL(preempt_model_##mode)
>
> PREEMPT_MODEL_ACCESSOR(none);
> PREEMPT_MODEL_ACCESSOR(voluntary);
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -59,6 +59,16 @@ enum syscall_work_bit {
>
> #include <asm/thread_info.h>
>
> +#ifdef CONFIG_PREEMPT_AUTO
> +# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY
> +# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY
> +# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
> +#else
> +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
> +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
> +# define TIF_NEED_RESCHED_LAZY_OFFSET 0
> +#endif
> +
> #ifdef __KERNEL__
>
> #ifndef arch_set_restart_data
> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
> (unsigned long *)(¤t_thread_info()->flags));
> }
>
> +static __always_inline bool tif_need_resched_lazy(void)
> +{
> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
> + arch_test_bit(TIF_NEED_RESCHED_LAZY,
> + (unsigned long *)(¤t_thread_info()->flags));
> +}
> +
> #else
>
> static __always_inline bool tif_need_resched(void)
> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
> (unsigned long *)(¤t_thread_info()->flags));
> }
>
> +static __always_inline bool tif_need_resched_lazy(void)
> +{
> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
> + test_bit(TIF_NEED_RESCHED_LAZY,
> + (unsigned long *)(¤t_thread_info()->flags));
> +}
> +
> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>
> #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -11,6 +11,13 @@ config PREEMPT_BUILD
> select PREEMPTION
> select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
>
> +config PREEMPT_BUILD_AUTO
> + bool
> + select PREEMPT_BUILD
> +
> +config HAVE_PREEMPT_AUTO
> + bool
> +
> choice
> prompt "Preemption Model"
> default PREEMPT_NONE
> @@ -67,9 +74,17 @@ config PREEMPT
> embedded system with latency requirements in the milliseconds
> range.
>
> +config PREEMPT_AUTO
> + bool "Automagic preemption mode with runtime tweaking support"
> + depends on HAVE_PREEMPT_AUTO
> + select PREEMPT_BUILD_AUTO
> + help
> + Add some sensible blurb here
> +
> config PREEMPT_RT
> bool "Fully Preemptible Kernel (Real-Time)"
> depends on EXPERT && ARCH_SUPPORTS_RT
> + select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
> select PREEMPTION
> help
> This option turns the kernel into a real-time kernel by replacing
> @@ -95,7 +110,7 @@ config PREEMPTION
>
> config PREEMPT_DYNAMIC
> bool "Preemption behaviour defined on boot"
> - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
> + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
> select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
> select PREEMPT_BUILD
> default y if HAVE_PREEMPT_DYNAMIC_CALL
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -60,7 +60,7 @@
> #define EXIT_TO_USER_MODE_WORK \
> (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
> _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
> - ARCH_EXIT_TO_USER_MODE_WORK)
> + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
>
> /**
> * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
> --- a/include/linux/entry-kvm.h
> +++ b/include/linux/entry-kvm.h
> @@ -18,7 +18,7 @@
>
> #define XFER_TO_GUEST_MODE_WORK \
> (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
>
> struct kvm_vcpu;
>
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
>
> local_irq_enable_exit_to_user(ti_work);
>
> - if (ti_work & _TIF_NEED_RESCHED)
> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> schedule();
>
> if (ti_work & _TIF_UPROBE)
> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
> rcu_irq_exit_check_preempt();
> if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> WARN_ON_ONCE(!on_thread_stack());
> - if (need_resched())
> + if (test_tsk_need_resched(current))
> preempt_schedule_irq();
> }
> }
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
> SCHED_FEAT(LATENCY_WARN, false)
>
> SCHED_FEAT(HZ_BW, true)
> +
> +SCHED_FEAT(FORCE_NEED_RESCHED, false)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
> extern void reweight_task(struct task_struct *p, int prio);
>
> extern void resched_curr(struct rq *rq);
> +extern void resched_curr_lazy(struct rq *rq);
> extern void resched_cpu(int cpu);
>
> extern struct rt_bandwidth def_rt_bandwidth;
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
> update_ti_thread_flag(task_thread_info(tsk), flag, value);
> }
>
> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
> {
> return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
> }
>
> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
> {
> return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
> }
>
> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
> {
> return test_ti_thread_flag(task_thread_info(tsk), flag);
> }
> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
> static inline void clear_tsk_need_resched(struct task_struct *tsk)
> {
> clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
> }
>
> -static inline int test_tsk_need_resched(struct task_struct *tsk)
> +static inline bool test_tsk_need_resched(struct task_struct *tsk)
> {
> return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
> }
> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
>
> static __always_inline bool need_resched(void)
> {
> - return unlikely(tif_need_resched());
> + return unlikely(tif_need_resched_lazy() || tif_need_resched());
> }
>
> /*
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
> * this is probably good enough.
> */
> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
> {
> + struct rq *rq = rq_of(cfs_rq);
> +
> if ((s64)(se->vruntime - se->deadline) < 0)
> return;
>
> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
> /*
> * The task has consumed its request, reschedule.
> */
> - if (cfs_rq->nr_running > 1) {
> - resched_curr(rq_of(cfs_rq));
> - clear_buddies(cfs_rq, se);
> + if (cfs_rq->nr_running < 2)
> + return;
> +
> + if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
> + resched_curr(rq);
> + } else {
> + /* Did the task ignore the lazy reschedule request? */
> + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
> + resched_curr(rq);
> + else
> + resched_curr_lazy(rq);
> }
> + clear_buddies(cfs_rq, se);
> }
>
> #include "pelt.h"
> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
> /*
> * Update the current task's runtime statistics.
> */
> -static void update_curr(struct cfs_rq *cfs_rq)
> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
> {
> struct sched_entity *curr = cfs_rq->curr;
> u64 now = rq_clock_task(rq_of(cfs_rq));
> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
> schedstat_add(cfs_rq->exec_clock, delta_exec);
>
> curr->vruntime += calc_delta_fair(delta_exec, curr);
> - update_deadline(cfs_rq, curr);
> + update_deadline(cfs_rq, curr, tick);
> update_min_vruntime(cfs_rq);
>
> if (entity_is_task(curr)) {
> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
> account_cfs_rq_runtime(cfs_rq, delta_exec);
> }
>
> +static inline void update_curr(struct cfs_rq *cfs_rq)
> +{
> + __update_curr(cfs_rq, false);
> +}
> +
> static void update_curr_fair(struct rq *rq)
> {
> update_curr(cfs_rq_of(&rq->curr->se));
> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> /*
> * Update run-time statistics of the 'current'.
> */
> - update_curr(cfs_rq);
> + __update_curr(cfs_rq, true);
>
> /*
> * Ensure that runnable average is periodically updated.
> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> * validating it and just reschedule.
> */
> if (queued) {
> - resched_curr(rq_of(cfs_rq));
> + resched_curr_lazy(rq_of(cfs_rq));
> return;
> }
> /*
> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
> * hierarchy can be throttled
> */
> if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
> - resched_curr(rq_of(cfs_rq));
> + resched_curr_lazy(rq_of(cfs_rq));
> }
>
> static __always_inline
> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>
> /* Determine whether we need to wake up potentially idle CPU: */
> if (rq->curr == rq->idle && rq->cfs.nr_running)
> - resched_curr(rq);
> + resched_curr_lazy(rq);
> }
>
> #ifdef CONFIG_SMP
> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
>
> if (delta < 0) {
> if (task_current(rq, p))
> - resched_curr(rq);
> + resched_curr_lazy(rq);
> return;
> }
> hrtick_start(rq, delta);
> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
> * prevents us from potentially nominating it as a false LAST_BUDDY
> * below.
> */
> - if (test_tsk_need_resched(curr))
> + if (need_resched())
> return;
>
> /* Idle tasks are by definition preempted by non-idle tasks. */
> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
> return;
>
> preempt:
> - resched_curr(rq);
> + resched_curr_lazy(rq);
> }
>
> #ifdef CONFIG_SMP
> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
> */
> if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
> __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
> - resched_curr(rq);
> + resched_curr_lazy(rq);
> }
>
> /*
> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
> */
> if (task_current(rq, p)) {
> if (p->prio > oldprio)
> - resched_curr(rq);
> + resched_curr_lazy(rq);
> } else
> check_preempt_curr(rq, p, 0);
> }
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces
> */
> static void __cpuidle acpi_safe_halt(void)
> {
> - if (!tif_need_resched()) {
> + if (!need_resched()) {
> raw_safe_halt();
> raw_local_irq_disable();
> }
> --- a/include/linux/sched/idle.h
> +++ b/include/linux/sched/idle.h
> @@ -63,7 +63,7 @@ static __always_inline bool __must_check
> */
> smp_mb__after_atomic();
>
> - return unlikely(tif_need_resched());
> + return unlikely(need_resched());
> }
>
> static __always_inline bool __must_check current_clr_polling_and_test(void)
> @@ -76,7 +76,7 @@ static __always_inline bool __must_check
> */
> smp_mb__after_atomic();
>
> - return unlikely(tif_need_resched());
> + return unlikely(need_resched());
> }
>
> #else
> @@ -85,11 +85,11 @@ static inline void __current_clr_polling
>
> static inline bool __must_check current_set_polling_and_test(void)
> {
> - return unlikely(tif_need_resched());
> + return unlikely(need_resched());
> }
> static inline bool __must_check current_clr_polling_and_test(void)
> {
> - return unlikely(tif_need_resched());
> + return unlikely(need_resched());
> }
> #endif
>
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
> ct_cpuidle_enter();
>
> raw_local_irq_enable();
> - while (!tif_need_resched() &&
> - (cpu_idle_force_poll || tick_check_broadcast_expired()))
> + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
> cpu_relax();
> raw_local_irq_disable();
>
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>
> if (tif_need_resched())
> trace_flags |= TRACE_FLAG_NEED_RESCHED;
> + if (tif_need_resched_lazy())
> + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
> if (test_preempt_need_resched())
> trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
> return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -271,6 +271,7 @@ config X86
> select HAVE_STATIC_CALL
> select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
> select HAVE_PREEMPT_DYNAMIC_CALL
> + select HAVE_PREEMPT_AUTO
> select HAVE_RSEQ
> select HAVE_RUST if X86_64
> select HAVE_SYSCALL_TRACEPOINTS
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -81,8 +81,9 @@ struct thread_info {
> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
> #define TIF_SIGPENDING 2 /* signal pending */
> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */
> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
> -#define TIF_SSBD 5 /* Speculative store bypass disable */
> +#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */
> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
> +#define TIF_SSBD 6 /* Speculative store bypass disable */
> #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
> #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
> #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
> @@ -104,6 +105,7 @@ struct thread_info {
> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
> +#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY)
> #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
> #define _TIF_SSBD (1 << TIF_SSBD)
> #define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
> return -EINTR;
> }
>
> - if (ti_work & _TIF_NEED_RESCHED)
> + if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
> schedule();
>
> if (ti_work & _TIF_NOTIFY_RESUME)
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>
> enum trace_flag_type {
> TRACE_FLAG_IRQS_OFF = 0x01,
> - TRACE_FLAG_IRQS_NOSUPPORT = 0x02,
> - TRACE_FLAG_NEED_RESCHED = 0x04,
> + TRACE_FLAG_NEED_RESCHED = 0x02,
> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04,
> TRACE_FLAG_HARDIRQ = 0x08,
> TRACE_FLAG_SOFTIRQ = 0x10,
> TRACE_FLAG_PREEMPT_RESCHED = 0x20,
> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
>
> static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
> {
> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> + return tracing_gen_ctx_irq_test(0);
> }
> static inline unsigned int tracing_gen_ctx(void)
> {
> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> + return tracing_gen_ctx_irq_test(0);
> }
> #endif
>
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
> (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
> (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
> bh_off ? 'b' :
> - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
> + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
> '.';
>
> - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
> + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
> TRACE_FLAG_PREEMPT_RESCHED)) {
> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> + need_resched = 'B';
> + break;
> case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
> need_resched = 'N';
> break;
> + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> + need_resched = 'L';
> + break;
> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
> + need_resched = 'b';
> + break;
> case TRACE_FLAG_NEED_RESCHED:
> need_resched = 'n';
> break;
> + case TRACE_FLAG_NEED_RESCHED_LAZY:
> + need_resched = 'l';
> + break;
> case TRACE_FLAG_PREEMPT_RESCHED:
> need_resched = 'p';
> break;
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -333,6 +333,23 @@ static const struct file_operations sche
> .release = seq_release,
> };
>
> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + unsigned long end = jiffies + 60 * HZ;
> +
> + for (; time_before(jiffies, end) && !signal_pending(current);)
> + cpu_relax();
> +
> + return cnt;
> +}
> +
> +static const struct file_operations sched_hog_fops = {
> + .write = sched_hog_write,
> + .open = simple_open,
> + .llseek = default_llseek,
> +};
> +
> static struct dentry *debugfs_sched;
>
> static __init int sched_init_debug(void)
> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
>
> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>
> + debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
> +
> return 0;
> }
> late_initcall(sched_init_debug);
>
Paul E. McKenney <[email protected]> writes:
> On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote:
>> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
>> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
>> >> That said - I think as a proof of concept and "look, with this we get
>> >> the expected scheduling event counts", that patch is perfect. I think
>> >> you more than proved the concept.
>> >
>> > There is certainly quite some analyis work to do to make this a one to
>> > one replacement.
>> >
>> > With a handful of benchmarks the PoC (tweaked with some obvious fixes)
>> > is pretty much on par with the current mainline variants (NONE/FULL),
>> > but the memtier benchmark makes a massive dent.
>> >
>> > It sports a whopping 10% regression with the LAZY mode versus the mainline
>> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
>> >
>> > That benchmark is really sensitive to the preemption model. With current
>> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
>> > performance drop versus preempt=NONE.
>>
>> That 20% was a tired pilot error. The real number is in the 5% ballpark.
>>
>> > I have no clue what's going on there yet, but that shows that there is
>> > obviously quite some work ahead to get this sorted.
>>
>> It took some head scratching to figure that out. The initial fix broke
>> the handling of the hog issue, i.e. the problem that Ankur tried to
>> solve, but I hacked up a "solution" for that too.
>>
>> With that the memtier benchmark is roughly back to the mainline numbers,
>> but my throughput benchmark know how is pretty close to zero, so that
>> should be looked at by people who actually understand these things.
>>
>> Likewise the hog prevention is just at the PoC level and clearly beyond
>> my knowledge of scheduler details: It unconditionally forces a
>> reschedule when the looping task is not responding to a lazy reschedule
>> request before the next tick. IOW it forces a reschedule on the second
>> tick, which is obviously different from the cond_resched()/might_sleep()
>> behaviour.
>>
>> The changes vs. the original PoC aside of the bug and thinko fixes:
>>
>> 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
>> lazy preempt bit as the trace_entry::flags field is full already.
>>
>> That obviously breaks the tracer ABI, but if we go there then
>> this needs to be fixed. Steven?
>>
>> 2) debugfs file to validate that loops can be force preempted w/o
>> cond_resched()
>>
>> The usage is:
>>
>> # taskset -c 1 bash
>> # echo 1 > /sys/kernel/debug/sched/hog &
>> # echo 1 > /sys/kernel/debug/sched/hog &
>> # echo 1 > /sys/kernel/debug/sched/hog &
>>
>> top shows ~33% CPU for each of the hogs and tracing confirms that
>> the crude hack in the scheduler tick works:
>>
>> bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr
>> bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr
>> bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr
>> bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr
>> bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr
>> bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr
>> bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr
>> bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr
>>
>> The 'l' instead of the usual 'N' reflects that the lazy resched
>> bit is set. That makes __update_curr() invoke resched_curr()
>> instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
>> and folds it into preempt_count so that preemption happens at the
>> next possible point, i.e. either in return from interrupt or at
>> the next preempt_enable().
>
> Belatedly calling out some RCU issues. Nothing fatal, just a
> (surprisingly) few adjustments that will need to be made. The key thing
> to note is that from RCU's viewpoint, with this change, all kernels
> are preemptible, though rcu_read_lock() readers remain non-preemptible.
Yeah, in Thomas' patch CONFIG_PREEMPTION=y and preemption models
none/voluntary/full are just scheduler tweaks on top of that. And, so
this would always have PREEMPT_RCU=y. So, shouldn't rcu_read_lock()
readers be preemptible?
(An alternate configuration might be:
config PREEMPT_NONE
select PREEMPT_COUNT
config PREEMPT_FULL
select PREEMPTION
This probably allows for more configuration flexibility across archs?
Would allow for TREE_RCU=y, for instance. That said, so far I've only
been working with PREEMPT_RCU=y.)
> With that:
>
> 1. As an optimization, given that preempt_count() would always give
> good information, the scheduling-clock interrupt could sense RCU
> readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the
> IPI handlers for expedited grace periods. A nice optimization.
> Except that...
>
> 2. The quiescent-state-forcing code currently relies on the presence
> of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix
> would be to do resched_cpu() more quickly, but some workloads
> might not love the additional IPIs. Another approach to do #1
> above to replace the quiescent states from cond_resched() with
> scheduler-tick-interrupt-sensed quiescent states.
Right, the call to rcu_all_qs(). Just to see if I have it straight,
something like this for PREEMPT_RCU=n kernels?
if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0)
rcu_all_qs();
(Masked because PREEMPT_NONE might not do any folding for
NEED_RESCHED_LAZY in the tick.)
Though the comment around rcu_all_qs() mentions that rcu_all_qs()
reports a quiescent state only if urgently needed. Given that the tick
executes less frequently than calls to cond_resched(), could we just
always report instead? Or I'm completely on the wrong track?
if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) {
preempt_disable();
rcu_qs();
preempt_enable();
}
On your point about the preempt_count() being dependable, there's a
wrinkle. As Linus mentions in
https://lore.kernel.org/lkml/CAHk-=wgUimqtF7PqFfRw4Ju5H1KYkp6+8F=hBz7amGQ8GaGKkA@mail.gmail.com/,
that might not be true for architectures that define ARCH_NO_PREEMPT.
My plan was to limit those archs to do preemption only at user space boundary
but there are almost certainly RCU implications that I missed.
> Plus...
>
> 3. For nohz_full CPUs that run for a long time in the kernel,
> there are no scheduling-clock interrupts. RCU reaches for
> the resched_cpu() hammer a few jiffies into the grace period.
> And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> interrupt-entry code will re-enable its scheduling-clock interrupt
> upon receiving the resched_cpu() IPI.
>
> So nohz_full CPUs should be OK as far as RCU is concerned.
> Other subsystems might have other opinions.
Ah, that's what I thought from my reading of the RCU comments. Good to
have that confirmed. Thanks.
> 4. As another optimization, kvfree_rcu() could unconditionally
> check preempt_count() to sense a clean environment suitable for
> memory allocation.
Had missed this completely. Could you elaborate?
> 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must
> instead say "select TASKS_RCU". This means that the #else
> in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> vanilla RCU must go. There might be be some fallout if something
> fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> rcu_tasks_classic_qs() do do something useful.
Ack.
> 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace
> or RCU Tasks Rude) would need those pesky cond_resched() calls
> to stick around. The reason is that RCU Tasks readers are ended
> only by voluntary context switches. This means that although a
> preemptible infinite loop in the kernel won't inconvenience a
> real-time task (nor an non-real-time task for all that long),
> and won't delay grace periods for the other flavors of RCU,
> it would indefinitely delay an RCU Tasks grace period.
>
> However, RCU Tasks grace periods seem to be finite in preemptible
> kernels today, so they should remain finite in limited-preemptible
> kernels tomorrow. Famous last words...
>
> 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> any algorithmic difference from this change.
So, essentially, as long as RCU tasks eventually, in the fullness of
time, call schedule(), removing cond_resched() shouldn't have any
effect :).
> 8. As has been noted elsewhere, in this new limited-preemption
> mode of operation, rcu_read_lock() readers remain preemptible.
> This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
Ack.
> 9. The rcu_preempt_depth() macro could do something useful in
> limited-preemption kernels. Its current lack of ability in
> CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
>
> 10. The cond_resched_rcu() function must remain because we still
> have non-preemptible rcu_read_lock() readers.
For configurations with PREEMPT_RCU=n? Yes, agreed. Though it need
only be this, right?:
static inline void cond_resched_rcu(void)
{
#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
rcu_read_unlock();
rcu_read_lock();
#endif
}
> 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> unchanged, but I must defer to the include/net/ip_vs.h people.
>
> 12. I need to check with the BPF folks on the BPF verifier's
> definition of BTF_ID(func, rcu_read_unlock_strict).
>
> 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> function might have some redundancy across the board instead
> of just on CONFIG_PREEMPT_RCU=y. Or might not.
I don't think I understand any of these well enough to comment. Will
Cc the relevant folks when I send out the RFC.
> 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
> might need to do something for non-preemptible RCU to make
> up for the lack of cond_resched() calls. Maybe just drop the
> "IS_ENABLED()" and execute the body of the current "if" statement
> unconditionally.
Aah, yes this is a good idea. Thanks.
> 15. I must defer to others on the mm/pgtable-generic.c file's
> #ifdef that depends on CONFIG_PREEMPT_RCU.
>
> While in the area, I noted that KLP seems to depend on cond_resched(),
> but on this I must defer to the KLP people.
Yeah, as part of this work, I ended up unhooking most of the KLP
hooks in cond_resched() and of course, cond_resched() itself.
Will poke the livepatching people.
> I am sure that I am missing something, but I have not yet seen any
> show-stoppers. Just some needed adjustments.
Appreciate this detailed list. Makes me think that everything might
not go up in smoke after all!
Thanks
Ankur
> Thoughts?
>
> Thanx, Paul
>
>> That's as much as I wanted to demonstrate and I'm not going to spend
>> more cycles on it as I have already too many other things on flight and
>> the resulting scheduler woes are clearly outside of my expertice.
>>
>> Though definitely I'm putting a permanent NAK in place for any attempts
>> to duct tape the preempt=NONE model any further by sprinkling more
>> cond*() and whatever warts around.
>>
>> Thanks,
>>
>> tglx
>> ---
>> arch/x86/Kconfig | 1
>> arch/x86/include/asm/thread_info.h | 6 ++--
>> drivers/acpi/processor_idle.c | 2 -
>> include/linux/entry-common.h | 2 -
>> include/linux/entry-kvm.h | 2 -
>> include/linux/sched.h | 12 +++++---
>> include/linux/sched/idle.h | 8 ++---
>> include/linux/thread_info.h | 24 +++++++++++++++++
>> include/linux/trace_events.h | 8 ++---
>> kernel/Kconfig.preempt | 17 +++++++++++-
>> kernel/entry/common.c | 4 +-
>> kernel/entry/kvm.c | 2 -
>> kernel/sched/core.c | 51 +++++++++++++++++++++++++------------
>> kernel/sched/debug.c | 19 +++++++++++++
>> kernel/sched/fair.c | 46 ++++++++++++++++++++++-----------
>> kernel/sched/features.h | 2 +
>> kernel/sched/idle.c | 3 --
>> kernel/sched/sched.h | 1
>> kernel/trace/trace.c | 2 +
>> kernel/trace/trace_output.c | 16 ++++++++++-
>> 20 files changed, 171 insertions(+), 57 deletions(-)
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
>>
>> #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
>> /*
>> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
>> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
>> * this avoids any races wrt polling state changes and thereby avoids
>> * spurious IPIs.
>> */
>> -static inline bool set_nr_and_not_polling(struct task_struct *p)
>> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
>> {
>> struct thread_info *ti = task_thread_info(p);
>> - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
>> +
>> + return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
>> }
>>
>> /*
>> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
>> for (;;) {
>> if (!(val & _TIF_POLLING_NRFLAG))
>> return false;
>> - if (val & _TIF_NEED_RESCHED)
>> + if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>> return true;
>> if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
>> break;
>> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
>> }
>>
>> #else
>> -static inline bool set_nr_and_not_polling(struct task_struct *p)
>> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
>> {
>> - set_tsk_need_resched(p);
>> + set_tsk_thread_flag(p, tif_bit);
>> return true;
>> }
>>
>> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
>> * might also involve a cross-CPU call to trigger the scheduler on
>> * the target CPU.
>> */
>> -void resched_curr(struct rq *rq)
>> +static void __resched_curr(struct rq *rq, int lazy)
>> {
>> + int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
>> struct task_struct *curr = rq->curr;
>> - int cpu;
>>
>> lockdep_assert_rq_held(rq);
>>
>> - if (test_tsk_need_resched(curr))
>> + if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
>> return;
>>
>> cpu = cpu_of(rq);
>>
>> if (cpu == smp_processor_id()) {
>> - set_tsk_need_resched(curr);
>> - set_preempt_need_resched();
>> + set_tsk_thread_flag(curr, tif_bit);
>> + if (!lazy)
>> + set_preempt_need_resched();
>> return;
>> }
>>
>> - if (set_nr_and_not_polling(curr))
>> - smp_send_reschedule(cpu);
>> - else
>> + if (set_nr_and_not_polling(curr, tif_bit)) {
>> + if (!lazy)
>> + smp_send_reschedule(cpu);
>> + } else {
>> trace_sched_wake_idle_without_ipi(cpu);
>> + }
>> +}
>> +
>> +void resched_curr(struct rq *rq)
>> +{
>> + __resched_curr(rq, 0);
>> +}
>> +
>> +void resched_curr_lazy(struct rq *rq)
>> +{
>> + int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
>> + TIF_NEED_RESCHED_LAZY_OFFSET : 0;
>> +
>> + if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
>> + return;
>> +
>> + __resched_curr(rq, lazy);
>> }
>>
>> void resched_cpu(int cpu)
>> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
>> if (cpu == smp_processor_id())
>> return;
>>
>> - if (set_nr_and_not_polling(rq->idle))
>> + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
>> smp_send_reschedule(cpu);
>> else
>> trace_sched_wake_idle_without_ipi(cpu);
>> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
>> WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
>> return preempt_dynamic_mode == preempt_dynamic_##mode; \
>> } \
>> - EXPORT_SYMBOL_GPL(preempt_model_##mode)
>>
>> PREEMPT_MODEL_ACCESSOR(none);
>> PREEMPT_MODEL_ACCESSOR(voluntary);
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -59,6 +59,16 @@ enum syscall_work_bit {
>>
>> #include <asm/thread_info.h>
>>
>> +#ifdef CONFIG_PREEMPT_AUTO
>> +# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY
>> +# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY
>> +# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
>> +#else
>> +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
>> +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
>> +# define TIF_NEED_RESCHED_LAZY_OFFSET 0
>> +#endif
>> +
>> #ifdef __KERNEL__
>>
>> #ifndef arch_set_restart_data
>> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
>> (unsigned long *)(¤t_thread_info()->flags));
>> }
>>
>> +static __always_inline bool tif_need_resched_lazy(void)
>> +{
>> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
>> + arch_test_bit(TIF_NEED_RESCHED_LAZY,
>> + (unsigned long *)(¤t_thread_info()->flags));
>> +}
>> +
>> #else
>>
>> static __always_inline bool tif_need_resched(void)
>> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
>> (unsigned long *)(¤t_thread_info()->flags));
>> }
>>
>> +static __always_inline bool tif_need_resched_lazy(void)
>> +{
>> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
>> + test_bit(TIF_NEED_RESCHED_LAZY,
>> + (unsigned long *)(¤t_thread_info()->flags));
>> +}
>> +
>> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>>
>> #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
>> --- a/kernel/Kconfig.preempt
>> +++ b/kernel/Kconfig.preempt
>> @@ -11,6 +11,13 @@ config PREEMPT_BUILD
>> select PREEMPTION
>> select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
>>
>> +config PREEMPT_BUILD_AUTO
>> + bool
>> + select PREEMPT_BUILD
>> +
>> +config HAVE_PREEMPT_AUTO
>> + bool
>> +
>> choice
>> prompt "Preemption Model"
>> default PREEMPT_NONE
>> @@ -67,9 +74,17 @@ config PREEMPT
>> embedded system with latency requirements in the milliseconds
>> range.
>>
>> +config PREEMPT_AUTO
>> + bool "Automagic preemption mode with runtime tweaking support"
>> + depends on HAVE_PREEMPT_AUTO
>> + select PREEMPT_BUILD_AUTO
>> + help
>> + Add some sensible blurb here
>> +
>> config PREEMPT_RT
>> bool "Fully Preemptible Kernel (Real-Time)"
>> depends on EXPERT && ARCH_SUPPORTS_RT
>> + select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
>> select PREEMPTION
>> help
>> This option turns the kernel into a real-time kernel by replacing
>> @@ -95,7 +110,7 @@ config PREEMPTION
>>
>> config PREEMPT_DYNAMIC
>> bool "Preemption behaviour defined on boot"
>> - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
>> + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
>> select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
>> select PREEMPT_BUILD
>> default y if HAVE_PREEMPT_DYNAMIC_CALL
>> --- a/include/linux/entry-common.h
>> +++ b/include/linux/entry-common.h
>> @@ -60,7 +60,7 @@
>> #define EXIT_TO_USER_MODE_WORK \
>> (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
>> _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
>> - ARCH_EXIT_TO_USER_MODE_WORK)
>> + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
>>
>> /**
>> * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
>> --- a/include/linux/entry-kvm.h
>> +++ b/include/linux/entry-kvm.h
>> @@ -18,7 +18,7 @@
>>
>> #define XFER_TO_GUEST_MODE_WORK \
>> (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
>> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
>> + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
>>
>> struct kvm_vcpu;
>>
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
>>
>> local_irq_enable_exit_to_user(ti_work);
>>
>> - if (ti_work & _TIF_NEED_RESCHED)
>> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>> schedule();
>>
>> if (ti_work & _TIF_UPROBE)
>> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
>> rcu_irq_exit_check_preempt();
>> if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>> WARN_ON_ONCE(!on_thread_stack());
>> - if (need_resched())
>> + if (test_tsk_need_resched(current))
>> preempt_schedule_irq();
>> }
>> }
>> --- a/kernel/sched/features.h
>> +++ b/kernel/sched/features.h
>> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
>> SCHED_FEAT(LATENCY_WARN, false)
>>
>> SCHED_FEAT(HZ_BW, true)
>> +
>> +SCHED_FEAT(FORCE_NEED_RESCHED, false)
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
>> extern void reweight_task(struct task_struct *p, int prio);
>>
>> extern void resched_curr(struct rq *rq);
>> +extern void resched_curr_lazy(struct rq *rq);
>> extern void resched_cpu(int cpu);
>>
>> extern struct rt_bandwidth def_rt_bandwidth;
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
>> update_ti_thread_flag(task_thread_info(tsk), flag, value);
>> }
>>
>> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
>> {
>> return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
>> }
>>
>> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
>> {
>> return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
>> }
>>
>> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
>> {
>> return test_ti_thread_flag(task_thread_info(tsk), flag);
>> }
>> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
>> static inline void clear_tsk_need_resched(struct task_struct *tsk)
>> {
>> clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
>> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
>> + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
>> }
>>
>> -static inline int test_tsk_need_resched(struct task_struct *tsk)
>> +static inline bool test_tsk_need_resched(struct task_struct *tsk)
>> {
>> return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
>> }
>> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
>>
>> static __always_inline bool need_resched(void)
>> {
>> - return unlikely(tif_need_resched());
>> + return unlikely(tif_need_resched_lazy() || tif_need_resched());
>> }
>>
>> /*
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
>> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>> * this is probably good enough.
>> */
>> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
>> {
>> + struct rq *rq = rq_of(cfs_rq);
>> +
>> if ((s64)(se->vruntime - se->deadline) < 0)
>> return;
>>
>> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
>> /*
>> * The task has consumed its request, reschedule.
>> */
>> - if (cfs_rq->nr_running > 1) {
>> - resched_curr(rq_of(cfs_rq));
>> - clear_buddies(cfs_rq, se);
>> + if (cfs_rq->nr_running < 2)
>> + return;
>> +
>> + if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
>> + resched_curr(rq);
>> + } else {
>> + /* Did the task ignore the lazy reschedule request? */
>> + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
>> + resched_curr(rq);
>> + else
>> + resched_curr_lazy(rq);
>> }
>> + clear_buddies(cfs_rq, se);
>> }
>>
>> #include "pelt.h"
>> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
>> /*
>> * Update the current task's runtime statistics.
>> */
>> -static void update_curr(struct cfs_rq *cfs_rq)
>> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
>> {
>> struct sched_entity *curr = cfs_rq->curr;
>> u64 now = rq_clock_task(rq_of(cfs_rq));
>> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
>> schedstat_add(cfs_rq->exec_clock, delta_exec);
>>
>> curr->vruntime += calc_delta_fair(delta_exec, curr);
>> - update_deadline(cfs_rq, curr);
>> + update_deadline(cfs_rq, curr, tick);
>> update_min_vruntime(cfs_rq);
>>
>> if (entity_is_task(curr)) {
>> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
>> account_cfs_rq_runtime(cfs_rq, delta_exec);
>> }
>>
>> +static inline void update_curr(struct cfs_rq *cfs_rq)
>> +{
>> + __update_curr(cfs_rq, false);
>> +}
>> +
>> static void update_curr_fair(struct rq *rq)
>> {
>> update_curr(cfs_rq_of(&rq->curr->se));
>> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>> /*
>> * Update run-time statistics of the 'current'.
>> */
>> - update_curr(cfs_rq);
>> + __update_curr(cfs_rq, true);
>>
>> /*
>> * Ensure that runnable average is periodically updated.
>> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>> * validating it and just reschedule.
>> */
>> if (queued) {
>> - resched_curr(rq_of(cfs_rq));
>> + resched_curr_lazy(rq_of(cfs_rq));
>> return;
>> }
>> /*
>> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
>> * hierarchy can be throttled
>> */
>> if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
>> - resched_curr(rq_of(cfs_rq));
>> + resched_curr_lazy(rq_of(cfs_rq));
>> }
>>
>> static __always_inline
>> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>>
>> /* Determine whether we need to wake up potentially idle CPU: */
>> if (rq->curr == rq->idle && rq->cfs.nr_running)
>> - resched_curr(rq);
>> + resched_curr_lazy(rq);
>> }
>>
>> #ifdef CONFIG_SMP
>> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
>>
>> if (delta < 0) {
>> if (task_current(rq, p))
>> - resched_curr(rq);
>> + resched_curr_lazy(rq);
>> return;
>> }
>> hrtick_start(rq, delta);
>> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
>> * prevents us from potentially nominating it as a false LAST_BUDDY
>> * below.
>> */
>> - if (test_tsk_need_resched(curr))
>> + if (need_resched())
>> return;
>>
>> /* Idle tasks are by definition preempted by non-idle tasks. */
>> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
>> return;
>>
>> preempt:
>> - resched_curr(rq);
>> + resched_curr_lazy(rq);
>> }
>>
>> #ifdef CONFIG_SMP
>> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
>> */
>> if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
>> __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>> - resched_curr(rq);
>> + resched_curr_lazy(rq);
>> }
>>
>> /*
>> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
>> */
>> if (task_current(rq, p)) {
>> if (p->prio > oldprio)
>> - resched_curr(rq);
>> + resched_curr_lazy(rq);
>> } else
>> check_preempt_curr(rq, p, 0);
>> }
>> --- a/drivers/acpi/processor_idle.c
>> +++ b/drivers/acpi/processor_idle.c
>> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces
>> */
>> static void __cpuidle acpi_safe_halt(void)
>> {
>> - if (!tif_need_resched()) {
>> + if (!need_resched()) {
>> raw_safe_halt();
>> raw_local_irq_disable();
>> }
>> --- a/include/linux/sched/idle.h
>> +++ b/include/linux/sched/idle.h
>> @@ -63,7 +63,7 @@ static __always_inline bool __must_check
>> */
>> smp_mb__after_atomic();
>>
>> - return unlikely(tif_need_resched());
>> + return unlikely(need_resched());
>> }
>>
>> static __always_inline bool __must_check current_clr_polling_and_test(void)
>> @@ -76,7 +76,7 @@ static __always_inline bool __must_check
>> */
>> smp_mb__after_atomic();
>>
>> - return unlikely(tif_need_resched());
>> + return unlikely(need_resched());
>> }
>>
>> #else
>> @@ -85,11 +85,11 @@ static inline void __current_clr_polling
>>
>> static inline bool __must_check current_set_polling_and_test(void)
>> {
>> - return unlikely(tif_need_resched());
>> + return unlikely(need_resched());
>> }
>> static inline bool __must_check current_clr_polling_and_test(void)
>> {
>> - return unlikely(tif_need_resched());
>> + return unlikely(need_resched());
>> }
>> #endif
>>
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
>> ct_cpuidle_enter();
>>
>> raw_local_irq_enable();
>> - while (!tif_need_resched() &&
>> - (cpu_idle_force_poll || tick_check_broadcast_expired()))
>> + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
>> cpu_relax();
>> raw_local_irq_disable();
>>
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>>
>> if (tif_need_resched())
>> trace_flags |= TRACE_FLAG_NEED_RESCHED;
>> + if (tif_need_resched_lazy())
>> + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
>> if (test_preempt_need_resched())
>> trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
>> return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -271,6 +271,7 @@ config X86
>> select HAVE_STATIC_CALL
>> select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
>> select HAVE_PREEMPT_DYNAMIC_CALL
>> + select HAVE_PREEMPT_AUTO
>> select HAVE_RSEQ
>> select HAVE_RUST if X86_64
>> select HAVE_SYSCALL_TRACEPOINTS
>> --- a/arch/x86/include/asm/thread_info.h
>> +++ b/arch/x86/include/asm/thread_info.h
>> @@ -81,8 +81,9 @@ struct thread_info {
>> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
>> #define TIF_SIGPENDING 2 /* signal pending */
>> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */
>> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
>> -#define TIF_SSBD 5 /* Speculative store bypass disable */
>> +#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */
>> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
>> +#define TIF_SSBD 6 /* Speculative store bypass disable */
>> #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
>> #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
>> #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
>> @@ -104,6 +105,7 @@ struct thread_info {
>> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
>> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
>> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
>> +#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY)
>> #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
>> #define _TIF_SSBD (1 << TIF_SSBD)
>> #define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
>> --- a/kernel/entry/kvm.c
>> +++ b/kernel/entry/kvm.c
>> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
>> return -EINTR;
>> }
>>
>> - if (ti_work & _TIF_NEED_RESCHED)
>> + if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
>> schedule();
>>
>> if (ti_work & _TIF_NOTIFY_RESUME)
>> --- a/include/linux/trace_events.h
>> +++ b/include/linux/trace_events.h
>> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>>
>> enum trace_flag_type {
>> TRACE_FLAG_IRQS_OFF = 0x01,
>> - TRACE_FLAG_IRQS_NOSUPPORT = 0x02,
>> - TRACE_FLAG_NEED_RESCHED = 0x04,
>> + TRACE_FLAG_NEED_RESCHED = 0x02,
>> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04,
>> TRACE_FLAG_HARDIRQ = 0x08,
>> TRACE_FLAG_SOFTIRQ = 0x10,
>> TRACE_FLAG_PREEMPT_RESCHED = 0x20,
>> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
>>
>> static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
>> {
>> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
>> + return tracing_gen_ctx_irq_test(0);
>> }
>> static inline unsigned int tracing_gen_ctx(void)
>> {
>> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
>> + return tracing_gen_ctx_irq_test(0);
>> }
>> #endif
>>
>> --- a/kernel/trace/trace_output.c
>> +++ b/kernel/trace/trace_output.c
>> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
>> (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
>> (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
>> bh_off ? 'b' :
>> - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
>> + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
>> '.';
>>
>> - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
>> + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
>> TRACE_FLAG_PREEMPT_RESCHED)) {
>> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
>> + need_resched = 'B';
>> + break;
>> case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
>> need_resched = 'N';
>> break;
>> + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
>> + need_resched = 'L';
>> + break;
>> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
>> + need_resched = 'b';
>> + break;
>> case TRACE_FLAG_NEED_RESCHED:
>> need_resched = 'n';
>> break;
>> + case TRACE_FLAG_NEED_RESCHED_LAZY:
>> + need_resched = 'l';
>> + break;
>> case TRACE_FLAG_PREEMPT_RESCHED:
>> need_resched = 'p';
>> break;
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -333,6 +333,23 @@ static const struct file_operations sche
>> .release = seq_release,
>> };
>>
>> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
>> + size_t cnt, loff_t *ppos)
>> +{
>> + unsigned long end = jiffies + 60 * HZ;
>> +
>> + for (; time_before(jiffies, end) && !signal_pending(current);)
>> + cpu_relax();
>> +
>> + return cnt;
>> +}
>> +
>> +static const struct file_operations sched_hog_fops = {
>> + .write = sched_hog_write,
>> + .open = simple_open,
>> + .llseek = default_llseek,
>> +};
>> +
>> static struct dentry *debugfs_sched;
>>
>> static __init int sched_init_debug(void)
>> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
>>
>> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>>
>> + debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
>> +
>> return 0;
>> }
>> late_initcall(sched_init_debug);
>>
--
ankur
Paul!
On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> Belatedly calling out some RCU issues. Nothing fatal, just a
> (surprisingly) few adjustments that will need to be made. The key thing
> to note is that from RCU's viewpoint, with this change, all kernels
> are preemptible, though rcu_read_lock() readers remain
> non-preemptible.
Why? Either I'm confused or you or both of us :)
With this approach the kernel is by definition fully preemptible, which
means means rcu_read_lock() is preemptible too. That's pretty much the
same situation as with PREEMPT_DYNAMIC.
For throughput sake this fully preemptible kernel provides a mechanism
to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
That means the preemption points in preempt_enable() and return from
interrupt to kernel will not see NEED_RESCHED and the tasks can run to
completion either to the point where they call schedule() or when they
return to user space. That's pretty much what PREEMPT_NONE does today.
The difference to NONE/VOLUNTARY is that the explicit cond_resched()
points are not longer required because the scheduler can preempt the
long running task by setting NEED_RESCHED instead.
That preemption might be suboptimal in some cases compared to
cond_resched(), but from my initial experimentation that's not really an
issue.
> With that:
>
> 1. As an optimization, given that preempt_count() would always give
> good information, the scheduling-clock interrupt could sense RCU
> readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the
> IPI handlers for expedited grace periods. A nice optimization.
> Except that...
>
> 2. The quiescent-state-forcing code currently relies on the presence
> of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix
> would be to do resched_cpu() more quickly, but some workloads
> might not love the additional IPIs. Another approach to do #1
> above to replace the quiescent states from cond_resched() with
> scheduler-tick-interrupt-sensed quiescent states.
Right. The tick can see either the lazy resched bit "ignored" or some
magic "RCU needs a quiescent state" and force a reschedule.
> Plus...
>
> 3. For nohz_full CPUs that run for a long time in the kernel,
> there are no scheduling-clock interrupts. RCU reaches for
> the resched_cpu() hammer a few jiffies into the grace period.
> And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> interrupt-entry code will re-enable its scheduling-clock interrupt
> upon receiving the resched_cpu() IPI.
You can spare the IPI by setting NEED_RESCHED on the remote CPU which
will cause it to preempt.
> So nohz_full CPUs should be OK as far as RCU is concerned.
> Other subsystems might have other opinions.
>
> 4. As another optimization, kvfree_rcu() could unconditionally
> check preempt_count() to sense a clean environment suitable for
> memory allocation.
Correct. All the limitations of preempt count being useless are gone.
> 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must
> instead say "select TASKS_RCU". This means that the #else
> in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> vanilla RCU must go. There might be be some fallout if something
> fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> rcu_tasks_classic_qs() do do something useful.
In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
remaining would be CONFIG_PREEMPT_RT, which should be renamed to
CONFIG_RT or such as it does not really change the preemption
model itself. RT just reduces the preemption disabled sections with the
lock conversions, forced interrupt threading and some more.
> 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace
> or RCU Tasks Rude) would need those pesky cond_resched() calls
> to stick around. The reason is that RCU Tasks readers are ended
> only by voluntary context switches. This means that although a
> preemptible infinite loop in the kernel won't inconvenience a
> real-time task (nor an non-real-time task for all that long),
> and won't delay grace periods for the other flavors of RCU,
> it would indefinitely delay an RCU Tasks grace period.
>
> However, RCU Tasks grace periods seem to be finite in preemptible
> kernels today, so they should remain finite in limited-preemptible
> kernels tomorrow. Famous last words...
That's an issue which you have today with preempt FULL, right? So if it
turns out to be a problem then it's not a problem of the new model.
> 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> any algorithmic difference from this change.
>
> 8. As has been noted elsewhere, in this new limited-preemption
> mode of operation, rcu_read_lock() readers remain preemptible.
> This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?
> 9. The rcu_preempt_depth() macro could do something useful in
> limited-preemption kernels. Its current lack of ability in
> CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
Correct.
> 10. The cond_resched_rcu() function must remain because we still
> have non-preemptible rcu_read_lock() readers.
Where?
> 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> unchanged, but I must defer to the include/net/ip_vs.h people.
*blink*
> 12. I need to check with the BPF folks on the BPF verifier's
> definition of BTF_ID(func, rcu_read_unlock_strict).
>
> 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> function might have some redundancy across the board instead
> of just on CONFIG_PREEMPT_RCU=y. Or might not.
>
> 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
> might need to do something for non-preemptible RCU to make
> up for the lack of cond_resched() calls. Maybe just drop the
> "IS_ENABLED()" and execute the body of the current "if" statement
> unconditionally.
Again. There is no non-preemtible RCU with this model, unless I'm
missing something important here.
> 15. I must defer to others on the mm/pgtable-generic.c file's
> #ifdef that depends on CONFIG_PREEMPT_RCU.
All those ifdefs should die :)
> While in the area, I noted that KLP seems to depend on cond_resched(),
> but on this I must defer to the KLP people.
Yeah, KLP needs some thoughts, but that's not rocket science to fix IMO.
> I am sure that I am missing something, but I have not yet seen any
> show-stoppers. Just some needed adjustments.
Right. If it works out as I think it can work out the main adjustments
are to remove a large amount of #ifdef maze and related gunk :)
Thanks,
tglx
On Wed, 18 Oct 2023 15:16:12 +0200
Thomas Gleixner <[email protected]> wrote:
> > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
> > might need to do something for non-preemptible RCU to make
> > up for the lack of cond_resched() calls. Maybe just drop the
> > "IS_ENABLED()" and execute the body of the current "if" statement
> > unconditionally.
Right.
I'm guessing you are talking about this code:
/*
* In some cases, notably when running on a nohz_full CPU with
* a stopped tick PREEMPT_RCU has no way to account for QSs.
* This will eventually cause unwarranted noise as PREEMPT_RCU
* will force preemption as the means of ending the current
* grace period. We avoid this problem by calling
* rcu_momentary_dyntick_idle(), which performs a zero duration
* EQS allowing PREEMPT_RCU to end the current grace period.
* This call shouldn't be wrapped inside an RCU critical
* section.
*
* Note that in non PREEMPT_RCU kernels QSs are handled through
* cond_resched()
*/
if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
if (!disable_irq)
local_irq_disable();
rcu_momentary_dyntick_idle();
if (!disable_irq)
local_irq_enable();
}
/*
* For the non-preemptive kernel config: let threads runs, if
* they so wish, unless set not do to so.
*/
if (!disable_irq && !disable_preemption)
cond_resched();
If everything becomes PREEMPT_RCU, then the above should be able to be
turned into just:
if (!disable_irq)
local_irq_disable();
rcu_momentary_dyntick_idle();
if (!disable_irq)
local_irq_enable();
And no cond_resched() is needed.
>
> Again. There is no non-preemtible RCU with this model, unless I'm
> missing something important here.
Daniel?
-- Steve
On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> Paul!
>
> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> > Belatedly calling out some RCU issues. Nothing fatal, just a
> > (surprisingly) few adjustments that will need to be made. The key thing
> > to note is that from RCU's viewpoint, with this change, all kernels
> > are preemptible, though rcu_read_lock() readers remain
> > non-preemptible.
>
> Why? Either I'm confused or you or both of us :)
Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
as preempt_enable() in this approach? I certainly hope so, as RCU
priority boosting would be a most unwelcome addition to many datacenter
workloads.
> With this approach the kernel is by definition fully preemptible, which
> means means rcu_read_lock() is preemptible too. That's pretty much the
> same situation as with PREEMPT_DYNAMIC.
Please, just no!!!
Please note that the current use of PREEMPT_DYNAMIC with preempt=none
avoids preempting RCU read-side critical sections. This means that the
distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
of RCU readers in environments expecting no preemption.
> For throughput sake this fully preemptible kernel provides a mechanism
> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
>
> That means the preemption points in preempt_enable() and return from
> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
> completion either to the point where they call schedule() or when they
> return to user space. That's pretty much what PREEMPT_NONE does today.
>
> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
> points are not longer required because the scheduler can preempt the
> long running task by setting NEED_RESCHED instead.
>
> That preemption might be suboptimal in some cases compared to
> cond_resched(), but from my initial experimentation that's not really an
> issue.
I am not (repeat NOT) arguing for keeping cond_resched(). I am instead
arguing that the less-preemptible variants of the kernel should continue
to avoid preempting RCU read-side critical sections.
> > With that:
> >
> > 1. As an optimization, given that preempt_count() would always give
> > good information, the scheduling-clock interrupt could sense RCU
> > readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the
> > IPI handlers for expedited grace periods. A nice optimization.
> > Except that...
> >
> > 2. The quiescent-state-forcing code currently relies on the presence
> > of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix
> > would be to do resched_cpu() more quickly, but some workloads
> > might not love the additional IPIs. Another approach to do #1
> > above to replace the quiescent states from cond_resched() with
> > scheduler-tick-interrupt-sensed quiescent states.
>
> Right. The tick can see either the lazy resched bit "ignored" or some
> magic "RCU needs a quiescent state" and force a reschedule.
Good, thank you for confirming.
> > Plus...
> >
> > 3. For nohz_full CPUs that run for a long time in the kernel,
> > there are no scheduling-clock interrupts. RCU reaches for
> > the resched_cpu() hammer a few jiffies into the grace period.
> > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> > interrupt-entry code will re-enable its scheduling-clock interrupt
> > upon receiving the resched_cpu() IPI.
>
> You can spare the IPI by setting NEED_RESCHED on the remote CPU which
> will cause it to preempt.
That is not sufficient for nohz_full CPUs executing in userspace, which
won't see that NEED_RESCHED until they either take an interrupt or do
a system call. And applications often work hard to prevent nohz_full
CPUs from doing either.
Please note that if the holdout CPU really is a nohz_full CPU executing
in userspace, RCU will see this courtesy of context tracking and will
therefore avoid ever IPIin it. The IPIs only happen if a nohz_full
CPU ends up executing for a long time in the kernel, which is an error
condition for the nohz_full use cases that I am aware of.
> > So nohz_full CPUs should be OK as far as RCU is concerned.
> > Other subsystems might have other opinions.
> >
> > 4. As another optimization, kvfree_rcu() could unconditionally
> > check preempt_count() to sense a clean environment suitable for
> > memory allocation.
>
> Correct. All the limitations of preempt count being useless are gone.
Woo-hoo!!! And that is of course a very attractive property of this.
> > 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must
> > instead say "select TASKS_RCU". This means that the #else
> > in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> > vanilla RCU must go. There might be be some fallout if something
> > fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> > and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> > rcu_tasks_classic_qs() do do something useful.
>
> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> CONFIG_RT or such as it does not really change the preemption
> model itself. RT just reduces the preemption disabled sections with the
> lock conversions, forced interrupt threading and some more.
Again, please, no.
There are situations where we still need rcu_read_lock() and
rcu_read_unlock() to be preempt_disable() and preempt_enable(),
repectively. Those can be cases selected only by Kconfig option, not
available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace
> > or RCU Tasks Rude) would need those pesky cond_resched() calls
> > to stick around. The reason is that RCU Tasks readers are ended
> > only by voluntary context switches. This means that although a
> > preemptible infinite loop in the kernel won't inconvenience a
> > real-time task (nor an non-real-time task for all that long),
> > and won't delay grace periods for the other flavors of RCU,
> > it would indefinitely delay an RCU Tasks grace period.
> >
> > However, RCU Tasks grace periods seem to be finite in preemptible
> > kernels today, so they should remain finite in limited-preemptible
> > kernels tomorrow. Famous last words...
>
> That's an issue which you have today with preempt FULL, right? So if it
> turns out to be a problem then it's not a problem of the new model.
Agreed, and hence my last three lines of text above. Plus the guy who
requested RCU Tasks said that it was OK for its grace periods to take
a long time, and I am holding Steven Rostedt to that. ;-)
> > 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> > any algorithmic difference from this change.
> >
> > 8. As has been noted elsewhere, in this new limited-preemption
> > mode of operation, rcu_read_lock() readers remain preemptible.
> > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
>
> Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?
That is in fact the problem. Preemption can be good, but it is possible
to have too much of a good thing, and preemptible RCU read-side critical
sections definitely is in that category for some important workloads. ;-)
> > 9. The rcu_preempt_depth() macro could do something useful in
> > limited-preemption kernels. Its current lack of ability in
> > CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
>
> Correct.
>
> > 10. The cond_resched_rcu() function must remain because we still
> > have non-preemptible rcu_read_lock() readers.
>
> Where?
In datacenters.
> > 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> > unchanged, but I must defer to the include/net/ip_vs.h people.
>
> *blink*
No argument here. ;-)
> > 12. I need to check with the BPF folks on the BPF verifier's
> > definition of BTF_ID(func, rcu_read_unlock_strict).
> >
> > 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> > function might have some redundancy across the board instead
> > of just on CONFIG_PREEMPT_RCU=y. Or might not.
> >
> > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
> > might need to do something for non-preemptible RCU to make
> > up for the lack of cond_resched() calls. Maybe just drop the
> > "IS_ENABLED()" and execute the body of the current "if" statement
> > unconditionally.
>
> Again. There is no non-preemtible RCU with this model, unless I'm
> missing something important here.
And again, there needs to be non-preemptible RCU with this model.
> > 15. I must defer to others on the mm/pgtable-generic.c file's
> > #ifdef that depends on CONFIG_PREEMPT_RCU.
>
> All those ifdefs should die :)
Like all things, they will eventually. ;-)
> > While in the area, I noted that KLP seems to depend on cond_resched(),
> > but on this I must defer to the KLP people.
>
> Yeah, KLP needs some thoughts, but that's not rocket science to fix IMO.
Not rocket science, just KLP science, which I am happy to defer to the
KLP people.
> > I am sure that I am missing something, but I have not yet seen any
> > show-stoppers. Just some needed adjustments.
>
> Right. If it works out as I think it can work out the main adjustments
> are to remove a large amount of #ifdef maze and related gunk :)
Just please don't remove the #ifdef gunk that is still needed!
Thanx, Paul
On Wed, 18 Oct 2023 10:19:53 -0700
"Paul E. McKenney" <[email protected]> wrote:
>
> Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> as preempt_enable() in this approach? I certainly hope so, as RCU
> priority boosting would be a most unwelcome addition to many datacenter
> workloads.
>
> > With this approach the kernel is by definition fully preemptible, which
> > means means rcu_read_lock() is preemptible too. That's pretty much the
> > same situation as with PREEMPT_DYNAMIC.
>
> Please, just no!!!
Note, when I first read Thomas's proposal, I figured that Paul would no
longer get to brag that:
"In CONFIG_PREEMPT_NONE, rcu_read_lock() and rcu_read_unlock() are simply
nops!"
But instead, they would be:
static void rcu_read_lock(void)
{
preempt_disable();
}
static void rcu_read_unlock(void)
{
preempt_enable();
}
as it was mentioned that today's preempt_disable() is fast and not an issue
like it was in older kernels.
That would mean that there will still be a "non preempt" version of RCU.
As the preempt version of RCU adds a lot more logic when scheduling out in
an RCU critical section, that I can envision not all workloads would want
around. Adding "preempt_disable()" is now low overhead, but adding the RCU
logic to handle preemption isn't as lightweight as that.
Not to mention the logic to boost those threads that were preempted and
being starved for some time.
> > > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace
> > > or RCU Tasks Rude) would need those pesky cond_resched() calls
> > > to stick around. The reason is that RCU Tasks readers are ended
> > > only by voluntary context switches. This means that although a
> > > preemptible infinite loop in the kernel won't inconvenience a
> > > real-time task (nor an non-real-time task for all that long),
> > > and won't delay grace periods for the other flavors of RCU,
> > > it would indefinitely delay an RCU Tasks grace period.
> > >
> > > However, RCU Tasks grace periods seem to be finite in preemptible
> > > kernels today, so they should remain finite in limited-preemptible
> > > kernels tomorrow. Famous last words...
> >
> > That's an issue which you have today with preempt FULL, right? So if it
> > turns out to be a problem then it's not a problem of the new model.
>
> Agreed, and hence my last three lines of text above. Plus the guy who
> requested RCU Tasks said that it was OK for its grace periods to take
> a long time, and I am holding Steven Rostedt to that. ;-)
Matters what your definition of "long time" is ;-)
-- Steve
On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote:
> >> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
> >> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
> >> >> That said - I think as a proof of concept and "look, with this we get
> >> >> the expected scheduling event counts", that patch is perfect. I think
> >> >> you more than proved the concept.
> >> >
> >> > There is certainly quite some analyis work to do to make this a one to
> >> > one replacement.
> >> >
> >> > With a handful of benchmarks the PoC (tweaked with some obvious fixes)
> >> > is pretty much on par with the current mainline variants (NONE/FULL),
> >> > but the memtier benchmark makes a massive dent.
> >> >
> >> > It sports a whopping 10% regression with the LAZY mode versus the mainline
> >> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
> >> >
> >> > That benchmark is really sensitive to the preemption model. With current
> >> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
> >> > performance drop versus preempt=NONE.
> >>
> >> That 20% was a tired pilot error. The real number is in the 5% ballpark.
> >>
> >> > I have no clue what's going on there yet, but that shows that there is
> >> > obviously quite some work ahead to get this sorted.
> >>
> >> It took some head scratching to figure that out. The initial fix broke
> >> the handling of the hog issue, i.e. the problem that Ankur tried to
> >> solve, but I hacked up a "solution" for that too.
> >>
> >> With that the memtier benchmark is roughly back to the mainline numbers,
> >> but my throughput benchmark know how is pretty close to zero, so that
> >> should be looked at by people who actually understand these things.
> >>
> >> Likewise the hog prevention is just at the PoC level and clearly beyond
> >> my knowledge of scheduler details: It unconditionally forces a
> >> reschedule when the looping task is not responding to a lazy reschedule
> >> request before the next tick. IOW it forces a reschedule on the second
> >> tick, which is obviously different from the cond_resched()/might_sleep()
> >> behaviour.
> >>
> >> The changes vs. the original PoC aside of the bug and thinko fixes:
> >>
> >> 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
> >> lazy preempt bit as the trace_entry::flags field is full already.
> >>
> >> That obviously breaks the tracer ABI, but if we go there then
> >> this needs to be fixed. Steven?
> >>
> >> 2) debugfs file to validate that loops can be force preempted w/o
> >> cond_resched()
> >>
> >> The usage is:
> >>
> >> # taskset -c 1 bash
> >> # echo 1 > /sys/kernel/debug/sched/hog &
> >> # echo 1 > /sys/kernel/debug/sched/hog &
> >> # echo 1 > /sys/kernel/debug/sched/hog &
> >>
> >> top shows ~33% CPU for each of the hogs and tracing confirms that
> >> the crude hack in the scheduler tick works:
> >>
> >> bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr
> >> bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr
> >> bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr
> >> bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr
> >> bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr
> >> bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr
> >> bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr
> >> bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr
> >>
> >> The 'l' instead of the usual 'N' reflects that the lazy resched
> >> bit is set. That makes __update_curr() invoke resched_curr()
> >> instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
> >> and folds it into preempt_count so that preemption happens at the
> >> next possible point, i.e. either in return from interrupt or at
> >> the next preempt_enable().
> >
> > Belatedly calling out some RCU issues. Nothing fatal, just a
> > (surprisingly) few adjustments that will need to be made. The key thing
> > to note is that from RCU's viewpoint, with this change, all kernels
> > are preemptible, though rcu_read_lock() readers remain non-preemptible.
>
> Yeah, in Thomas' patch CONFIG_PREEMPTION=y and preemption models
> none/voluntary/full are just scheduler tweaks on top of that. And, so
> this would always have PREEMPT_RCU=y. So, shouldn't rcu_read_lock()
> readers be preemptible?
>
> (An alternate configuration might be:
> config PREEMPT_NONE
> select PREEMPT_COUNT
>
> config PREEMPT_FULL
> select PREEMPTION
>
> This probably allows for more configuration flexibility across archs?
> Would allow for TREE_RCU=y, for instance. That said, so far I've only
> been working with PREEMPT_RCU=y.)
Then this is a bug that needs to be fixed. We need a way to make
RCU readers non-preemptible.
> > With that:
> >
> > 1. As an optimization, given that preempt_count() would always give
> > good information, the scheduling-clock interrupt could sense RCU
> > readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the
> > IPI handlers for expedited grace periods. A nice optimization.
> > Except that...
> >
> > 2. The quiescent-state-forcing code currently relies on the presence
> > of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix
> > would be to do resched_cpu() more quickly, but some workloads
> > might not love the additional IPIs. Another approach to do #1
> > above to replace the quiescent states from cond_resched() with
> > scheduler-tick-interrupt-sensed quiescent states.
>
> Right, the call to rcu_all_qs(). Just to see if I have it straight,
> something like this for PREEMPT_RCU=n kernels?
>
> if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0)
> rcu_all_qs();
>
> (Masked because PREEMPT_NONE might not do any folding for
> NEED_RESCHED_LAZY in the tick.)
>
> Though the comment around rcu_all_qs() mentions that rcu_all_qs()
> reports a quiescent state only if urgently needed. Given that the tick
> executes less frequently than calls to cond_resched(), could we just
> always report instead? Or I'm completely on the wrong track?
>
> if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) {
> preempt_disable();
> rcu_qs();
> preempt_enable();
> }
>
> On your point about the preempt_count() being dependable, there's a
> wrinkle. As Linus mentions in
> https://lore.kernel.org/lkml/CAHk-=wgUimqtF7PqFfRw4Ju5H1KYkp6+8F=hBz7amGQ8GaGKkA@mail.gmail.com/,
> that might not be true for architectures that define ARCH_NO_PREEMPT.
>
> My plan was to limit those archs to do preemption only at user space boundary
> but there are almost certainly RCU implications that I missed.
Just add this to the "if" condition of the CONFIG_PREEMPT_RCU=n version
of rcu_flavor_sched_clock_irq():
|| !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))
Resulting in something like this:
------------------------------------------------------------------------
static void rcu_flavor_sched_clock_irq(int user)
{
if (user || rcu_is_cpu_rrupt_from_idle() ||
!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
/*
* Get here if this CPU took its interrupt from user
* mode or from the idle loop, and if this is not a nested
* interrupt, or if the interrupt is from a preemptible
* region of the kernel. In this case, the CPU is in a
* quiescent state, so note it.
*
* No memory barrier is required here because rcu_qs()
* references only CPU-local variables that other CPUs
* neither access nor modify, at least not while the
* corresponding CPU is online.
*/
rcu_qs();
}
}
------------------------------------------------------------------------
> > Plus...
> >
> > 3. For nohz_full CPUs that run for a long time in the kernel,
> > there are no scheduling-clock interrupts. RCU reaches for
> > the resched_cpu() hammer a few jiffies into the grace period.
> > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> > interrupt-entry code will re-enable its scheduling-clock interrupt
> > upon receiving the resched_cpu() IPI.
> >
> > So nohz_full CPUs should be OK as far as RCU is concerned.
> > Other subsystems might have other opinions.
>
> Ah, that's what I thought from my reading of the RCU comments. Good to
> have that confirmed. Thanks.
>
> > 4. As another optimization, kvfree_rcu() could unconditionally
> > check preempt_count() to sense a clean environment suitable for
> > memory allocation.
>
> Had missed this completely. Could you elaborate?
It is just an optimization. But the idea is to use less restrictive
GFP_ flags in add_ptr_to_bulk_krc_lock() when the caller's context
allows it. Add Uladzislau on CC for his thoughts.
> > 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must
> > instead say "select TASKS_RCU". This means that the #else
> > in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> > vanilla RCU must go. There might be be some fallout if something
> > fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> > and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> > rcu_tasks_classic_qs() do do something useful.
>
> Ack.
>
> > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace
> > or RCU Tasks Rude) would need those pesky cond_resched() calls
> > to stick around. The reason is that RCU Tasks readers are ended
> > only by voluntary context switches. This means that although a
> > preemptible infinite loop in the kernel won't inconvenience a
> > real-time task (nor an non-real-time task for all that long),
> > and won't delay grace periods for the other flavors of RCU,
> > it would indefinitely delay an RCU Tasks grace period.
> >
> > However, RCU Tasks grace periods seem to be finite in preemptible
> > kernels today, so they should remain finite in limited-preemptible
> > kernels tomorrow. Famous last words...
> >
> > 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> > any algorithmic difference from this change.
>
> So, essentially, as long as RCU tasks eventually, in the fullness of
> time, call schedule(), removing cond_resched() shouldn't have any
> effect :).
Almost.
SRCU and RCU Tasks Trace have explicit read-side state changes that
the corresponding grace-period code can detect, one way or another,
and thus is not dependent on reschedules. RCU Tasks Rude does explicit
reschedules on all CPUs (hence "Rude"), and thus doesn't have to care
about whether or not other things do reschedules.
> > 8. As has been noted elsewhere, in this new limited-preemption
> > mode of operation, rcu_read_lock() readers remain preemptible.
> > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
>
> Ack.
>
> > 9. The rcu_preempt_depth() macro could do something useful in
> > limited-preemption kernels. Its current lack of ability in
> > CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
> >
> > 10. The cond_resched_rcu() function must remain because we still
> > have non-preemptible rcu_read_lock() readers.
>
> For configurations with PREEMPT_RCU=n? Yes, agreed. Though it need
> only be this, right?:
>
> static inline void cond_resched_rcu(void)
> {
> #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
> rcu_read_unlock();
>
> rcu_read_lock();
> #endif
> }
There is a good chance that it will also need to do an explicit
rcu_all_qs(). The problem is that there is an extremely low probability
that the scheduling clock interrupt will hit that space between the
rcu_read_unlock() and rcu_read_lock().
But either way, not a showstopper.
> > 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> > unchanged, but I must defer to the include/net/ip_vs.h people.
> >
> > 12. I need to check with the BPF folks on the BPF verifier's
> > definition of BTF_ID(func, rcu_read_unlock_strict).
> >
> > 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> > function might have some redundancy across the board instead
> > of just on CONFIG_PREEMPT_RCU=y. Or might not.
>
> I don't think I understand any of these well enough to comment. Will
> Cc the relevant folks when I send out the RFC.
Sounds like a plan to me! ;-)
> > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
> > might need to do something for non-preemptible RCU to make
> > up for the lack of cond_resched() calls. Maybe just drop the
> > "IS_ENABLED()" and execute the body of the current "if" statement
> > unconditionally.
>
> Aah, yes this is a good idea. Thanks.
>
> > 15. I must defer to others on the mm/pgtable-generic.c file's
> > #ifdef that depends on CONFIG_PREEMPT_RCU.
> >
> > While in the area, I noted that KLP seems to depend on cond_resched(),
> > but on this I must defer to the KLP people.
>
> Yeah, as part of this work, I ended up unhooking most of the KLP
> hooks in cond_resched() and of course, cond_resched() itself.
> Will poke the livepatching people.
Again, sounds like a plan to me!
> > I am sure that I am missing something, but I have not yet seen any
> > show-stoppers. Just some needed adjustments.
>
> Appreciate this detailed list. Makes me think that everything might
> not go up in smoke after all!
C'mon, Ankur, if it doesn't go up in smoke at some point, you just
aren't trying hard enough! ;-)
Thanx, Paul
> Thanks
> Ankur
>
> > Thoughts?
> >
> > Thanx, Paul
> >
> >> That's as much as I wanted to demonstrate and I'm not going to spend
> >> more cycles on it as I have already too many other things on flight and
> >> the resulting scheduler woes are clearly outside of my expertice.
> >>
> >> Though definitely I'm putting a permanent NAK in place for any attempts
> >> to duct tape the preempt=NONE model any further by sprinkling more
> >> cond*() and whatever warts around.
> >>
> >> Thanks,
> >>
> >> tglx
> >> ---
> >> arch/x86/Kconfig | 1
> >> arch/x86/include/asm/thread_info.h | 6 ++--
> >> drivers/acpi/processor_idle.c | 2 -
> >> include/linux/entry-common.h | 2 -
> >> include/linux/entry-kvm.h | 2 -
> >> include/linux/sched.h | 12 +++++---
> >> include/linux/sched/idle.h | 8 ++---
> >> include/linux/thread_info.h | 24 +++++++++++++++++
> >> include/linux/trace_events.h | 8 ++---
> >> kernel/Kconfig.preempt | 17 +++++++++++-
> >> kernel/entry/common.c | 4 +-
> >> kernel/entry/kvm.c | 2 -
> >> kernel/sched/core.c | 51 +++++++++++++++++++++++++------------
> >> kernel/sched/debug.c | 19 +++++++++++++
> >> kernel/sched/fair.c | 46 ++++++++++++++++++++++-----------
> >> kernel/sched/features.h | 2 +
> >> kernel/sched/idle.c | 3 --
> >> kernel/sched/sched.h | 1
> >> kernel/trace/trace.c | 2 +
> >> kernel/trace/trace_output.c | 16 ++++++++++-
> >> 20 files changed, 171 insertions(+), 57 deletions(-)
> >>
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
> >>
> >> #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
> >> /*
> >> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
> >> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
> >> * this avoids any races wrt polling state changes and thereby avoids
> >> * spurious IPIs.
> >> */
> >> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
> >> {
> >> struct thread_info *ti = task_thread_info(p);
> >> - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
> >> +
> >> + return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
> >> }
> >>
> >> /*
> >> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
> >> for (;;) {
> >> if (!(val & _TIF_POLLING_NRFLAG))
> >> return false;
> >> - if (val & _TIF_NEED_RESCHED)
> >> + if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> >> return true;
> >> if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
> >> break;
> >> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
> >> }
> >>
> >> #else
> >> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
> >> {
> >> - set_tsk_need_resched(p);
> >> + set_tsk_thread_flag(p, tif_bit);
> >> return true;
> >> }
> >>
> >> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
> >> * might also involve a cross-CPU call to trigger the scheduler on
> >> * the target CPU.
> >> */
> >> -void resched_curr(struct rq *rq)
> >> +static void __resched_curr(struct rq *rq, int lazy)
> >> {
> >> + int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
> >> struct task_struct *curr = rq->curr;
> >> - int cpu;
> >>
> >> lockdep_assert_rq_held(rq);
> >>
> >> - if (test_tsk_need_resched(curr))
> >> + if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
> >> return;
> >>
> >> cpu = cpu_of(rq);
> >>
> >> if (cpu == smp_processor_id()) {
> >> - set_tsk_need_resched(curr);
> >> - set_preempt_need_resched();
> >> + set_tsk_thread_flag(curr, tif_bit);
> >> + if (!lazy)
> >> + set_preempt_need_resched();
> >> return;
> >> }
> >>
> >> - if (set_nr_and_not_polling(curr))
> >> - smp_send_reschedule(cpu);
> >> - else
> >> + if (set_nr_and_not_polling(curr, tif_bit)) {
> >> + if (!lazy)
> >> + smp_send_reschedule(cpu);
> >> + } else {
> >> trace_sched_wake_idle_without_ipi(cpu);
> >> + }
> >> +}
> >> +
> >> +void resched_curr(struct rq *rq)
> >> +{
> >> + __resched_curr(rq, 0);
> >> +}
> >> +
> >> +void resched_curr_lazy(struct rq *rq)
> >> +{
> >> + int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
> >> + TIF_NEED_RESCHED_LAZY_OFFSET : 0;
> >> +
> >> + if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
> >> + return;
> >> +
> >> + __resched_curr(rq, lazy);
> >> }
> >>
> >> void resched_cpu(int cpu)
> >> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
> >> if (cpu == smp_processor_id())
> >> return;
> >>
> >> - if (set_nr_and_not_polling(rq->idle))
> >> + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
> >> smp_send_reschedule(cpu);
> >> else
> >> trace_sched_wake_idle_without_ipi(cpu);
> >> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
> >> WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
> >> return preempt_dynamic_mode == preempt_dynamic_##mode; \
> >> } \
> >> - EXPORT_SYMBOL_GPL(preempt_model_##mode)
> >>
> >> PREEMPT_MODEL_ACCESSOR(none);
> >> PREEMPT_MODEL_ACCESSOR(voluntary);
> >> --- a/include/linux/thread_info.h
> >> +++ b/include/linux/thread_info.h
> >> @@ -59,6 +59,16 @@ enum syscall_work_bit {
> >>
> >> #include <asm/thread_info.h>
> >>
> >> +#ifdef CONFIG_PREEMPT_AUTO
> >> +# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY
> >> +# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY
> >> +# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
> >> +#else
> >> +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
> >> +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
> >> +# define TIF_NEED_RESCHED_LAZY_OFFSET 0
> >> +#endif
> >> +
> >> #ifdef __KERNEL__
> >>
> >> #ifndef arch_set_restart_data
> >> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
> >> (unsigned long *)(¤t_thread_info()->flags));
> >> }
> >>
> >> +static __always_inline bool tif_need_resched_lazy(void)
> >> +{
> >> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
> >> + arch_test_bit(TIF_NEED_RESCHED_LAZY,
> >> + (unsigned long *)(¤t_thread_info()->flags));
> >> +}
> >> +
> >> #else
> >>
> >> static __always_inline bool tif_need_resched(void)
> >> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
> >> (unsigned long *)(¤t_thread_info()->flags));
> >> }
> >>
> >> +static __always_inline bool tif_need_resched_lazy(void)
> >> +{
> >> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
> >> + test_bit(TIF_NEED_RESCHED_LAZY,
> >> + (unsigned long *)(¤t_thread_info()->flags));
> >> +}
> >> +
> >> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
> >>
> >> #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
> >> --- a/kernel/Kconfig.preempt
> >> +++ b/kernel/Kconfig.preempt
> >> @@ -11,6 +11,13 @@ config PREEMPT_BUILD
> >> select PREEMPTION
> >> select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
> >>
> >> +config PREEMPT_BUILD_AUTO
> >> + bool
> >> + select PREEMPT_BUILD
> >> +
> >> +config HAVE_PREEMPT_AUTO
> >> + bool
> >> +
> >> choice
> >> prompt "Preemption Model"
> >> default PREEMPT_NONE
> >> @@ -67,9 +74,17 @@ config PREEMPT
> >> embedded system with latency requirements in the milliseconds
> >> range.
> >>
> >> +config PREEMPT_AUTO
> >> + bool "Automagic preemption mode with runtime tweaking support"
> >> + depends on HAVE_PREEMPT_AUTO
> >> + select PREEMPT_BUILD_AUTO
> >> + help
> >> + Add some sensible blurb here
> >> +
> >> config PREEMPT_RT
> >> bool "Fully Preemptible Kernel (Real-Time)"
> >> depends on EXPERT && ARCH_SUPPORTS_RT
> >> + select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
> >> select PREEMPTION
> >> help
> >> This option turns the kernel into a real-time kernel by replacing
> >> @@ -95,7 +110,7 @@ config PREEMPTION
> >>
> >> config PREEMPT_DYNAMIC
> >> bool "Preemption behaviour defined on boot"
> >> - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
> >> + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
> >> select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
> >> select PREEMPT_BUILD
> >> default y if HAVE_PREEMPT_DYNAMIC_CALL
> >> --- a/include/linux/entry-common.h
> >> +++ b/include/linux/entry-common.h
> >> @@ -60,7 +60,7 @@
> >> #define EXIT_TO_USER_MODE_WORK \
> >> (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
> >> _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
> >> - ARCH_EXIT_TO_USER_MODE_WORK)
> >> + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
> >>
> >> /**
> >> * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
> >> --- a/include/linux/entry-kvm.h
> >> +++ b/include/linux/entry-kvm.h
> >> @@ -18,7 +18,7 @@
> >>
> >> #define XFER_TO_GUEST_MODE_WORK \
> >> (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
> >> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> >> + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
> >>
> >> struct kvm_vcpu;
> >>
> >> --- a/kernel/entry/common.c
> >> +++ b/kernel/entry/common.c
> >> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
> >>
> >> local_irq_enable_exit_to_user(ti_work);
> >>
> >> - if (ti_work & _TIF_NEED_RESCHED)
> >> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> >> schedule();
> >>
> >> if (ti_work & _TIF_UPROBE)
> >> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
> >> rcu_irq_exit_check_preempt();
> >> if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> >> WARN_ON_ONCE(!on_thread_stack());
> >> - if (need_resched())
> >> + if (test_tsk_need_resched(current))
> >> preempt_schedule_irq();
> >> }
> >> }
> >> --- a/kernel/sched/features.h
> >> +++ b/kernel/sched/features.h
> >> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
> >> SCHED_FEAT(LATENCY_WARN, false)
> >>
> >> SCHED_FEAT(HZ_BW, true)
> >> +
> >> +SCHED_FEAT(FORCE_NEED_RESCHED, false)
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
> >> extern void reweight_task(struct task_struct *p, int prio);
> >>
> >> extern void resched_curr(struct rq *rq);
> >> +extern void resched_curr_lazy(struct rq *rq);
> >> extern void resched_cpu(int cpu);
> >>
> >> extern struct rt_bandwidth def_rt_bandwidth;
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
> >> update_ti_thread_flag(task_thread_info(tsk), flag, value);
> >> }
> >>
> >> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> {
> >> return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
> >> }
> >>
> >> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> {
> >> return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
> >> }
> >>
> >> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> {
> >> return test_ti_thread_flag(task_thread_info(tsk), flag);
> >> }
> >> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
> >> static inline void clear_tsk_need_resched(struct task_struct *tsk)
> >> {
> >> clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> >> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> >> + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
> >> }
> >>
> >> -static inline int test_tsk_need_resched(struct task_struct *tsk)
> >> +static inline bool test_tsk_need_resched(struct task_struct *tsk)
> >> {
> >> return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
> >> }
> >> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
> >>
> >> static __always_inline bool need_resched(void)
> >> {
> >> - return unlikely(tif_need_resched());
> >> + return unlikely(tif_need_resched_lazy() || tif_need_resched());
> >> }
> >>
> >> /*
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
> >> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
> >> * this is probably good enough.
> >> */
> >> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
> >> {
> >> + struct rq *rq = rq_of(cfs_rq);
> >> +
> >> if ((s64)(se->vruntime - se->deadline) < 0)
> >> return;
> >>
> >> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
> >> /*
> >> * The task has consumed its request, reschedule.
> >> */
> >> - if (cfs_rq->nr_running > 1) {
> >> - resched_curr(rq_of(cfs_rq));
> >> - clear_buddies(cfs_rq, se);
> >> + if (cfs_rq->nr_running < 2)
> >> + return;
> >> +
> >> + if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
> >> + resched_curr(rq);
> >> + } else {
> >> + /* Did the task ignore the lazy reschedule request? */
> >> + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
> >> + resched_curr(rq);
> >> + else
> >> + resched_curr_lazy(rq);
> >> }
> >> + clear_buddies(cfs_rq, se);
> >> }
> >>
> >> #include "pelt.h"
> >> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
> >> /*
> >> * Update the current task's runtime statistics.
> >> */
> >> -static void update_curr(struct cfs_rq *cfs_rq)
> >> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
> >> {
> >> struct sched_entity *curr = cfs_rq->curr;
> >> u64 now = rq_clock_task(rq_of(cfs_rq));
> >> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
> >> schedstat_add(cfs_rq->exec_clock, delta_exec);
> >>
> >> curr->vruntime += calc_delta_fair(delta_exec, curr);
> >> - update_deadline(cfs_rq, curr);
> >> + update_deadline(cfs_rq, curr, tick);
> >> update_min_vruntime(cfs_rq);
> >>
> >> if (entity_is_task(curr)) {
> >> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
> >> account_cfs_rq_runtime(cfs_rq, delta_exec);
> >> }
> >>
> >> +static inline void update_curr(struct cfs_rq *cfs_rq)
> >> +{
> >> + __update_curr(cfs_rq, false);
> >> +}
> >> +
> >> static void update_curr_fair(struct rq *rq)
> >> {
> >> update_curr(cfs_rq_of(&rq->curr->se));
> >> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> >> /*
> >> * Update run-time statistics of the 'current'.
> >> */
> >> - update_curr(cfs_rq);
> >> + __update_curr(cfs_rq, true);
> >>
> >> /*
> >> * Ensure that runnable average is periodically updated.
> >> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> >> * validating it and just reschedule.
> >> */
> >> if (queued) {
> >> - resched_curr(rq_of(cfs_rq));
> >> + resched_curr_lazy(rq_of(cfs_rq));
> >> return;
> >> }
> >> /*
> >> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
> >> * hierarchy can be throttled
> >> */
> >> if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
> >> - resched_curr(rq_of(cfs_rq));
> >> + resched_curr_lazy(rq_of(cfs_rq));
> >> }
> >>
> >> static __always_inline
> >> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
> >>
> >> /* Determine whether we need to wake up potentially idle CPU: */
> >> if (rq->curr == rq->idle && rq->cfs.nr_running)
> >> - resched_curr(rq);
> >> + resched_curr_lazy(rq);
> >> }
> >>
> >> #ifdef CONFIG_SMP
> >> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
> >>
> >> if (delta < 0) {
> >> if (task_current(rq, p))
> >> - resched_curr(rq);
> >> + resched_curr_lazy(rq);
> >> return;
> >> }
> >> hrtick_start(rq, delta);
> >> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
> >> * prevents us from potentially nominating it as a false LAST_BUDDY
> >> * below.
> >> */
> >> - if (test_tsk_need_resched(curr))
> >> + if (need_resched())
> >> return;
> >>
> >> /* Idle tasks are by definition preempted by non-idle tasks. */
> >> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
> >> return;
> >>
> >> preempt:
> >> - resched_curr(rq);
> >> + resched_curr_lazy(rq);
> >> }
> >>
> >> #ifdef CONFIG_SMP
> >> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
> >> */
> >> if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
> >> __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
> >> - resched_curr(rq);
> >> + resched_curr_lazy(rq);
> >> }
> >>
> >> /*
> >> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
> >> */
> >> if (task_current(rq, p)) {
> >> if (p->prio > oldprio)
> >> - resched_curr(rq);
> >> + resched_curr_lazy(rq);
> >> } else
> >> check_preempt_curr(rq, p, 0);
> >> }
> >> --- a/drivers/acpi/processor_idle.c
> >> +++ b/drivers/acpi/processor_idle.c
> >> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces
> >> */
> >> static void __cpuidle acpi_safe_halt(void)
> >> {
> >> - if (!tif_need_resched()) {
> >> + if (!need_resched()) {
> >> raw_safe_halt();
> >> raw_local_irq_disable();
> >> }
> >> --- a/include/linux/sched/idle.h
> >> +++ b/include/linux/sched/idle.h
> >> @@ -63,7 +63,7 @@ static __always_inline bool __must_check
> >> */
> >> smp_mb__after_atomic();
> >>
> >> - return unlikely(tif_need_resched());
> >> + return unlikely(need_resched());
> >> }
> >>
> >> static __always_inline bool __must_check current_clr_polling_and_test(void)
> >> @@ -76,7 +76,7 @@ static __always_inline bool __must_check
> >> */
> >> smp_mb__after_atomic();
> >>
> >> - return unlikely(tif_need_resched());
> >> + return unlikely(need_resched());
> >> }
> >>
> >> #else
> >> @@ -85,11 +85,11 @@ static inline void __current_clr_polling
> >>
> >> static inline bool __must_check current_set_polling_and_test(void)
> >> {
> >> - return unlikely(tif_need_resched());
> >> + return unlikely(need_resched());
> >> }
> >> static inline bool __must_check current_clr_polling_and_test(void)
> >> {
> >> - return unlikely(tif_need_resched());
> >> + return unlikely(need_resched());
> >> }
> >> #endif
> >>
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
> >> ct_cpuidle_enter();
> >>
> >> raw_local_irq_enable();
> >> - while (!tif_need_resched() &&
> >> - (cpu_idle_force_poll || tick_check_broadcast_expired()))
> >> + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
> >> cpu_relax();
> >> raw_local_irq_disable();
> >>
> >> --- a/kernel/trace/trace.c
> >> +++ b/kernel/trace/trace.c
> >> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
> >>
> >> if (tif_need_resched())
> >> trace_flags |= TRACE_FLAG_NEED_RESCHED;
> >> + if (tif_need_resched_lazy())
> >> + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
> >> if (test_preempt_need_resched())
> >> trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
> >> return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -271,6 +271,7 @@ config X86
> >> select HAVE_STATIC_CALL
> >> select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
> >> select HAVE_PREEMPT_DYNAMIC_CALL
> >> + select HAVE_PREEMPT_AUTO
> >> select HAVE_RSEQ
> >> select HAVE_RUST if X86_64
> >> select HAVE_SYSCALL_TRACEPOINTS
> >> --- a/arch/x86/include/asm/thread_info.h
> >> +++ b/arch/x86/include/asm/thread_info.h
> >> @@ -81,8 +81,9 @@ struct thread_info {
> >> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
> >> #define TIF_SIGPENDING 2 /* signal pending */
> >> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */
> >> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
> >> -#define TIF_SSBD 5 /* Speculative store bypass disable */
> >> +#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */
> >> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
> >> +#define TIF_SSBD 6 /* Speculative store bypass disable */
> >> #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
> >> #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
> >> #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
> >> @@ -104,6 +105,7 @@ struct thread_info {
> >> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
> >> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> >> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
> >> +#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY)
> >> #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
> >> #define _TIF_SSBD (1 << TIF_SSBD)
> >> #define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
> >> --- a/kernel/entry/kvm.c
> >> +++ b/kernel/entry/kvm.c
> >> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
> >> return -EINTR;
> >> }
> >>
> >> - if (ti_work & _TIF_NEED_RESCHED)
> >> + if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
> >> schedule();
> >>
> >> if (ti_work & _TIF_NOTIFY_RESUME)
> >> --- a/include/linux/trace_events.h
> >> +++ b/include/linux/trace_events.h
> >> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
> >>
> >> enum trace_flag_type {
> >> TRACE_FLAG_IRQS_OFF = 0x01,
> >> - TRACE_FLAG_IRQS_NOSUPPORT = 0x02,
> >> - TRACE_FLAG_NEED_RESCHED = 0x04,
> >> + TRACE_FLAG_NEED_RESCHED = 0x02,
> >> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04,
> >> TRACE_FLAG_HARDIRQ = 0x08,
> >> TRACE_FLAG_SOFTIRQ = 0x10,
> >> TRACE_FLAG_PREEMPT_RESCHED = 0x20,
> >> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
> >>
> >> static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
> >> {
> >> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> >> + return tracing_gen_ctx_irq_test(0);
> >> }
> >> static inline unsigned int tracing_gen_ctx(void)
> >> {
> >> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> >> + return tracing_gen_ctx_irq_test(0);
> >> }
> >> #endif
> >>
> >> --- a/kernel/trace/trace_output.c
> >> +++ b/kernel/trace/trace_output.c
> >> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
> >> (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
> >> (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
> >> bh_off ? 'b' :
> >> - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
> >> + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
> >> '.';
> >>
> >> - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
> >> + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
> >> TRACE_FLAG_PREEMPT_RESCHED)) {
> >> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> >> + need_resched = 'B';
> >> + break;
> >> case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
> >> need_resched = 'N';
> >> break;
> >> + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> >> + need_resched = 'L';
> >> + break;
> >> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
> >> + need_resched = 'b';
> >> + break;
> >> case TRACE_FLAG_NEED_RESCHED:
> >> need_resched = 'n';
> >> break;
> >> + case TRACE_FLAG_NEED_RESCHED_LAZY:
> >> + need_resched = 'l';
> >> + break;
> >> case TRACE_FLAG_PREEMPT_RESCHED:
> >> need_resched = 'p';
> >> break;
> >> --- a/kernel/sched/debug.c
> >> +++ b/kernel/sched/debug.c
> >> @@ -333,6 +333,23 @@ static const struct file_operations sche
> >> .release = seq_release,
> >> };
> >>
> >> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
> >> + size_t cnt, loff_t *ppos)
> >> +{
> >> + unsigned long end = jiffies + 60 * HZ;
> >> +
> >> + for (; time_before(jiffies, end) && !signal_pending(current);)
> >> + cpu_relax();
> >> +
> >> + return cnt;
> >> +}
> >> +
> >> +static const struct file_operations sched_hog_fops = {
> >> + .write = sched_hog_write,
> >> + .open = simple_open,
> >> + .llseek = default_llseek,
> >> +};
> >> +
> >> static struct dentry *debugfs_sched;
> >>
> >> static __init int sched_init_debug(void)
> >> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
> >>
> >> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
> >>
> >> + debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
> >> +
> >> return 0;
> >> }
> >> late_initcall(sched_init_debug);
> >>
>
>
> --
> ankur
On Wed, Oct 18, 2023 at 10:31:46AM -0400, Steven Rostedt wrote:
> On Wed, 18 Oct 2023 15:16:12 +0200
> Thomas Gleixner <[email protected]> wrote:
>
> > > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
> > > might need to do something for non-preemptible RCU to make
> > > up for the lack of cond_resched() calls. Maybe just drop the
> > > "IS_ENABLED()" and execute the body of the current "if" statement
> > > unconditionally.
>
> Right.
>
> I'm guessing you are talking about this code:
>
> /*
> * In some cases, notably when running on a nohz_full CPU with
> * a stopped tick PREEMPT_RCU has no way to account for QSs.
> * This will eventually cause unwarranted noise as PREEMPT_RCU
> * will force preemption as the means of ending the current
> * grace period. We avoid this problem by calling
> * rcu_momentary_dyntick_idle(), which performs a zero duration
> * EQS allowing PREEMPT_RCU to end the current grace period.
> * This call shouldn't be wrapped inside an RCU critical
> * section.
> *
> * Note that in non PREEMPT_RCU kernels QSs are handled through
> * cond_resched()
> */
> if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
> if (!disable_irq)
> local_irq_disable();
>
> rcu_momentary_dyntick_idle();
>
> if (!disable_irq)
> local_irq_enable();
> }
That is indeed the place!
> /*
> * For the non-preemptive kernel config: let threads runs, if
> * they so wish, unless set not do to so.
> */
> if (!disable_irq && !disable_preemption)
> cond_resched();
>
>
>
> If everything becomes PREEMPT_RCU, then the above should be able to be
> turned into just:
>
> if (!disable_irq)
> local_irq_disable();
>
> rcu_momentary_dyntick_idle();
>
> if (!disable_irq)
> local_irq_enable();
>
> And no cond_resched() is needed.
Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
run_osnoise() is running in kthread context with preemption and everything
else enabled (am I right?), then the change you suggest should work fine.
> > Again. There is no non-preemtible RCU with this model, unless I'm
> > missing something important here.
>
> Daniel?
But very happy to defer to Daniel. ;-)
Thanx, Paul
On Wed, 18 Oct 2023 10:55:02 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > If everything becomes PREEMPT_RCU, then the above should be able to be
> > turned into just:
> >
> > if (!disable_irq)
> > local_irq_disable();
> >
> > rcu_momentary_dyntick_idle();
> >
> > if (!disable_irq)
> > local_irq_enable();
> >
> > And no cond_resched() is needed.
>
> Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
> run_osnoise() is running in kthread context with preemption and everything
> else enabled (am I right?), then the change you suggest should work fine.
There's a user space option that lets you run that loop with preemption and/or
interrupts disabled.
>
> > > Again. There is no non-preemtible RCU with this model, unless I'm
> > > missing something important here.
> >
> > Daniel?
>
> But very happy to defer to Daniel. ;-)
But Daniel could also correct me ;-)
-- Steve
On Wed, Oct 18, 2023 at 01:41:07PM -0400, Steven Rostedt wrote:
> On Wed, 18 Oct 2023 10:19:53 -0700
> "Paul E. McKenney" <[email protected]> wrote:
> >
> > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> > as preempt_enable() in this approach? I certainly hope so, as RCU
> > priority boosting would be a most unwelcome addition to many datacenter
> > workloads.
> >
> > > With this approach the kernel is by definition fully preemptible, which
> > > means means rcu_read_lock() is preemptible too. That's pretty much the
> > > same situation as with PREEMPT_DYNAMIC.
> >
> > Please, just no!!!
>
> Note, when I first read Thomas's proposal, I figured that Paul would no
> longer get to brag that:
>
> "In CONFIG_PREEMPT_NONE, rcu_read_lock() and rcu_read_unlock() are simply
> nops!"
I will still be able to brag that in a fully non-preemptible environment,
rcu_read_lock() and rcu_read_unlock() are simply no-ops. It will
just be that the Linux kernel will no longer be such an environment.
For the moment, anyway, there is still userspace RCU along with a few
other instances of zero-cost RCU readers. ;-)
> But instead, they would be:
>
> static void rcu_read_lock(void)
> {
> preempt_disable();
> }
>
> static void rcu_read_unlock(void)
> {
> preempt_enable();
> }
>
> as it was mentioned that today's preempt_disable() is fast and not an issue
> like it was in older kernels.
And they are already defined as you show above in rcupdate.h, albeit
with leading underscores on the function names.
> That would mean that there will still be a "non preempt" version of RCU.
That would be very good!
> As the preempt version of RCU adds a lot more logic when scheduling out in
> an RCU critical section, that I can envision not all workloads would want
> around. Adding "preempt_disable()" is now low overhead, but adding the RCU
> logic to handle preemption isn't as lightweight as that.
>
> Not to mention the logic to boost those threads that were preempted and
> being starved for some time.
Exactly, thank you!
> > > > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace
> > > > or RCU Tasks Rude) would need those pesky cond_resched() calls
> > > > to stick around. The reason is that RCU Tasks readers are ended
> > > > only by voluntary context switches. This means that although a
> > > > preemptible infinite loop in the kernel won't inconvenience a
> > > > real-time task (nor an non-real-time task for all that long),
> > > > and won't delay grace periods for the other flavors of RCU,
> > > > it would indefinitely delay an RCU Tasks grace period.
> > > >
> > > > However, RCU Tasks grace periods seem to be finite in preemptible
> > > > kernels today, so they should remain finite in limited-preemptible
> > > > kernels tomorrow. Famous last words...
> > >
> > > That's an issue which you have today with preempt FULL, right? So if it
> > > turns out to be a problem then it's not a problem of the new model.
> >
> > Agreed, and hence my last three lines of text above. Plus the guy who
> > requested RCU Tasks said that it was OK for its grace periods to take
> > a long time, and I am holding Steven Rostedt to that. ;-)
>
> Matters what your definition of "long time" is ;-)
If RCU Tasks grace-period latency has been acceptable in preemptible
kernels (including all CONFIG_PREEMPT_DYNAMIC=y kernels), your definition
of "long" is sufficiently short. ;-)
Thanx, Paul
On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote:
> On Wed, 18 Oct 2023 10:55:02 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > If everything becomes PREEMPT_RCU, then the above should be able to be
> > > turned into just:
> > >
> > > if (!disable_irq)
> > > local_irq_disable();
> > >
> > > rcu_momentary_dyntick_idle();
> > >
> > > if (!disable_irq)
> > > local_irq_enable();
> > >
> > > And no cond_resched() is needed.
> >
> > Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
> > run_osnoise() is running in kthread context with preemption and everything
> > else enabled (am I right?), then the change you suggest should work fine.
>
> There's a user space option that lets you run that loop with preemption and/or
> interrupts disabled.
Ah, thank you. Then as long as this function is not expecting an RCU
reader to span that call to rcu_momentary_dyntick_idle(), all is well.
This is a kthread, so there cannot be something else expecting an RCU
reader to span that call.
> > > > Again. There is no non-preemtible RCU with this model, unless I'm
> > > > missing something important here.
> > >
> > > Daniel?
> >
> > But very happy to defer to Daniel. ;-)
>
> But Daniel could also correct me ;-)
If he figures out a way that it is broken, he gets to fix it. ;-)
Thanx, Paul
Paul E. McKenney <[email protected]> writes:
> On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
>> Paul!
>>
>> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
>> > Belatedly calling out some RCU issues. Nothing fatal, just a
>> > (surprisingly) few adjustments that will need to be made. The key thing
>> > to note is that from RCU's viewpoint, with this change, all kernels
>> > are preemptible, though rcu_read_lock() readers remain
>> > non-preemptible.
>>
>> Why? Either I'm confused or you or both of us :)
>
> Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> as preempt_enable() in this approach? I certainly hope so, as RCU
> priority boosting would be a most unwelcome addition to many datacenter
> workloads.
No, in this approach, PREEMPT_AUTO selects PREEMPTION and thus
PREEMPT_RCU so rcu_read_lock/unlock() would touch the
rcu_read_lock_nesting. Which is identical to what PREEMPT_DYNAMIC does.
>> With this approach the kernel is by definition fully preemptible, which
>> means means rcu_read_lock() is preemptible too. That's pretty much the
>> same situation as with PREEMPT_DYNAMIC.
>
> Please, just no!!!
>
> Please note that the current use of PREEMPT_DYNAMIC with preempt=none
> avoids preempting RCU read-side critical sections. This means that the
> distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
> of RCU readers in environments expecting no preemption.
Ah. So, though PREEMPT_DYNAMIC with preempt=none runs with PREEMPT_RCU,
preempt=none stubs out the actual preemption via __preempt_schedule.
Okay, I see what you are saying.
(Side issue: but this means that even for PREEMPT_DYNAMIC preempt=none,
_cond_resched() doesn't call rcu_all_qs().)
>> For throughput sake this fully preemptible kernel provides a mechanism
>> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
>> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
>>
>> That means the preemption points in preempt_enable() and return from
>> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
>> completion either to the point where they call schedule() or when they
>> return to user space. That's pretty much what PREEMPT_NONE does today.
>>
>> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
>> points are not longer required because the scheduler can preempt the
>> long running task by setting NEED_RESCHED instead.
>>
>> That preemption might be suboptimal in some cases compared to
>> cond_resched(), but from my initial experimentation that's not really an
>> issue.
>
> I am not (repeat NOT) arguing for keeping cond_resched(). I am instead
> arguing that the less-preemptible variants of the kernel should continue
> to avoid preempting RCU read-side critical sections.
[ snip ]
>> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
>> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
>> CONFIG_RT or such as it does not really change the preemption
>> model itself. RT just reduces the preemption disabled sections with the
>> lock conversions, forced interrupt threading and some more.
>
> Again, please, no.
>
> There are situations where we still need rcu_read_lock() and
> rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> repectively. Those can be cases selected only by Kconfig option, not
> available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
As far as non-preemptible RCU read-side critical sections are concerned,
are the current
- PREEMPT_DYNAMIC=y, PREEMPT_RCU, preempt=none config
(rcu_read_lock/unlock() do not manipulate preempt_count, but do
stub out preempt_schedule())
- and PREEMPT_NONE=y, TREE_RCU config (rcu_read_lock/unlock() manipulate
preempt_count)?
roughly similar or no?
>> > I am sure that I am missing something, but I have not yet seen any
>> > show-stoppers. Just some needed adjustments.
>>
>> Right. If it works out as I think it can work out the main adjustments
>> are to remove a large amount of #ifdef maze and related gunk :)
>
> Just please don't remove the #ifdef gunk that is still needed!
Always the hard part :).
Thanks
--
ankur
On Wed, Oct 18, 2023 at 01:15:28PM -0700, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> >> Paul!
> >>
> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> >> > Belatedly calling out some RCU issues. Nothing fatal, just a
> >> > (surprisingly) few adjustments that will need to be made. The key thing
> >> > to note is that from RCU's viewpoint, with this change, all kernels
> >> > are preemptible, though rcu_read_lock() readers remain
> >> > non-preemptible.
> >>
> >> Why? Either I'm confused or you or both of us :)
> >
> > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> > as preempt_enable() in this approach? I certainly hope so, as RCU
> > priority boosting would be a most unwelcome addition to many datacenter
> > workloads.
>
> No, in this approach, PREEMPT_AUTO selects PREEMPTION and thus
> PREEMPT_RCU so rcu_read_lock/unlock() would touch the
> rcu_read_lock_nesting. Which is identical to what PREEMPT_DYNAMIC does.
Understood. And we need some way to build a kernel such that RCU
read-side critical sections are non-preemptible. This is a hard
requirement that is not going away anytime soon.
> >> With this approach the kernel is by definition fully preemptible, which
> >> means means rcu_read_lock() is preemptible too. That's pretty much the
> >> same situation as with PREEMPT_DYNAMIC.
> >
> > Please, just no!!!
> >
> > Please note that the current use of PREEMPT_DYNAMIC with preempt=none
> > avoids preempting RCU read-side critical sections. This means that the
> > distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
> > of RCU readers in environments expecting no preemption.
>
> Ah. So, though PREEMPT_DYNAMIC with preempt=none runs with PREEMPT_RCU,
> preempt=none stubs out the actual preemption via __preempt_schedule.
>
> Okay, I see what you are saying.
More to the point, currently, you can build with CONFIG_PREEMPT_DYNAMIC=n
and CONFIG_PREEMPT_NONE=y and have non-preemptible RCU read-side critical
sections.
> (Side issue: but this means that even for PREEMPT_DYNAMIC preempt=none,
> _cond_resched() doesn't call rcu_all_qs().)
I have no idea if anyone runs with CONFIG_PREEMPT_DYNAMIC=y and
preempt=none. We don't do so. ;-)
> >> For throughput sake this fully preemptible kernel provides a mechanism
> >> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
> >> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
> >>
> >> That means the preemption points in preempt_enable() and return from
> >> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
> >> completion either to the point where they call schedule() or when they
> >> return to user space. That's pretty much what PREEMPT_NONE does today.
> >>
> >> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
> >> points are not longer required because the scheduler can preempt the
> >> long running task by setting NEED_RESCHED instead.
> >>
> >> That preemption might be suboptimal in some cases compared to
> >> cond_resched(), but from my initial experimentation that's not really an
> >> issue.
> >
> > I am not (repeat NOT) arguing for keeping cond_resched(). I am instead
> > arguing that the less-preemptible variants of the kernel should continue
> > to avoid preempting RCU read-side critical sections.
>
> [ snip ]
>
> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> >> CONFIG_RT or such as it does not really change the preemption
> >> model itself. RT just reduces the preemption disabled sections with the
> >> lock conversions, forced interrupt threading and some more.
> >
> > Again, please, no.
> >
> > There are situations where we still need rcu_read_lock() and
> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> > repectively. Those can be cases selected only by Kconfig option, not
> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
>
> As far as non-preemptible RCU read-side critical sections are concerned,
> are the current
> - PREEMPT_DYNAMIC=y, PREEMPT_RCU, preempt=none config
> (rcu_read_lock/unlock() do not manipulate preempt_count, but do
> stub out preempt_schedule())
> - and PREEMPT_NONE=y, TREE_RCU config (rcu_read_lock/unlock() manipulate
> preempt_count)?
>
> roughly similar or no?
No.
There is still considerable exposure to preemptible-RCU code paths,
for example, when current->rcu_read_unlock_special.b.blocked is set.
> >> > I am sure that I am missing something, but I have not yet seen any
> >> > show-stoppers. Just some needed adjustments.
> >>
> >> Right. If it works out as I think it can work out the main adjustments
> >> are to remove a large amount of #ifdef maze and related gunk :)
> >
> > Just please don't remove the #ifdef gunk that is still needed!
>
> Always the hard part :).
Hey, we wouldn't want to insult your intelligence by letting you work
on too easy of a problem! ;-)
Thanx, Paul
On Wed, Oct 18 2023 at 10:51, Paul E. McKenney wrote:
> On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote:
Can you folks please trim your replies. It's annoying to scroll
through hundreds of quoted lines to figure out that nothing is there.
>> This probably allows for more configuration flexibility across archs?
>> Would allow for TREE_RCU=y, for instance. That said, so far I've only
>> been working with PREEMPT_RCU=y.)
>
> Then this is a bug that needs to be fixed. We need a way to make
> RCU readers non-preemptible.
Why?
On Thu, Oct 19, 2023 at 12:53:05AM +0200, Thomas Gleixner wrote:
> On Wed, Oct 18 2023 at 10:51, Paul E. McKenney wrote:
> > On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote:
>
> Can you folks please trim your replies. It's annoying to scroll
> through hundreds of quoted lines to figure out that nothing is there.
>
> >> This probably allows for more configuration flexibility across archs?
> >> Would allow for TREE_RCU=y, for instance. That said, so far I've only
> >> been working with PREEMPT_RCU=y.)
> >
> > Then this is a bug that needs to be fixed. We need a way to make
> > RCU readers non-preemptible.
>
> Why?
So that we don't get tail latencies from preempted RCU readers that
result in memory-usage spikes on systems that have good and sufficient
quantities of memory, but which do not have enough memory to tolerate
readers being preempted.
Thanx, Paul
Paul!
On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
>> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
>> > Belatedly calling out some RCU issues. Nothing fatal, just a
>> > (surprisingly) few adjustments that will need to be made. The key thing
>> > to note is that from RCU's viewpoint, with this change, all kernels
>> > are preemptible, though rcu_read_lock() readers remain
>> > non-preemptible.
>>
>> Why? Either I'm confused or you or both of us :)
>
> Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> as preempt_enable() in this approach? I certainly hope so, as RCU
> priority boosting would be a most unwelcome addition to many datacenter
> workloads.
Sure, but that's an orthogonal problem, really.
>> With this approach the kernel is by definition fully preemptible, which
>> means means rcu_read_lock() is preemptible too. That's pretty much the
>> same situation as with PREEMPT_DYNAMIC.
>
> Please, just no!!!
>
> Please note that the current use of PREEMPT_DYNAMIC with preempt=none
> avoids preempting RCU read-side critical sections. This means that the
> distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
> of RCU readers in environments expecting no preemption.
It does not _avoid_ it, it simply _prevents_ it by not preempting in
preempt_enable() and on return from interrupt so whatever sets
NEED_RESCHED has to wait for a voluntary invocation of schedule(),
cond_resched() or return to user space.
But under the hood RCU is fully preemptible and the boosting logic is
active, but it does not have an effect until one of those preemption
points is reached, which makes the boosting moot.
>> For throughput sake this fully preemptible kernel provides a mechanism
>> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
>> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
>>
>> That means the preemption points in preempt_enable() and return from
>> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
>> completion either to the point where they call schedule() or when they
>> return to user space. That's pretty much what PREEMPT_NONE does today.
>>
>> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
>> points are not longer required because the scheduler can preempt the
>> long running task by setting NEED_RESCHED instead.
>>
>> That preemption might be suboptimal in some cases compared to
>> cond_resched(), but from my initial experimentation that's not really an
>> issue.
>
> I am not (repeat NOT) arguing for keeping cond_resched(). I am instead
> arguing that the less-preemptible variants of the kernel should continue
> to avoid preempting RCU read-side critical sections.
That's the whole point of the lazy mechanism:
It avoids (repeat AVOIDS) preemption of any kernel code as much as it
can by _not_ setting NEED_RESCHED.
The only difference is that it does not _prevent_ it like
preempt=none does. It will preempt when NEED_RESCHED is set.
Now the question is when will NEED_RESCHED be set?
1) If the preempting task belongs to a scheduling class above
SCHED_OTHER
This is a PoC implementation detail. The lazy mechanism can be
extended to any other scheduling class w/o a big effort.
I deliberately did not do that because:
A) I'm lazy
B) More importantly I wanted to demonstrate that as long as
there are only SCHED_OTHER tasks involved there is no forced
(via NEED_RESCHED) preemption unless the to be preempted task
ignores the lazy resched request, which proves that
cond_resched() can be avoided.
At the same time such a kernel allows a RT task to preempt at
any time.
2) If the to be preempted task does not react within a certain time
frame (I used a full tick in my PoC) on the NEED_RESCHED_LAZY
request, which is the prerequisite to get rid of cond_resched()
and related muck.
That's obviously mandatory for getting rid of cond_resched() and
related muck, no?
I concede that there are a lot of details to be discussed before we get
there, but I don't see a real show stopper yet.
The important point is that the details are basically boiling down to
policy decisions in the scheduler which are aided by hints from the
programmer.
As I said before we might end up with something like
preempt_me_not_if_not_absolutely_required();
....
preempt_me_I_dont_care();
(+/- name bike shedding) to give the scheduler a better understanding of
the context.
Something like that has distinct advantages over the current situation
with all the cond_resched() muck:
1) It is clearly scope based
2) It is properly nesting
3) It can be easily made implicit for existing scope constructs like
rcu_read_lock/unlock() or regular locking mechanisms.
The important point is that at the very end the scheduler has the
ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any
random damage due to the fact that preemption count is functional, which
makes your life easier as well as you admitted already. But that does
not mean you can eat the cake and still have it. :)
That said, I completely understand your worries about the consequences,
but please take the step back and look at it from a conceptual point of
view.
The goal is to replace the hard coded (Kconfig or DYNAMIC) policy
mechanisms with a flexible scheduler controlled policy mechanism.
That allows you to focus on one consolidated model and optimize that
for particular policy scenarios instead of dealing with optimizing the
hell out of hardcoded policies which force you to come up with
horrible workaround for each of them.
Of course the policies have to be defined (scheduling classes affected
depending on model, hint/annotation meaning etc.), but that's way more
palatable than what we have now. Let me give you a simple example:
Right now the only way out on preempt=none when a rogue code path
which lacks a cond_resched() fails to release the CPU is a big fat
stall splat and a hosed machine.
I rather prefer to have the fully controlled hammer ready which keeps
the machine usable and the situation debuggable.
You still can yell in dmesg, but that again is a flexible policy
decision and not hard coded by any means.
>> > 3. For nohz_full CPUs that run for a long time in the kernel,
>> > there are no scheduling-clock interrupts. RCU reaches for
>> > the resched_cpu() hammer a few jiffies into the grace period.
>> > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
>> > interrupt-entry code will re-enable its scheduling-clock interrupt
>> > upon receiving the resched_cpu() IPI.
>>
>> You can spare the IPI by setting NEED_RESCHED on the remote CPU which
>> will cause it to preempt.
>
> That is not sufficient for nohz_full CPUs executing in userspace,
That's not what I was talking about. You said:
>> > 3. For nohz_full CPUs that run for a long time in the kernel,
^^^^^^
Duh! I did not realize that you meant user space. For user space there
is zero difference to the current situation. Once the task is out in
user space it's out of RCU side critical sections, so that's obiously
not a problem.
As I said: I might be confused. :)
>> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
>> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
>> CONFIG_RT or such as it does not really change the preemption
>> model itself. RT just reduces the preemption disabled sections with the
>> lock conversions, forced interrupt threading and some more.
>
> Again, please, no.
>
> There are situations where we still need rcu_read_lock() and
> rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> repectively. Those can be cases selected only by Kconfig option, not
> available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
Why are you so fixated on making everything hardcoded instead of making
it a proper policy decision problem. See above.
>> > 8. As has been noted elsewhere, in this new limited-preemption
>> > mode of operation, rcu_read_lock() readers remain preemptible.
>> > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
>>
>> Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?
>
> That is in fact the problem. Preemption can be good, but it is possible
> to have too much of a good thing, and preemptible RCU read-side critical
> sections definitely is in that category for some important workloads. ;-)
See above.
>> > 10. The cond_resched_rcu() function must remain because we still
>> > have non-preemptible rcu_read_lock() readers.
>>
>> Where?
>
> In datacenters.
See above.
>> > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
>> > might need to do something for non-preemptible RCU to make
>> > up for the lack of cond_resched() calls. Maybe just drop the
>> > "IS_ENABLED()" and execute the body of the current "if" statement
>> > unconditionally.
>>
>> Again. There is no non-preemtible RCU with this model, unless I'm
>> missing something important here.
>
> And again, there needs to be non-preemptible RCU with this model.
See above.
Thanks,
tglx
On 10/18/23 20:13, Paul E. McKenney wrote:
> On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote:
>> On Wed, 18 Oct 2023 10:55:02 -0700
>> "Paul E. McKenney" <[email protected]> wrote:
>>
>>>> If everything becomes PREEMPT_RCU, then the above should be able to be
>>>> turned into just:
>>>>
>>>> if (!disable_irq)
>>>> local_irq_disable();
>>>>
>>>> rcu_momentary_dyntick_idle();
>>>>
>>>> if (!disable_irq)
>>>> local_irq_enable();
>>>>
>>>> And no cond_resched() is needed.
>>>
>>> Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
>>> run_osnoise() is running in kthread context with preemption and everything
>>> else enabled (am I right?), then the change you suggest should work fine.
>>
>> There's a user space option that lets you run that loop with preemption and/or
>> interrupts disabled.
>
> Ah, thank you. Then as long as this function is not expecting an RCU
> reader to span that call to rcu_momentary_dyntick_idle(), all is well.
> This is a kthread, so there cannot be something else expecting an RCU
> reader to span that call.
Sorry for the delay, this thread is quite long (and I admit I should be paying
attention to it).
It seems that you both figure it out without me anyways. This piece of
code is preemptive unless a config is set to disable irq or preemption (as
steven mentioned). That call is just a ping to RCU to say that things
are fine.
So Steven's suggestion should work.
>>>>> Again. There is no non-preemtible RCU with this model, unless I'm
>>>>> missing something important here.
>>>>
>>>> Daniel?
>>>
>>> But very happy to defer to Daniel. ;-)
>>
>> But Daniel could also correct me ;-)
>
> If he figures out a way that it is broken, he gets to fix it. ;-)
It works for me, keep in the loop for the patches and I can test and
adjust osnoise accordingly. osnoise should not be a reason to block more
important things like this patch set, and we can find a way out in
the osnoise tracer side. (I might need an assistance from rcu
people, but I know I can count on them :-).
Thanks!
-- Daniel
> Thanx, Paul
On Thu, Oct 19, 2023 at 02:37:23PM +0200, Daniel Bristot de Oliveira wrote:
> On 10/18/23 20:13, Paul E. McKenney wrote:
> > On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote:
> >> On Wed, 18 Oct 2023 10:55:02 -0700
> >> "Paul E. McKenney" <[email protected]> wrote:
> >>
> >>>> If everything becomes PREEMPT_RCU, then the above should be able to be
> >>>> turned into just:
> >>>>
> >>>> if (!disable_irq)
> >>>> local_irq_disable();
> >>>>
> >>>> rcu_momentary_dyntick_idle();
> >>>>
> >>>> if (!disable_irq)
> >>>> local_irq_enable();
> >>>>
> >>>> And no cond_resched() is needed.
> >>>
> >>> Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
> >>> run_osnoise() is running in kthread context with preemption and everything
> >>> else enabled (am I right?), then the change you suggest should work fine.
> >>
> >> There's a user space option that lets you run that loop with preemption and/or
> >> interrupts disabled.
> >
> > Ah, thank you. Then as long as this function is not expecting an RCU
> > reader to span that call to rcu_momentary_dyntick_idle(), all is well.
> > This is a kthread, so there cannot be something else expecting an RCU
> > reader to span that call.
>
> Sorry for the delay, this thread is quite long (and I admit I should be paying
> attention to it).
>
> It seems that you both figure it out without me anyways. This piece of
> code is preemptive unless a config is set to disable irq or preemption (as
> steven mentioned). That call is just a ping to RCU to say that things
> are fine.
>
> So Steven's suggestion should work.
Very good!
> >>>>> Again. There is no non-preemtible RCU with this model, unless I'm
> >>>>> missing something important here.
> >>>>
> >>>> Daniel?
> >>>
> >>> But very happy to defer to Daniel. ;-)
> >>
> >> But Daniel could also correct me ;-)
> >
> > If he figures out a way that it is broken, he gets to fix it. ;-)
>
> It works for me, keep in the loop for the patches and I can test and
> adjust osnoise accordingly. osnoise should not be a reason to block more
> important things like this patch set, and we can find a way out in
> the osnoise tracer side. (I might need an assistance from rcu
> people, but I know I can count on them :-).
For good or for bad, we will be here. ;-)
Thanx, Paul
Thomas!
On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> Paul!
>
> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> >> > Belatedly calling out some RCU issues. Nothing fatal, just a
> >> > (surprisingly) few adjustments that will need to be made. The key thing
> >> > to note is that from RCU's viewpoint, with this change, all kernels
> >> > are preemptible, though rcu_read_lock() readers remain
> >> > non-preemptible.
> >>
> >> Why? Either I'm confused or you or both of us :)
> >
> > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> > as preempt_enable() in this approach? I certainly hope so, as RCU
> > priority boosting would be a most unwelcome addition to many datacenter
> > workloads.
>
> Sure, but that's an orthogonal problem, really.
Orthogonal, parallel, skew, whatever, it and its friends still need to
be addressed.
> >> With this approach the kernel is by definition fully preemptible, which
> >> means means rcu_read_lock() is preemptible too. That's pretty much the
> >> same situation as with PREEMPT_DYNAMIC.
> >
> > Please, just no!!!
> >
> > Please note that the current use of PREEMPT_DYNAMIC with preempt=none
> > avoids preempting RCU read-side critical sections. This means that the
> > distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
> > of RCU readers in environments expecting no preemption.
>
> It does not _avoid_ it, it simply _prevents_ it by not preempting in
> preempt_enable() and on return from interrupt so whatever sets
> NEED_RESCHED has to wait for a voluntary invocation of schedule(),
> cond_resched() or return to user space.
A distinction without a difference. ;-)
> But under the hood RCU is fully preemptible and the boosting logic is
> active, but it does not have an effect until one of those preemption
> points is reached, which makes the boosting moot.
And for many distros, this appears to be just fine, not that I personally
know of anyone running large numbers of systems in production with
kernels built with CONFIG_PREEMPT_DYNAMIC=y and booted with preempt=none.
And let's face it, if you want exactly the same binary to support both
modes, you are stuck with the fully-preemptible implementation of RCU.
But we should not make a virtue of such a distro's necessity.
And some of us are not afraid to build our own kernels, which allows
us to completely avoid the added code required to make RCU read-side
critical sections be preemptible.
> >> For throughput sake this fully preemptible kernel provides a mechanism
> >> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
> >> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
> >>
> >> That means the preemption points in preempt_enable() and return from
> >> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
> >> completion either to the point where they call schedule() or when they
> >> return to user space. That's pretty much what PREEMPT_NONE does today.
> >>
> >> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
> >> points are not longer required because the scheduler can preempt the
> >> long running task by setting NEED_RESCHED instead.
> >>
> >> That preemption might be suboptimal in some cases compared to
> >> cond_resched(), but from my initial experimentation that's not really an
> >> issue.
> >
> > I am not (repeat NOT) arguing for keeping cond_resched(). I am instead
> > arguing that the less-preemptible variants of the kernel should continue
> > to avoid preempting RCU read-side critical sections.
>
> That's the whole point of the lazy mechanism:
>
> It avoids (repeat AVOIDS) preemption of any kernel code as much as it
> can by _not_ setting NEED_RESCHED.
>
> The only difference is that it does not _prevent_ it like
> preempt=none does. It will preempt when NEED_RESCHED is set.
>
> Now the question is when will NEED_RESCHED be set?
>
> 1) If the preempting task belongs to a scheduling class above
> SCHED_OTHER
>
> This is a PoC implementation detail. The lazy mechanism can be
> extended to any other scheduling class w/o a big effort.
>
> I deliberately did not do that because:
>
> A) I'm lazy
>
> B) More importantly I wanted to demonstrate that as long as
> there are only SCHED_OTHER tasks involved there is no forced
> (via NEED_RESCHED) preemption unless the to be preempted task
> ignores the lazy resched request, which proves that
> cond_resched() can be avoided.
>
> At the same time such a kernel allows a RT task to preempt at
> any time.
>
> 2) If the to be preempted task does not react within a certain time
> frame (I used a full tick in my PoC) on the NEED_RESCHED_LAZY
> request, which is the prerequisite to get rid of cond_resched()
> and related muck.
>
> That's obviously mandatory for getting rid of cond_resched() and
> related muck, no?
Keeping firmly in mind that there are no cond_resched() calls within RCU
read-side critical sections, sure. Or, if you prefer, any such calls
are bugs. And agreed, outside of atomic contexts (in my specific case,
including RCU readers), there does eventually need to be a preemption.
> I concede that there are a lot of details to be discussed before we get
> there, but I don't see a real show stopper yet.
Which is what I have been saying as well, at least as long as we can
have a way of building a kernel with a non-preemptible build of RCU.
And not just a preemptible RCU in which the scheduler (sometimes?)
refrains from preempting the RCU read-side critical sections, but
really only having the CONFIG_PREEMPT_RCU=n code built.
Give or take the needs of the KLP guys, but again, I must defer to
them.
> The important point is that the details are basically boiling down to
> policy decisions in the scheduler which are aided by hints from the
> programmer.
>
> As I said before we might end up with something like
>
> preempt_me_not_if_not_absolutely_required();
> ....
> preempt_me_I_dont_care();
>
> (+/- name bike shedding) to give the scheduler a better understanding of
> the context.
>
> Something like that has distinct advantages over the current situation
> with all the cond_resched() muck:
>
> 1) It is clearly scope based
>
> 2) It is properly nesting
>
> 3) It can be easily made implicit for existing scope constructs like
> rcu_read_lock/unlock() or regular locking mechanisms.
You know, I was on board with throwing cond_resched() overboard (again,
give or take whatever KLP might need) when I first read of this in that
LWN article. You therefore cannot possibly gain anything by continuing
to sell it to me, and, worse yet, you might provoke an heretofore-innocent
bystander into pushing some bogus but convincing argument against. ;-)
Yes, there are risks due to additional state space exposed by the
additional preemption. However, at least some of this is already covered
by quite a few people running preemptible kernels. There will be some
not covered, given our sensitivity to low-probability bugs, but there
should also be some improvements in tail latency. The process of getting
the first cond_resched()-free kernel deployed will therefore likely be
a bit painful, but overall the gains should be worth the pain.
> The important point is that at the very end the scheduler has the
> ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any
> random damage due to the fact that preemption count is functional, which
> makes your life easier as well as you admitted already. But that does
> not mean you can eat the cake and still have it. :)
Which is exactly why I need rcu_read_lock() to map to preempt_disable()
and rcu_read_unlock() to preempt_enable(). ;-)
> That said, I completely understand your worries about the consequences,
> but please take the step back and look at it from a conceptual point of
> view.
Conceptual point of view? That sounds suspiciously academic. Who are
you and what did you do with the real Thomas Gleixner? ;-)
But yes, consequences are extremely important, as always.
> The goal is to replace the hard coded (Kconfig or DYNAMIC) policy
> mechanisms with a flexible scheduler controlled policy mechanism.
Are you saying that CONFIG_PREEMPT_RT will also be selected at boot time
and/or via debugfs?
> That allows you to focus on one consolidated model and optimize that
> for particular policy scenarios instead of dealing with optimizing the
> hell out of hardcoded policies which force you to come up with
> horrible workaround for each of them.
>
> Of course the policies have to be defined (scheduling classes affected
> depending on model, hint/annotation meaning etc.), but that's way more
> palatable than what we have now. Let me give you a simple example:
>
> Right now the only way out on preempt=none when a rogue code path
> which lacks a cond_resched() fails to release the CPU is a big fat
> stall splat and a hosed machine.
>
> I rather prefer to have the fully controlled hammer ready which keeps
> the machine usable and the situation debuggable.
>
> You still can yell in dmesg, but that again is a flexible policy
> decision and not hard coded by any means.
And I have agreed from my first read of that LWN article that allowing
preemption of code where preempt_count()=0 is a good thing.
The only thing that I am pushing back on is specifially your wish to
always be running the CONFIG_PREEMPT_RCU=y RCU code. Yes, that is what
single-binary distros will do, just as they do now. But again, some of
us are happy to build our own kernels.
There might be other things that I should be pushing back on, but that
is all that I am aware of right now. ;-)
> >> > 3. For nohz_full CPUs that run for a long time in the kernel,
> >> > there are no scheduling-clock interrupts. RCU reaches for
> >> > the resched_cpu() hammer a few jiffies into the grace period.
> >> > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> >> > interrupt-entry code will re-enable its scheduling-clock interrupt
> >> > upon receiving the resched_cpu() IPI.
> >>
> >> You can spare the IPI by setting NEED_RESCHED on the remote CPU which
> >> will cause it to preempt.
> >
> > That is not sufficient for nohz_full CPUs executing in userspace,
>
> That's not what I was talking about. You said:
>
> >> > 3. For nohz_full CPUs that run for a long time in the kernel,
> ^^^^^^
> Duh! I did not realize that you meant user space. For user space there
> is zero difference to the current situation. Once the task is out in
> user space it's out of RCU side critical sections, so that's obiously
> not a problem.
>
> As I said: I might be confused. :)
And I might well also be confused. Here is my view for nohz_full CPUs:
o Running in userspace. RCU will ignore them without disturbing
the CPU, courtesy of context tracking. As you say, there is
no way (absent extremely strange sidechannel attacks) to
have a kernel RCU read-side critical section here.
These CPUs will ignore NEED_RESCHED until they exit usermode
one way or another. This exit will usually be supplied by
the scheduler's wakeup IPI for the newly awakened task.
But just setting NEED_RESCHED without otherwise getting the
CPU's full attention won't have any effect.
o Running in the kernel entry/exit code. RCU will ignore them
without disturbing the CPU, courtesy of context tracking.
Unlike usermode, you can type rcu_read_lock(), but if you do,
lockdep will complain bitterly.
Assuming the time in the kernel is sharply bounded, as it
usually will be, these CPUs will respond to NEED_RESCHED in a
timely manner. For longer times in the kernel, please see below.
o Running in the kernel in deep idle, that is, where RCU is not
watching. RCU will ignore them without disturbing the CPU,
courtesy of context tracking. As with the entry/exit code,
you can type rcu_read_lock(), but if you do, lockdep will
complain bitterly.
The exact response to NEED_RESCHED depends on the type of idle
loop, with (as I understand it) polling idle loops responding
quickly and other idle loops needing some event to wake up
the CPU. This event is typically an IPI, as is the case when
the scheduler wakes up a task on the CPU in question.
o Running in other parts of the kernel, but with scheduling
clock interrupt enabled. The next scheduling clock interrupt
will take care of both RCU and NEED_RESCHED. Give or take
policy decisions, as you say above.
o Running in other parts of the kernel, but with scheduling clock
interrupt disabled. If there is a grace period waiting on this
CPU, RCU will eventually set a flag and invoke resched_cpu(),
which will get the CPU's attention via an IPI and will also turn
the scheduling clock interrupt back on.
I believe that a wakeup from the scheduler has the same effect,
and that it uses an IPI to get the CPU's attention when needed,
but it has been one good long time since I traced out all the
details.
However, given that there is to be no cond_resched(), setting
NEED_RESCHED without doing something like an IPI to get that
CPU's attention will still not be guarantee to have any effect,
just as with the nohz_full CPU executing in userspace, correct?
Did I miss anything?
> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> >> CONFIG_RT or such as it does not really change the preemption
> >> model itself. RT just reduces the preemption disabled sections with the
> >> lock conversions, forced interrupt threading and some more.
> >
> > Again, please, no.
> >
> > There are situations where we still need rcu_read_lock() and
> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> > repectively. Those can be cases selected only by Kconfig option, not
> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
>
> Why are you so fixated on making everything hardcoded instead of making
> it a proper policy decision problem. See above.
Because I am one of the people who will bear the consequences.
In that same vein, why are you so opposed to continuing to provide
the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code
is already in place, is extremely well tested, and you need to handle
preempt_disable()/preeempt_enable() regions of code in any case. What is
the real problem here?
> >> > 8. As has been noted elsewhere, in this new limited-preemption
> >> > mode of operation, rcu_read_lock() readers remain preemptible.
> >> > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
> >>
> >> Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?
> >
> > That is in fact the problem. Preemption can be good, but it is possible
> > to have too much of a good thing, and preemptible RCU read-side critical
> > sections definitely is in that category for some important workloads. ;-)
>
> See above.
>
> >> > 10. The cond_resched_rcu() function must remain because we still
> >> > have non-preemptible rcu_read_lock() readers.
> >>
> >> Where?
> >
> > In datacenters.
>
> See above.
>
> >> > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
> >> > might need to do something for non-preemptible RCU to make
> >> > up for the lack of cond_resched() calls. Maybe just drop the
> >> > "IS_ENABLED()" and execute the body of the current "if" statement
> >> > unconditionally.
> >>
> >> Again. There is no non-preemtible RCU with this model, unless I'm
> >> missing something important here.
> >
> > And again, there needs to be non-preemptible RCU with this model.
>
> See above.
And back at you with all three instances of "See above". ;-)
Thanx, Paul
On Thu, Oct 19, 2023 at 12:13:31PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> > On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> > > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> > >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
[ . . . ]
> > >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> > >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> > >> CONFIG_RT or such as it does not really change the preemption
> > >> model itself. RT just reduces the preemption disabled sections with the
> > >> lock conversions, forced interrupt threading and some more.
> > >
> > > Again, please, no.
> > >
> > > There are situations where we still need rcu_read_lock() and
> > > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> > > repectively. Those can be cases selected only by Kconfig option, not
> > > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> >
> > Why are you so fixated on making everything hardcoded instead of making
> > it a proper policy decision problem. See above.
>
> Because I am one of the people who will bear the consequences.
>
> In that same vein, why are you so opposed to continuing to provide
> the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code
> is already in place, is extremely well tested, and you need to handle
> preempt_disable()/preeempt_enable() regions of code in any case. What is
> the real problem here?
I should hasten to add that from a conceptual viewpoint, I do support
the eventual elimination of CONFIG_PREEMPT_RCU=n code, but with emphasis
on the word "eventual". Although preemptible RCU is plenty reliable if
you are running only a few thousand servers (and maybe even a few tens
of thousands), it has some improving to do before I will be comfortable
recommending its use in a large-scale datacenters.
And yes, I know about Android deployments. But those devices tend
to spend very little time in the kernel, in fact, many of them tend to
spend very little time powered up. Plus they tend to have relatively few
CPUs, at least by 2020s standards. So it takes a rather large number of
Android devices to impose the same stress on the kernel that is imposed
by a single mid-sized server.
And we are working on making preemptible RCU more reliable. One nice
change over the past 5-10 years is that more people are getting serious
about digging into the RCU code, testing it, and reporting and fixing the
resulting bugs. I am also continuing to make rcutorture more vicious,
and of course I am greatly helped by the easier availability of hardware
with which to test RCU.
If this level of activity continues for another five years, then maybe
preemptible RCU will be ready for large datacenter deployments.
But I am guessing that you had something in mind in addition to code
consolidation.
Thanx, Paul
Paul E. McKenney <[email protected]> writes:
> Thomas!
>
> On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
>> Paul!
>>
>> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
>> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
>> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
>> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
>> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
>> >> CONFIG_RT or such as it does not really change the preemption
>> >> model itself. RT just reduces the preemption disabled sections with the
>> >> lock conversions, forced interrupt threading and some more.
>> >
>> > Again, please, no.
>> >
>> > There are situations where we still need rcu_read_lock() and
>> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
>> > repectively. Those can be cases selected only by Kconfig option, not
>> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
>>
>> Why are you so fixated on making everything hardcoded instead of making
>> it a proper policy decision problem. See above.
>
> Because I am one of the people who will bear the consequences.
>
> In that same vein, why are you so opposed to continuing to provide
> the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code
> is already in place, is extremely well tested, and you need to handle
> preempt_disable()/preeempt_enable() regions of code in any case. What is
> the real problem here?
I have a somewhat related question. What ties PREEMPTION=y to PREEMPT_RCU=y?
I see e72aeafc66 ("rcu: Remove prompt for RCU implementation") from
2015, stating that the only possible choice for PREEMPTION=y kernels
is PREEMPT_RCU=y:
The RCU implementation is chosen based on PREEMPT and SMP config options
and is not really a user-selectable choice. This commit removes the
menu entry, given that there is not much point in calling something a
choice when there is in fact no choice.. The TINY_RCU, TREE_RCU, and
PREEMPT_RCU Kconfig options continue to be selected based solely on the
values of the PREEMPT and SMP options.
As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly
stronger forward progress guarantees with respect to rcu readers (in
that they can't be preempted.)
So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something
obvious there.
Thanks
--
ankur
On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > Thomas!
> >
> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> >> Paul!
> >>
> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> >> >> CONFIG_RT or such as it does not really change the preemption
> >> >> model itself. RT just reduces the preemption disabled sections with the
> >> >> lock conversions, forced interrupt threading and some more.
> >> >
> >> > Again, please, no.
> >> >
> >> > There are situations where we still need rcu_read_lock() and
> >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> >> > repectively. Those can be cases selected only by Kconfig option, not
> >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> >>
> >> Why are you so fixated on making everything hardcoded instead of making
> >> it a proper policy decision problem. See above.
> >
> > Because I am one of the people who will bear the consequences.
> >
> > In that same vein, why are you so opposed to continuing to provide
> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code
> > is already in place, is extremely well tested, and you need to handle
> > preempt_disable()/preeempt_enable() regions of code in any case. What is
> > the real problem here?
>
> I have a somewhat related question. What ties PREEMPTION=y to PREEMPT_RCU=y?
This Kconfig block in kernel/rcu/Kconfig:
------------------------------------------------------------------------
config PREEMPT_RCU
bool
default y if PREEMPTION
select TREE_RCU
help
This option selects the RCU implementation that is
designed for very large SMP systems with hundreds or
thousands of CPUs, but for which real-time response
is also required. It also scales down nicely to
smaller systems.
Select this option if you are unsure.
------------------------------------------------------------------------
There is no prompt string after the "bool", so it is not user-settable.
Therefore, it is driven directly off of the value of PREEMPTION, taking
the global default of "n" if PREEMPTION is not set and "y" otherwise.
You could change the second line to read:
bool "Go ahead! Make my day!"
or preferably something more helpful. This change would allow a
preemptible kernel to be built with non-preemptible RCU and vice versa,
as used to be the case long ago. However, it might be way better to drive
the choice from some other Kconfig option and leave out the prompt string.
> I see e72aeafc66 ("rcu: Remove prompt for RCU implementation") from
> 2015, stating that the only possible choice for PREEMPTION=y kernels
> is PREEMPT_RCU=y:
>
> The RCU implementation is chosen based on PREEMPT and SMP config options
> and is not really a user-selectable choice. This commit removes the
> menu entry, given that there is not much point in calling something a
> choice when there is in fact no choice.. The TINY_RCU, TREE_RCU, and
> PREEMPT_RCU Kconfig options continue to be selected based solely on the
> values of the PREEMPT and SMP options.
The main point of this commit was to reduce testing effort and sysadm
confusion by removing choices that were not necessary back then.
> As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly
> stronger forward progress guarantees with respect to rcu readers (in
> that they can't be preempted.)
TREE_RCU=y is absolutely required if you want a kernel to run on a system
with more than one CPU, and for that matter, if you want preemptible RCU,
even on a single-CPU system.
> So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something
> obvious there.
If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you
can run any combination:
PREEMPTION && PREEMPT_RCU: This is what we use today for preemptible
kernels, so this works just fine (famous last words).
PREEMPTION && !PREEMPT_RCU: A preemptible kernel with non-preemptible
RCU, so that rcu_read_lock() is preempt_disable() and
rcu_read_unlock() is preempt_enable(). This should just work,
except for the fact that cond_resched() disappears, which
stymies some of RCU's forward-progress mechanisms. And this
was the topic of our earlier discussion on this thread. The
fixes should not be too hard.
Of course, this has not been either tested or used for at least
eight years, so there might be some bitrot. If so, I will of
course be happy to help fix it.
!PREEMPTION && PREEMPT_RCU: A non-preemptible kernel with preemptible
RCU. Although this particular combination of Kconfig
options has not been tested for at least eight years, giving
a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none
kernel boot parameter gets you pretty close. Again, there is
likely to be some bitrot somewhere, but way fewer bits to rot
than for PREEMPTION && !PREEMPT_RCU. Outside of the current
CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this
combination, but if there is a need and if it is broken, I will
be happy to help fix it.
!PREEMPTION && !PREEMPT_RCU: A non-preemptible kernel with non-preemptible
RCU, which is what we use today for non-preemptible kernels built
with CONFIG_PREEMPT_DYNAMIC=n. So to repeat those famous last
works, this works just fine.
Does that help, or am I missing the point of your question?
Thanx, Paul
Paul E. McKenney <[email protected]> writes:
> On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote:
>>
>> Paul E. McKenney <[email protected]> writes:
>>
>> > Thomas!
>> >
>> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
>> >> Paul!
>> >>
>> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
>> >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
>> >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
>> >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
>> >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
>> >> >> CONFIG_RT or such as it does not really change the preemption
>> >> >> model itself. RT just reduces the preemption disabled sections with the
>> >> >> lock conversions, forced interrupt threading and some more.
>> >> >
>> >> > Again, please, no.
>> >> >
>> >> > There are situations where we still need rcu_read_lock() and
>> >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
>> >> > repectively. Those can be cases selected only by Kconfig option, not
>> >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
>> >>
>> >> Why are you so fixated on making everything hardcoded instead of making
>> >> it a proper policy decision problem. See above.
>> >
>> > Because I am one of the people who will bear the consequences.
>> >
>> > In that same vein, why are you so opposed to continuing to provide
>> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code
>> > is already in place, is extremely well tested, and you need to handle
>> > preempt_disable()/preeempt_enable() regions of code in any case. What is
>> > the real problem here?
>>
[ snip ]
>> As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly
>> stronger forward progress guarantees with respect to rcu readers (in
>> that they can't be preempted.)
>
> TREE_RCU=y is absolutely required if you want a kernel to run on a system
> with more than one CPU, and for that matter, if you want preemptible RCU,
> even on a single-CPU system.
>
>> So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something
>> obvious there.
>
> If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you
> can run any combination:
Sorry, yes I did. Should have said "can PREEMPTION=y run with, (TREE_RCU=y,
PREEMPT_RCU=n).
> PREEMPTION && PREEMPT_RCU: This is what we use today for preemptible
> kernels, so this works just fine (famous last words).
>
> PREEMPTION && !PREEMPT_RCU: A preemptible kernel with non-preemptible
> RCU, so that rcu_read_lock() is preempt_disable() and
> rcu_read_unlock() is preempt_enable(). This should just work,
> except for the fact that cond_resched() disappears, which
> stymies some of RCU's forward-progress mechanisms. And this
> was the topic of our earlier discussion on this thread. The
> fixes should not be too hard.
>
> Of course, this has not been either tested or used for at least
> eight years, so there might be some bitrot. If so, I will of
> course be happy to help fix it.
>
>
> !PREEMPTION && PREEMPT_RCU: A non-preemptible kernel with preemptible
> RCU. Although this particular combination of Kconfig
> options has not been tested for at least eight years, giving
> a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none
> kernel boot parameter gets you pretty close. Again, there is
> likely to be some bitrot somewhere, but way fewer bits to rot
> than for PREEMPTION && !PREEMPT_RCU. Outside of the current
> CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this
> combination, but if there is a need and if it is broken, I will
> be happy to help fix it.
>
> !PREEMPTION && !PREEMPT_RCU: A non-preemptible kernel with non-preemptible
> RCU, which is what we use today for non-preemptible kernels built
> with CONFIG_PREEMPT_DYNAMIC=n. So to repeat those famous last
> works, this works just fine.
>
> Does that help, or am I missing the point of your question?
It does indeed. What I was going for, is that this series (or, at
least my adaptation of TGLX's PoC) wants to keep CONFIG_PREEMPTION
in spirit, while doing away with it as a compile-time config option.
That it does, as TGLX mentioned upthread, by moving all of the policy
to the scheduler, which can be tuned by user-space (via sched-features.)
So, my question was in response to this:
>> > In that same vein, why are you so opposed to continuing to provide
>> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code
>> > is already in place, is extremely well tested, and you need to handle
>> > preempt_disable()/preeempt_enable() regions of code in any case. What is
>> > the real problem here?
Based on your response the (PREEMPT_RCU=n, TREE_RCU=y) configuration
seems to be eminently usable with this configuration.
(Or maybe I'm missed the point of that discussion.)
On a related note, I had started rcutorture on a (PREEMPTION=y, PREEMPT_RCU=n,
TREE_RCU=y) kernel some hours ago. Nothing broken (yet!).
--
ankur
On Fri, Oct 20, 2023 at 06:05:21PM -0700, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote:
> >>
> >> Paul E. McKenney <[email protected]> writes:
> >>
> >> > Thomas!
> >> >
> >> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> >> >> Paul!
> >> >>
> >> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> >> >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> >> >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> >> >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> >> >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> >> >> >> CONFIG_RT or such as it does not really change the preemption
> >> >> >> model itself. RT just reduces the preemption disabled sections with the
> >> >> >> lock conversions, forced interrupt threading and some more.
> >> >> >
> >> >> > Again, please, no.
> >> >> >
> >> >> > There are situations where we still need rcu_read_lock() and
> >> >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> >> >> > repectively. Those can be cases selected only by Kconfig option, not
> >> >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> >> >>
> >> >> Why are you so fixated on making everything hardcoded instead of making
> >> >> it a proper policy decision problem. See above.
> >> >
> >> > Because I am one of the people who will bear the consequences.
> >> >
> >> > In that same vein, why are you so opposed to continuing to provide
> >> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code
> >> > is already in place, is extremely well tested, and you need to handle
> >> > preempt_disable()/preeempt_enable() regions of code in any case. What is
> >> > the real problem here?
> >>
>
> [ snip ]
>
> >> As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly
> >> stronger forward progress guarantees with respect to rcu readers (in
> >> that they can't be preempted.)
> >
> > TREE_RCU=y is absolutely required if you want a kernel to run on a system
> > with more than one CPU, and for that matter, if you want preemptible RCU,
> > even on a single-CPU system.
> >
> >> So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something
> >> obvious there.
> >
> > If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you
> > can run any combination:
>
> Sorry, yes I did. Should have said "can PREEMPTION=y run with, (TREE_RCU=y,
> PREEMPT_RCU=n).
>
> > PREEMPTION && PREEMPT_RCU: This is what we use today for preemptible
> > kernels, so this works just fine (famous last words).
> >
> > PREEMPTION && !PREEMPT_RCU: A preemptible kernel with non-preemptible
> > RCU, so that rcu_read_lock() is preempt_disable() and
> > rcu_read_unlock() is preempt_enable(). This should just work,
> > except for the fact that cond_resched() disappears, which
> > stymies some of RCU's forward-progress mechanisms. And this
> > was the topic of our earlier discussion on this thread. The
> > fixes should not be too hard.
> >
> > Of course, this has not been either tested or used for at least
> > eight years, so there might be some bitrot. If so, I will of
> > course be happy to help fix it.
> >
> >
> > !PREEMPTION && PREEMPT_RCU: A non-preemptible kernel with preemptible
> > RCU. Although this particular combination of Kconfig
> > options has not been tested for at least eight years, giving
> > a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none
> > kernel boot parameter gets you pretty close. Again, there is
> > likely to be some bitrot somewhere, but way fewer bits to rot
> > than for PREEMPTION && !PREEMPT_RCU. Outside of the current
> > CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this
> > combination, but if there is a need and if it is broken, I will
> > be happy to help fix it.
> >
> > !PREEMPTION && !PREEMPT_RCU: A non-preemptible kernel with non-preemptible
> > RCU, which is what we use today for non-preemptible kernels built
> > with CONFIG_PREEMPT_DYNAMIC=n. So to repeat those famous last
> > works, this works just fine.
> >
> > Does that help, or am I missing the point of your question?
>
> It does indeed. What I was going for, is that this series (or, at
> least my adaptation of TGLX's PoC) wants to keep CONFIG_PREEMPTION
> in spirit, while doing away with it as a compile-time config option.
>
> That it does, as TGLX mentioned upthread, by moving all of the policy
> to the scheduler, which can be tuned by user-space (via sched-features.)
>
> So, my question was in response to this:
>
> >> > In that same vein, why are you so opposed to continuing to provide
> >> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code
> >> > is already in place, is extremely well tested, and you need to handle
> >> > preempt_disable()/preeempt_enable() regions of code in any case. What is
> >> > the real problem here?
>
> Based on your response the (PREEMPT_RCU=n, TREE_RCU=y) configuration
> seems to be eminently usable with this configuration.
>
> (Or maybe I'm missed the point of that discussion.)
>
> On a related note, I had started rcutorture on a (PREEMPTION=y, PREEMPT_RCU=n,
> TREE_RCU=y) kernel some hours ago. Nothing broken (yet!).
Thank you, and here is hoping! ;-)
Thanx, Paul
Paul!
On Thu, Oct 19 2023 at 12:13, Paul E. McKenney wrote:
> On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
>> The important point is that at the very end the scheduler has the
>> ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any
>> random damage due to the fact that preemption count is functional, which
>> makes your life easier as well as you admitted already. But that does
>> not mean you can eat the cake and still have it. :)
>
> Which is exactly why I need rcu_read_lock() to map to preempt_disable()
> and rcu_read_unlock() to preempt_enable(). ;-)
After reading back in the thread, I think we greatly talked past each
other mostly due to the different expectations and the resulting
dependencies which seem to be hardwired into our brains.
I'm pleading guilty as charged as I failed completely to read your
initial statement
"The key thing to note is that from RCU's viewpoint, with this change,
all kernels are preemptible, though rcu_read_lock() readers remain
non-preemptible."
with that in mind and instead of dissecting it properly I committed the
fallacy of stating exactly the opposite, which obviously reflects only
the point of view I'm coming from.
With a fresh view, this turns out to be a complete non-problem because
there is no semantical dependency between the preemption model and the
RCU flavour.
The unified kernel preemption model has the following properties:
1) It provides full preemptive multitasking.
2) Preemptability is limited by implicit and explicit mechanisms.
3) The ability to avoid overeager preemption for SCHED_OTHER tasks via
the PREEMPT_LAZY mechanism.
This emulates the NONE/VOLUNTARY preemption models which
semantically provide collaborative multitasking.
This emulation is not breaking the semantical properties of full
preemptive multitasking because the scheduler still has the ability
to enforce immediate preemption under consideration of #2.
Which in turn is a prerequiste for removing the semantically
ill-defined cond/might_resched() constructs.
The compile time selectable RCU flavour (preemptible/non-preemptible) is
not imposing a semantical change on this unified preemption model.
The selection of the RCU flavour is solely affecting the preemptability
(#2 above). Selecting non-preemptible RCU reduces preemptability by
adding an implicit restriction via mapping rcu_read_lock()
to preempt_disable().
IOW, the current upstream enforcement of RCU_PREEMPT=n when PREEMPTION=n
is only enforced by the the lack of the full preempt counter in
PREEMPTION=n configs. Once the preemption counter is always enabled this
hardwired dependency goes away.
Even PREEMPT_DYNAMIC should just work with RCU_PREEMPT=n today because
with PREEMPT_DYNAMIC the preemption counter is unconditionally
available.
So that makes these hardwired dependencies go away in practice and
hopefully soon from our mental models too :)
RT will keep its hard dependency on RCU_PREEMPT in the same way it
depends hard on forced interrupt threading and other minor details to
enable the spinlock substitution.
>> That said, I completely understand your worries about the consequences,
>> but please take the step back and look at it from a conceptual point of
>> view.
>
> Conceptual point of view? That sounds suspiciously academic.
Hehehe.
> Who are you and what did you do with the real Thomas Gleixner? ;-)
The point I'm trying to make is not really academic, it comes from a
very practical point of view. As you know for almost two decades I'm
mostly busy with janitoring and mopping up the kernel.
A major takeaway from this eclectic experience is that there is a
tendency to implement very specialized solutions for different classes
of use cases.
The reasons to do so in the first place:
1) Avoid breaking the existing and established solutions:
E.g. the initial separation of x8664 and i386
2) Enforcement due to dependencies on mechanisms, which are
considered "harmful" for particular use cases
E.g. Preemptible RCU, which is separate also due to #1
3) Because we can and something is sooo special
You probably remember the full day we both spent in a room with SoC
people to make them understand that their SoCs are not so special at
all. :)
So there are perfectly valid reasons (#1, #2) to separate things, but we
really need to go back from time to time and think hard about the
question whether a particular separation is still justified. This is
especially true when dependencies or prerequisites change.
But in many cases we just keep going, take the separation as set in
stone forever and add features and workarounds on all ends without
rethinking whether we could unify these things for the better. The real
bad thing about this is that the more we add to the separation the
harder consolidation or unification becomes.
Granted that my initial take of consolidating on preemptible RCU might
be too brisk or too naive, but I still think that with the prospect of
an unified preemption model it's at least worth to have a very close
look at this question.
Not asking such questions or dismissing them upfront is a real danger
for the long term sustainability and maintainability of the kernel in my
opinion. Especially when the few people who actively "janitor" these
things are massively outnumbered by people who indulge in
specialization. :)
That said, the real Thomas Gleixner and his grumpy self are still there,
just slightly tired of handling the slurry brush all day long :)
Thanks,
tglx
On Tue, 19 Sep 2023 01:42:03 +0200
Thomas Gleixner <[email protected]> wrote:
> 2) When the scheduler wants to set NEED_RESCHED due it sets
> NEED_RESCHED_LAZY instead which is only evaluated in the return to
> user space preemption points.
>
> As NEED_RESCHED_LAZY is not folded into the preemption count the
> preemption count won't become zero, so the task can continue until
> it hits return to user space.
>
> That preserves the existing behaviour.
I'm looking into extending this concept to user space and to VMs.
I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis")
The ideas is this. Have VMs/user space share a memory region with the
kernel that is per thread/vCPU. This would be registered via a syscall or
ioctl on some defined file or whatever. Then, when entering user space /
VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it
checks if the thread has this memory region and a special bit in it is
set, and if it does, it does not schedule. It will treat it like a long
kernel system call.
The kernel will then set another bit in the shared memory region that will
tell user space / VM that the kernel wanted to schedule, but is allowing it
to finish its critical section. When user space / VM is done with the
critical section, it will check the bit that may be set by the kernel and
if it is set, it should do a sched_yield() or VMEXIT so that the kernel can
now schedule it.
What about DOS you say? It's no different than running a long system call.
No task can run forever. It's not a "preempt disable", it's just "give me
some more time". A "NEED_RESCHED" will always schedule, just like a kernel
system call that takes a long time. The goal is to allow user space to get
out of critical sections that we know can cause problems if they get
preempted. Usually it's a user space / VM lock is held or maybe a VM
interrupt handler that needs to wake up a task on another vCPU.
If we are worried about abuse, we could even punish tasks that don't call
sched_yield() by the time its extended time slice is taken. Even without
that punishment, if we have EEVDF, this extension will make it less
eligible the next time around.
The goal is to prevent a thread / vCPU being preempted while holding a lock
or resource that other threads / vCPUs will want. That is, prevent
contention, as that's usually the biggest issue with performance in user
space and VMs.
I'm going to work on a POC, and see if I can get some benchmarks on how
much this could help tasks like databases and VMs in general.
-- Steve
On Tue, Oct 24, 2023 at 02:15:25PM +0200, Thomas Gleixner wrote:
> Paul!
>
> On Thu, Oct 19 2023 at 12:13, Paul E. McKenney wrote:
> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> >> The important point is that at the very end the scheduler has the
> >> ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any
> >> random damage due to the fact that preemption count is functional, which
> >> makes your life easier as well as you admitted already. But that does
> >> not mean you can eat the cake and still have it. :)
> >
> > Which is exactly why I need rcu_read_lock() to map to preempt_disable()
> > and rcu_read_unlock() to preempt_enable(). ;-)
>
> After reading back in the thread, I think we greatly talked past each
> other mostly due to the different expectations and the resulting
> dependencies which seem to be hardwired into our brains.
>
> I'm pleading guilty as charged as I failed completely to read your
> initial statement
>
> "The key thing to note is that from RCU's viewpoint, with this change,
> all kernels are preemptible, though rcu_read_lock() readers remain
> non-preemptible."
>
> with that in mind and instead of dissecting it properly I committed the
> fallacy of stating exactly the opposite, which obviously reflects only
> the point of view I'm coming from.
>
> With a fresh view, this turns out to be a complete non-problem because
> there is no semantical dependency between the preemption model and the
> RCU flavour.
Agreed, and been there and done that myself, as you well know! ;-)
> The unified kernel preemption model has the following properties:
>
> 1) It provides full preemptive multitasking.
>
> 2) Preemptability is limited by implicit and explicit mechanisms.
>
> 3) The ability to avoid overeager preemption for SCHED_OTHER tasks via
> the PREEMPT_LAZY mechanism.
>
> This emulates the NONE/VOLUNTARY preemption models which
> semantically provide collaborative multitasking.
>
> This emulation is not breaking the semantical properties of full
> preemptive multitasking because the scheduler still has the ability
> to enforce immediate preemption under consideration of #2.
>
> Which in turn is a prerequiste for removing the semantically
> ill-defined cond/might_resched() constructs.
>
> The compile time selectable RCU flavour (preemptible/non-preemptible) is
> not imposing a semantical change on this unified preemption model.
>
> The selection of the RCU flavour is solely affecting the preemptability
> (#2 above). Selecting non-preemptible RCU reduces preemptability by
> adding an implicit restriction via mapping rcu_read_lock()
> to preempt_disable().
>
> IOW, the current upstream enforcement of RCU_PREEMPT=n when PREEMPTION=n
> is only enforced by the the lack of the full preempt counter in
> PREEMPTION=n configs. Once the preemption counter is always enabled this
> hardwired dependency goes away.
>
> Even PREEMPT_DYNAMIC should just work with RCU_PREEMPT=n today because
> with PREEMPT_DYNAMIC the preemption counter is unconditionally
> available.
>
> So that makes these hardwired dependencies go away in practice and
> hopefully soon from our mental models too :)
The real reason for tying RCU_PREEMPT to PREEMPTION back in the day was
that there were no real-world uses of RCU_PREEMPT not matching PREEMPTION,
so those combinations were ruled out in order to reduce the number of
rcutorture scenarios.
But now it appears that we do have a use case for PREEMPTION=y and
RCU_PREEMPT=n, plus I have access to way more test hardware, so that
the additional rcutorture scenarios are less of a testing burden.
> RT will keep its hard dependency on RCU_PREEMPT in the same way it
> depends hard on forced interrupt threading and other minor details to
> enable the spinlock substitution.
"other minor details". ;-)
Making PREEMPT_RT select RCU_PREEMPT makes sense to me!
> >> That said, I completely understand your worries about the consequences,
> >> but please take the step back and look at it from a conceptual point of
> >> view.
> >
> > Conceptual point of view? That sounds suspiciously academic.
>
> Hehehe.
>
> > Who are you and what did you do with the real Thomas Gleixner? ;-)
>
> The point I'm trying to make is not really academic, it comes from a
> very practical point of view. As you know for almost two decades I'm
> mostly busy with janitoring and mopping up the kernel.
>
> A major takeaway from this eclectic experience is that there is a
> tendency to implement very specialized solutions for different classes
> of use cases.
>
> The reasons to do so in the first place:
>
> 1) Avoid breaking the existing and established solutions:
>
> E.g. the initial separation of x8664 and i386
>
> 2) Enforcement due to dependencies on mechanisms, which are
> considered "harmful" for particular use cases
>
> E.g. Preemptible RCU, which is separate also due to #1
>
> 3) Because we can and something is sooo special
>
> You probably remember the full day we both spent in a room with SoC
> people to make them understand that their SoCs are not so special at
> all. :)
4) Because we don't see a use for a given combination, and we
want to keep test time down to a dull roar, as noted above.
> So there are perfectly valid reasons (#1, #2) to separate things, but we
> really need to go back from time to time and think hard about the
> question whether a particular separation is still justified. This is
> especially true when dependencies or prerequisites change.
>
> But in many cases we just keep going, take the separation as set in
> stone forever and add features and workarounds on all ends without
> rethinking whether we could unify these things for the better. The real
> bad thing about this is that the more we add to the separation the
> harder consolidation or unification becomes.
>
> Granted that my initial take of consolidating on preemptible RCU might
> be too brisk or too naive, but I still think that with the prospect of
> an unified preemption model it's at least worth to have a very close
> look at this question.
>
> Not asking such questions or dismissing them upfront is a real danger
> for the long term sustainability and maintainability of the kernel in my
> opinion. Especially when the few people who actively "janitor" these
> things are massively outnumbered by people who indulge in
> specialization. :)
Longer term, I do agree in principle with the notion of simplifying the
Linux-kernel RCU implementation by eliminating the PREEMPT_RCU=n code.
In the near term practice, here are the reasons for holding off on
this consolidation:
1. Preemptible RCU needs more work for datacenter deployments,
as mentioned earlier. I also reiterate that if you only have
a few thousand (or maybe even a few tens of thousand) servers,
preemptible RCU will be just fine for you. Give or take the
safety criticality of your application.
2. RCU priority boosting has not yet been really tested and tuned
for systems that are adequately but not generously endowed with
memory. Boost too soon and you needlessly burning cycles and
preempt important tasks. Boost too late and it is OOM for you!
3. To the best of my knowledge, the scheduler doesn't take memory
footprint into account. In particular, if a long-running RCU
reader is preempted in a memory-naive fashion, all we gain
is turning a potentially unimportant latency outlier into a
definitely important OOM.
4. There are probably a few gotchas that I haven't thought of or
that I am forgetting. More likely, more than a few. As always!
But to your point, yes, these are things that we should be able to do
something about, given appropriate time and effort. My guess is five
years, with the long pole being the reliability. Preemptible RCU has
been gone through line by line recently, which is an extremely good
thing and an extremely welcome change from past practice, but that is
just a start. That effort was getting people familiar with the code,
and should not be mistaken for a find-lots-of-bugs review session,
let alone a find-all-bugs review session.
> That said, the real Thomas Gleixner and his grumpy self are still there,
> just slightly tired of handling the slurry brush all day long :)
Whew!!! Good to hear that the real Thomas Gleixner is still with us!!! ;-)
Thanx, Paul
On Tue, 24 Oct 2023 10:34:26 -0400
Steven Rostedt <[email protected]> wrote:
> I'm going to work on a POC, and see if I can get some benchmarks on how
> much this could help tasks like databases and VMs in general.
And that was much easier than I thought it would be. It also shows some
great results!
I started with Thomas's PREEMPT_AUTO.patch from the rt-devel tree:
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches
So you need to select:
CONFIG_PREEMPT_AUTO
The below is my proof of concept patch. It still has debugging in it, and
I'm sure the interface will need to be changed.
There's now a new file: /sys/kernel/extend_sched
Attached is a program that tests it. It mmaps that file, with:
struct extend_map {
unsigned long flags;
};
static __thread struct extend_map *extend_map;
That is, there's this structure for every thread. It's assigned with:
fd = open("/sys/kernel/extend_sched", O_RDWR);
extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
I don't actually like this interface, as it wastes a full page for just two
bits :-p
Anyway, to tell the kernel to "extend" the time slice if possible because
it's in a critical section, we have:
static void extend(void)
{
if (!extend_map)
return;
extend_map->flags = 1;
}
And to say that's it's done:
static void unextend(void)
{
unsigned long prev;
if (!extend_map)
return;
prev = xchg(&extend_map->flags, 0);
if (prev & 2)
sched_yield();
}
So, bit 1 is for user space to tell the kernel "please extend me", and bit
two is for the kernel to tell user space "OK, I extended you, but call
sched_yield() when done".
This test program creates 1 + number of CPUs threads, that run in a loop
for 5 seconds. Each thread will grab a user space spin lock (not a futex,
but just shared memory). Before grabbing the lock it will call "extend()",
if it fails to grab the lock, it calls "unextend()" and spins on the lock
until its free, where it will try again. Then after it gets the lock, it
will update a counter, and release the lock, calling "unextend()" as well.
Then it will spin on the counter until it increments again to allow another
task to get into the critical section.
With the init of the extend_map disabled and it doesn't use the extend
code, it ends with:
Ran for 3908165 times
Total wait time: 33.965654
I can give you stdev and all that too, but the above is pretty much the
same after several runs.
After enabling the extend code, it has:
Ran for 4829340 times
Total wait time: 32.635407
It was able to get into the critical section almost 1 million times more in
those 5 seconds! That's a 23% improvement!
The wait time for getting into the critical section also dropped by the
total of over a second (4% improvement).
I ran a traceeval tool on it (still work in progress, but I can post when
it's done), and with the following trace, and the writes to trace-marker
(tracefs_printf)
trace-cmd record -e sched_switch ./extend-sched
It showed that without the extend, each task was preempted while holding
the lock around 200 times. With the extend, only one task was ever
preempted while holding the lock, and it only happened once!
Below is my patch (with debugging and on top of Thomas's PREEMPT_AUTO.patch):
Attached is the program I tested it with. It uses libtracefs to write to
the trace_marker file, but if you don't want to build it with libtracefs:
gcc -o extend-sched extend-sched.c `pkg-config --libs --cflags libtracefs` -lpthread
You can just do:
grep -v tracefs extend-sched.c > extend-sched-notracefs.c
And build that.
-- Steve
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b13b7d4f1d3..fb540dd0dec0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -740,6 +740,10 @@ struct kmap_ctrl {
#endif
};
+struct extend_map {
+ long flags;
+};
+
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -802,6 +806,8 @@ struct task_struct {
unsigned int core_occupation;
#endif
+ struct extend_map *extend_map;
+
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c1f706038637..21d0e4d81d33 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -147,17 +147,32 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
+ unsigned long ignore_mask;
+
/*
* Before returning to user space ensure that all pending work
* items have been completed.
*/
while (ti_work & EXIT_TO_USER_MODE_WORK) {
+ ignore_mask = 0;
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
+ if (ti_work & _TIF_NEED_RESCHED) {
schedule();
+ } else if (ti_work & _TIF_NEED_RESCHED_LAZY) {
+ if (!current->extend_map ||
+ !(current->extend_map->flags & 1)) {
+ schedule();
+ } else {
+ trace_printk("Extend!\n");
+ /* Allow to leave with NEED_RESCHED_LAZY still set */
+ ignore_mask |= _TIF_NEED_RESCHED_LAZY;
+ current->extend_map->flags |= 2;
+ }
+ }
+
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
@@ -184,6 +199,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
tick_nohz_user_enter_prepare();
ti_work = read_thread_flags();
+ ti_work &= ~ignore_mask;
}
/* Return the latest work state for arch_exit_to_user_mode() */
diff --git a/kernel/exit.c b/kernel/exit.c
index edb50b4c9972..ddf89ec9ab62 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -906,6 +906,13 @@ void __noreturn do_exit(long code)
if (tsk->io_context)
exit_io_context(tsk);
+ if (tsk->extend_map) {
+ unsigned long addr = (unsigned long)tsk->extend_map;
+
+ virt_to_page(addr)->mapping = NULL;
+ free_page(addr);
+ }
+
if (tsk->splice_pipe)
free_pipe_info(tsk->splice_pipe);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..da2214082d25 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1166,6 +1166,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
tsk->wake_q.next = NULL;
tsk->worker_private = NULL;
+ tsk->extend_map = NULL;
+
kcov_task_init(tsk);
kmsan_task_create(tsk);
kmap_local_fork(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 976092b7bd45..297061cfa08d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -32,3 +32,4 @@ obj-y += core.o
obj-y += fair.o
obj-y += build_policy.o
obj-y += build_utility.o
+obj-y += extend.o
diff --git a/kernel/sched/extend.c b/kernel/sched/extend.c
new file mode 100644
index 000000000000..a632e1a8f57b
--- /dev/null
+++ b/kernel/sched/extend.c
@@ -0,0 +1,90 @@
+#include <linux/kobject.h>
+#include <linux/pagemap.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+
+#ifdef CONFIG_SYSFS
+static ssize_t extend_sched_read(struct file *file, struct kobject *kobj,
+ struct bin_attribute *bin_attr,
+ char *buf, loff_t off, size_t len)
+{
+ static const char output[] = "Extend scheduling time slice\n";
+
+ printk("%s:%d\n", __func__, __LINE__);
+ if (off >= sizeof(output))
+ return 0;
+
+ strscpy(buf, output + off, len);
+ return min((ssize_t)len, sizeof(output) - off - 1);
+}
+
+static ssize_t extend_sched_write(struct file *file, struct kobject *kobj,
+ struct bin_attribute *bin_attr,
+ char *buf, loff_t off, size_t len)
+{
+ printk("%s:%d\n", __func__, __LINE__);
+ return -EINVAL;
+}
+
+static vm_fault_t extend_sched_mmap_fault(struct vm_fault *vmf)
+{
+ vm_fault_t ret = VM_FAULT_SIGBUS;
+
+ trace_printk("%s:%d\n", __func__, __LINE__);
+ /* Only has one page */
+ if (vmf->pgoff || !current->extend_map)
+ return ret;
+
+ vmf->page = virt_to_page(current->extend_map);
+
+ get_page(vmf->page);
+ vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+ vmf->page->index = vmf->pgoff;
+
+ return 0;
+}
+
+static void extend_sched_mmap_open(struct vm_area_struct *vma)
+{
+ printk("%s:%d\n", __func__, __LINE__);
+ WARN_ON(!current->extend_map);
+}
+
+static const struct vm_operations_struct extend_sched_vmops = {
+ .open = extend_sched_mmap_open,
+ .fault = extend_sched_mmap_fault,
+};
+
+static int extend_sched_mmap(struct file *file, struct kobject *kobj,
+ struct bin_attribute *attr,
+ struct vm_area_struct *vma)
+{
+ if (current->extend_map)
+ return -EBUSY;
+
+ current->extend_map = page_to_virt(alloc_page(GFP_USER | __GFP_ZERO));
+ if (!current->extend_map)
+ return -ENOMEM;
+
+ vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP | VM_MAYWRITE, 0);
+ vma->vm_ops = &extend_sched_vmops;
+
+ return 0;
+}
+
+static struct bin_attribute extend_sched_attr = {
+ .attr = {
+ .name = "extend_sched",
+ .mode = 0777,
+ },
+ .read = &extend_sched_read,
+ .write = &extend_sched_write,
+ .mmap = &extend_sched_mmap,
+};
+
+static __init int extend_init(void)
+{
+ return sysfs_create_bin_file(kernel_kobj, &extend_sched_attr);
+}
+late_initcall(extend_init);
+#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700b140ac1bb..17ca22e80384 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -993,9 +993,10 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool
resched_curr(rq);
} else {
/* Did the task ignore the lazy reschedule request? */
- if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+ if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) {
+ trace_printk("Force resched?\n");
resched_curr(rq);
- else
+ } else
resched_curr_lazy(rq);
}
clear_buddies(cfs_rq, se);
On (23/10/24 10:34), Steven Rostedt wrote:
> On Tue, 19 Sep 2023 01:42:03 +0200
> Thomas Gleixner <[email protected]> wrote:
>
> > 2) When the scheduler wants to set NEED_RESCHED due it sets
> > NEED_RESCHED_LAZY instead which is only evaluated in the return to
> > user space preemption points.
> >
> > As NEED_RESCHED_LAZY is not folded into the preemption count the
> > preemption count won't become zero, so the task can continue until
> > it hits return to user space.
> >
> > That preserves the existing behaviour.
>
> I'm looking into extending this concept to user space and to VMs.
>
> I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis")
>
> The ideas is this. Have VMs/user space share a memory region with the
> kernel that is per thread/vCPU. This would be registered via a syscall or
> ioctl on some defined file or whatever. Then, when entering user space /
> VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it
> checks if the thread has this memory region and a special bit in it is
> set, and if it does, it does not schedule. It will treat it like a long
> kernel system call.
>
> The kernel will then set another bit in the shared memory region that will
> tell user space / VM that the kernel wanted to schedule, but is allowing it
> to finish its critical section. When user space / VM is done with the
> critical section, it will check the bit that may be set by the kernel and
> if it is set, it should do a sched_yield() or VMEXIT so that the kernel can
> now schedule it.
>
> What about DOS you say? It's no different than running a long system call.
> No task can run forever. It's not a "preempt disable", it's just "give me
> some more time". A "NEED_RESCHED" will always schedule, just like a kernel
> system call that takes a long time. The goal is to allow user space to get
> out of critical sections that we know can cause problems if they get
> preempted. Usually it's a user space / VM lock is held or maybe a VM
> interrupt handler that needs to wake up a task on another vCPU.
>
> If we are worried about abuse, we could even punish tasks that don't call
> sched_yield() by the time its extended time slice is taken. Even without
> that punishment, if we have EEVDF, this extension will make it less
> eligible the next time around.
>
> The goal is to prevent a thread / vCPU being preempted while holding a lock
> or resource that other threads / vCPUs will want. That is, prevent
> contention, as that's usually the biggest issue with performance in user
> space and VMs.
I think some time ago we tried to check guest's preempt count on each vm-exit
and we'd vm-enter if guest exited from a critical section (those that bump
preempt count) so that it can hopefully finish whatever is was going to
do and vmexit again. We didn't look into covering guest's RCU read-side
critical sections.
Can you educate me, is your PoC significantly different from guest preempt
count check?
On Thu, 26 Oct 2023 16:50:16 +0900
Sergey Senozhatsky <[email protected]> wrote:
> > The goal is to prevent a thread / vCPU being preempted while holding a lock
> > or resource that other threads / vCPUs will want. That is, prevent
> > contention, as that's usually the biggest issue with performance in user
> > space and VMs.
>
> I think some time ago we tried to check guest's preempt count on each vm-exit
> and we'd vm-enter if guest exited from a critical section (those that bump
> preempt count) so that it can hopefully finish whatever is was going to
> do and vmexit again. We didn't look into covering guest's RCU read-side
> critical sections.
>
> Can you educate me, is your PoC significantly different from guest preempt
> count check?
No, it's probably very similar. Just the mechanism to allow it to run
longer may be different.
-- Steve