2023-09-19 03:22:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:
> On preempt_model_none() or preempt_model_voluntary() configurations
> rescheduling of kernel threads happens only when they allow it, and
> only at explicit preemption points, via calls to cond_resched() or
> similar.
>
> That leaves out contexts where it is not convenient to periodically
> call cond_resched() -- for instance when executing a potentially long
> running primitive (such as REP; STOSB.)
>

So I said this not too long ago in the context of Xen PV, but maybe it's time to ask it in general:

Why do we support anything other than full preempt? I can think of two reasons, neither of which I think is very good:

1. Once upon a time, tracking preempt state was expensive. But we fixed that.

2. Folklore suggests that there's a latency vs throughput tradeoff, and serious workloads, for some definition of serious, want throughput, so they should run without full preemption.

I think #2 is a bit silly. If you want throughput, and you're busy waiting for a CPU that wants to run you, but it's not because it's running some low-priority non-preemptible thing (because preempt is set to none or volunary), you're not getting throughput. If you want to get keep some I/O resource busy to get throughput, but you have excessive latency getting scheduled, you don't get throughput.

If the actual problem is that there's a workload that performs better when scheduling is delayed (which preempt=none and preempt=volunary do, essentialy at random), then maybe someone should identify that workload and fix the scheduler.

So maybe we should just very strongly encourage everyone to run with full preempt and simplify the kernel?


2023-09-19 09:23:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Mon, Sep 18 2023 at 20:21, Andy Lutomirski wrote:
> On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:

> Why do we support anything other than full preempt? I can think of
> two reasons, neither of which I think is very good:
>
> 1. Once upon a time, tracking preempt state was expensive. But we fixed that.
>
> 2. Folklore suggests that there's a latency vs throughput tradeoff,
> and serious workloads, for some definition of serious, want
> throughput, so they should run without full preemption.

It's absolutely not folklore. Run to completion is has well known
benefits as it avoids contention and avoids the overhead of scheduling
for a large amount of scenarios.

We've seen that painfully in PREEMPT_RT before we came up with the
concept of lazy preemption for throughput oriented tasks.

Thanks,

tglx

2023-09-19 17:20:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


* Thomas Gleixner <[email protected]> wrote:

> On Mon, Sep 18 2023 at 20:21, Andy Lutomirski wrote:
> > On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:
>
> > Why do we support anything other than full preempt? I can think of
> > two reasons, neither of which I think is very good:
> >
> > 1. Once upon a time, tracking preempt state was expensive. But we fixed that.
> >
> > 2. Folklore suggests that there's a latency vs throughput tradeoff,
> > and serious workloads, for some definition of serious, want
> > throughput, so they should run without full preemption.
>
> It's absolutely not folklore. Run to completion is has well known
> benefits as it avoids contention and avoids the overhead of scheduling
> for a large amount of scenarios.
>
> We've seen that painfully in PREEMPT_RT before we came up with the
> concept of lazy preemption for throughput oriented tasks.

Yeah, for a large majority of workloads reduction in preemption increases
batching and improves cache locality. Most scalability-conscious enterprise
users want longer timeslices & better cache locality, not shorter
timeslices with spread out cache use.

There's microbenchmarks that fit mostly in cache that benefit if work is
immediately processed by freshly woken tasks - but that's not true for most
workloads with a substantial real-life cache footprint.

Thanks,

Ingo