2016-03-02 07:49:16

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

Hi,

I'm still catching up on the plurality of scheduler/cpufreq threads but
I thought I would chime in with some historical reasons for why
cpufreq_sched.c looks the way it does today.

Quoting Steve Muckle (2016-02-25 16:34:23)
> On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> > On Monday, February 22, 2016 05:22:43 PM Steve Muckle wrote:
> > Besides, governors traditionally go to drivers/cpufreq. Why is this different?
>
> This predates my involvement but I'm guessing this was done because the
> long term vision was for cpufreq to eventually be removed from the
> picture. But it may be quite some time before that happens - I think
> communicating directly with drivers may be difficult due to the need to
> synchronize with thermal or userspace min/max requests etc, plus there's
> the fact that we've got a longstanding API exposed to userspace.
>
> I'm happy to move this to drivers/cpufreq.

Originally it was put in kernel/sched/ because cpufreq_sched.c needed
access to kernel/sched/sched.h. Rafael's implementation flips the
push-pull relationship between the governor and scheduler so that this
is not necessary. If that requirement is removed from Steve's series
then there is no need to put the file outside of drivers/cpufreq.

...
> > One thing I personally like in the RCU-based approach is its universality. The
> > callbacks may be installed by different entities in a uniform way: intel_pstate
> > can do that, the old governors can do that, my experimental schedutil code can
> > do that and your code could have done that too in principle. And this is very
> > nice, because it is a common underlying mechanism that can be used by everybody
> > regardless of their particular implementations on the other side.
> >
> > Why would I want to use something different, then?
>
> I've got nothing against a callback registration mechanism. As you
> mentioned in another mail it could itself use static keys, enabling the
> static key when a callback is registered and disabling it otherwise to
> avoid calling into cpufreq_update_util().

I'm very happy to see the return of capacity/util ops. I had these in my
initial prototype back in 2014 but Peter shot it down, partially due to
the performance hit of indirect function call overhead in the fast path:

http://permalink.gmane.org/gmane.linux.kernel/1803410

...
> >> +
> >> +/*
> >> + * Capacity margin added to CFS and RT capacity requests to provide
> >> + * some head room if task utilization further increases.
> >> + */
> >
> > OK, where does this number come from?
>
> Someone's posterior :) .

Indeed, this margin is based on a posterior-derived value, in particular
the 25% imbalance_pct in struct sched_domain:

include/linux/sched.h:
struct sched_domain {
/* These fields must be setup */
...
unsigned int imbalance_pct; /* No balance until over watermark */
...

and,

kernel/sched/core.c:
static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl, int cpu)
{
struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
...
sd = (struct sched_domain){
.min_interval = sd_weight,
.max_interval = 2*sd_weight,
.busy_factor = 32,
.imbalance_pct = 125,
...

>
> This really should be a tunable IMO, but there's a fairly strong
> anti-tunable sentiment, so it's been left hard-coded in an attempt to
> provide something that "just works."

It should definitely be tunable.

>
> At the least I can add a comment saying that the 20% idle headroom
> requirement was an off the cuff estimate and that at this time, we don't
> have significant data to suggest it's the best number.

Yeah, it was just for prototyping.

...
> > Now, what really is the advantage of having those extra threads vs using
> > workqueues?

More historical anecdotes: The RT/DL stuff below is absolutely worth
talking about, but if your question is, "why did the design choose
kthreads over wq's", the answer is legacy.

Originally back in 2014 Morten played with queuing work up via wq within
schedule() context, but hit a problem where that caused re-entering
schedule() from schedule() and entering reentrancy hell and entering
reentrancy hell and entering reentrancy hell and entering reentrancy
hell ... well you get the idea.

I came up with an ugly workaround to wake up a dormant kthread with
wake_up_process() which was called from an arch-specific callback to
change cpu frequency and later Juri suggested to use irq_work to kick
the kthread, which is still the method used in Steve's patch series.

Seems that it is fine to toss work onto a workqueue from irq_work
context however, which is nice, except that the RT thread approach
yields better latency. I'll let you guys sort out the issues around
RT/DL starvation, which seems like a real issue.

> >
> > I guess the underlying concern is that RT tasks may stall workqueues indefinitely
> > in theory and then the frequency won't be updated, but there's much more kernel
> > stuff run from workqueues and if that is starved, you won't get very far anyway.
> >
> > If you take special measures to prevent frequency change requests from being
> > stalled by RT tasks, question is why are they so special? Aren't there any
> > other kernel activities that also should be protected from that and may be
> > more important than CPU frequency changes?
>
> I think updating the CPU frequency during periods of heavy RT/DL load is
> one of the most (if not the most) important things. I can't speak for
> other system activities that may get blocked, but there's an opportunity
> to protect CPU frequency changes here, and it seems worth taking to me.
>
> >
> > Plus if this really is the problem here, then it also affects the other cpufreq
> > governors, so maybe it should be solved for everybody in some common way?
>
> Agreed, I'd think a freq change thread that serves frequency change
> requests would be a useful common component. The locking and throttling
> (slowpath_lock, finish_last_request()) are somewhat specific to this
> implementation, but could probably be done generically and maybe even
> used in other governors. If you're okay with it though I'd like to view
> that as a slightly longer term effort, as I think it would get unwieldy
> trying to do that as part of this initial change.

I do not have any data to back up a case for stalls caused by RT/DL
starvation, but conceptually I would say that latency is fundamentally
more important in a scheduler-driven cpu frequency selection scenario,
versus the legacy timer-based governors.

In the latter case we get woken up by a timer (prior to Rafael's recent
"cpufreq: governor: Replace timers with utilization update callbacks"
patch), we sample idleness/busyness, and change frequency, all in one go
and all from process context.

In the case of the scheduler selecting frequency in the hot path, with
hardware that takes a long time to transition frequency (and thus must
be done in a slow path), we want to minimize the time delta between the
scheduler picking a frequency and the thread that executes that change
actually being run.

In my over-simplified view of the scheduler, it would be great if we
could have a backdoor mechanism to place the frequency transition
kthread onto a runqueue from within the schedule() context and dispense
with the irq_work stuff in Steve's series altogether.

Regards,
Mike


2016-03-03 02:49:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

On Wed, Mar 2, 2016 at 8:49 AM, Michael Turquette
<[email protected]> wrote:
>

[cut]

> I do not have any data to back up a case for stalls caused by RT/DL
> starvation, but conceptually I would say that latency is fundamentally
> more important in a scheduler-driven cpu frequency selection scenario,
> versus the legacy timer-based governors.
>
> In the latter case we get woken up by a timer (prior to Rafael's recent
> "cpufreq: governor: Replace timers with utilization update callbacks"
> patch), we sample idleness/busyness, and change frequency, all in one go
> and all from process context.
>
> In the case of the scheduler selecting frequency in the hot path, with
> hardware that takes a long time to transition frequency (and thus must
> be done in a slow path), we want to minimize the time delta between the
> scheduler picking a frequency and the thread that executes that change
> actually being run.

That is a good point. However, the Peter's one about the RT tasks
having to run at the max util and affecting the frequency control this
way is good too.

I'm not actually sure if RT is the right answer here. DL may be a
better choice. After all, we want the thing to happen shortly, but
not necessarily at full speed.

So something like a DL workqueue would be quite useful here it seems.

> In my over-simplified view of the scheduler, it would be great if we
> could have a backdoor mechanism to place the frequency transition
> kthread onto a runqueue from within the schedule() context and dispense
> with the irq_work stuff in Steve's series altogether.

Well, I use irq_work() now in schedutil and ondemand/conservative too
for queuing up work items and it gets the job done.

Thanks,
Rafael

2016-03-03 03:50:09

by Steve Muckle

[permalink] [raw]
Subject: Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

On 03/02/2016 06:49 PM, Rafael J. Wysocki wrote:
> I'm not actually sure if RT is the right answer here. DL may be a
> better choice. After all, we want the thing to happen shortly, but
> not necessarily at full speed.
>
> So something like a DL workqueue would be quite useful here it seems.

The DL idea seems like a good one to me.

It would also prevent cpufreq changes from being delayed by other RT or
DL tasks.

thanks,
Steve

2016-03-03 09:33:53

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

On 02/03/16 19:50, Steve Muckle wrote:
> On 03/02/2016 06:49 PM, Rafael J. Wysocki wrote:
> > I'm not actually sure if RT is the right answer here. DL may be a
> > better choice. After all, we want the thing to happen shortly, but
> > not necessarily at full speed.
> >
> > So something like a DL workqueue would be quite useful here it seems.
>
> The DL idea seems like a good one to me.
>
> It would also prevent cpufreq changes from being delayed by other RT or
> DL tasks.
>

Dimensioning period and runtime could require some experimenting, but
it's worth a try, I agree.

Best,

- Juri

2016-03-03 13:03:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

On Tue, Mar 01, 2016 at 11:49:10PM -0800, Michael Turquette wrote:
>
> In my over-simplified view of the scheduler, it would be great if we
> could have a backdoor mechanism to place the frequency transition
> kthread onto a runqueue from within the schedule() context and dispense
> with the irq_work stuff in Steve's series altogether.

This is actually very very hard :/

So while there is something similar for workqueues,
try_to_wake_up_local(), that will not work for the cpufreq stuff.

The main problem is that schedule() is done with rq->lock held, but
wakeups need p->pi_lock, but it so happens that rq->lock nests inside of
p->pi_lock.

Now, the workqueue stuff with try_to_wake_up_local() can get away with
dropping rq->lock, because of where it is called, way early in
schedule() before we really muck things up.

The cpufreq hook otoh is called all over the place.

The second problem is that doing a wakeup will in fact also end up
calling the cpufreq hook, so you're back in recursion hell.

The third problem is that cpufreq is called from wakeups, which would
want to do another wakeup (see point 2), but this also means we have to
nest p->pi_lock, and we can't really do that either.