2022-03-08 18:26:21

by Paul Bone

[permalink] [raw]
Subject: Scheduling for heterogeneous computers


Are there plans for power-aware scheduling on heterogeneous computers that
processes & threads can opt-in to?

Several mainstream devices now offer power-aware heterogeneous scheduling:

* Lots of ARM (and therefore android) devices offer big.LITTLE cores.
* Apple's M1 CPU has "gold" and "silver" cores. The gold cores are faster
and have more cache. I think there are other microarchitectual
differences.
* Intel's Alder Lake CPUs have P and E cores. I'm told that the E cores
don't save power though since each core type still gets the same work
done per Watt, it's just that the P cores are bigger and faster.
* Multicore CPUs that offer frequency scaling could get some power savings
by switching off turbo boost and similar features. They wonThe work/watt
improves at the cost of throughput & responsiveness.

I'm aware that Linux does some Energy Aware Scheduling
https://docs.kernel.org/scheduler/sched-energy.html, however what I'm
looking for is an API that processes (but ideally threads) can opt in-to
(and out-of (unlike nice)) to say that the work they're currently doing is
bulk work. It needs to get done but it doesn't have a deadline and
therefore can be done on a smaller / more power efficient core. The idea is
that the same work gets done eventually, but for a background task (eg
Garbage Collection) it can be done in a greener or more
battery-charge-extending way.

MacOS has added an API for this as:
pthread_set_qos_class_self_np()
https://developer.apple.com/documentation/apple-silicon/tuning-your-code-s-performance-for-apple-silicon?preferredLanguage=occ

Windows has:
ThreadPowerThrottling
https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setthreadinformation

I'm not aware of anything for Linux and I've been unable to find anything.
Are there any plans to implement this?

Cheers.



2022-03-21 21:16:39

by Qais Yousef

[permalink] [raw]
Subject: Re: Scheduling for heterogeneous computers

Hi Paul

On 03/08/22 20:21, Paul Bone wrote:
>
> Are there plans for power-aware scheduling on heterogeneous computers that
> processes & threads can opt-in to?
>
> Several mainstream devices now offer power-aware heterogeneous scheduling:
>
> * Lots of ARM (and therefore android) devices offer big.LITTLE cores.
> * Apple's M1 CPU has "gold" and "silver" cores. The gold cores are faster
> and have more cache. I think there are other microarchitectual
> differences.
> * Intel's Alder Lake CPUs have P and E cores. I'm told that the E cores
> don't save power though since each core type still gets the same work
> done per Watt, it's just that the P cores are bigger and faster.
> * Multicore CPUs that offer frequency scaling could get some power savings
> by switching off turbo boost and similar features. They wonThe work/watt
> improves at the cost of throughput & responsiveness.
>
> I'm aware that Linux does some Energy Aware Scheduling
> https://docs.kernel.org/scheduler/sched-energy.html, however what I'm
> looking for is an API that processes (but ideally threads) can opt in-to
> (and out-of (unlike nice)) to say that the work they're currently doing is
> bulk work. It needs to get done but it doesn't have a deadline and
> therefore can be done on a smaller / more power efficient core. The idea is
> that the same work gets done eventually, but for a background task (eg
> Garbage Collection) it can be done in a greener or more
> battery-charge-extending way.
>
> MacOS has added an API for this as:
> pthread_set_qos_class_self_np()
> https://developer.apple.com/documentation/apple-silicon/tuning-your-code-s-performance-for-apple-silicon?preferredLanguage=occ
>
> Windows has:
> ThreadPowerThrottling
> https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setthreadinformation
>
> I'm not aware of anything for Linux and I've been unable to find anything.
> Are there any plans to implement this?

We do actually have a feature called util clamp (uclamp for short) that allows
you to do that.

There's a new field in sched_setattr() to set UCLAMP_MIN and UCLAMP_MAX.

UCLAMP_MIN hints towards performance. Ie: tell the system this task needs at
least this performance level as a minimum. Which will be translated into task
placement and frequency selection by the scheduler when this task is running.

UCLAMP_MAX hints towards efficiency. Ie: tell the system this task does not
need to operate above this performance level. Like UCLAMP_MIN, this will impact
task placement and frequency selection when this task is running.

There's a tool called uclampset in util-linux v2.37.2 that allows you to play
with this. See this commit message for an example:

https://lore.kernel.org/lkml/[email protected]/

There are some issues that you might need to be aware of though.

1. UCLAMP_MAX effectiveness issues when there are multiple tasks with
different demands running on the same CPU.

This LPC talk will explain the problem:
https://www.youtube.com/watch?v=i5BdYn6SNQc&t=680s

2. fits_capacity() is not uclamp aware yet, and this means the task
placement bias will not work as well as it should be.

I am working on both these issues and kernel documentation to help better
explain the feature. There's a cgroup interface in the cpu controller
(cpu.uclamp.min/max).

You need to use schedutil cpufreq governor.

There was a LWN article on the feature that might help with more background:

https://lwn.net/Articles/762043/

HTH.

Cheers

--
Qais Yousef

2022-05-25 19:41:35

by Qais Yousef

[permalink] [raw]
Subject: Re: Scheduling for heterogeneous computers

Hi Paul

On 05/24/22 15:23, Paul Bone wrote:
> Hi Qais,
>
> That's excellent.
>
> I'll definitely check out those links. This could be very interesting for
> people using firefox on a phone/tablet, where we can run background tasks with
> a lower UCLAMP_MAX

If you're running on Android, you might find that you won't have permission to
use uclamp directly. Android restricts access and requires you to use higher
level APIs sometimes.

And I'm not sure if they have API to allow you to do what you want. I've seen
they have the concept of creating Foreground and Background jobs in one of
their Google IO presentations. But not sure if this will be tied to uclamp_max.
It might give you similar results still though regardless of the underlying
mechanism.

If you're running on mainline kernel, then the biggest issue you might
encounter is that sched_setattr() syscall is not part of any libc yet. So you
need to create your own wrapper - look at uclampset for an example.

Laptops can still benefit from this by the way. Hopefully everyone is moving to
schedutil by default which is a pre-requisite to using uclamp. It can also help
in SMP environments to avoid driving frequency high for tasks that don't really
care about performance but otherwise busy.

You can also use UCLAMP_MIN to boost bursty tasks that are not busy but require
to get work done within a certain amount of time and DVFS delays can prevent
them from running at adequate frequency. UCLAMP_MIN will ensure they always
perceive a performance point specified by UCLAMP_MIN at a minimum when they
wakeup.

RT tasks respect uclamp values too. You can opt-in to run at a different
frequency than MAX frequency which leads to high power consumption on battery
powered devices. RT tasks always run at constant frequency, so need to be
controlled with UCLAMP_MIN only.

Happy hacking ;-)

Cheers

--
Qais Yousef

2022-05-28 19:31:23

by Paul Bone

[permalink] [raw]
Subject: Re: Scheduling for heterogeneous computers

On Wed, May 25, 2022 at 04:29:56PM +0100, Qais Yousef wrote:
> Hi Paul
>
> On 05/24/22 15:23, Paul Bone wrote:
> > Hi Qais,
> >
> > That's excellent.
> >
> > I'll definitely check out those links. This could be very interesting for
> > people using firefox on a phone/tablet, where we can run background tasks with
> > a lower UCLAMP_MAX
>
> If you're running on Android, you might find that you won't have permission to
> use uclamp directly. Android restricts access and requires you to use higher
> level APIs sometimes.
>
> And I'm not sure if they have API to allow you to do what you want. I've seen
> they have the concept of creating Foreground and Background jobs in one of
> their Google IO presentations. But not sure if this will be tied to uclamp_max.
> It might give you similar results still though regardless of the underlying
> mechanism.

We want to support both desktop and android. I have been assuming there's
an API for android already, I vaguely remember hearing about it before. We
might already be using it (at least for processes but not yet for individual
threads).

I was searching online now, but not for very long, and didn't find something
like this, so maybe Android doesn't expose it, or at least not in one of the
APIs they encourage you to use.

What I'd really like is an API where I can choose one of:

* This task is user-interactive, as-quick-as-possible please.
* This task is not user-interactive, but does have a deadline.
* This task doesn't have a deadline.

Rather than choosing a suitable UCLAMP_MAX, I'll expriment with the numbers
but choosing "400" on one system might mean something different from "400"
on another system. But I guess that's the problem, there are gray areas
between my discrete options above. A deadline could be "finish doing GC
before we run out of memory" (which can have feedback from the GC about if
it's on target), or "Finish encoding this video before the client wants to
publish it", or "finish rendering this frame of a video game before the next
VBLANK". Depending on how on-target any of these are we could decrese or
increase clock speed, because decreasing will always save power as long as
things get done by their deadline.

> If you're running on mainline kernel, then the biggest issue you might
> encounter is that sched_setattr() syscall is not part of any libc yet. So you
> need to create your own wrapper - look at uclampset for an example.

Okay thanks.

> Laptops can still benefit from this by the way. Hopefully everyone is moving to
> schedutil by default which is a pre-requisite to using uclamp. It can also help
> in SMP environments to avoid driving frequency high for tasks that don't really
> care about performance but otherwise busy.

Indeed, servers too.

> You can also use UCLAMP_MIN to boost bursty tasks that are not busy but require
> to get work done within a certain amount of time and DVFS delays can prevent
> them from running at adequate frequency. UCLAMP_MIN will ensure they always
> perceive a performance point specified by UCLAMP_MIN at a minimum when they
> wakeup.

That could be very useful for something on a UI deadline.

> RT tasks respect uclamp values too. You can opt-in to run at a different
> frequency than MAX frequency which leads to high power consumption on battery
> powered devices. RT tasks always run at constant frequency, so need to be
> controlled with UCLAMP_MIN only.
>
> Happy hacking ;-)

Thank you.



2022-06-01 19:27:02

by Qais Yousef

[permalink] [raw]
Subject: Re: Scheduling for heterogeneous computers

On 05/27/22 15:45, Paul Bone wrote:
> On Wed, May 25, 2022 at 04:29:56PM +0100, Qais Yousef wrote:
> > Hi Paul
> >
> > On 05/24/22 15:23, Paul Bone wrote:
> > > Hi Qais,
> > >
> > > That's excellent.
> > >
> > > I'll definitely check out those links. This could be very interesting for
> > > people using firefox on a phone/tablet, where we can run background tasks with
> > > a lower UCLAMP_MAX
> >
> > If you're running on Android, you might find that you won't have permission to
> > use uclamp directly. Android restricts access and requires you to use higher
> > level APIs sometimes.
> >
> > And I'm not sure if they have API to allow you to do what you want. I've seen
> > they have the concept of creating Foreground and Background jobs in one of
> > their Google IO presentations. But not sure if this will be tied to uclamp_max.
> > It might give you similar results still though regardless of the underlying
> > mechanism.
>
> We want to support both desktop and android. I have been assuming there's
> an API for android already, I vaguely remember hearing about it before. We
> might already be using it (at least for processes but not yet for individual
> threads).
>
> I was searching online now, but not for very long, and didn't find something
> like this, so maybe Android doesn't expose it, or at least not in one of the
> APIs they encourage you to use.

I am not an Android developer, so don't take it as a guidance :-)

But what I've seen and seemed related is this:

* https://developer.android.com/guide/background
* https://www.youtube.com/watch?v=IqnCqHyu1E4

I don't know the inner plumbing of these APIs and just some relevant stuff I've
come across. I hope they get attached to background cgroup and benefit from
uclamp indirectly that way.

>
> What I'd really like is an API where I can choose one of:
>
> * This task is user-interactive, as-quick-as-possible please.
> * This task is not user-interactive, but does have a deadline.

What's the difference between the two?

as-quickly-as-possible is about wake up latency or DVFS latency?

If the former then we had several discussions for that in OSPM and LPC. Latest
proposal is here to try to help tag tasks that care about wake up latency:

https://lore.kernel.org/lkml/[email protected]/

If the latter, then uclamp_min should help you tell the kernel what performance
you need to get your work done in time. You can dynamically adjust it, or set
it once after a short discovery period assuming your workload is constant for
the duration of its lifetime. The goal to keep it as small as possible to
avoid wasting unnecessary power yet without missing a deadline.

> * This task doesn't have a deadline.

I think we have enough plumbing in the kernel to provide these classifications.
It'd be nice to have a library that provides higher level API maybe for the end
users.

> Rather than choosing a suitable UCLAMP_MAX, I'll expriment with the numbers
> but choosing "400" on one system might mean something different from "400"
> on another system. But I guess that's the problem, there are gray areas
> between my discrete options above. A deadline could be "finish doing GC
> before we run out of memory" (which can have feedback from the GC about if
> it's on target), or "Finish encoding this video before the client wants to
> publish it", or "finish rendering this frame of a video game before the next
> VBLANK". Depending on how on-target any of these are we could decrese or
> increase clock speed, because decreasing will always save power as long as
> things get done by their deadline.

Yep. I'm glad you're aware that "400" could mean anything and depends on the
target.

Any feedback on how to make this more useful will be appreciated. I think
picking the middle (512) and then expand or shrink based how much headroom you
have (or performance you're willing to sacrifice) might be a good starting
point. I think steering away from the top perf points will yield good results
in general. To squeeze more, maybe we'll need to expose more info to allow for
potable code. If you know your system, you can make some assumptions.

I'd be interested to know how well you can do with simple controls like these.

Cheers

--
Qais Yousef