LinuxLists.cc - [PATCH V2] sched: fair: Use the earliest break even

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

Hi Daniel,

On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
>
> In the idle CPU selection process occuring in the slow path via the
> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> with the shallowest idle state otherwise we fall back to the least
> loaded CPU.

The idea makes sense but this path is only used by fork and exec so
I'm not sure about the real impact

>
> In order to be more energy efficient but without impacting the
> performances, let's use another criteria: the break even deadline.
>
> At idle time, when we store the idle state the CPU is entering in, we
> compute the next deadline where the CPU could be woken up without
> spending more energy to sleep.
>
> At the selection process, we use the shallowest CPU but in addition we
> choose the one with the minimal break even deadline instead of relying
> on the idle_timestamp. When the CPU is idle, the timestamp has less
> meaning because the CPU could have wake up and sleep again several times
> without exiting the idle loop. In this case the break even deadline is
> more relevant as it increases the probability of choosing a CPU which
> reached its break even.
>
> Tested on:
> - a synquacer 24 cores, 6 sched domains
> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
>
> sched/perf and messaging does not show a performance regression. Ran
> 50 times schbench, adrestia and forkbench.
>
> The tools described at https://lwn.net/Articles/724935/
>
> --------------------------------------------------------------
> | Synquacer | With break even | Without break even |
> --------------------------------------------------------------
> | schbench *99.0th | 14844.8 | 15017.6 |
> | adrestia / periodic | 57.95 | 57 |
> | adrestia / single | 49.3 | 55.4 |
> --------------------------------------------------------------

Have you got some figures or cpuidle statistics for the syncquacer ?

> | Hikey960 | With break even | Without break even |
> --------------------------------------------------------------
> | schbench *99.0th | 56140.8 | 56256 |
> | schbench energy | 153.575 | 152.676 |
> | adrestia / periodic | 4.98 | 5.2 |
> | adrestia / single | 9.02 | 9.12 |
> | adrestia energy | 1.18 | 1.233 |
> | forkbench | 7.971 | 8.05 |
> | forkbench energy | 9.37 | 9.42 |
> --------------------------------------------------------------
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> kernel/sched/fair.c | 18 ++++++++++++++++--
> kernel/sched/idle.c | 8 +++++++-
> kernel/sched/sched.h | 20 ++++++++++++++++++++
> 3 files changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4b5d5e5e701e..8bd6ea148db7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> {
> unsigned long load, min_load = ULONG_MAX;
> unsigned int min_exit_latency = UINT_MAX;
> + s64 min_break_even = S64_MAX;
> u64 latest_idle_timestamp = 0;
> int least_loaded_cpu = this_cpu;
> int shallowest_idle_cpu = -1;
> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> if (available_idle_cpu(i)) {
> struct rq *rq = cpu_rq(i);
> struct cpuidle_state *idle = idle_get_state(rq);
> + s64 break_even = idle_get_break_even(rq);
> +
> if (idle && idle->exit_latency < min_exit_latency) {
> /*
> * We give priority to a CPU whose idle state
> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> * of any idle timestamp.
> */
> min_exit_latency = idle->exit_latency;
> + min_break_even = break_even;
> latest_idle_timestamp = rq->idle_stamp;
> shallowest_idle_cpu = i;
> - } else if ((!idle || idle->exit_latency == min_exit_latency) &&
> - rq->idle_stamp > latest_idle_timestamp) {
> + } else if ((idle && idle->exit_latency == min_exit_latency) &&
> + break_even < min_break_even) {
> + /*
> + * We give priority to the shallowest
> + * idle states with the minimal break
> + * even deadline to decrease the
> + * probability to choose a CPU which
> + * did not reach its break even yet
> + */
> + min_break_even = break_even;
> + shallowest_idle_cpu = i;
> + } else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
> /*
> * If equal or no active idle state, then
> * the most recently idled CPU might have
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index b743bf38f08f..3342e7bae072 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -19,7 +19,13 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
> */
> void sched_idle_set_state(struct cpuidle_state *idle_state)
> {
> - idle_set_state(this_rq(), idle_state);
> + struct rq *rq = this_rq();
> +
> + idle_set_state(rq, idle_state);

Shouldn't the state be set after setting break even otherwise you will
have a time window with an idle_state != null but the break_even still
set to the previous value

> +
> + if (idle_state)
> + idle_set_break_even(rq, ktime_get_ns() +

What worries me a bit is that it adds one ktime_get call each time a
cpu enters idle

> + idle_state->exit_latency_ns);
> }
>
> static int __read_mostly cpu_idle_force_poll;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2a0caf394dd4..eef1e535e2c2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1015,6 +1015,7 @@ struct rq {
> #ifdef CONFIG_CPU_IDLE
> /* Must be inspected within a rcu lock section */
> struct cpuidle_state *idle_state;
> + s64 break_even;
> #endif
> };
>
> @@ -1850,6 +1851,16 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>
> return rq->idle_state;
> }
> +
> +static inline void idle_set_break_even(struct rq *rq, s64 break_even)
> +{
> + WRITE_ONCE(rq->break_even, break_even);
> +}
> +
> +static inline s64 idle_get_break_even(struct rq *rq)
> +{
> + return READ_ONCE(rq->break_even);
> +}
> #else
> static inline void idle_set_state(struct rq *rq,
> struct cpuidle_state *idle_state)
> @@ -1860,6 +1871,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
> {
> return NULL;
> }
> +
> +static inline void idle_set_break_even(struct rq *rq, s64 break_even)
> +{
> +}
> +
> +static inline s64 idle_get_break_even(struct rq *rq)
> +{
> + return 0;
> +}
> #endif
>
> extern void schedule_idle(void);
> --
> 2.17.1
>

2020-03-12 10:05:38

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On 12/03/2020 09:36, Vincent Guittot wrote:
> Hi Daniel,
>
> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
>>
>> In the idle CPU selection process occuring in the slow path via the
>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
>> with the shallowest idle state otherwise we fall back to the least
>> loaded CPU.
>
> The idea makes sense but this path is only used by fork and exec so
> I'm not sure about the real impact

I agree the fork / exec path is called much less often than the wake
path but it makes more sense for the decision.

>> In order to be more energy efficient but without impacting the
>> performances, let's use another criteria: the break even deadline.
>>
>> At idle time, when we store the idle state the CPU is entering in, we
>> compute the next deadline where the CPU could be woken up without
>> spending more energy to sleep.
>>
>> At the selection process, we use the shallowest CPU but in addition we
>> choose the one with the minimal break even deadline instead of relying
>> on the idle_timestamp. When the CPU is idle, the timestamp has less
>> meaning because the CPU could have wake up and sleep again several times
>> without exiting the idle loop. In this case the break even deadline is
>> more relevant as it increases the probability of choosing a CPU which
>> reached its break even.
>>
>> Tested on:
>> - a synquacer 24 cores, 6 sched domains
>> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
>>
>> sched/perf and messaging does not show a performance regression. Ran
>> 50 times schbench, adrestia and forkbench.
>>
>> The tools described at https://lwn.net/Articles/724935/
>>
>> --------------------------------------------------------------
>> | Synquacer | With break even | Without break even |
>> --------------------------------------------------------------
>> | schbench *99.0th | 14844.8 | 15017.6 |
>> | adrestia / periodic | 57.95 | 57 |
>> | adrestia / single | 49.3 | 55.4 |
>> --------------------------------------------------------------
>
> Have you got some figures or cpuidle statistics for the syncquacer ?

No, and we just noticed the syncquacer has a bug in the firmware and
does not actually go to the idle states.

>> | Hikey960 | With break even | Without break even |
>> --------------------------------------------------------------
>> | schbench *99.0th | 56140.8 | 56256 |
>> | schbench energy | 153.575 | 152.676 |
>> | adrestia / periodic | 4.98 | 5.2 |
>> | adrestia / single | 9.02 | 9.12 |
>> | adrestia energy | 1.18 | 1.233 |
>> | forkbench | 7.971 | 8.05 |
>> | forkbench energy | 9.37 | 9.42 |
>> --------------------------------------------------------------
>>
>> Signed-off-by: Daniel Lezcano <[email protected]>
>> ---
>> kernel/sched/fair.c | 18 ++++++++++++++++--
>> kernel/sched/idle.c | 8 +++++++-
>> kernel/sched/sched.h | 20 ++++++++++++++++++++
>> 3 files changed, 43 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4b5d5e5e701e..8bd6ea148db7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>> {
>> unsigned long load, min_load = ULONG_MAX;
>> unsigned int min_exit_latency = UINT_MAX;
>> + s64 min_break_even = S64_MAX;
>> u64 latest_idle_timestamp = 0;
>> int least_loaded_cpu = this_cpu;
>> int shallowest_idle_cpu = -1;
>> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>> if (available_idle_cpu(i)) {
>> struct rq *rq = cpu_rq(i);
>> struct cpuidle_state *idle = idle_get_state(rq);
>> + s64 break_even = idle_get_break_even(rq);
>> +
>> if (idle && idle->exit_latency < min_exit_latency) {
>> /*
>> * We give priority to a CPU whose idle state
>> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>> * of any idle timestamp.
>> */
>> min_exit_latency = idle->exit_latency;
>> + min_break_even = break_even;
>> latest_idle_timestamp = rq->idle_stamp;
>> shallowest_idle_cpu = i;
>> - } else if ((!idle || idle->exit_latency == min_exit_latency) &&
>> - rq->idle_stamp > latest_idle_timestamp) {
>> + } else if ((idle && idle->exit_latency == min_exit_latency) &&
>> + break_even < min_break_even) {
>> + /*
>> + * We give priority to the shallowest
>> + * idle states with the minimal break
>> + * even deadline to decrease the
>> + * probability to choose a CPU which
>> + * did not reach its break even yet
>> + */
>> + min_break_even = break_even;
>> + shallowest_idle_cpu = i;
>> + } else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
>> /*
>> * If equal or no active idle state, then
>> * the most recently idled CPU might have
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index b743bf38f08f..3342e7bae072 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -19,7 +19,13 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
>> */
>> void sched_idle_set_state(struct cpuidle_state *idle_state)
>> {
>> - idle_set_state(this_rq(), idle_state);
>> + struct rq *rq = this_rq();
>> +
>> + idle_set_state(rq, idle_state);
>
> Shouldn't the state be set after setting break even otherwise you will
> have a time window with an idle_state != null but the break_even still
> set to the previous value

IIUC we are protected in this section. Otherwise the routine above would
be also wrong [if (idle && idle->exit_latency)], no?

>> +
>> + if (idle_state)
>> + idle_set_break_even(rq, ktime_get_ns() +
>
> What worries me a bit is that it adds one ktime_get call each time a
> cpu enters idle

Right, we can improve this in the future by folding the local_clock() in
cpuidle when entering idle with this ktime_get.

>> + idle_state->exit_latency_ns);
>> }

[ ... ]

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2020-03-12 12:30:12

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On Thu, 12 Mar 2020 at 11:04, Daniel Lezcano <[email protected]> wrote:
>
> On 12/03/2020 09:36, Vincent Guittot wrote:
> > Hi Daniel,
> >
> > On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
> >>
> >> In the idle CPU selection process occuring in the slow path via the
> >> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> >> with the shallowest idle state otherwise we fall back to the least
> >> loaded CPU.
> >
> > The idea makes sense but this path is only used by fork and exec so
> > I'm not sure about the real impact
>
> I agree the fork / exec path is called much less often than the wake
> path but it makes more sense for the decision.
>
> >> In order to be more energy efficient but without impacting the
> >> performances, let's use another criteria: the break even deadline.
> >>
> >> At idle time, when we store the idle state the CPU is entering in, we
> >> compute the next deadline where the CPU could be woken up without
> >> spending more energy to sleep.
> >>
> >> At the selection process, we use the shallowest CPU but in addition we
> >> choose the one with the minimal break even deadline instead of relying
> >> on the idle_timestamp. When the CPU is idle, the timestamp has less
> >> meaning because the CPU could have wake up and sleep again several times
> >> without exiting the idle loop. In this case the break even deadline is
> >> more relevant as it increases the probability of choosing a CPU which
> >> reached its break even.
> >>
> >> Tested on:
> >> - a synquacer 24 cores, 6 sched domains
> >> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
> >>
> >> sched/perf and messaging does not show a performance regression. Ran
> >> 50 times schbench, adrestia and forkbench.
> >>
> >> The tools described at https://lwn.net/Articles/724935/
> >>
> >> --------------------------------------------------------------
> >> | Synquacer | With break even | Without break even |
> >> --------------------------------------------------------------
> >> | schbench *99.0th | 14844.8 | 15017.6 |
> >> | adrestia / periodic | 57.95 | 57 |
> >> | adrestia / single | 49.3 | 55.4 |
> >> --------------------------------------------------------------
> >
> > Have you got some figures or cpuidle statistics for the syncquacer ?
>
> No, and we just noticed the syncquacer has a bug in the firmware and
> does not actually go to the idle states.
>
>
> >> | Hikey960 | With break even | Without break even |
> >> --------------------------------------------------------------
> >> | schbench *99.0th | 56140.8 | 56256 |
> >> | schbench energy | 153.575 | 152.676 |
> >> | adrestia / periodic | 4.98 | 5.2 |
> >> | adrestia / single | 9.02 | 9.12 |
> >> | adrestia energy | 1.18 | 1.233 |
> >> | forkbench | 7.971 | 8.05 |
> >> | forkbench energy | 9.37 | 9.42 |
> >> --------------------------------------------------------------
> >>
> >> Signed-off-by: Daniel Lezcano <[email protected]>
> >> ---
> >> kernel/sched/fair.c | 18 ++++++++++++++++--
> >> kernel/sched/idle.c | 8 +++++++-
> >> kernel/sched/sched.h | 20 ++++++++++++++++++++
> >> 3 files changed, 43 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 4b5d5e5e701e..8bd6ea148db7 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> >> {
> >> unsigned long load, min_load = ULONG_MAX;
> >> unsigned int min_exit_latency = UINT_MAX;
> >> + s64 min_break_even = S64_MAX;
> >> u64 latest_idle_timestamp = 0;
> >> int least_loaded_cpu = this_cpu;
> >> int shallowest_idle_cpu = -1;
> >> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> >> if (available_idle_cpu(i)) {
> >> struct rq *rq = cpu_rq(i);
> >> struct cpuidle_state *idle = idle_get_state(rq);
> >> + s64 break_even = idle_get_break_even(rq);
> >> +
> >> if (idle && idle->exit_latency < min_exit_latency) {
> >> /*
> >> * We give priority to a CPU whose idle state
> >> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> >> * of any idle timestamp.
> >> */
> >> min_exit_latency = idle->exit_latency;
> >> + min_break_even = break_even;
> >> latest_idle_timestamp = rq->idle_stamp;
> >> shallowest_idle_cpu = i;
> >> - } else if ((!idle || idle->exit_latency == min_exit_latency) &&
> >> - rq->idle_stamp > latest_idle_timestamp) {
> >> + } else if ((idle && idle->exit_latency == min_exit_latency) &&
> >> + break_even < min_break_even) {
> >> + /*
> >> + * We give priority to the shallowest
> >> + * idle states with the minimal break
> >> + * even deadline to decrease the
> >> + * probability to choose a CPU which
> >> + * did not reach its break even yet
> >> + */
> >> + min_break_even = break_even;
> >> + shallowest_idle_cpu = i;
> >> + } else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
> >> /*
> >> * If equal or no active idle state, then
> >> * the most recently idled CPU might have
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index b743bf38f08f..3342e7bae072 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -19,7 +19,13 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
> >> */
> >> void sched_idle_set_state(struct cpuidle_state *idle_state)
> >> {
> >> - idle_set_state(this_rq(), idle_state);
> >> + struct rq *rq = this_rq();
> >> +
> >> + idle_set_state(rq, idle_state);
> >
> > Shouldn't the state be set after setting break even otherwise you will
> > have a time window with an idle_state != null but the break_even still
> > set to the previous value
>
> IIUC we are protected in this section. Otherwise the routine above would
> be also wrong [if (idle && idle->exit_latency)], no?

no there are not the same because it uses the idle pointer to read
exit_latency so we are sure to use exit_latency related to the idle
pointer.

In your case it checks idle is not null but then it uses rq to read
break_even but it might not have been already updated

>
> >> +
> >> + if (idle_state)
> >> + idle_set_break_even(rq, ktime_get_ns() +
> >
> > What worries me a bit is that it adds one ktime_get call each time a
> > cpu enters idle
>
> Right, we can improve this in the future by folding the local_clock() in
> cpuidle when entering idle with this ktime_get.

Using local_clock() would be more latency friendly

>
> >> + idle_state->exit_latency_ns);
> >> }
>
> [ ... ]
>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2020-03-13 12:16:58

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On 12/03/2020 13:27, Vincent Guittot wrote:
> On Thu, 12 Mar 2020 at 11:04, Daniel Lezcano <[email protected]> wrote:
>>
>> On 12/03/2020 09:36, Vincent Guittot wrote:
>>> Hi Daniel,
>>>
>>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
>>>>
>>>> In the idle CPU selection process occuring in the slow path via the
>>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
>>>> with the shallowest idle state otherwise we fall back to the least
>>>> loaded CPU.
>>>
>>> The idea makes sense but this path is only used by fork and exec so
>>> I'm not sure about the real impact
>>
>> I agree the fork / exec path is called much less often than the wake
>> path but it makes more sense for the decision.
>>
>>>> In order to be more energy efficient but without impacting the
>>>> performances, let's use another criteria: the break even deadline.
>>>>
>>>> At idle time, when we store the idle state the CPU is entering in, we
>>>> compute the next deadline where the CPU could be woken up without
>>>> spending more energy to sleep.
>>>>
>>>> At the selection process, we use the shallowest CPU but in addition we
>>>> choose the one with the minimal break even deadline instead of relying
>>>> on the idle_timestamp. When the CPU is idle, the timestamp has less
>>>> meaning because the CPU could have wake up and sleep again several times
>>>> without exiting the idle loop. In this case the break even deadline is
>>>> more relevant as it increases the probability of choosing a CPU which
>>>> reached its break even.
>>>>
>>>> Tested on:
>>>> - a synquacer 24 cores, 6 sched domains
>>>> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
>>>>
>>>> sched/perf and messaging does not show a performance regression. Ran
>>>> 50 times schbench, adrestia and forkbench.
>>>>
>>>> The tools described at https://lwn.net/Articles/724935/
>>>>
>>>> --------------------------------------------------------------
>>>> | Synquacer | With break even | Without break even |
>>>> --------------------------------------------------------------
>>>> | schbench *99.0th | 14844.8 | 15017.6 |
>>>> | adrestia / periodic | 57.95 | 57 |
>>>> | adrestia / single | 49.3 | 55.4 |
>>>> --------------------------------------------------------------
>>>
>>> Have you got some figures or cpuidle statistics for the syncquacer ?
>>
>> No, and we just noticed the syncquacer has a bug in the firmware and
>> does not actually go to the idle states.
>>
>>
>>>> | Hikey960 | With break even | Without break even |
>>>> --------------------------------------------------------------
>>>> | schbench *99.0th | 56140.8 | 56256 |
>>>> | schbench energy | 153.575 | 152.676 |
>>>> | adrestia / periodic | 4.98 | 5.2 |
>>>> | adrestia / single | 9.02 | 9.12 |
>>>> | adrestia energy | 1.18 | 1.233 |
>>>> | forkbench | 7.971 | 8.05 |
>>>> | forkbench energy | 9.37 | 9.42 |
>>>> --------------------------------------------------------------
>>>>
>>>> Signed-off-by: Daniel Lezcano <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 18 ++++++++++++++++--
>>>> kernel/sched/idle.c | 8 +++++++-
>>>> kernel/sched/sched.h | 20 ++++++++++++++++++++
>>>> 3 files changed, 43 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4b5d5e5e701e..8bd6ea148db7 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>> {
>>>> unsigned long load, min_load = ULONG_MAX;
>>>> unsigned int min_exit_latency = UINT_MAX;
>>>> + s64 min_break_even = S64_MAX;
>>>> u64 latest_idle_timestamp = 0;
>>>> int least_loaded_cpu = this_cpu;
>>>> int shallowest_idle_cpu = -1;
>>>> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>> if (available_idle_cpu(i)) {
>>>> struct rq *rq = cpu_rq(i);
>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>> + s64 break_even = idle_get_break_even(rq);
>>>> +
>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>> /*
>>>> * We give priority to a CPU whose idle state
>>>> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>> * of any idle timestamp.
>>>> */
>>>> min_exit_latency = idle->exit_latency;
>>>> + min_break_even = break_even;
>>>> latest_idle_timestamp = rq->idle_stamp;
>>>> shallowest_idle_cpu = i;
>>>> - } else if ((!idle || idle->exit_latency == min_exit_latency) &&
>>>> - rq->idle_stamp > latest_idle_timestamp) {
>>>> + } else if ((idle && idle->exit_latency == min_exit_latency) &&
>>>> + break_even < min_break_even) {
>>>> + /*
>>>> + * We give priority to the shallowest
>>>> + * idle states with the minimal break
>>>> + * even deadline to decrease the
>>>> + * probability to choose a CPU which
>>>> + * did not reach its break even yet
>>>> + */
>>>> + min_break_even = break_even;
>>>> + shallowest_idle_cpu = i;
>>>> + } else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
>>>> /*
>>>> * If equal or no active idle state, then
>>>> * the most recently idled CPU might have
>>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>>>> index b743bf38f08f..3342e7bae072 100644
>>>> --- a/kernel/sched/idle.c
>>>> +++ b/kernel/sched/idle.c
>>>> @@ -19,7 +19,13 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
>>>> */
>>>> void sched_idle_set_state(struct cpuidle_state *idle_state)
>>>> {
>>>> - idle_set_state(this_rq(), idle_state);
>>>> + struct rq *rq = this_rq();
>>>> +
>>>> + idle_set_state(rq, idle_state);
>>>
>>> Shouldn't the state be set after setting break even otherwise you will
>>> have a time window with an idle_state != null but the break_even still
>>> set to the previous value
>>
>> IIUC we are protected in this section. Otherwise the routine above would
>> be also wrong [if (idle && idle->exit_latency)], no?
>
> no there are not the same because it uses the idle pointer to read
> exit_latency so we are sure to use exit_latency related to the idle
> pointer.
>
> In your case it checks idle is not null but then it uses rq to read
> break_even but it might not have been already updated

Ok I will invert the lines.

>>>> +
>>>> + if (idle_state)
>>>> + idle_set_break_even(rq, ktime_get_ns() +
>>>
>>> What worries me a bit is that it adds one ktime_get call each time a
>>> cpu enters idle
>>
>> Right, we can improve this in the future by folding the local_clock() in
>> cpuidle when entering idle with this ktime_get.
>
> Using local_clock() would be more latency friendly

Unfortunately we are comparing the deadline across CPUs, so the
local_clock() can not be used here.

But if we have one ktime_get() instead of a local_clock() + ktime_get(),
that should be fine, no?

>>>> + idle_state->exit_latency_ns);
>>>> }
>>
>> [ ... ]
>>
>>
>> --
>> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>>
>> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
>> <http://twitter.com/#!/linaroorg> Twitter |
>> <http://www.linaro.org/linaro-blog/> Blog
>>

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2020-03-13 13:17:40

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On Fri, 13 Mar 2020 at 13:15, Daniel Lezcano <[email protected]> wrote:
>
> On 12/03/2020 13:27, Vincent Guittot wrote:
> > On Thu, 12 Mar 2020 at 11:04, Daniel Lezcano <[email protected]> wrote:
> >>
> >> On 12/03/2020 09:36, Vincent Guittot wrote:
> >>> Hi Daniel,
> >>>
> >>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
> >>>>
> >>>> In the idle CPU selection process occuring in the slow path via the
> >>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> >>>> with the shallowest idle state otherwise we fall back to the least
> >>>> loaded CPU.
> >>>
> >>> The idea makes sense but this path is only used by fork and exec so
> >>> I'm not sure about the real impact
> >>
> >> I agree the fork / exec path is called much less often than the wake
> >> path but it makes more sense for the decision.
> >>
> >>>> In order to be more energy efficient but without impacting the
> >>>> performances, let's use another criteria: the break even deadline.
> >>>>
> >>>> At idle time, when we store the idle state the CPU is entering in, we
> >>>> compute the next deadline where the CPU could be woken up without
> >>>> spending more energy to sleep.
> >>>>
> >>>> At the selection process, we use the shallowest CPU but in addition we
> >>>> choose the one with the minimal break even deadline instead of relying
> >>>> on the idle_timestamp. When the CPU is idle, the timestamp has less
> >>>> meaning because the CPU could have wake up and sleep again several times
> >>>> without exiting the idle loop. In this case the break even deadline is
> >>>> more relevant as it increases the probability of choosing a CPU which
> >>>> reached its break even.
> >>>>
> >>>> Tested on:
> >>>> - a synquacer 24 cores, 6 sched domains
> >>>> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
> >>>>
> >>>> sched/perf and messaging does not show a performance regression. Ran
> >>>> 50 times schbench, adrestia and forkbench.
> >>>>
> >>>> The tools described at https://lwn.net/Articles/724935/
> >>>>
> >>>> --------------------------------------------------------------
> >>>> | Synquacer | With break even | Without break even |
> >>>> --------------------------------------------------------------
> >>>> | schbench *99.0th | 14844.8 | 15017.6 |
> >>>> | adrestia / periodic | 57.95 | 57 |
> >>>> | adrestia / single | 49.3 | 55.4 |
> >>>> --------------------------------------------------------------
> >>>
> >>> Have you got some figures or cpuidle statistics for the syncquacer ?
> >>
> >> No, and we just noticed the syncquacer has a bug in the firmware and
> >> does not actually go to the idle states.
> >>
> >>
> >>>> | Hikey960 | With break even | Without break even |
> >>>> --------------------------------------------------------------
> >>>> | schbench *99.0th | 56140.8 | 56256 |
> >>>> | schbench energy | 153.575 | 152.676 |
> >>>> | adrestia / periodic | 4.98 | 5.2 |
> >>>> | adrestia / single | 9.02 | 9.12 |
> >>>> | adrestia energy | 1.18 | 1.233 |
> >>>> | forkbench | 7.971 | 8.05 |
> >>>> | forkbench energy | 9.37 | 9.42 |
> >>>> --------------------------------------------------------------
> >>>>
> >>>> Signed-off-by: Daniel Lezcano <[email protected]>
> >>>> ---
> >>>> kernel/sched/fair.c | 18 ++++++++++++++++--
> >>>> kernel/sched/idle.c | 8 +++++++-
> >>>> kernel/sched/sched.h | 20 ++++++++++++++++++++
> >>>> 3 files changed, 43 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 4b5d5e5e701e..8bd6ea148db7 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -5793,6 +5793,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> >>>> {
> >>>> unsigned long load, min_load = ULONG_MAX;
> >>>> unsigned int min_exit_latency = UINT_MAX;
> >>>> + s64 min_break_even = S64_MAX;
> >>>> u64 latest_idle_timestamp = 0;
> >>>> int least_loaded_cpu = this_cpu;
> >>>> int shallowest_idle_cpu = -1;
> >>>> @@ -5810,6 +5811,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> >>>> if (available_idle_cpu(i)) {
> >>>> struct rq *rq = cpu_rq(i);
> >>>> struct cpuidle_state *idle = idle_get_state(rq);
> >>>> + s64 break_even = idle_get_break_even(rq);
> >>>> +
> >>>> if (idle && idle->exit_latency < min_exit_latency) {
> >>>> /*
> >>>> * We give priority to a CPU whose idle state
> >>>> @@ -5817,10 +5820,21 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> >>>> * of any idle timestamp.
> >>>> */
> >>>> min_exit_latency = idle->exit_latency;
> >>>> + min_break_even = break_even;
> >>>> latest_idle_timestamp = rq->idle_stamp;
> >>>> shallowest_idle_cpu = i;
> >>>> - } else if ((!idle || idle->exit_latency == min_exit_latency) &&
> >>>> - rq->idle_stamp > latest_idle_timestamp) {
> >>>> + } else if ((idle && idle->exit_latency == min_exit_latency) &&
> >>>> + break_even < min_break_even) {
> >>>> + /*
> >>>> + * We give priority to the shallowest
> >>>> + * idle states with the minimal break
> >>>> + * even deadline to decrease the
> >>>> + * probability to choose a CPU which
> >>>> + * did not reach its break even yet
> >>>> + */
> >>>> + min_break_even = break_even;
> >>>> + shallowest_idle_cpu = i;
> >>>> + } else if (!idle && rq->idle_stamp > latest_idle_timestamp) {
> >>>> /*
> >>>> * If equal or no active idle state, then
> >>>> * the most recently idled CPU might have
> >>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >>>> index b743bf38f08f..3342e7bae072 100644
> >>>> --- a/kernel/sched/idle.c
> >>>> +++ b/kernel/sched/idle.c
> >>>> @@ -19,7 +19,13 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
> >>>> */
> >>>> void sched_idle_set_state(struct cpuidle_state *idle_state)
> >>>> {
> >>>> - idle_set_state(this_rq(), idle_state);
> >>>> + struct rq *rq = this_rq();
> >>>> +
> >>>> + idle_set_state(rq, idle_state);
> >>>
> >>> Shouldn't the state be set after setting break even otherwise you will
> >>> have a time window with an idle_state != null but the break_even still
> >>> set to the previous value
> >>
> >> IIUC we are protected in this section. Otherwise the routine above would
> >> be also wrong [if (idle && idle->exit_latency)], no?
> >
> > no there are not the same because it uses the idle pointer to read
> > exit_latency so we are sure to use exit_latency related to the idle
> > pointer.
> >
> > In your case it checks idle is not null but then it uses rq to read
> > break_even but it might not have been already updated
>
> Ok I will invert the lines.
>
> >>>> +
> >>>> + if (idle_state)
> >>>> + idle_set_break_even(rq, ktime_get_ns() +
> >>>
> >>> What worries me a bit is that it adds one ktime_get call each time a
> >>> cpu enters idle
> >>
> >> Right, we can improve this in the future by folding the local_clock() in
> >> cpuidle when entering idle with this ktime_get.
> >
> > Using local_clock() would be more latency friendly
>
> Unfortunately we are comparing the deadline across CPUs, so the
> local_clock() can not be used here.
>
> But if we have one ktime_get() instead of a local_clock() + ktime_get(),
> that should be fine, no?

Can't this computation of break_even be done in cpuidle framework
while computing other statistics for selecting the idle state instead
? cpuidle already uses ktime_get for next hrtimer as an example.
So cpuidle compute break_even and make it available to scheduler like
exit_latency. And I can imagine that system wide time value will also
be needed when looking at next wakeup event of cluster/group of CPUs

>
> >>>> + idle_state->exit_latency_ns);
> >>>> }
> >>
> >> [ ... ]
> >>
> >>
> >> --
> >> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> >>
> >> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> >> <http://twitter.com/#!/linaroorg> Twitter |
> >> <http://www.linaro.org/linaro-blog/> Blog
> >>
>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2020-03-13 13:19:12

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On 13/03/2020 14:15, Vincent Guittot wrote:
> On Fri, 13 Mar 2020 at 13:15, Daniel Lezcano <[email protected]> wrote:

[ ... ]

>>>>>> +
>>>>>> + if (idle_state)
>>>>>> + idle_set_break_even(rq, ktime_get_ns() +
>>>>>
>>>>> What worries me a bit is that it adds one ktime_get call each time a
>>>>> cpu enters idle
>>>>
>>>> Right, we can improve this in the future by folding the local_clock() in
>>>> cpuidle when entering idle with this ktime_get.
>>>
>>> Using local_clock() would be more latency friendly
>>
>> Unfortunately we are comparing the deadline across CPUs, so the
>> local_clock() can not be used here.
>>
>> But if we have one ktime_get() instead of a local_clock() + ktime_get(),
>> that should be fine, no?
>
> Can't this computation of break_even be done in cpuidle framework
> while computing other statistics for selecting the idle state instead
> ? cpuidle already uses ktime_get for next hrtimer as an example.
> So cpuidle compute break_even and make it available to scheduler like
> exit_latency. And I can imagine that system wide time value will also
> be needed when looking at next wakeup event of cluster/group of CPUs

Ok, so you suggest to revisit and consolidate the whole time capture in
cpuidle? I think that makes sense.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2020-03-13 13:22:21

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On Fri, 13 Mar 2020 at 14:17, Daniel Lezcano <[email protected]> wrote:
>
> On 13/03/2020 14:15, Vincent Guittot wrote:
> > On Fri, 13 Mar 2020 at 13:15, Daniel Lezcano <[email protected]> wrote:
>
> [ ... ]
>
> >>>>>> +
> >>>>>> + if (idle_state)
> >>>>>> + idle_set_break_even(rq, ktime_get_ns() +
> >>>>>
> >>>>> What worries me a bit is that it adds one ktime_get call each time a
> >>>>> cpu enters idle
> >>>>
> >>>> Right, we can improve this in the future by folding the local_clock() in
> >>>> cpuidle when entering idle with this ktime_get.
> >>>
> >>> Using local_clock() would be more latency friendly
> >>
> >> Unfortunately we are comparing the deadline across CPUs, so the
> >> local_clock() can not be used here.
> >>
> >> But if we have one ktime_get() instead of a local_clock() + ktime_get(),
> >> that should be fine, no?
> >
> > Can't this computation of break_even be done in cpuidle framework
> > while computing other statistics for selecting the idle state instead
> > ? cpuidle already uses ktime_get for next hrtimer as an example.
> > So cpuidle compute break_even and make it available to scheduler like
> > exit_latency. And I can imagine that system wide time value will also
> > be needed when looking at next wakeup event of cluster/group of CPUs
>
> Ok, so you suggest to revisit and consolidate the whole time capture in
> cpuidle? I think that makes sense.

Yes

>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2020-03-17 07:58:13

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

Hi Daniel,

First, I think letting the scheduler know about desired minimum idle
times is an interesting optimization if the overhead can be kept at a
minimum. I do have a few comments about the patch though.

On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
> On 12/03/2020 09:36, Vincent Guittot wrote:
> > Hi Daniel,
> >
> > On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
> >>
> >> In the idle CPU selection process occuring in the slow path via the
> >> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> >> with the shallowest idle state otherwise we fall back to the least
> >> loaded CPU.
> >
> > The idea makes sense but this path is only used by fork and exec so
> > I'm not sure about the real impact
>
> I agree the fork / exec path is called much less often than the wake
> path but it makes more sense for the decision.

Looking at the flow in find_idlest_cpu(), AFAICT,
find_idlest_group_cpu() is not actually making the final choice of CPU,
so going through a lot of trouble there looking at idle states is
pointless. Is there something I don't see?

We fellow sd->child until groups == CPUs which which means that
find_idlest_group() actually makes the final choice as the final group
passed to find_idlest_group_cpu() is single-CPU group. The flow has been
like that for years. Even before you added the initial idle-state
awareness.

I agree with Vincent, if this should really make a difference it should
include wake-ups existing tasks too. Although I'm aware it would be a
more invasive change. As said from the beginning, the idea is fine, but
the current implementation should not make any measurable difference?

>
> >> In order to be more energy efficient but without impacting the
> >> performances, let's use another criteria: the break even deadline.
> >>
> >> At idle time, when we store the idle state the CPU is entering in, we
> >> compute the next deadline where the CPU could be woken up without
> >> spending more energy to sleep.

I don't follow the argument that sleeping longer should improve energy
consumption. The patch doesn't affect the number of idle state
enter/exit cycles, so you spend the amount of energy on those
transitions. The main change is that idle time get spread out, so CPUs
are less likely to be in the process of entering an idle state when they
are asked to wake back up again.

Isn't it fair to say that we expect the total number of wake-ups remains
unchanged? Total busy and idle times across all CPUs should remain the
same too? Unless chosen idle-state is changed, which I don't think we
expect either, there should be no net effect on energy? The main benefit
is reduced wake-up latency I think.

Regarding chosen idle state, I'm wondering how this patch affects the
cpuidle governor's idle state selection. Could the spreading of wake-ups
trick governor to pick a shallower idle-state for some idle CPUs because
we actively spread wake-ups rather than consolidating them? Just a
thought.

> >>
> >> At the selection process, we use the shallowest CPU but in addition we
> >> choose the one with the minimal break even deadline instead of relying
> >> on the idle_timestamp. When the CPU is idle, the timestamp has less
> >> meaning because the CPU could have wake up and sleep again several times
> >> without exiting the idle loop. In this case the break even deadline is
> >> more relevant as it increases the probability of choosing a CPU which
> >> reached its break even.

I guess you could improve the idle time stamping without adding the
break-even time, they don't have to go together?

> >>
> >> Tested on:
> >> - a synquacer 24 cores, 6 sched domains
> >> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
> >>
> >> sched/perf and messaging does not show a performance regression. Ran
> >> 50 times schbench, adrestia and forkbench.
> >>
> >> The tools described at https://lwn.net/Articles/724935/
> >>
> >> --------------------------------------------------------------
> >> | Synquacer | With break even | Without break even |
> >> --------------------------------------------------------------
> >> | schbench *99.0th | 14844.8 | 15017.6 |
> >> | adrestia / periodic | 57.95 | 57 |
> >> | adrestia / single | 49.3 | 55.4 |
> >> --------------------------------------------------------------
> >
> > Have you got some figures or cpuidle statistics for the syncquacer ?
>
> No, and we just noticed the syncquacer has a bug in the firmware and
> does not actually go to the idle states.

I would also like some statistics to help understanding what actually
changes.

I did some measurements on TX2, which only has one idle-state. I don't
see the same trends as you do. adrestia single seems to be most affected
by the patch, but _increases_ with the break_even patch rather than
decrease. I don't trust adrestia too much though as the time resolution
is low on TX2.

TX2 tip break_even
----------------------------------------------------
adrestia / single 5.21 5.51
adrestia / periodic 5.75 5.67
schbench 99.0th 45465.6 45376.0
hackbench 27.9851 27.9775

Notes:
adrestia: Avg of 100 runs: adrestia -l 25000
schbench: Avg of 10 runs: schbench -m16 -t64
hackbench: Avg of 10 runs: hackbench -g 20 -T 256 -l 100000

Morten

2020-03-17 10:57:23

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

Hi Daniel,

One more comment on the break even itself, ignoring the rest:

On Wed, Mar 11 2020, Daniel Lezcano wrote:
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index b743bf38f08f..3342e7bae072 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -19,7 +19,13 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
> */
> void sched_idle_set_state(struct cpuidle_state *idle_state)
> {
> - idle_set_state(this_rq(), idle_state);
> + struct rq *rq = this_rq();
> +
> + idle_set_state(rq, idle_state);
> +
> + if (idle_state)
> + idle_set_break_even(rq, ktime_get_ns() +
> + idle_state->exit_latency_ns);

I'm not sure I follow why we go for entry time + exit latency. If this
is based on the minimum residency, shouldn't this be something depending
on the entry latency? i.e. something like

break_even = now + entry_latency + idling_time
\_________________________/
min-residency

or am I missing something?

> }
>
> static int __read_mostly cpu_idle_force_poll;

2020-03-17 13:50:32

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

Hi Morten,

On 17/03/2020 08:56, Morten Rasmussen wrote:
> Hi Daniel,
>
> First, I think letting the scheduler know about desired minimum idle
> times is an interesting optimization if the overhead can be kept at a
> minimum. I do have a few comments about the patch though.
>
> On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
>> On 12/03/2020 09:36, Vincent Guittot wrote:
>>> Hi Daniel,
>>>
>>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
>>>>
>>>> In the idle CPU selection process occuring in the slow path via the
>>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
>>>> with the shallowest idle state otherwise we fall back to the least
>>>> loaded CPU.
>>>
>>> The idea makes sense but this path is only used by fork and exec so
>>> I'm not sure about the real impact
>>
>> I agree the fork / exec path is called much less often than the wake
>> path but it makes more sense for the decision.
>
> Looking at the flow in find_idlest_cpu(), AFAICT,
> find_idlest_group_cpu() is not actually making the final choice of CPU,
> so going through a lot of trouble there looking at idle states is
> pointless. Is there something I don't see?
>
> We fellow sd->child until groups == CPUs which which means that
> find_idlest_group() actually makes the final choice as the final group
> passed to find_idlest_group_cpu() is single-CPU group. The flow has been
> like that for years. Even before you added the initial idle-state
> awareness.
>
> I agree with Vincent, if this should really make a difference it should
> include wake-ups existing tasks too. Although I'm aware it would be a
> more invasive change. As said from the beginning, the idea is fine, but
> the current implementation should not make any measurable difference?

I'm seeing the wake-ups path so sensitive, I'm not comfortable to do any
changes in it. That is the reason why the patch only changes the slow path.

>>>> In order to be more energy efficient but without impacting the
>>>> performances, let's use another criteria: the break even deadline.
>>>>
>>>> At idle time, when we store the idle state the CPU is entering in, we
>>>> compute the next deadline where the CPU could be woken up without
>>>> spending more energy to sleep.
>
> I don't follow the argument that sleeping longer should improve energy
> consumption.

May be it is not explained correctly.

The patch is about selecting a CPU with the smallest break even deadline
value. In a group of idle CPUs in the same idle state, we will pick the
one with the smallest break even dead line which is the one with the
highest probability it already reached its target residency.

It is best effort.

> The patch doesn't affect the number of idle state
> enter/exit cycles, so you spend the amount of energy on those
> transitions. The main change is that idle time get spread out, so CPUs
> are less likely to be in the process of entering an idle state when they
> are asked to wake back up again.
>
> Isn't it fair to say that we expect the total number of wake-ups remains
> unchanged? Total busy and idle times across all CPUs should remain the
> same too? Unless chosen idle-state is changed, which I don't think we
> expect either, there should be no net effect on energy? The main benefit
> is reduced wake-up latency I think.
>
> Regarding chosen idle state, I'm wondering how this patch affects the
> cpuidle governor's idle state selection. Could the spreading of wake-ups
> trick governor to pick a shallower idle-state for some idle CPUs because
> we actively spread wake-ups rather than consolidating them? Just a
> thought.

May be I missed the point, why are we spreading the tasks?

We are taking the decision on the same sched domain, no?

>>>> At the selection process, we use the shallowest CPU but in addition we
>>>> choose the one with the minimal break even deadline instead of relying
>>>> on the idle_timestamp. When the CPU is idle, the timestamp has less
>>>> meaning because the CPU could have wake up and sleep again several times
>>>> without exiting the idle loop. In this case the break even deadline is
>>>> more relevant as it increases the probability of choosing a CPU which
>>>> reached its break even.
>
> I guess you could improve the idle time stamping without adding the
> break-even time, they don't have to go together?

Yes, we can add the idle start time when entering idle in the
cpuidle_enter function which is different from the idle_timestamp which
gives the idle task scheduling. I sent a RFC for that [1].

However, each time we would like to inspect the deadline, we will have
to compute it, so IMO it makes more sense to pre-compute it when
entering idle in addition to the idle start.

[1] https://lkml.org/lkml/2020/3/16/902

>>>> Tested on:
>>>> - a synquacer 24 cores, 6 sched domains
>>>> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
>>>>
>>>> sched/perf and messaging does not show a performance regression. Ran
>>>> 50 times schbench, adrestia and forkbench.
>>>>
>>>> The tools described at https://lwn.net/Articles/724935/
>>>>
>>>> --------------------------------------------------------------
>>>> | Synquacer | With break even | Without break even |
>>>> --------------------------------------------------------------
>>>> | schbench *99.0th | 14844.8 | 15017.6 |
>>>> | adrestia / periodic | 57.95 | 57 |
>>>> | adrestia / single | 49.3 | 55.4 |
>>>> --------------------------------------------------------------
>>>
>>> Have you got some figures or cpuidle statistics for the syncquacer ?
>>
>> No, and we just noticed the syncquacer has a bug in the firmware and
>> does not actually go to the idle states.
>
> I would also like some statistics to help understanding what actually
> changes.
>
> I did some measurements on TX2, which only has one idle-state. I don't
> see the same trends as you do. adrestia single seems to be most affected
> by the patch, but _increases_ with the break_even patch rather than
> decrease. I don't trust adrestia too much though as the time resolution
> is low on TX2.
>
> TX2 tip break_even
> ----------------------------------------------------
> adrestia / single 5.21 5.51
> adrestia / periodic 5.75 5.67
> schbench 99.0th 45465.6 45376.0
> hackbench 27.9851 27.9775
>
> Notes:
> adrestia: Avg of 100 runs: adrestia -l 25000
> schbench: Avg of 10 runs: schbench -m16 -t64
> hackbench: Avg of 10 runs: hackbench -g 20 -T 256 -l 100000

Thanks for testing. Is that a Jetson TX2 from Nvidia? If that is the
case, IIRC, it has some kind of switcher for the CPUs in the firmware, I
don't know how that can interact with the testing.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2020-03-17 14:00:13

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On 17/03/2020 11:56, Valentin Schneider wrote:
>
> Hi Daniel,
>
> One more comment on the break even itself, ignoring the rest:
>
> On Wed, Mar 11 2020, Daniel Lezcano wrote:
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index b743bf38f08f..3342e7bae072 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -19,7 +19,13 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
>> */
>> void sched_idle_set_state(struct cpuidle_state *idle_state)
>> {
>> - idle_set_state(this_rq(), idle_state);
>> + struct rq *rq = this_rq();
>> +
>> + idle_set_state(rq, idle_state);
>> +
>> + if (idle_state)
>> + idle_set_break_even(rq, ktime_get_ns() +
>> + idle_state->exit_latency_ns);
>
> I'm not sure I follow why we go for entry time + exit latency. If this
> is based on the minimum residency, shouldn't this be something depending
> on the entry latency? i.e. something like
>
> break_even = now + entry_latency + idling_time
> \_________________________/
> min-residency
>
> or am I missing something?
Oh, no it is a stupid mistake from me. Thanks for pointing this out!

It should be:

break_even = now + idle_state->target_residency_ns;

- Documentation/devicetree/bindings/arm/idle-states.yaml

* min-residency: Minimum period, including preparation and entry, for a
given idle state to be worthwhile energywise.

(target_residency_ns == min-residency).

>
>> }
>>
>> static int __read_mostly cpu_idle_force_poll;

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2020-03-17 14:31:45

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On Tue, Mar 17, 2020 at 02:48:51PM +0100, Daniel Lezcano wrote:
>
> Hi Morten,
>
> On 17/03/2020 08:56, Morten Rasmussen wrote:
> > Hi Daniel,
> >
> > First, I think letting the scheduler know about desired minimum idle
> > times is an interesting optimization if the overhead can be kept at a
> > minimum. I do have a few comments about the patch though.
> >
> > On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
> >> On 12/03/2020 09:36, Vincent Guittot wrote:
> >>> Hi Daniel,
> >>>
> >>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
> >>>>
> >>>> In the idle CPU selection process occuring in the slow path via the
> >>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> >>>> with the shallowest idle state otherwise we fall back to the least
> >>>> loaded CPU.
> >>>
> >>> The idea makes sense but this path is only used by fork and exec so
> >>> I'm not sure about the real impact
> >>
> >> I agree the fork / exec path is called much less often than the wake
> >> path but it makes more sense for the decision.
> >
> > Looking at the flow in find_idlest_cpu(), AFAICT,
> > find_idlest_group_cpu() is not actually making the final choice of CPU,
> > so going through a lot of trouble there looking at idle states is
> > pointless. Is there something I don't see?
> >
> > We fellow sd->child until groups == CPUs which which means that
> > find_idlest_group() actually makes the final choice as the final group
> > passed to find_idlest_group_cpu() is single-CPU group. The flow has been
> > like that for years. Even before you added the initial idle-state
> > awareness.
> >
> > I agree with Vincent, if this should really make a difference it should
> > include wake-ups existing tasks too. Although I'm aware it would be a
> > more invasive change. As said from the beginning, the idea is fine, but
> > the current implementation should not make any measurable difference?
>
> I'm seeing the wake-ups path so sensitive, I'm not comfortable to do any
> changes in it. That is the reason why the patch only changes the slow path.

Right. I'm not against being cautious at all. It would be interesting to
evaluate how bad it really is. The extra time-stamping business cost is
the same, so it really down how much we dare to use the information in
the fast-path and change the CPU selection policy. And of course, how
much can be gained by the change.

>
> >>>> In order to be more energy efficient but without impacting the
> >>>> performances, let's use another criteria: the break even deadline.
> >>>>
> >>>> At idle time, when we store the idle state the CPU is entering in, we
> >>>> compute the next deadline where the CPU could be woken up without
> >>>> spending more energy to sleep.
> >
> > I don't follow the argument that sleeping longer should improve energy
> > consumption.
>
> May be it is not explained correctly.
>
> The patch is about selecting a CPU with the smallest break even deadline
> value. In a group of idle CPUs in the same idle state, we will pick the
> one with the smallest break even dead line which is the one with the
> highest probability it already reached its target residency.
>
> It is best effort.

Indeed. I get what the patch does, I just don't see how the patch
improves energy efficiency.

>
> > The patch doesn't affect the number of idle state
> > enter/exit cycles, so you spend the amount of energy on those
> > transitions. The main change is that idle time get spread out, so CPUs
> > are less likely to be in the process of entering an idle state when they
> > are asked to wake back up again.
> >
> > Isn't it fair to say that we expect the total number of wake-ups remains
> > unchanged? Total busy and idle times across all CPUs should remain the
> > same too? Unless chosen idle-state is changed, which I don't think we
> > expect either, there should be no net effect on energy? The main benefit
> > is reduced wake-up latency I think.
> >
> > Regarding chosen idle state, I'm wondering how this patch affects the
> > cpuidle governor's idle state selection. Could the spreading of wake-ups
> > trick governor to pick a shallower idle-state for some idle CPUs because
> > we actively spread wake-ups rather than consolidating them? Just a
> > thought.
>
> May be I missed the point, why are we spreading the tasks?

Picking the CPU with the smallest break-even time-stamp means you pick
the CPU that has been idle longest in the shallowest idle-state. If you
periodically one-shot spawn tasks at a rate which is long enough that
the shallowest state is the same for several CPUs, you would end up
picking the least recently used CPU each time effectively spreading the
wake-ups across all the CPUs in the same state.

Thinking more about it, it might not be a real problem as if one of the
CPUs suddenly choose a shallower idle-state, it would become the target
all new tasks from that point onwards.

> We are taking the decision on the same sched domain, no?

I'm not sure I get the relation to the sched_domain?

>
> >>>> At the selection process, we use the shallowest CPU but in addition we
> >>>> choose the one with the minimal break even deadline instead of relying
> >>>> on the idle_timestamp. When the CPU is idle, the timestamp has less
> >>>> meaning because the CPU could have wake up and sleep again several times
> >>>> without exiting the idle loop. In this case the break even deadline is
> >>>> more relevant as it increases the probability of choosing a CPU which
> >>>> reached its break even.
> >
> > I guess you could improve the idle time stamping without adding the
> > break-even time, they don't have to go together?
>
> Yes, we can add the idle start time when entering idle in the
> cpuidle_enter function which is different from the idle_timestamp which
> gives the idle task scheduling. I sent a RFC for that [1].
>
> However, each time we would like to inspect the deadline, we will have
> to compute it, so IMO it makes more sense to pre-compute it when
> entering idle in addition to the idle start.
>
> [1] https://lkml.org/lkml/2020/3/16/902

Yes, I saw that patch too. Seems to make sense :-)

>
> >>>> Tested on:
> >>>> - a synquacer 24 cores, 6 sched domains
> >>>> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
> >>>>
> >>>> sched/perf and messaging does not show a performance regression. Ran
> >>>> 50 times schbench, adrestia and forkbench.
> >>>>
> >>>> The tools described at https://lwn.net/Articles/724935/
> >>>>
> >>>> --------------------------------------------------------------
> >>>> | Synquacer | With break even | Without break even |
> >>>> --------------------------------------------------------------
> >>>> | schbench *99.0th | 14844.8 | 15017.6 |
> >>>> | adrestia / periodic | 57.95 | 57 |
> >>>> | adrestia / single | 49.3 | 55.4 |
> >>>> --------------------------------------------------------------
> >>>
> >>> Have you got some figures or cpuidle statistics for the syncquacer ?
> >>
> >> No, and we just noticed the syncquacer has a bug in the firmware and
> >> does not actually go to the idle states.
> >
> > I would also like some statistics to help understanding what actually
> > changes.
> >
> > I did some measurements on TX2, which only has one idle-state. I don't
> > see the same trends as you do. adrestia single seems to be most affected
> > by the patch, but _increases_ with the break_even patch rather than
> > decrease. I don't trust adrestia too much though as the time resolution
> > is low on TX2.
> >
> > TX2 tip break_even
> > ----------------------------------------------------
> > adrestia / single 5.21 5.51
> > adrestia / periodic 5.75 5.67
> > schbench 99.0th 45465.6 45376.0
> > hackbench 27.9851 27.9775
> >
> > Notes:
> > adrestia: Avg of 100 runs: adrestia -l 25000
> > schbench: Avg of 10 runs: schbench -m16 -t64
> > hackbench: Avg of 10 runs: hackbench -g 20 -T 256 -l 100000
>
> Thanks for testing. Is that a Jetson TX2 from Nvidia? If that is the
> case, IIRC, it has some kind of switcher for the CPUs in the firmware, I
> don't know how that can interact with the testing.

Sorry, I should have been clearer. It is a ThunderX2. 2x 32-core (128
threads) = 256 HW threads.

Morten

2020-03-17 17:10:39

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On 17/03/2020 15:30, Morten Rasmussen wrote:
> On Tue, Mar 17, 2020 at 02:48:51PM +0100, Daniel Lezcano wrote:
>>
>> Hi Morten,
>>
>> On 17/03/2020 08:56, Morten Rasmussen wrote:
>>> Hi Daniel,
>>>
>>> First, I think letting the scheduler know about desired minimum idle
>>> times is an interesting optimization if the overhead can be kept at a
>>> minimum. I do have a few comments about the patch though.
>>>
>>> On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
>>>> On 12/03/2020 09:36, Vincent Guittot wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
>>>>>>
>>>>>> In the idle CPU selection process occuring in the slow path via the
>>>>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
>>>>>> with the shallowest idle state otherwise we fall back to the least
>>>>>> loaded CPU.
>>>>>
>>>>> The idea makes sense but this path is only used by fork and exec so
>>>>> I'm not sure about the real impact
>>>>
>>>> I agree the fork / exec path is called much less often than the wake
>>>> path but it makes more sense for the decision.
>>>
>>> Looking at the flow in find_idlest_cpu(), AFAICT,
>>> find_idlest_group_cpu() is not actually making the final choice of CPU,
>>> so going through a lot of trouble there looking at idle states is
>>> pointless. Is there something I don't see?
>>>
>>> We fellow sd->child until groups == CPUs which which means that
>>> find_idlest_group() actually makes the final choice as the final group
>>> passed to find_idlest_group_cpu() is single-CPU group. The flow has been
>>> like that for years. Even before you added the initial idle-state
>>> awareness.
>>>
>>> I agree with Vincent, if this should really make a difference it should
>>> include wake-ups existing tasks too. Although I'm aware it would be a
>>> more invasive change. As said from the beginning, the idea is fine, but
>>> the current implementation should not make any measurable difference?
>>
>> I'm seeing the wake-ups path so sensitive, I'm not comfortable to do any
>> changes in it. That is the reason why the patch only changes the slow path.
>
> Right. I'm not against being cautious at all. It would be interesting to
> evaluate how bad it really is. The extra time-stamping business cost is
> the same, so it really down how much we dare to use the information in
> the fast-path and change the CPU selection policy. And of course, how
> much can be gained by the change.

If the change provided by this patch is acceptable. I can give a try
with the fast path.

>>>>>> In order to be more energy efficient but without impacting the
>>>>>> performances, let's use another criteria: the break even deadline.
>>>>>>
>>>>>> At idle time, when we store the idle state the CPU is entering in, we
>>>>>> compute the next deadline where the CPU could be woken up without
>>>>>> spending more energy to sleep.
>>>
>>> I don't follow the argument that sleeping longer should improve energy
>>> consumption.
>>
>> May be it is not explained correctly.
>>
>> The patch is about selecting a CPU with the smallest break even deadline
>> value. In a group of idle CPUs in the same idle state, we will pick the
>> one with the smallest break even dead line which is the one with the
>> highest probability it already reached its target residency.
>>
>> It is best effort.
>
> Indeed. I get what the patch does, I just don't see how the patch
> improves energy efficiency.

If the CPU is woken up before it reached the break even, the idle state
cost in energy is greater than the energy it saved.

Am I misunderstanding your point?

>>> The patch doesn't affect the number of idle state
>>> enter/exit cycles, so you spend the amount of energy on those
>>> transitions. The main change is that idle time get spread out, so CPUs
>>> are less likely to be in the process of entering an idle state when they
>>> are asked to wake back up again.
>>>
>>> Isn't it fair to say that we expect the total number of wake-ups remains
>>> unchanged? Total busy and idle times across all CPUs should remain the
>>> same too? Unless chosen idle-state is changed, which I don't think we
>>> expect either, there should be no net effect on energy? The main benefit
>>> is reduced wake-up latency I think.
>>>
>>> Regarding chosen idle state, I'm wondering how this patch affects the
>>> cpuidle governor's idle state selection. Could the spreading of wake-ups
>>> trick governor to pick a shallower idle-state for some idle CPUs because
>>> we actively spread wake-ups rather than consolidating them? Just a
>>> thought.
>>
>> May be I missed the point, why are we spreading the tasks?
>
> Picking the CPU with the smallest break-even time-stamp means you pick
> the CPU that has been idle longest in the shallowest idle-state. If you
> periodically one-shot spawn tasks at a rate which is long enough that
> the shallowest state is the same for several CPUs, you would end up
> picking the least recently used CPU each time effectively spreading the
> wake-ups across all the CPUs in the same state.

Ok, thanks for the clarification.

> Thinking more about it, it might not be a real problem as if one of the
> CPUs suddenly choose a shallower idle-state, it would become the target
> all new tasks from that point onwards.

Right.

>> We are taking the decision on the same sched domain, no?
>
> I'm not sure I get the relation to the sched_domain?

Never mind. I misunderstood your comment.

>>>>>> At the selection process, we use the shallowest CPU but in addition we
>>>>>> choose the one with the minimal break even deadline instead of relying
>>>>>> on the idle_timestamp. When the CPU is idle, the timestamp has less
>>>>>> meaning because the CPU could have wake up and sleep again several times
>>>>>> without exiting the idle loop. In this case the break even deadline is
>>>>>> more relevant as it increases the probability of choosing a CPU which
>>>>>> reached its break even.
>>>
>>> I guess you could improve the idle time stamping without adding the
>>> break-even time, they don't have to go together?
>>
>> Yes, we can add the idle start time when entering idle in the
>> cpuidle_enter function which is different from the idle_timestamp which
>> gives the idle task scheduling. I sent a RFC for that [1].
>>
>> However, each time we would like to inspect the deadline, we will have
>> to compute it, so IMO it makes more sense to pre-compute it when
>> entering idle in addition to the idle start.
>>
>> [1] https://lkml.org/lkml/2020/3/16/902
>
> Yes, I saw that patch too. Seems to make sense :-)
>
>>
>>>>>> Tested on:
>>>>>> - a synquacer 24 cores, 6 sched domains
>>>>>> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
>>>>>>
>>>>>> sched/perf and messaging does not show a performance regression. Ran
>>>>>> 50 times schbench, adrestia and forkbench.
>>>>>>
>>>>>> The tools described at https://lwn.net/Articles/724935/
>>>>>>
>>>>>> --------------------------------------------------------------
>>>>>> | Synquacer | With break even | Without break even |
>>>>>> --------------------------------------------------------------
>>>>>> | schbench *99.0th | 14844.8 | 15017.6 |
>>>>>> | adrestia / periodic | 57.95 | 57 |
>>>>>> | adrestia / single | 49.3 | 55.4 |
>>>>>> --------------------------------------------------------------
>>>>>
>>>>> Have you got some figures or cpuidle statistics for the syncquacer ?
>>>>
>>>> No, and we just noticed the syncquacer has a bug in the firmware and
>>>> does not actually go to the idle states.
>>>
>>> I would also like some statistics to help understanding what actually
>>> changes.
>>>
>>> I did some measurements on TX2, which only has one idle-state. I don't
>>> see the same trends as you do. adrestia single seems to be most affected
>>> by the patch, but _increases_ with the break_even patch rather than
>>> decrease. I don't trust adrestia too much though as the time resolution
>>> is low on TX2.
>>>
>>> TX2 tip break_even
>>> ----------------------------------------------------
>>> adrestia / single 5.21 5.51
>>> adrestia / periodic 5.75 5.67
>>> schbench 99.0th 45465.6 45376.0
>>> hackbench 27.9851 27.9775
>>>
>>> Notes:
>>> adrestia: Avg of 100 runs: adrestia -l 25000
>>> schbench: Avg of 10 runs: schbench -m16 -t64
>>> hackbench: Avg of 10 runs: hackbench -g 20 -T 256 -l 100000
>>
>> Thanks for testing. Is that a Jetson TX2 from Nvidia? If that is the
>> case, IIRC, it has some kind of switcher for the CPUs in the firmware, I
>> don't know how that can interact with the testing.
>
> Sorry, I should have been clearer. It is a ThunderX2. 2x 32-core (128
> threads) = 256 HW threads.

Indeed, that is different :)

I've a syncquacer but we find out a bug in the firmware, hopefully it
can be fixed.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2020-03-18 07:51:57

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On Tue, 17 Mar 2020 at 15:30, Morten Rasmussen <[email protected]> wrote:
>
> On Tue, Mar 17, 2020 at 02:48:51PM +0100, Daniel Lezcano wrote:
> >
> > Hi Morten,
> >
> > On 17/03/2020 08:56, Morten Rasmussen wrote:
> > > Hi Daniel,
> > >
> > > First, I think letting the scheduler know about desired minimum idle
> > > times is an interesting optimization if the overhead can be kept at a
> > > minimum. I do have a few comments about the patch though.
> > >
> > > On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
> > >> On 12/03/2020 09:36, Vincent Guittot wrote:
> > >>> Hi Daniel,
> > >>>
> > >>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
> > >>>>
> > >>>> In the idle CPU selection process occuring in the slow path via the
> > >>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> > >>>> with the shallowest idle state otherwise we fall back to the least
> > >>>> loaded CPU.
> > >>>
> > >>> The idea makes sense but this path is only used by fork and exec so
> > >>> I'm not sure about the real impact
> > >>
> > >> I agree the fork / exec path is called much less often than the wake
> > >> path but it makes more sense for the decision.
> > >
> > > Looking at the flow in find_idlest_cpu(), AFAICT,
> > > find_idlest_group_cpu() is not actually making the final choice of CPU,
> > > so going through a lot of trouble there looking at idle states is
> > > pointless. Is there something I don't see?
> > >
> > > We fellow sd->child until groups == CPUs which which means that
> > > find_idlest_group() actually makes the final choice as the final group
> > > passed to find_idlest_group_cpu() is single-CPU group. The flow has been
> > > like that for years. Even before you added the initial idle-state
> > > awareness.
> > >
> > > I agree with Vincent, if this should really make a difference it should
> > > include wake-ups existing tasks too. Although I'm aware it would be a
> > > more invasive change. As said from the beginning, the idea is fine, but
> > > the current implementation should not make any measurable difference?
> >
> > I'm seeing the wake-ups path so sensitive, I'm not comfortable to do any
> > changes in it. That is the reason why the patch only changes the slow path.
>
> Right. I'm not against being cautious at all. It would be interesting to
> evaluate how bad it really is. The extra time-stamping business cost is
> the same, so it really down how much we dare to use the information in
> the fast-path and change the CPU selection policy. And of course, how
> much can be gained by the change.

Hmm... the fast path not even parses all idle CPUs right now so it
will be difficult to make any comparison. Regarding wakeup path, the
only place that could make sense IMO is the EAS slow wakeup path
find_energy_efficient_cpu() which mitigates performance vs energy
consumption

>
> >
> > >>>> In order to be more energy efficient but without impacting the
> > >>>> performances, let's use another criteria: the break even deadline.
> > >>>>
> > >>>> At idle time, when we store the idle state the CPU is entering in, we
> > >>>> compute the next deadline where the CPU could be woken up without
> > >>>> spending more energy to sleep.
> > >
> > > I don't follow the argument that sleeping longer should improve energy
> > > consumption.
> >
> > May be it is not explained correctly.
> >
> > The patch is about selecting a CPU with the smallest break even deadline
> > value. In a group of idle CPUs in the same idle state, we will pick the
> > one with the smallest break even dead line which is the one with the
> > highest probability it already reached its target residency.
> >
> > It is best effort.
>
> Indeed. I get what the patch does, I just don't see how the patch
> improves energy efficiency.
>
> >
> > > The patch doesn't affect the number of idle state
> > > enter/exit cycles, so you spend the amount of energy on those
> > > transitions. The main change is that idle time get spread out, so CPUs
> > > are less likely to be in the process of entering an idle state when they
> > > are asked to wake back up again.
> > >
> > > Isn't it fair to say that we expect the total number of wake-ups remains
> > > unchanged? Total busy and idle times across all CPUs should remain the
> > > same too? Unless chosen idle-state is changed, which I don't think we
> > > expect either, there should be no net effect on energy? The main benefit
> > > is reduced wake-up latency I think.
> > >
> > > Regarding chosen idle state, I'm wondering how this patch affects the
> > > cpuidle governor's idle state selection. Could the spreading of wake-ups
> > > trick governor to pick a shallower idle-state for some idle CPUs because
> > > we actively spread wake-ups rather than consolidating them? Just a
> > > thought.
> >
> > May be I missed the point, why are we spreading the tasks?
>
> Picking the CPU with the smallest break-even time-stamp means you pick
> the CPU that has been idle longest in the shallowest idle-state. If you
> periodically one-shot spawn tasks at a rate which is long enough that
> the shallowest state is the same for several CPUs, you would end up
> picking the least recently used CPU each time effectively spreading the
> wake-ups across all the CPUs in the same state.
>
> Thinking more about it, it might not be a real problem as if one of the
> CPUs suddenly choose a shallower idle-state, it would become the target
> all new tasks from that point onwards.
>
> > We are taking the decision on the same sched domain, no?
>
> I'm not sure I get the relation to the sched_domain?
>
> >
> > >>>> At the selection process, we use the shallowest CPU but in addition we
> > >>>> choose the one with the minimal break even deadline instead of relying
> > >>>> on the idle_timestamp. When the CPU is idle, the timestamp has less
> > >>>> meaning because the CPU could have wake up and sleep again several times
> > >>>> without exiting the idle loop. In this case the break even deadline is
> > >>>> more relevant as it increases the probability of choosing a CPU which
> > >>>> reached its break even.
> > >
> > > I guess you could improve the idle time stamping without adding the
> > > break-even time, they don't have to go together?
> >
> > Yes, we can add the idle start time when entering idle in the
> > cpuidle_enter function which is different from the idle_timestamp which
> > gives the idle task scheduling. I sent a RFC for that [1].
> >
> > However, each time we would like to inspect the deadline, we will have
> > to compute it, so IMO it makes more sense to pre-compute it when
> > entering idle in addition to the idle start.
> >
> > [1] https://lkml.org/lkml/2020/3/16/902
>
> Yes, I saw that patch too. Seems to make sense :-)
>
> >
> > >>>> Tested on:
> > >>>> - a synquacer 24 cores, 6 sched domains
> > >>>> - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
> > >>>>
> > >>>> sched/perf and messaging does not show a performance regression. Ran
> > >>>> 50 times schbench, adrestia and forkbench.
> > >>>>
> > >>>> The tools described at https://lwn.net/Articles/724935/
> > >>>>
> > >>>> --------------------------------------------------------------
> > >>>> | Synquacer | With break even | Without break even |
> > >>>> --------------------------------------------------------------
> > >>>> | schbench *99.0th | 14844.8 | 15017.6 |
> > >>>> | adrestia / periodic | 57.95 | 57 |
> > >>>> | adrestia / single | 49.3 | 55.4 |
> > >>>> --------------------------------------------------------------
> > >>>
> > >>> Have you got some figures or cpuidle statistics for the syncquacer ?
> > >>
> > >> No, and we just noticed the syncquacer has a bug in the firmware and
> > >> does not actually go to the idle states.
> > >
> > > I would also like some statistics to help understanding what actually
> > > changes.
> > >
> > > I did some measurements on TX2, which only has one idle-state. I don't
> > > see the same trends as you do. adrestia single seems to be most affected
> > > by the patch, but _increases_ with the break_even patch rather than
> > > decrease. I don't trust adrestia too much though as the time resolution
> > > is low on TX2.
> > >
> > > TX2 tip break_even
> > > ----------------------------------------------------
> > > adrestia / single 5.21 5.51
> > > adrestia / periodic 5.75 5.67
> > > schbench 99.0th 45465.6 45376.0
> > > hackbench 27.9851 27.9775
> > >
> > > Notes:
> > > adrestia: Avg of 100 runs: adrestia -l 25000
> > > schbench: Avg of 10 runs: schbench -m16 -t64
> > > hackbench: Avg of 10 runs: hackbench -g 20 -T 256 -l 100000
> >
> > Thanks for testing. Is that a Jetson TX2 from Nvidia? If that is the
> > case, IIRC, it has some kind of switcher for the CPUs in the firmware, I
> > don't know how that can interact with the testing.
>
> Sorry, I should have been clearer. It is a ThunderX2. 2x 32-core (128
> threads) = 256 HW threads.
>
> Morten

2020-03-18 08:26:36

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On Tue, Mar 17, 2020 at 06:07:43PM +0100, Daniel Lezcano wrote:
> On 17/03/2020 15:30, Morten Rasmussen wrote:
> > On Tue, Mar 17, 2020 at 02:48:51PM +0100, Daniel Lezcano wrote:
> >> On 17/03/2020 08:56, Morten Rasmussen wrote:
> >>> On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
> >>>>>> In order to be more energy efficient but without impacting the
> >>>>>> performances, let's use another criteria: the break even deadline.
> >>>>>>
> >>>>>> At idle time, when we store the idle state the CPU is entering in, we
> >>>>>> compute the next deadline where the CPU could be woken up without
> >>>>>> spending more energy to sleep.
> >>>
> >>> I don't follow the argument that sleeping longer should improve energy
> >>> consumption.
> >>
> >> May be it is not explained correctly.
> >>
> >> The patch is about selecting a CPU with the smallest break even deadline
> >> value. In a group of idle CPUs in the same idle state, we will pick the
> >> one with the smallest break even dead line which is the one with the
> >> highest probability it already reached its target residency.
> >>
> >> It is best effort.
> >
> > Indeed. I get what the patch does, I just don't see how the patch
> > improves energy efficiency.
>
> If the CPU is woken up before it reached the break even, the idle state
> cost in energy is greater than the energy it saved.
>
> Am I misunderstanding your point?

Considering just the waking then yes, it reaches energy break-even.
However, considering all the CPUs in the system, it just moves the idle
entry/exit energy cost to a different CPU, it doesn't go away.

Whether you have:

|-BE-|
____ ____
CPU0: ___/ \__/ \___

CPU1: ____________________

Or:
|-BE-|
____
CPU0: ___/ \___________
____
CPU1: ___________/ \___

_
= CPU busy = P_{busy}
_ = CPU idle = P_{idle}
/ = CPU idle exit = P_{exit}
\ = CPU idle entry = P_{entry}

The sum of areas under the curves is the same, i.e. the total energy is
unchanged.

Morten

2020-03-18 08:35:17

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On Wed, Mar 18, 2020 at 08:51:04AM +0100, Vincent Guittot wrote:
> On Tue, 17 Mar 2020 at 15:30, Morten Rasmussen <[email protected]> wrote:
> > On Tue, Mar 17, 2020 at 02:48:51PM +0100, Daniel Lezcano wrote:
> > > On 17/03/2020 08:56, Morten Rasmussen wrote:
> > > > On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
> > > >> On 12/03/2020 09:36, Vincent Guittot wrote:
> > > >>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <[email protected]> wrote:
> > > >>>>
> > > >>>> In the idle CPU selection process occuring in the slow path via the
> > > >>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
> > > >>>> with the shallowest idle state otherwise we fall back to the least
> > > >>>> loaded CPU.
> > > >>>
> > > >>> The idea makes sense but this path is only used by fork and exec so
> > > >>> I'm not sure about the real impact
> > > >>
> > > >> I agree the fork / exec path is called much less often than the wake
> > > >> path but it makes more sense for the decision.
> > > >
> > > > Looking at the flow in find_idlest_cpu(), AFAICT,
> > > > find_idlest_group_cpu() is not actually making the final choice of CPU,
> > > > so going through a lot of trouble there looking at idle states is
> > > > pointless. Is there something I don't see?
> > > >
> > > > We fellow sd->child until groups == CPUs which which means that
> > > > find_idlest_group() actually makes the final choice as the final group
> > > > passed to find_idlest_group_cpu() is single-CPU group. The flow has been
> > > > like that for years. Even before you added the initial idle-state
> > > > awareness.
> > > >
> > > > I agree with Vincent, if this should really make a difference it should
> > > > include wake-ups existing tasks too. Although I'm aware it would be a
> > > > more invasive change. As said from the beginning, the idea is fine, but
> > > > the current implementation should not make any measurable difference?
> > >
> > > I'm seeing the wake-ups path so sensitive, I'm not comfortable to do any
> > > changes in it. That is the reason why the patch only changes the slow path.
> >
> > Right. I'm not against being cautious at all. It would be interesting to
> > evaluate how bad it really is. The extra time-stamping business cost is
> > the same, so it really down how much we dare to use the information in
> > the fast-path and change the CPU selection policy. And of course, how
> > much can be gained by the change.
>
> Hmm... the fast path not even parses all idle CPUs right now so it
> will be difficult to make any comparison. Regarding wakeup path, the
> only place that could make sense IMO is the EAS slow wakeup path
> find_energy_efficient_cpu() which mitigates performance vs energy
> consumption

Agreed, it is not really compatible with the current fast-path policies.
I would start with just hacking it in to evaluate the potential
improvements to have an idea about what is on the table.

As you say, it could live somewhere slightly out the way like the EAS
path, but I'm not sure if actually fits well under EAS. EAS is disabled
for symmetric CPU capacities, which is probably the systems that will
benefit the most. Also, as mentioned already in this thread, I don't see
how the idea brings any energy savings?

Morten

2020-03-18 10:18:44

[permalink] [raw]

Subject: Re: [PATCH V2] sched: fair: Use the earliest break even

On 18/03/2020 09:24, Morten Rasmussen wrote:
> On Tue, Mar 17, 2020 at 06:07:43PM +0100, Daniel Lezcano wrote:
>> On 17/03/2020 15:30, Morten Rasmussen wrote:
>>> On Tue, Mar 17, 2020 at 02:48:51PM +0100, Daniel Lezcano wrote:
>>>> On 17/03/2020 08:56, Morten Rasmussen wrote:
>>>>> On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
>>>>>>>> In order to be more energy efficient but without impacting the
>>>>>>>> performances, let's use another criteria: the break even deadline.
>>>>>>>>
>>>>>>>> At idle time, when we store the idle state the CPU is entering in, we
>>>>>>>> compute the next deadline where the CPU could be woken up without
>>>>>>>> spending more energy to sleep.
>>>>>
>>>>> I don't follow the argument that sleeping longer should improve energy
>>>>> consumption.
>>>>
>>>> May be it is not explained correctly.
>>>>
>>>> The patch is about selecting a CPU with the smallest break even deadline
>>>> value. In a group of idle CPUs in the same idle state, we will pick the
>>>> one with the smallest break even dead line which is the one with the
>>>> highest probability it already reached its target residency.
>>>>
>>>> It is best effort.
>>>
>>> Indeed. I get what the patch does, I just don't see how the patch
>>> improves energy efficiency.
>>
>> If the CPU is woken up before it reached the break even, the idle state
>> cost in energy is greater than the energy it saved.
>>
>> Am I misunderstanding your point?
>
> Considering just the waking then yes, it reaches energy break-even.
> However, considering all the CPUs in the system, it just moves the idle
> entry/exit energy cost to a different CPU, it doesn't go away.
>
> Whether you have:
>
> |-BE-|
> ____ ____
> CPU0: ___/ \__/ \___
>
> CPU1: ____________________
>
> Or:
> |-BE-|
> ____
> CPU0: ___/ \___________
> ____
> CPU1: ___________/ \___
>
> _
> = CPU busy = P_{busy}
> _ = CPU idle = P_{idle}
> / = CPU idle exit = P_{exit}
> \ = CPU idle entry = P_{entry}
>
> The sum of areas under the curves is the same, i.e. the total energy is
> unchanged.

It is a counter-intuitive comment, now I get it, thanks for the
clarification. It is a good point.

Taking into consideration the dynamic, in the case #1, the break even is
not reached, the idle duration is smaller and that leads the governor to
choose shallower idle states after and consequently CPU0 will be used in
priority. We end up with CPU0 in a shallow state and CPU1 in a deep state.

With the case #2, we can have the CPUs in both deep state and the
governor should be keeping choosing the same idle state.

I don't know what is more energy/perf efficient. IMO this is very
workload dependent. The only way to check is to test. Hopefully I can
find a platform for that.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2020-03-18 14:39:42