2023-09-29 18:47:40

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH v5 0/2] sched: EAS changes for EM complexity and sysctl

In Brief:
[PATCH v5 1/2] sched/topology: Remove EM_MAX_COMPLEXITY limit
Since the EAS complexity was greatly reduced, bigger platforms can
handle EAS. To reflect this improvement, remove the EAS complexity check.

[PATCH v5 2/2] sched/topology: change behaviour of sysctl sched_energy_aware
based on the platform
Depending on the platform sysctl will either enable/disable EAS and NOP
in case if EAS is not supported.

Patchset contains these two patches. Second patch depends on the first
patch to be applied first.

v4->v5:
sched_is_eas_possible missed handling of case when EM complexity was high.
Dietmar suggested that there was work done already which removes these
checks. Since it makes sched_is_eas_possible cleaner, picked up that
patch along with v4 and made it as a patchset.
Instead of using first CPU in cpu_active_mask, doing a simple loop across
all CPU in cpu_active_mask to check if there is any asymmetric CPU
capacities since it was breaking EAS capabilities over CPUSET islands.
v3->v4:
valentin suggested it would be better to consider simpler approach that
was mentioned in v2. It is a standard approach to keep the knob visible
but change how read and write are handled. Did that and Refactored the
code to use a common function in build_perf_domains and in sysctl handler.
v2->v3:
Chen Yu and Pierre Gondois both pointed out that if platform becomes
capable of EAS later, this patch was not allowing that to happen.
Addressed that by using a variable to indicate the sysctl change
and re-worded the commit message with desired behaviour,
v1->v2:
Chen Yu had pointed out that this will not destroy the perf domains on
architectures where EAS is supported by changing the sysctl.
[v1] Link: https://lore.kernel.org/lkml/[email protected]/
[v2] Link: https://lore.kernel.org/lkml/[email protected]/
[v3] Link: https://lore.kernel.org/lkml/[email protected]/
[v4] Link: https://lore.kernel.org/lkml/[email protected]/

Pierre Gondois (1):
sched/topology: Remove EM_MAX_COMPLEXITY limit

Shrikanth Hegde (1):
change behaviour of sysctl sched_energy_aware based on the platform

Documentation/admin-guide/sysctl/kernel.rst | 3 +-
Documentation/scheduler/sched-energy.rst | 29 +---
kernel/sched/topology.c | 151 ++++++++++----------
3 files changed, 82 insertions(+), 101 deletions(-)

--
2.39.3


2023-09-29 18:52:47

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH v5 2/2] sched/topology: change behaviour of sysctl sched_energy_aware based on the platform

sysctl sched_energy_aware is available for the admin to disable/enable
energy aware scheduling(EAS). EAS is enabled only if few conditions are
met by the platform. They are, asymmetric CPU capacity, no SMT,
schedutil CPUfreq governor, frequency invariant load tracking etc.
A platform may boot without EAS capability, but could gain such
capability at runtime For example, changing/registering the CPUfreq
governor to schedutil.

At present, though platform doesn't support EAS, this sysctl returns 1
and it ends up calling build_perf_domains on write to 1 and
NOP when writing to 0. That is confusing and un-necessary.

Desired behavior would be to, have this sysctl to enable/disable the EAS
on supported platform. On Non supported platform write to the sysctl
would return not supported error and read of the sysctl would return
empty. So
sched_energy_aware returns empty - EAS is not possible at this moment
This will include EAS capable platforms which have at least one EAS
condition false during startup, e.g. using a Performance CPUfreq governor
sched_energy_aware returns 0 - EAS is supported but disabled by admin.
sched_energy_aware returns 1 - EAS is supported and enabled.

User can find out the reason why EAS is not possible by checking
info messages. sched_is_eas_possible returns true if the platform
can do EAS at this moment.

Depends on [PATCH v5 1/2] sched/topology: Remove EM_MAX_COMPLEXITY limit
to be applied first.

Signed-off-by: Shrikanth Hegde <[email protected]>
---
Documentation/admin-guide/sysctl/kernel.rst | 3 +-
kernel/sched/topology.c | 112 +++++++++++++-------
2 files changed, 76 insertions(+), 39 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index cf33de56da27..d89ac2bd8dc4 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1182,7 +1182,8 @@ automatically on platforms where it can run (that is,
platforms with asymmetric CPU topologies and having an Energy
Model available). If your platform happens to meet the
requirements for EAS but you do not want to use it, change
-this value to 0.
+this value to 0. On Non-EAS platforms, write operation fails and
+read doesn't return anything.

task_delayacct
===============
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e0b9920e7e3e..a654d0186ac0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -212,6 +212,70 @@ static unsigned int sysctl_sched_energy_aware = 1;
static DEFINE_MUTEX(sched_energy_mutex);
static bool sched_energy_update;

+extern struct cpufreq_governor schedutil_gov;
+static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
+{
+ bool any_asym_capacity = false;
+ struct cpufreq_policy *policy;
+ struct cpufreq_governor *gov;
+ int i;
+
+ /* EAS is enabled for asymmetric CPU capacity topologies. */
+ for_each_cpu(i, cpu_mask) {
+ if (per_cpu(sd_asym_cpucapacity, i)) {
+ any_asym_capacity = true;
+ break;
+ }
+ }
+ if (!any_asym_capacity) {
+ if (sched_debug()) {
+ pr_info("rd %*pbl: Checking EAS, CPUs do not have asymmetric capacities\n",
+ cpumask_pr_args(cpu_mask));
+ }
+ return false;
+ }
+
+ /* EAS definitely does *not* handle SMT */
+ if (sched_smt_active()) {
+ if (sched_debug()) {
+ pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
+ cpumask_pr_args(cpu_mask));
+ }
+ return false;
+ }
+
+ if (!arch_scale_freq_invariant()) {
+ if (sched_debug()) {
+ pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
+ cpumask_pr_args(cpu_mask));
+ }
+ return false;
+ }
+
+ /* Do not attempt EAS if schedutil is not being used. */
+ for_each_cpu(i, cpu_mask) {
+ policy = cpufreq_cpu_get(i);
+ if (!policy) {
+ if (sched_debug()) {
+ pr_info("rd %*pbl: Checking EAS, cpufreq policy not set for CPU: %d",
+ cpumask_pr_args(cpu_mask), i);
+ }
+ return false;
+ }
+ gov = policy->governor;
+ cpufreq_cpu_put(policy);
+ if (gov != &schedutil_gov) {
+ if (sched_debug()) {
+ pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
+ cpumask_pr_args(cpu_mask));
+ }
+ return false;
+ }
+ }
+
+ return true;
+}
+
void rebuild_sched_domains_energy(void)
{
mutex_lock(&sched_energy_mutex);
@@ -231,6 +295,15 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
return -EPERM;

ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+ if (!sched_is_eas_possible(cpu_active_mask)) {
+ if (write) {
+ return -EOPNOTSUPP;
+ } else {
+ *lenp = 0;
+ return 0;
+ }
+ }
+
if (!ret && write) {
state = static_branch_unlikely(&sched_energy_present);
if (state != sysctl_sched_energy_aware)
@@ -351,61 +424,24 @@ static void sched_energy_set(bool has_eas)
* 4. schedutil is driving the frequency of all CPUs of the rd;
* 5. frequency invariance support is present;
*/
-extern struct cpufreq_governor schedutil_gov;
static bool build_perf_domains(const struct cpumask *cpu_map)
{
int i;
struct perf_domain *pd = NULL, *tmp;
int cpu = cpumask_first(cpu_map);
struct root_domain *rd = cpu_rq(cpu)->rd;
- struct cpufreq_policy *policy;
- struct cpufreq_governor *gov;

if (!sysctl_sched_energy_aware)
goto free;

- /* EAS is enabled for asymmetric CPU capacity topologies. */
- if (!per_cpu(sd_asym_cpucapacity, cpu)) {
- if (sched_debug()) {
- pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
- cpumask_pr_args(cpu_map));
- }
- goto free;
- }
-
- /* EAS definitely does *not* handle SMT */
- if (sched_smt_active()) {
- pr_warn("rd %*pbl: Disabling EAS, SMT is not supported\n",
- cpumask_pr_args(cpu_map));
- goto free;
- }
-
- if (!arch_scale_freq_invariant()) {
- if (sched_debug()) {
- pr_warn("rd %*pbl: Disabling EAS: frequency-invariant load tracking not yet supported",
- cpumask_pr_args(cpu_map));
- }
+ if (!sched_is_eas_possible(cpu_map))
goto free;
- }

for_each_cpu(i, cpu_map) {
/* Skip already covered CPUs. */
if (find_pd(pd, i))
continue;

- /* Do not attempt EAS if schedutil is not being used. */
- policy = cpufreq_cpu_get(i);
- if (!policy)
- goto free;
- gov = policy->governor;
- cpufreq_cpu_put(policy);
- if (gov != &schedutil_gov) {
- if (rd->pd)
- pr_warn("rd %*pbl: Disabling EAS, schedutil is mandatory\n",
- cpumask_pr_args(cpu_map));
- goto free;
- }
-
/* Create the new pd and add it to the local list. */
tmp = pd_init(i);
if (!tmp)
--
2.39.3

2023-09-30 20:07:29

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH v5 1/2] sched/topology: Remove EM_MAX_COMPLEXITY limit

From: Pierre Gondois <[email protected]>

The Energy Aware Scheduler (EAS) estimates the energy consumption
of placing a task on different CPUs. The goal is to minimize this
energy consumption. Estimating the energy of different task placements
is increasingly complex with the size of the platform. To avoid having
a slow wake-up path, EAS is only enabled if this complexity is low
enough.

The current complexity limit was set in:
commit b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate
platforms").
base on the first implementation of EAS, which was re-computing
the power of the whole platform for each task placement scenario, cf:
commit 390031e4c309 ("sched/fair: Introduce an energy estimation helper
function").
but the complexity of EAS was reduced in:
commit eb92692b2544d ("sched/fair: Speed-up energy-aware wake-ups")
and find_energy_efficient_cpu() (feec) algorithm was updated in:
commit 3e8c6c9aac42 ("sched/fair: Remove task_util from effective
utilization in feec()")

find_energy_efficient_cpu() (feec) is now doing:
feec()
\_ for_each_pd(pd) [0]
// get max_spare_cap_cpu and compute_prev_delta
\_ for_each_cpu(pd) [1]

\_ eenv_pd_busy_time(pd) [2]
\_ for_each_cpu(pd)

// compute_energy(pd) without the task
\_ eenv_pd_max_util(pd, -1) [3.0]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, -1)
\_ for_each_ps(pd)

// compute_energy(pd) with the task on prev_cpu
\_ eenv_pd_max_util(pd, prev_cpu) [3.1]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, prev_cpu)
\_ for_each_ps(pd)

// compute_energy(pd) with the task on max_spare_cap_cpu
\_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, max_spare_cap_cpu)
\_ for_each_ps(pd)

[3.1] happens only once since prev_cpu is unique. With the same
definitions for nr_pd, nr_cpus and nr_ps, the complexity is of:
nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd]))
+ ([nr_cpus in pd] + [nr_ps in pd])

[0] * ( [1] + [2] + [3.0] + [3.2] )
+ [3.1]

= nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd])
+ [nr_cpus in prev pd] + nr_ps

The complexity limit was set to 2048 in:
commit b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate
platforms")
to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
performance states each". For the same platform, the complexity would
actually be of:
16 * (4 + 2 * 7) + 1 + 7 = 296

Since the EAS complexity was greatly reduced, bigger platforms can
handle EAS. For instance, a platform with 112 CPUs with 7 performance
states each would not reach it:
112 * (4 + 2 * 7) + 1 + 7 = 2024

To reflect this improvement, remove the EAS complexity check.
Note that a limit on the number of CPUs still holds against
EM_MAX_NUM_CPUS to avoid overflows during the energy estimation.

Signed-off-by: Pierre Gondois <[email protected]>
Reviewed-by: Lukasz Luba <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
---
Documentation/scheduler/sched-energy.rst | 29 ++----------------
kernel/sched/topology.c | 39 ++----------------------
2 files changed, 6 insertions(+), 62 deletions(-)

diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst
index fc853c8cc346..70e2921ef725 100644
--- a/Documentation/scheduler/sched-energy.rst
+++ b/Documentation/scheduler/sched-energy.rst
@@ -359,32 +359,9 @@ in milli-Watts or in an 'abstract scale'.
6.3 - Energy Model complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-The task wake-up path is very latency-sensitive. When the EM of a platform is
-too complex (too many CPUs, too many performance domains, too many performance
-states, ...), the cost of using it in the wake-up path can become prohibitive.
-The energy-aware wake-up algorithm has a complexity of:
-
- C = Nd * (Nc + Ns)
-
-with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
-total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
-
-A complexity check is performed at the root domain level, when scheduling
-domains are built. EAS will not start on a root domain if its C happens to be
-higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
-time of writing).
-
-If you really want to use EAS but the complexity of your platform's Energy
-Model is too high to be used with a single root domain, you're left with only
-two possible options:
-
- 1. split your system into separate, smaller, root domains using exclusive
- cpusets and enable EAS locally on each of them. This option has the
- benefit to work out of the box but the drawback of preventing load
- balance between root domains, which can result in an unbalanced system
- overall;
- 2. submit patches to reduce the complexity of the EAS wake-up algorithm,
- hence enabling it to cope with larger EMs in reasonable time.
+EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but
+restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during
+the energy estimation.


6.4 - Schedutil governor
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a7b50bba7829..e0b9920e7e3e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -348,32 +348,13 @@ static void sched_energy_set(bool has_eas)
* 1. an Energy Model (EM) is available;
* 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
* 3. no SMT is detected.
- * 4. the EM complexity is low enough to keep scheduling overheads low;
- * 5. schedutil is driving the frequency of all CPUs of the rd;
- * 6. frequency invariance support is present;
- *
- * The complexity of the Energy Model is defined as:
- *
- * C = nr_pd * (nr_cpus + nr_ps)
- *
- * with parameters defined as:
- * - nr_pd: the number of performance domains
- * - nr_cpus: the number of CPUs
- * - nr_ps: the sum of the number of performance states of all performance
- * domains (for example, on a system with 2 performance domains,
- * with 10 performance states each, nr_ps = 2 * 10 = 20).
- *
- * It is generally not a good idea to use such a model in the wake-up path on
- * very complex platforms because of the associated scheduling overheads. The
- * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
- * with per-CPU DVFS and less than 8 performance states each, for example.
+ * 4. schedutil is driving the frequency of all CPUs of the rd;
+ * 5. frequency invariance support is present;
*/
-#define EM_MAX_COMPLEXITY 2048
-
extern struct cpufreq_governor schedutil_gov;
static bool build_perf_domains(const struct cpumask *cpu_map)
{
- int i, nr_pd = 0, nr_ps = 0, nr_cpus = cpumask_weight(cpu_map);
+ int i;
struct perf_domain *pd = NULL, *tmp;
int cpu = cpumask_first(cpu_map);
struct root_domain *rd = cpu_rq(cpu)->rd;
@@ -431,20 +412,6 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
goto free;
tmp->next = pd;
pd = tmp;
-
- /*
- * Count performance domains and performance states for the
- * complexity check.
- */
- nr_pd++;
- nr_ps += em_pd_nr_perf_states(pd->em_pd);
- }
-
- /* Bail out if the Energy Model complexity is too high. */
- if (nr_pd * (nr_ps + nr_cpus) > EM_MAX_COMPLEXITY) {
- WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n",
- cpumask_pr_args(cpu_map));
- goto free;
}

perf_domain_debug(cpu_map, pd);
--
2.39.3

2023-10-03 09:21:25

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/topology: change behaviour of sysctl sched_energy_aware based on the platform

Hello Shrikanth,
Some NITs about the commit message:

On 9/29/23 17:52, Shrikanth Hegde wrote:
> sysctl sched_energy_aware is available for the admin to disable/enable
> energy aware scheduling(EAS). EAS is enabled only if few conditions are
> met by the platform. They are, asymmetric CPU capacity, no SMT,
> schedutil CPUfreq governor, frequency invariant load tracking etc.
> A platform may boot without EAS capability, but could gain such
> capability at runtime For example, changing/registering the CPUfreq

Missing dot I think: 'runtime. For example,'

> governor to schedutil.
>
> At present, though platform doesn't support EAS, this sysctl returns 1
> and it ends up calling build_perf_domains on write to 1 and
> NOP when writing to 0. That is confusing and un-necessary.

I'm not sure I fully understand the sentence:
- it sounds that the user is writing a value to either 1/0
(I think the user is writing 1/0 to the sysctl)
- aren't the sched domain rebuilt even when writing 0 to the sysctl ?
I'm not sure I understand to what the NOP is referring to exactly.

What about:
Platforms without EAS capability currently advertise this sysctl.
Its effects (i.e. rebuilding sched-domains) is unnecessary on
such platforms and its presence can be confusing.

>
> Desired behavior would be to, have this sysctl to enable/disable the EAS

Unnecessary comma I think

> on supported platform. On Non supported platform write to the sysctl

Non supported -> non-supported

> would return not supported error and read of the sysctl would return
> empty. So> sched_energy_aware returns empty - EAS is not possible at this moment
> This will include EAS capable platforms which have at least one EAS
> condition false during startup, e.g. using a Performance CPUfreq governor

Just a remark, using the performance governor is not exactly a condition
disabling EAS, it is more 'not using the schedutil CPUfreq governor'

> sched_energy_aware returns 0 - EAS is supported but disabled by admin.
> sched_energy_aware returns 1 - EAS is supported and enabled.
>
> User can find out the reason why EAS is not possible by checking
> info messages. sched_is_eas_possible returns true if the platform
> can do EAS at this moment.
>
> Depends on [PATCH v5 1/2] sched/topology: Remove EM_MAX_COMPLEXITY limit
> to be applied first.

I think it's implied as the 2 patches are sent together.

Otherwise:
Tested-by: Pierre Gondois <[email protected]>

>
> Signed-off-by: Shrikanth Hegde <[email protected]>
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 3 +-
> kernel/sched/topology.c | 112 +++++++++++++-------
> 2 files changed, 76 insertions(+), 39 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index cf33de56da27..d89ac2bd8dc4 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1182,7 +1182,8 @@ automatically on platforms where it can run (that is,
> platforms with asymmetric CPU topologies and having an Energy
> Model available). If your platform happens to meet the
> requirements for EAS but you do not want to use it, change
> -this value to 0.
> +this value to 0. On Non-EAS platforms, write operation fails and
> +read doesn't return anything.
>
> task_delayacct
> ===============
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index e0b9920e7e3e..a654d0186ac0 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -212,6 +212,70 @@ static unsigned int sysctl_sched_energy_aware = 1;
> static DEFINE_MUTEX(sched_energy_mutex);
> static bool sched_energy_update;
>
> +extern struct cpufreq_governor schedutil_gov;
> +static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
> +{
> + bool any_asym_capacity = false;
> + struct cpufreq_policy *policy;
> + struct cpufreq_governor *gov;
> + int i;
> +
> + /* EAS is enabled for asymmetric CPU capacity topologies. */
> + for_each_cpu(i, cpu_mask) {
> + if (per_cpu(sd_asym_cpucapacity, i)) {
> + any_asym_capacity = true;
> + break;
> + }
> + }
> + if (!any_asym_capacity) {
> + if (sched_debug()) {
> + pr_info("rd %*pbl: Checking EAS, CPUs do not have asymmetric capacities\n",
> + cpumask_pr_args(cpu_mask));
> + }
> + return false;
> + }
> +
> + /* EAS definitely does *not* handle SMT */
> + if (sched_smt_active()) {
> + if (sched_debug()) {
> + pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
> + cpumask_pr_args(cpu_mask));
> + }
> + return false;
> + }
> +
> + if (!arch_scale_freq_invariant()) {
> + if (sched_debug()) {
> + pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
> + cpumask_pr_args(cpu_mask));
> + }
> + return false;
> + }
> +
> + /* Do not attempt EAS if schedutil is not being used. */
> + for_each_cpu(i, cpu_mask) {
> + policy = cpufreq_cpu_get(i);
> + if (!policy) {
> + if (sched_debug()) {
> + pr_info("rd %*pbl: Checking EAS, cpufreq policy not set for CPU: %d",
> + cpumask_pr_args(cpu_mask), i);
> + }
> + return false;
> + }
> + gov = policy->governor;
> + cpufreq_cpu_put(policy);
> + if (gov != &schedutil_gov) {
> + if (sched_debug()) {
> + pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> + cpumask_pr_args(cpu_mask));
> + }
> + return false;
> + }
> + }
> +
> + return true;
> +}
> +
> void rebuild_sched_domains_energy(void)
> {
> mutex_lock(&sched_energy_mutex);
> @@ -231,6 +295,15 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
> return -EPERM;
>
> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> + if (!sched_is_eas_possible(cpu_active_mask)) {
> + if (write) {
> + return -EOPNOTSUPP;
> + } else {
> + *lenp = 0;
> + return 0;
> + }
> + }
> +
> if (!ret && write) {
> state = static_branch_unlikely(&sched_energy_present);
> if (state != sysctl_sched_energy_aware)
> @@ -351,61 +424,24 @@ static void sched_energy_set(bool has_eas)
> * 4. schedutil is driving the frequency of all CPUs of the rd;
> * 5. frequency invariance support is present;
> */
> -extern struct cpufreq_governor schedutil_gov;
> static bool build_perf_domains(const struct cpumask *cpu_map)
> {
> int i;
> struct perf_domain *pd = NULL, *tmp;
> int cpu = cpumask_first(cpu_map);
> struct root_domain *rd = cpu_rq(cpu)->rd;
> - struct cpufreq_policy *policy;
> - struct cpufreq_governor *gov;
>
> if (!sysctl_sched_energy_aware)
> goto free;
>
> - /* EAS is enabled for asymmetric CPU capacity topologies. */
> - if (!per_cpu(sd_asym_cpucapacity, cpu)) {
> - if (sched_debug()) {
> - pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
> - cpumask_pr_args(cpu_map));
> - }
> - goto free;
> - }
> -
> - /* EAS definitely does *not* handle SMT */
> - if (sched_smt_active()) {
> - pr_warn("rd %*pbl: Disabling EAS, SMT is not supported\n",
> - cpumask_pr_args(cpu_map));
> - goto free;
> - }
> -
> - if (!arch_scale_freq_invariant()) {
> - if (sched_debug()) {
> - pr_warn("rd %*pbl: Disabling EAS: frequency-invariant load tracking not yet supported",
> - cpumask_pr_args(cpu_map));
> - }
> + if (!sched_is_eas_possible(cpu_map))
> goto free;
> - }
>
> for_each_cpu(i, cpu_map) {
> /* Skip already covered CPUs. */
> if (find_pd(pd, i))
> continue;
>
> - /* Do not attempt EAS if schedutil is not being used. */
> - policy = cpufreq_cpu_get(i);
> - if (!policy)
> - goto free;
> - gov = policy->governor;
> - cpufreq_cpu_put(policy);
> - if (gov != &schedutil_gov) {
> - if (rd->pd)
> - pr_warn("rd %*pbl: Disabling EAS, schedutil is mandatory\n",
> - cpumask_pr_args(cpu_map));
> - goto free;
> - }
> -
> /* Create the new pd and add it to the local list. */
> tmp = pd_init(i);
> if (!tmp)
> --
> 2.39.3
>

2023-10-03 12:28:03

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/topology: change behaviour of sysctl sched_energy_aware based on the platform



On 10/3/23 2:50 PM, Pierre Gondois wrote:
> Hello Shrikanth,
> Some NITs about the commit message:
>

Hi Pierre.


> On 9/29/23 17:52, Shrikanth Hegde wrote:
>> sysctl sched_energy_aware is available for the admin to disable/enable
>> energy aware scheduling(EAS). EAS is enabled only if few conditions are
>> met by the platform. They are, asymmetric CPU capacity, no SMT,
>> schedutil CPUfreq governor, frequency invariant load tracking etc.
>> A platform may boot without EAS capability, but could gain such
>> capability at runtime For example, changing/registering the CPUfreq
>
> Missing dot I think: 'runtime. For example,'

ok.

>
>> governor to schedutil.
>>
>> At present, though platform doesn't support EAS, this sysctl returns 1
>> and it ends up calling build_perf_domains on write to 1 and
>> NOP when writing to 0. That is confusing and un-necessary.
>
This is current problematic behavior that patch 2/2 tries to address.

> I'm not sure I fully understand the sentence:
> - it sounds that the user is writing a value to either 1/0
>   (I think the user is writing 1/0 to the sysctl)

Yes, any user with root
privileges can edit this file and perform read and write.

> - aren't the sched domain rebuilt even when writing 0 to the sysctl ?
>   I'm not sure I understand to what the NOP is referring to exactly.
>

Complete sched domains aren't built as this case goes to match1 and match2 statements.

> What about:
> Platforms without EAS capability currently advertise this sysctl.
> Its effects (i.e. rebuilding sched-domains) is unnecessary on
> such platforms and its presence can be confusing.
>
look ok. the changelog had described in detail IMHO


>>
>> Desired behavior would be to, have this sysctl to enable/disable the EAS
>
> Unnecessary comma I think
>
>> on supported platform. On Non supported platform write to the sysctl
>
> Non supported  -> non-supported

ok for the above two nits.

>
>> would return not supported error and read of the sysctl would return
>> empty. So> sched_energy_aware returns empty - EAS is not possible at
>> this moment
>> This will include EAS capable platforms which have at least one EAS
>> condition false during startup, e.g. using a Performance CPUfreq governor
>
> Just a remark, using the performance governor is not exactly a condition
> disabling EAS, it is more 'not using the schedutil CPUfreq governor'
>

ok.

>> sched_energy_aware returns 0 - EAS is supported but disabled by admin.
>> sched_energy_aware returns 1 - EAS is supported and enabled.
>>
>> User can find out the reason why EAS is not possible by checking
>> info messages. sched_is_eas_possible returns true if the platform
>> can do EAS at this moment.
>>
>> Depends on [PATCH v5 1/2] sched/topology: Remove EM_MAX_COMPLEXITY limit
>> to be applied first.
>
> I think it's implied as the 2 patches are sent together.
>

yes. Did mention it explicitly since b4 mbox can try apply 2/2 first.
had run into similar issues recently.

> Otherwise:
> Tested-by: Pierre Gondois <[email protected]>
>
>>

Thank you very much for the testing it and providing the tag.

>> Signed-off-by: Shrikanth Hegde <[email protected]>
>> ---
>>   Documentation/admin-guide/sysctl/kernel.rst |   3 +-
>>   kernel/sched/topology.c                     | 112 +++++++++++++-------
>>   2 files changed, 76 insertions(+), 39 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst
>> b/Documentation/admin-guide/sysctl/kernel.rst
>> index cf33de56da27..d89ac2bd8dc4 100644
>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>> @@ -1182,7 +1182,8 @@ automatically on platforms where it can run
>> (that is,
>>   platforms with asymmetric CPU topologies and having an Energy
>>   Model available). If your platform happens to meet the
>>   requirements for EAS but you do not want to use it, change
>> -this value to 0.
>> +this value to 0. On Non-EAS platforms, write operation fails and
>> +read doesn't return anything.
>>
>>   task_delayacct
>>   ===============
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index e0b9920e7e3e..a654d0186ac0 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -212,6 +212,70 @@ static unsigned int sysctl_sched_energy_aware = 1;
>>   static DEFINE_MUTEX(sched_energy_mutex);
>>   static bool sched_energy_update;
>>
>> +extern struct cpufreq_governor schedutil_gov;
>> +static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
>> +{
>> +    bool any_asym_capacity = false;
>> +    struct cpufreq_policy *policy;
>> +    struct cpufreq_governor *gov;
>> +    int i;
>> +
>> +    /* EAS is enabled for asymmetric CPU capacity topologies. */
>> +    for_each_cpu(i, cpu_mask) {
>> +        if (per_cpu(sd_asym_cpucapacity, i)) {
>> +            any_asym_capacity = true;
>> +            break;
>> +        }
>> +    }
>> +    if (!any_asym_capacity) {
>> +        if (sched_debug()) {
>> +            pr_info("rd %*pbl: Checking EAS, CPUs do not have
>> asymmetric capacities\n",
>> +                cpumask_pr_args(cpu_mask));
>> +        }
>> +        return false;
>> +    }
>> +
>> +    /* EAS definitely does *not* handle SMT */
>> +    if (sched_smt_active()) {
>> +        if (sched_debug()) {
>> +            pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
>> +                cpumask_pr_args(cpu_mask));
>> +        }
>> +        return false;
>> +    }
>> +
>> +    if (!arch_scale_freq_invariant()) {
>> +        if (sched_debug()) {
>> +            pr_info("rd %*pbl: Checking EAS: frequency-invariant load
>> tracking not yet supported",
>> +                cpumask_pr_args(cpu_mask));
>> +        }
>> +        return false;
>> +    }
>> +
>> +    /* Do not attempt EAS if schedutil is not being used. */
>> +    for_each_cpu(i, cpu_mask) {
>> +        policy = cpufreq_cpu_get(i);
>> +        if (!policy) {
>> +            if (sched_debug()) {
>> +                pr_info("rd %*pbl: Checking EAS, cpufreq policy not
>> set for CPU: %d",
>> +                    cpumask_pr_args(cpu_mask), i);
>> +            }
>> +            return false;
>> +        }
>> +        gov = policy->governor;
>> +        cpufreq_cpu_put(policy);
>> +        if (gov != &schedutil_gov) {
>> +            if (sched_debug()) {
>> +                pr_info("rd %*pbl: Checking EAS, schedutil is
>> mandatory\n",
>> +                    cpumask_pr_args(cpu_mask));
>> +            }
>> +            return false;
>> +        }
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>   void rebuild_sched_domains_energy(void)
>>   {
>>       mutex_lock(&sched_energy_mutex);
>> @@ -231,6 +295,15 @@ static int sched_energy_aware_handler(struct
>> ctl_table *table, int write,
>>           return -EPERM;
>>
>>       ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> +    if (!sched_is_eas_possible(cpu_active_mask)) {
>> +        if (write) {
>> +            return -EOPNOTSUPP;
>> +        } else {
>> +            *lenp = 0;
>> +            return 0;
>> +        }
>> +    }
>> +
>>       if (!ret && write) {
>>           state = static_branch_unlikely(&sched_energy_present);
>>           if (state != sysctl_sched_energy_aware)
>> @@ -351,61 +424,24 @@ static void sched_energy_set(bool has_eas)
>>    *    4. schedutil is driving the frequency of all CPUs of the rd;
>>    *    5. frequency invariance support is present;
>>    */
>> -extern struct cpufreq_governor schedutil_gov;
>>   static bool build_perf_domains(const struct cpumask *cpu_map)
>>   {
>>       int i;
>>       struct perf_domain *pd = NULL, *tmp;
>>       int cpu = cpumask_first(cpu_map);
>>       struct root_domain *rd = cpu_rq(cpu)->rd;
>> -    struct cpufreq_policy *policy;
>> -    struct cpufreq_governor *gov;
>>
>>       if (!sysctl_sched_energy_aware)
>>           goto free;
>>
>> -    /* EAS is enabled for asymmetric CPU capacity topologies. */
>> -    if (!per_cpu(sd_asym_cpucapacity, cpu)) {
>> -        if (sched_debug()) {
>> -            pr_info("rd %*pbl: CPUs do not have asymmetric
>> capacities\n",
>> -                    cpumask_pr_args(cpu_map));
>> -        }
>> -        goto free;
>> -    }
>> -
>> -    /* EAS definitely does *not* handle SMT */
>> -    if (sched_smt_active()) {
>> -        pr_warn("rd %*pbl: Disabling EAS, SMT is not supported\n",
>> -            cpumask_pr_args(cpu_map));
>> -        goto free;
>> -    }
>> -
>> -    if (!arch_scale_freq_invariant()) {
>> -        if (sched_debug()) {
>> -            pr_warn("rd %*pbl: Disabling EAS: frequency-invariant
>> load tracking not yet supported",
>> -                cpumask_pr_args(cpu_map));
>> -        }
>> +    if (!sched_is_eas_possible(cpu_map))
>>           goto free;
>> -    }
>>
>>       for_each_cpu(i, cpu_map) {
>>           /* Skip already covered CPUs. */
>>           if (find_pd(pd, i))
>>               continue;
>>
>> -        /* Do not attempt EAS if schedutil is not being used. */
>> -        policy = cpufreq_cpu_get(i);
>> -        if (!policy)
>> -            goto free;
>> -        gov = policy->governor;
>> -        cpufreq_cpu_put(policy);
>> -        if (gov != &schedutil_gov) {
>> -            if (rd->pd)
>> -                pr_warn("rd %*pbl: Disabling EAS, schedutil is
>> mandatory\n",
>> -                        cpumask_pr_args(cpu_map));
>> -            goto free;
>> -        }
>> -
>>           /* Create the new pd and add it to the local list. */
>>           tmp = pd_init(i);
>>           if (!tmp)
>> --
>> 2.39.3
>>

will send out v6 with these changes to changelog and Tested-by tag.
will wait for a while to see if there are any concerns or comments.

2023-10-04 11:29:14

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/topology: change behaviour of sysctl sched_energy_aware based on the platform

On 29/09/23 21:22, Shrikanth Hegde wrote:
> +static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
> +{
> + bool any_asym_capacity = false;
> + struct cpufreq_policy *policy;
> + struct cpufreq_governor *gov;
> + int i;
> +
> + /* EAS is enabled for asymmetric CPU capacity topologies. */
> + for_each_cpu(i, cpu_mask) {
> + if (per_cpu(sd_asym_cpucapacity, i)) {

Lockdep should complain here in the sysctl path - this is an RCU-protected
pointer.

rcu_access_pointer() should do since you're not dereferencing the pointer.

> + any_asym_capacity = true;
> + break;
> + }
> + }

> @@ -231,6 +295,15 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
> return -EPERM;
>
> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);

Shouldn't this happen after we check sched_is_eas_possible()? Otherwise
AFAICT a write can actually happen despite !sched_is_eas_possible().

> + if (!sched_is_eas_possible(cpu_active_mask)) {
> + if (write) {
> + return -EOPNOTSUPP;
> + } else {
> + *lenp = 0;
> + return 0;
> + }
> + }

But now this is making me wonder, why not bite the bullet and store
somewhere whether we ever managed to enable EAS? Something like so?
(I didn't bother making this yet another static key given this is not a hot
path at all)
---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e0b9920e7e3e4..abd950f434206 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -209,6 +209,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
DEFINE_STATIC_KEY_FALSE(sched_energy_present);
static unsigned int sysctl_sched_energy_aware = 1;
+static bool __read_mostly sched_energy_once;
static DEFINE_MUTEX(sched_energy_mutex);
static bool sched_energy_update;

@@ -230,6 +231,15 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
if (write && !capable(CAP_SYS_ADMIN))
return -EPERM;

+ if (!sched_energy_once) {
+ if (write) {
+ return -EOPNOTSUPP;
+ } else {
+ *lenp = 0;
+ return 0;
+ }
+ }
+
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (!ret && write) {
state = static_branch_unlikely(&sched_energy_present);
@@ -340,6 +350,8 @@ static void sched_energy_set(bool has_eas)
if (sched_debug())
pr_info("%s: starting EAS\n", __func__);
static_branch_enable_cpuslocked(&sched_energy_present);
+ // Record that we managed to enable EAS at least once
+ sched_energy_once = true;
}
}


2023-10-04 15:01:13

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/topology: change behaviour of sysctl sched_energy_aware based on the platform



On 10/4/23 4:57 PM, Valentin Schneider wrote:
> On 29/09/23 21:22, Shrikanth Hegde wrote:

Hi Valentin, Thanks for taking a look at this patchset.

>> +static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
>> +{
>> + bool any_asym_capacity = false;
>> + struct cpufreq_policy *policy;
>> + struct cpufreq_governor *gov;
>> + int i;
>> +
>> + /* EAS is enabled for asymmetric CPU capacity topologies. */
>> + for_each_cpu(i, cpu_mask) {
>> + if (per_cpu(sd_asym_cpucapacity, i)) {
>
> Lockdep should complain here in the sysctl path - this is an RCU-protected
> pointer.
>
> rcu_access_pointer() should do since you're not dereferencing the pointer.

Yes. I did miss to catch that since mostly copied the snippets from build_perf_domains.

>
>> + any_asym_capacity = true;
>> + break;
>> + }
>> + }
>
>> @@ -231,6 +295,15 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>> return -EPERM;
>>
>> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>
> Shouldn't this happen after we check sched_is_eas_possible()? Otherwise
> AFAICT a write can actually happen despite !sched_is_eas_possible().

Yes. That's right. Will change that.

>
>> + if (!sched_is_eas_possible(cpu_active_mask)) {
>> + if (write) {
>> + return -EOPNOTSUPP;
>> + } else {
>> + *lenp = 0;
>> + return 0;
>> + }
>> + }
>
> But now this is making me wonder, why not bite the bullet and store
> somewhere whether we ever managed to enable EAS? Something like so?
> (I didn't bother making this yet another static key given this is not a hot
> path at all)

IIUC, Problem with this is, a platform which can do EAS now, may not be able to do EAS
sometime later.
for example, frequency governor is changed from performance to
schedutil, EAS can be enabled and sched_energy_once will be set, but later it can be
set to performance again. In that case saying it is EAS capable is wrong.

> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index e0b9920e7e3e4..abd950f434206 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -209,6 +209,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
> #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> DEFINE_STATIC_KEY_FALSE(sched_energy_present);
> static unsigned int sysctl_sched_energy_aware = 1;
> +static bool __read_mostly sched_energy_once;
> static DEFINE_MUTEX(sched_energy_mutex);
> static bool sched_energy_update;
>
> @@ -230,6 +231,15 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
> if (write && !capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> + if (!sched_energy_once) {
> + if (write) {
> + return -EOPNOTSUPP;
> + } else {
> + *lenp = 0;
> + return 0;
> + }
> + }
> +
> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> if (!ret && write) {
> state = static_branch_unlikely(&sched_energy_present);
> @@ -340,6 +350,8 @@ static void sched_energy_set(bool has_eas)
> if (sched_debug())
> pr_info("%s: starting EAS\n", __func__);
> static_branch_enable_cpuslocked(&sched_energy_present);
> + // Record that we managed to enable EAS at least once
> + sched_energy_once = true;
> }
> }
>
>