2018-08-29 13:22:20

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 0/4] sched/numa: remove unused code

With:
commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")
some code and fields are not used anymore.

Patch 1-2 remove the unused code.

Patch 3 goes a bit further and removes smt_gain field from sched_domain.

Patch 4 removes sd parameter from arch_scale_cpu_capacity once the last
user has been removed by patch 3

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected] (open list)

Vincent Guittot (4):
sched/numa: remove unused code from update_numa_stats()
sched/numa: remove unused nr_running field
sched/numa: remove smt_gain
sched/topology: remove unused sd param from arch_scale_cpu_capacity()

arch/arm/kernel/topology.c | 2 +-
drivers/base/arch_topology.c | 6 +++---
include/linux/arch_topology.h | 2 +-
include/linux/sched/topology.h | 1 -
kernel/sched/cpufreq_schedutil.c | 2 +-
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 32 +++++---------------------------
kernel/sched/pelt.c | 2 +-
kernel/sched/sched.h | 7 ++-----
kernel/sched/topology.c | 2 --
10 files changed, 15 insertions(+), 43 deletions(-)

--
2.7.4



2018-08-29 13:20:50

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 1/4] sched/numa: remove unused code from update_numa_stats()

With :
commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")

the local variables smt, cpus and capacity and their results are not used
anymore in numa_has_capacity()

Remove this unused code

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected] (open list)
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..c2b8bf4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1463,8 +1463,7 @@ struct numa_stats {
*/
static void update_numa_stats(struct numa_stats *ns, int nid)
{
- int smt, cpu, cpus = 0;
- unsigned long capacity;
+ int cpu;

memset(ns, 0, sizeof(*ns));
for_each_cpu(cpu, cpumask_of_node(nid)) {
@@ -1473,26 +1472,8 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
ns->nr_running += rq->nr_running;
ns->load += weighted_cpuload(rq);
ns->compute_capacity += capacity_of(cpu);
-
- cpus++;
}

- /*
- * If we raced with hotplug and there are no CPUs left in our mask
- * the @ns structure is NULL'ed and task_numa_compare() will
- * not find this node attractive.
- *
- * We'll detect a huge imbalance and bail there.
- */
- if (!cpus)
- return;
-
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_power < 2 */
- smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, ns->compute_capacity);
- capacity = cpus / smt; /* cores */
-
- capacity = min_t(unsigned, capacity,
- DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
}

struct task_numa_env {
--
2.7.4


2018-08-29 13:20:54

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH 3/4] sched/topology: remove smt_gain

smt_gain is used to compute the capacity of CPUs of a SMT core with the
constraint 1 < smt_gain < 2 in order to be able to compute number of CPUs
per core. The field has_free_capacity of struct numa_stat, which was the
last user of this computation of number of CPUs per core, has been removed
by :

commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")

We can now remove this constraint on core capacity and use the defautl value
SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE
becomes the maximum compute capacity of CPUs on every systems. This should
help to simplify some code and remove fields like rd->max_cpu_capacity

Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other
places in the code when it wants the capacity of a CPUs to scale
some metrics like in pelt, deadline or schedutil. In case on SMT, the value
returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE.

Remove the smt_gain field from sched_domain struct

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected] (open list)
Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched/topology.h | 1 -
kernel/sched/sched.h | 3 ---
kernel/sched/topology.c | 2 --
3 files changed, 6 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2634774..212792b 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -89,7 +89,6 @@ struct sched_domain {
unsigned int newidle_idx;
unsigned int wake_idx;
unsigned int forkexec_idx;
- unsigned int smt_gain;

int nohz_idle; /* NOHZ IDLE status */
int flags; /* See SD_* */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8ca..b1715b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1758,9 +1758,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
- if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
- return sd->smt_gain / sd->span_weight;
-
return SCHED_CAPACITY_SCALE;
}
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 56a0fed..069c924 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1129,7 +1129,6 @@ sd_init(struct sched_domain_topology_level *tl,

.last_balance = jiffies,
.balance_interval = sd_weight,
- .smt_gain = 0,
.max_newidle_lb_cost = 0,
.next_decay_max_lb_cost = jiffies,
.child = child,
@@ -1155,7 +1154,6 @@ sd_init(struct sched_domain_topology_level *tl,
if (sd->flags & SD_SHARE_CPUCAPACITY) {
sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
- sd->smt_gain = 1178; /* ~15% */

} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->flags |= SD_PREFER_SIBLING;
--
2.7.4


2018-08-29 13:21:07

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 2/4] sched/numa: remove unused nr_running field

nr_running in struct numa_stats is not used anywhere in the code.

Remove it.

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected] (open list)
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2b8bf4..cff1682 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1454,8 +1454,6 @@ struct numa_stats {

/* Total compute capacity of CPUs on a node */
unsigned long compute_capacity;
-
- unsigned int nr_running;
};

/*
@@ -1469,7 +1467,6 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
for_each_cpu(cpu, cpumask_of_node(nid)) {
struct rq *rq = cpu_rq(cpu);

- ns->nr_running += rq->nr_running;
ns->load += weighted_cpuload(rq);
ns->compute_capacity += capacity_of(cpu);
}
--
2.7.4


2018-08-29 13:21:15

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH 4/4] sched/topology: remove unused sd param from arch_scale_cpu_capacity()

struct sched_domain *sd parameter is no more used in arch_scale_cpu_capacity()
so we can remote it.

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Russell King <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Vincent Guittot <[email protected]>

---
arch/arm/kernel/topology.c | 2 +-
drivers/base/arch_topology.c | 6 +++---
include/linux/arch_topology.h | 2 +-
kernel/sched/cpufreq_schedutil.c | 2 +-
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 8 ++++----
kernel/sched/pelt.c | 2 +-
kernel/sched/sched.h | 4 ++--
8 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 24ac3ca..d3d75c5 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -175,7 +175,7 @@ static void update_cpu_capacity(unsigned int cpu)
topology_set_cpu_scale(cpu, cpu_capacity(cpu) / middle_capacity);

pr_info("CPU%u: update cpu_capacity %lu\n",
- cpu, topology_get_cpu_scale(NULL, cpu));
+ cpu, topology_get_cpu_scale(cpu));
}

#else
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6..6dc9339 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -44,7 +44,7 @@ static ssize_t cpu_capacity_show(struct device *dev,
{
struct cpu *cpu = container_of(dev, struct cpu, dev);

- return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
+ return sprintf(buf, "%lu\n", topology_get_cpu_scale(cpu->dev.id));
}

static ssize_t cpu_capacity_store(struct device *dev,
@@ -124,7 +124,7 @@ void topology_normalize_cpu_scale(void)
/ capacity_scale;
topology_set_cpu_scale(cpu, capacity);
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
- cpu, topology_get_cpu_scale(NULL, cpu));
+ cpu, topology_get_cpu_scale(cpu));
}
mutex_unlock(&cpu_scale_mutex);
}
@@ -194,7 +194,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);

for_each_cpu(cpu, policy->related_cpus) {
- raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) *
+ raw_capacity[cpu] = topology_get_cpu_scale(cpu) *
policy->cpuinfo.max_freq / 1000UL;
capacity_scale = max(raw_capacity[cpu], capacity_scale);
}
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 2b70941..5df6773 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -17,7 +17,7 @@ DECLARE_PER_CPU(unsigned long, cpu_scale);

struct sched_domain;
static inline
-unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu)
+unsigned long topology_get_cpu_scale(int cpu)
{
return per_cpu(cpu_scale, cpu);
}
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3fffad3..01b95057 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -202,7 +202,7 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
struct rq *rq = cpu_rq(sg_cpu->cpu);
unsigned long util, irq, max;

- sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+ sg_cpu->max = max = arch_scale_cpu_capacity(sg_cpu->cpu);
sg_cpu->bw_dl = cpu_bw_dl(rq);

if (rt_rq_is_runnable(&rq->rt))
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 997ea7b..5f763b1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1196,7 +1196,7 @@ static void update_curr_dl(struct rq *rq)
&curr->dl);
} else {
unsigned long scale_freq = arch_scale_freq_capacity(cpu);
- unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);

scaled_delta_exec = cap_scale(delta_exec, scale_freq);
scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cff1682..2eeac7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -748,7 +748,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
struct sched_avg *sa = &se->avg;
- long cpu_scale = arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq)));
+ long cpu_scale = arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq)));
long cap = (long)(cpu_scale - cfs_rq->avg.util_avg) / 2;

if (cap > 0) {
@@ -3175,7 +3175,7 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
* is not we rescale running_sum 1st
*/
running_sum = se->avg.util_sum /
- arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq)));
+ arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq)));
runnable_sum = max(runnable_sum, running_sum);

load_sum = (s64)se_weight(se) * runnable_sum;
@@ -7462,7 +7462,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+ unsigned long max = arch_scale_cpu_capacity(cpu);
unsigned long used, free;
unsigned long irq;

@@ -7487,7 +7487,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;

- cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);
+ cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);

if (!capacity)
capacity = 1;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 35475c0..5efa152 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -114,7 +114,7 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
u64 periods;

scale_freq = arch_scale_freq_capacity(cpu);
- scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+ scale_cpu = arch_scale_cpu_capacity(cpu);

delta += sa->period_contrib;
periods = delta / 1024; /* A period is 1024us (~1ms) */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b1715b8..8b306ce 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1756,7 +1756,7 @@ unsigned long arch_scale_freq_capacity(int cpu)
#ifdef CONFIG_SMP
#ifndef arch_scale_cpu_capacity
static __always_inline
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_cpu_capacity(int cpu)
{
return SCHED_CAPACITY_SCALE;
}
@@ -1764,7 +1764,7 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
#else
#ifndef arch_scale_cpu_capacity
static __always_inline
-unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+unsigned long arch_scale_cpu_capacity(int cpu)
{
return SCHED_CAPACITY_SCALE;
}
--
2.7.4


2018-08-29 14:11:03

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

On 29/08/18 14:19, Vincent Guittot wrote:
> smt_gain is used to compute the capacity of CPUs of a SMT core with the
> constraint 1 < smt_gain < 2 in order to be able to compute number of CPUs
> per core. The field has_free_capacity of struct numa_stat, which was the
> last user of this computation of number of CPUs per core, has been removed
> by :
>
> commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")
>
> We can now remove this constraint on core capacity and use the defautl value
> SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE
> becomes the maximum compute capacity of CPUs on every systems. This should
> help to simplify some code and remove fields like rd->max_cpu_capacity
>
> Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other
> places in the code when it wants the capacity of a CPUs to scale
> some metrics like in pelt, deadline or schedutil. In case on SMT, the value
> returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE.
>
> Remove the smt_gain field from sched_domain struct

You beat me to it, I got confused by smt_gain recently when I stumbled
on it as I found out on ARM it's not used and had to spend sometime to
convince myself it's not really necessary to use it.

It was hard to track the history of this and *why* it's needed.

The only 'theoretical' case I found smt_gain can be useful is when you
have asymmetric system, for example:

Node_A: 1 Core 2 Threads
Node_B: 1 Core 4 Threads

Then with smt_gain the group_capacity at the core level will be limited
to smt_gain. But without smt_gain Node_B will look twice as powerful as
Node_A - which will affect balancing AFAICT causing Node_B's single core
to be oversubscribed as the 4 threads will still have to share the same
underlying hardware resources. I don't think in practice such systems
exists, or even make sense, though.

So +1 from my side for the removal.

> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: [email protected] (open list)
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> include/linux/sched/topology.h | 1 -
> kernel/sched/sched.h | 3 ---
> kernel/sched/topology.c | 2 --
> 3 files changed, 6 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 2634774..212792b 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -89,7 +89,6 @@ struct sched_domain {
> unsigned int newidle_idx;
> unsigned int wake_idx;
> unsigned int forkexec_idx;
> - unsigned int smt_gain;
>
> int nohz_idle; /* NOHZ IDLE status */
> int flags; /* See SD_* */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4a2e8ca..b1715b8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1758,9 +1758,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
> static __always_inline
> unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> {
> - if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> - return sd->smt_gain / sd->span_weight;
> -
> return SCHED_CAPACITY_SCALE;
> }
> #endif
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 56a0fed..069c924 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1129,7 +1129,6 @@ sd_init(struct sched_domain_topology_level *tl,
>
> .last_balance = jiffies,
> .balance_interval = sd_weight,
> - .smt_gain = 0,
> .max_newidle_lb_cost = 0,
> .next_decay_max_lb_cost = jiffies,
> .child = child,
> @@ -1155,7 +1154,6 @@ sd_init(struct sched_domain_topology_level *tl,
> if (sd->flags & SD_SHARE_CPUCAPACITY) {
> sd->flags |= SD_PREFER_SIBLING;
> sd->imbalance_pct = 110;
> - sd->smt_gain = 1178; /* ~15% */
>
> } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> sd->flags |= SD_PREFER_SIBLING;


--
Qais Yousef


2018-08-29 14:43:35

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

On Wed, 29 Aug 2018 at 16:08, Qais Yousef <[email protected]> wrote:
>

> You beat me to it, I got confused by smt_gain recently when I stumbled
> on it as I found out on ARM it's not used and had to spend sometime to
> convince myself it's not really necessary to use it.
>
> It was hard to track the history of this and *why* it's needed.
>
> The only 'theoretical' case I found smt_gain can be useful is when you
> have asymmetric system, for example:
>
> Node_A: 1 Core 2 Threads
> Node_B: 1 Core 4 Threads
>
> Then with smt_gain the group_capacity at the core level will be limited
> to smt_gain. But without smt_gain Node_B will look twice as powerful as
> Node_A - which will affect balancing AFAICT causing Node_B's single core
> to be oversubscribed as the 4 threads will still have to share the same
> underlying hardware resources. I don't think in practice such systems
> exists, or even make sense, though.

Yes I came to the same conclusion

>
> So +1 from my side for the removal.

Thanks

>
>
> --
> Qais Yousef
>

2018-09-04 06:40:50

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/numa: remove unused code from update_numa_stats()

> With :
> commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")
>
> the local variables smt, cpus and capacity and their results are not used
> anymore in numa_has_capacity()
>
> Remove this unused code
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Srikar Dronamraju <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: [email protected] (open list)
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 21 +--------------------
> 1 file changed, 1 insertion(+), 20 deletions(-)

Looks good to me.

Reviewed-by: Srikar Dronamraju <[email protected]>

--
Thanks and Regards
Srikar Dronamraju


2018-09-04 06:41:58

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/numa: remove unused nr_running field

* Vincent Guittot <[email protected]> [2018-08-29 15:19:10]:

> nr_running in struct numa_stats is not used anywhere in the code.
>
> Remove it.
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Srikar Dronamraju <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: [email protected] (open list)
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 3 ---
> 1 file changed, 3 deletions(-)

Looks good to me.

Reviewed-by: Srikar Dronamraju <[email protected]>

--
Thanks and Regards
Srikar Dronamraju


2018-09-04 08:26:10

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: [email protected] (open list)
> Signed-off-by: Vincent Guittot <[email protected]>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4a2e8ca..b1715b8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1758,9 +1758,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
> static __always_inline
> unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> {
> - if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> - return sd->smt_gain / sd->span_weight;
> -
> return SCHED_CAPACITY_SCALE;

Without this change, the capacity_orig of an SMT would have been based
on the number of threads.
For example on SMT2, capacity_orig would have been 589 and
for SMT 8, capacity_orig would have been 148.

However after this change, capacity_orig of each SMT thread would be
1024. For example SMT 8 core capacity_orig would now be 8192.

smt_gain was suppose to make a multi threaded core was slightly more
powerful than a single threaded core. I suspect if that sometimes hurt
us when doing load balance between 2 cores i.e at MC or DIE sched
domain. Even with 2 threads running on a core, the core might look
lightly loaded 2048/8192. Hence might dissuade movement to a idle core.

I always wonder why arch_scale_cpu_capacity() is called with NULL
sched_domain, in scale_rt_capacity(). This way capacity might actually
be more than the capacity_orig. I am always under an impression that
capacity_orig > capacity. Or am I misunderstanding that?

--
Thanks and Regards
Srikar Dronamraju


2018-09-04 09:37:52

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

Hi Srikar,

Le Tuesday 04 Sep 2018 ? 01:24:24 (-0700), Srikar Dronamraju a ?crit :
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: [email protected] (open list)
> > Signed-off-by: Vincent Guittot <[email protected]>
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 4a2e8ca..b1715b8 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1758,9 +1758,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
> > static __always_inline
> > unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> > {
> > - if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> > - return sd->smt_gain / sd->span_weight;
> > -
> > return SCHED_CAPACITY_SCALE;
>
> Without this change, the capacity_orig of an SMT would have been based
> on the number of threads.
> For example on SMT2, capacity_orig would have been 589 and
> for SMT 8, capacity_orig would have been 148.
>
> However after this change, capacity_orig of each SMT thread would be
> 1024. For example SMT 8 core capacity_orig would now be 8192.
>
> smt_gain was suppose to make a multi threaded core was slightly more
> powerful than a single threaded core. I suspect if that sometimes hurt

Is there system with both single threaded and multi threaded core ?
That was the main open point for me (and for Qais too)


> us when doing load balance between 2 cores i.e at MC or DIE sched
> domain. Even with 2 threads running on a core, the core might look
> lightly loaded 2048/8192. Hence might dissuade movement to a idle core.

Then, there is the sibling flag at SMT level that normally ensures 1 task per
core for such UC

>
> I always wonder why arch_scale_cpu_capacity() is called with NULL
> sched_domain, in scale_rt_capacity(). This way capacity might actually

Probably because until this v4.19-rcxx version, the rt scaling was done
relatively to local cpu capacity:
capacity ?= arch_scale_cpu() * scale_rt_capacity / SCHED_CAPACITY_SCALE

Whereas now, it directly returns the remaining capacity

> be more than the capacity_orig. I am always under an impression that
> capacity_orig > capacity. Or am I misunderstanding that?

You are right, there is a bug for SMT and the patch below should fix it.
Nevertheless, we still have the problem in some other places in the code.

Subject: [PATCH] sched/fair: fix scale_rt_capacity() for SMT

Since commit:
commit 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
scale_rt_capacity() returns the remaining capacity and not a scale factor
to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
scale_rt_capacity() so we must take the sched_domain argument

Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
Reported-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..c73e1fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7481,10 +7481,10 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}

-static unsigned long scale_rt_capacity(int cpu)
+static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+ unsigned long max = arch_scale_cpu_capacity(sd, cpu);
unsigned long used, free;
unsigned long irq;

@@ -7506,7 +7506,7 @@ static unsigned long scale_rt_capacity(int cpu)

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long capacity = scale_rt_capacity(cpu);
+ unsigned long capacity = scale_rt_capacity(sd, cpu);
struct sched_group *sdg = sd->groups;

cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);
--
2.7.4

>
> --
> Thanks and Regards
> Srikar Dronamraju
>

2018-09-04 10:39:20

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

* Vincent Guittot <[email protected]> [2018-09-04 11:36:26]:

> Hi Srikar,
>
> Le Tuesday 04 Sep 2018 ? 01:24:24 (-0700), Srikar Dronamraju a ?crit :
> > However after this change, capacity_orig of each SMT thread would be
> > 1024. For example SMT 8 core capacity_orig would now be 8192.
> >
> > smt_gain was suppose to make a multi threaded core was slightly more
> > powerful than a single threaded core. I suspect if that sometimes hurt
>
> Is there system with both single threaded and multi threaded core ?
> That was the main open point for me (and for Qais too)
>

I dont know of any systems that have come with single threaded and
multithreaded. However some user can still offline few threads in a core
while leaving other cores untouched. I dont really know why somebody
would want to do it. For example, some customer was toying with SMT 3
mode in a SMT 8 power8 box.

>
> > us when doing load balance between 2 cores i.e at MC or DIE sched
> > domain. Even with 2 threads running on a core, the core might look
> > lightly loaded 2048/8192. Hence might dissuade movement to a idle core.
>
> Then, there is the sibling flag at SMT level that normally ensures 1 task per
> core for such UC
>

Agree.

> >
> > I always wonder why arch_scale_cpu_capacity() is called with NULL
> > sched_domain, in scale_rt_capacity(). This way capacity might actually
>
> Probably because until this v4.19-rcxx version, the rt scaling was done
> relatively to local cpu capacity:
> capacity ?= arch_scale_cpu() * scale_rt_capacity / SCHED_CAPACITY_SCALE
>
> Whereas now, it directly returns the remaining capacity
>
> > be more than the capacity_orig. I am always under an impression that
> > capacity_orig > capacity. Or am I misunderstanding that?
>
> You are right, there is a bug for SMT and the patch below should fix it.
> Nevertheless, we still have the problem in some other places in the code.
>
> Subject: [PATCH] sched/fair: fix scale_rt_capacity() for SMT
>
> Since commit:
> commit 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
> scale_rt_capacity() returns the remaining capacity and not a scale factor
> to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
> scale_rt_capacity() so we must take the sched_domain argument
>
> Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
> Reported-by: Srikar Dronamraju <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>

Reviewed-by: Srikar Dronamraju <[email protected]>

--
Thanks and Regards
Srikar Dronamraju


2018-09-05 07:38:16

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

On Tue, 4 Sep 2018 at 12:37, Srikar Dronamraju
<[email protected]> wrote:
>
> * Vincent Guittot <[email protected]> [2018-09-04 11:36:26]:
>
> > Hi Srikar,
> >
> > Le Tuesday 04 Sep 2018 à 01:24:24 (-0700), Srikar Dronamraju a écrit :
> > > However after this change, capacity_orig of each SMT thread would be
> > > 1024. For example SMT 8 core capacity_orig would now be 8192.
> > >
> > > smt_gain was suppose to make a multi threaded core was slightly more
> > > powerful than a single threaded core. I suspect if that sometimes hurt
> >
> > Is there system with both single threaded and multi threaded core ?
> > That was the main open point for me (and for Qais too)
> >
>
> I dont know of any systems that have come with single threaded and
> multithreaded. However some user can still offline few threads in a core
> while leaving other cores untouched. I dont really know why somebody
> would want to do it. For example, some customer was toying with SMT 3
> mode in a SMT 8 power8 box.

In this case, it means that we have the same core capacity whatever
the number of CPUs
and a core with SMT 3 will be set with the same compute capacity as
the core with SMT 8.
Does it still make sense ?

>
> >
> > > us when doing load balance between 2 cores i.e at MC or DIE sched
> > > domain. Even with 2 threads running on a core, the core might look
> > > lightly loaded 2048/8192. Hence might dissuade movement to a idle core.
> >
> > Then, there is the sibling flag at SMT level that normally ensures 1 task per
> > core for such UC
> >
>
> Agree.
>
> > >
> > > I always wonder why arch_scale_cpu_capacity() is called with NULL
> > > sched_domain, in scale_rt_capacity(). This way capacity might actually
> >
> > Probably because until this v4.19-rcxx version, the rt scaling was done
> > relatively to local cpu capacity:
> > capacity = arch_scale_cpu() * scale_rt_capacity / SCHED_CAPACITY_SCALE
> >
> > Whereas now, it directly returns the remaining capacity
> >
> > > be more than the capacity_orig. I am always under an impression that
> > > capacity_orig > capacity. Or am I misunderstanding that?
> >
> > You are right, there is a bug for SMT and the patch below should fix it.
> > Nevertheless, we still have the problem in some other places in the code.
> >
> > Subject: [PATCH] sched/fair: fix scale_rt_capacity() for SMT
> >
> > Since commit:
> > commit 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
> > scale_rt_capacity() returns the remaining capacity and not a scale factor
> > to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
> > scale_rt_capacity() so we must take the sched_domain argument
> >
> > Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
> > Reported-by: Srikar Dronamraju <[email protected]>
> > Signed-off-by: Vincent Guittot <[email protected]>
>
> Reviewed-by: Srikar Dronamraju <[email protected]>
>
> --
> Thanks and Regards
> Srikar Dronamraju
>

2018-09-05 08:51:55

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

* Vincent Guittot <[email protected]> [2018-09-05 09:36:42]:

> >
> > I dont know of any systems that have come with single threaded and
> > multithreaded. However some user can still offline few threads in a core
> > while leaving other cores untouched. I dont really know why somebody
> > would want to do it. For example, some customer was toying with SMT 3
> > mode in a SMT 8 power8 box.
>
> In this case, it means that we have the same core capacity whatever
> the number of CPUs
> and a core with SMT 3 will be set with the same compute capacity as
> the core with SMT 8.
> Does it still make sense ?
>

To me it make sense atleast from a power 8 perspective, because SMT 1 >
SMT 2 > SMT 4 > SMT8. So if one core is configured for SMT 2 and other
core is configured for SMT4; all threads being busy, the individual
threads running on SMT2 core will complete more work than SMT 4 core
threads.

--
Thanks and Regards
Srikar Dronamraju


2018-09-05 09:13:14

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

On Wed, 5 Sep 2018 at 10:50, Srikar Dronamraju
<[email protected]> wrote:
>
> * Vincent Guittot <[email protected]> [2018-09-05 09:36:42]:
>
> > >
> > > I dont know of any systems that have come with single threaded and
> > > multithreaded. However some user can still offline few threads in a core
> > > while leaving other cores untouched. I dont really know why somebody
> > > would want to do it. For example, some customer was toying with SMT 3
> > > mode in a SMT 8 power8 box.
> >
> > In this case, it means that we have the same core capacity whatever
> > the number of CPUs
> > and a core with SMT 3 will be set with the same compute capacity as
> > the core with SMT 8.
> > Does it still make sense ?
> >
>
> To me it make sense atleast from a power 8 perspective, because SMT 1 >
> SMT 2 > SMT 4 > SMT8. So if one core is configured for SMT 2 and other
> core is configured for SMT4; all threads being busy, the individual
> threads running on SMT2 core will complete more work than SMT 4 core
> threads.

I agree for individual thread capacity but at core group level, the
core SMT 1 will have the same capacity as core group SMT 8 so load
balance will try to balance evenly the tasks between the 2 cores
whereas core SMT 8 > core SMT1 , isn't it ?

>
> --
> Thanks and Regards
> Srikar Dronamraju
>

2018-09-05 11:16:32

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

* Vincent Guittot <[email protected]> [2018-09-05 11:11:35]:

> On Wed, 5 Sep 2018 at 10:50, Srikar Dronamraju
> <[email protected]> wrote:
> >
> > * Vincent Guittot <[email protected]> [2018-09-05 09:36:42]:
> >
> > > >
> > > > I dont know of any systems that have come with single threaded and
> > > > multithreaded. However some user can still offline few threads in a core
> > > > while leaving other cores untouched. I dont really know why somebody
> > > > would want to do it. For example, some customer was toying with SMT 3
> > > > mode in a SMT 8 power8 box.
> > >
> > > In this case, it means that we have the same core capacity whatever
> > > the number of CPUs
> > > and a core with SMT 3 will be set with the same compute capacity as
> > > the core with SMT 8.
> > > Does it still make sense ?
> > >
> >
> > To me it make sense atleast from a power 8 perspective, because SMT 1 >
> > SMT 2 > SMT 4 > SMT8. So if one core is configured for SMT 2 and other
> > core is configured for SMT4; all threads being busy, the individual
> > threads running on SMT2 core will complete more work than SMT 4 core
> > threads.
>
> I agree for individual thread capacity but at core group level, the
> core SMT 1 will have the same capacity as core group SMT 8 so load
> balance will try to balance evenly the tasks between the 2 cores
> whereas core SMT 8 > core SMT1 , isn't it ?
>

I believe that Core capacity irrespective of the number of threads
should be similar. We wanted to give a small benefit if the core has
multiple threads and that was smt_gain. Lets say we have 8 equal sw
threads running on 2 cores; one being SMT 2 and other being SMT4.
then 4 threads should be spread to each core. So that we would be fair
to each of the 8 SW threads.

--
Thanks and Regards
Srikar Dronamraju


2018-09-06 09:23:03

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

On Wed, 5 Sep 2018 at 13:14, Srikar Dronamraju
<[email protected]> wrote:
>
> * Vincent Guittot <[email protected]> [2018-09-05 11:11:35]:
>
> > On Wed, 5 Sep 2018 at 10:50, Srikar Dronamraju
> > <[email protected]> wrote:
> > >
> > > * Vincent Guittot <[email protected]> [2018-09-05 09:36:42]:
> > >
> > > > >
> > > > > I dont know of any systems that have come with single threaded and
> > > > > multithreaded. However some user can still offline few threads in a core
> > > > > while leaving other cores untouched. I dont really know why somebody
> > > > > would want to do it. For example, some customer was toying with SMT 3
> > > > > mode in a SMT 8 power8 box.
> > > >
> > > > In this case, it means that we have the same core capacity whatever
> > > > the number of CPUs
> > > > and a core with SMT 3 will be set with the same compute capacity as
> > > > the core with SMT 8.
> > > > Does it still make sense ?
> > > >
> > >
> > > To me it make sense atleast from a power 8 perspective, because SMT 1 >
> > > SMT 2 > SMT 4 > SMT8. So if one core is configured for SMT 2 and other
> > > core is configured for SMT4; all threads being busy, the individual
> > > threads running on SMT2 core will complete more work than SMT 4 core
> > > threads.
> >
> > I agree for individual thread capacity but at core group level, the
> > core SMT 1 will have the same capacity as core group SMT 8 so load
> > balance will try to balance evenly the tasks between the 2 cores
> > whereas core SMT 8 > core SMT1 , isn't it ?
> >
>
> I believe that Core capacity irrespective of the number of threads
> should be similar. We wanted to give a small benefit if the core has
> multiple threads and that was smt_gain. Lets say we have 8 equal sw
> threads running on 2 cores; one being SMT 2 and other being SMT4.
> then 4 threads should be spread to each core. So that we would be fair
> to each of the 8 SW threads.

Do you mean that it would be the same with SMT 2 and SMT 8 ?
evenly spread the 8 SW threads between the 2 cores would be better
than 2 SW threads on core SMT 2 and 6 on core SMT8

>
> --
> Thanks and Regards
> Srikar Dronamraju
>

2018-09-07 13:09:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

On Wed, Aug 29, 2018 at 03:08:38PM +0100, Qais Yousef wrote:
> It was hard to track the history of this and *why* it's needed.

Yeah, my bad..

So at some point I wanted to do dynamic capacity and dynamic smt gain by
using the x86 APERF/MPERF stuff. But it never quite worked and we got
stuck with the remnants.

Subject: [tip:sched/core] sched/fair: Fix scale_rt_capacity() for SMT

Commit-ID: 287cdaac5700c5b8970d739f73d742d863d3e2ca
Gitweb: https://git.kernel.org/tip/287cdaac5700c5b8970d739f73d742d863d3e2ca
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 4 Sep 2018 11:36:26 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 10:13:47 +0200

sched/fair: Fix scale_rt_capacity() for SMT

Since commit:

523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")

scale_rt_capacity() returns the remaining capacity and not a scale factor
to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
scale_rt_capacity() so we must take the sched_domain argument.

Reported-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6b7d6daab20..f12d004be6a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7488,10 +7488,10 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}

-static unsigned long scale_rt_capacity(int cpu)
+static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+ unsigned long max = arch_scale_cpu_capacity(sd, cpu);
unsigned long used, free;
unsigned long irq;

@@ -7513,7 +7513,7 @@ static unsigned long scale_rt_capacity(int cpu)

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long capacity = scale_rt_capacity(cpu);
+ unsigned long capacity = scale_rt_capacity(sd, cpu);
struct sched_group *sdg = sd->groups;

cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);

Subject: [tip:sched/core] sched/numa: Remove unused code from update_numa_stats()

Commit-ID: d90707ebebe03596e19de3abbf79b766e72a3465
Gitweb: https://git.kernel.org/tip/d90707ebebe03596e19de3abbf79b766e72a3465
Author: Vincent Guittot <[email protected]>
AuthorDate: Wed, 29 Aug 2018 15:19:09 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:55 +0200

sched/numa: Remove unused code from update_numa_stats()

With:

commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")

the local variables 'smt', 'cpus' and 'capacity' and their results are not used
anymore in numa_has_capacity()

Remove this unused code.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06ff75f4ac7b..b65596fae06b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1463,8 +1463,7 @@ struct numa_stats {
*/
static void update_numa_stats(struct numa_stats *ns, int nid)
{
- int smt, cpu, cpus = 0;
- unsigned long capacity;
+ int cpu;

memset(ns, 0, sizeof(*ns));
for_each_cpu(cpu, cpumask_of_node(nid)) {
@@ -1473,26 +1472,8 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
ns->nr_running += rq->nr_running;
ns->load += weighted_cpuload(rq);
ns->compute_capacity += capacity_of(cpu);
-
- cpus++;
}

- /*
- * If we raced with hotplug and there are no CPUs left in our mask
- * the @ns structure is NULL'ed and task_numa_compare() will
- * not find this node attractive.
- *
- * We'll detect a huge imbalance and bail there.
- */
- if (!cpus)
- return;
-
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_power < 2 */
- smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, ns->compute_capacity);
- capacity = cpus / smt; /* cores */
-
- capacity = min_t(unsigned, capacity,
- DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
}

struct task_numa_env {

Subject: [tip:sched/core] sched/numa: Remove unused numa_stats::nr_running field

Commit-ID: 7477a3504e619768c9e972dafe2907e6b8ed9823
Gitweb: https://git.kernel.org/tip/7477a3504e619768c9e972dafe2907e6b8ed9823
Author: Vincent Guittot <[email protected]>
AuthorDate: Wed, 29 Aug 2018 15:19:10 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:56 +0200

sched/numa: Remove unused numa_stats::nr_running field

nr_running in struct numa_stats is not used anywhere in the code.

Remove it.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b65596fae06b..6bd142d19549 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1454,8 +1454,6 @@ struct numa_stats {

/* Total compute capacity of CPUs on a node */
unsigned long compute_capacity;
-
- unsigned int nr_running;
};

/*
@@ -1469,7 +1467,6 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
for_each_cpu(cpu, cpumask_of_node(nid)) {
struct rq *rq = cpu_rq(cpu);

- ns->nr_running += rq->nr_running;
ns->load += weighted_cpuload(rq);
ns->compute_capacity += capacity_of(cpu);
}

2018-09-10 11:07:15

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

On 04/09/18 11:37, Srikar Dronamraju wrote:
> * Vincent Guittot <[email protected]> [2018-09-04 11:36:26]:
>
>> Hi Srikar,
>>
>> Le Tuesday 04 Sep 2018 à 01:24:24 (-0700), Srikar Dronamraju a écrit :
>>> However after this change, capacity_orig of each SMT thread would be
>>> 1024. For example SMT 8 core capacity_orig would now be 8192.
>>>
>>> smt_gain was suppose to make a multi threaded core was slightly more
>>> powerful than a single threaded core. I suspect if that sometimes hurt
>> Is there system with both single threaded and multi threaded core ?
>> That was the main open point for me (and for Qais too)
>>
> I dont know of any systems that have come with single threaded and
> multithreaded. However some user can still offline few threads in a core
> while leaving other cores untouched. I dont really know why somebody
> would want to do it. For example, some customer was toying with SMT 3
> mode in a SMT 8 power8 box.

What was the customer trying to achieve by this? Did this end up being
useful?

If we get mixed SMT 3 and SMT 8 in the system we might have issues, but
again I don't see how this makes sense from system point of view. I
don't think power savings will be significant by turning off few
hardware threads since the core which I'd expect to be the main energy
consumer is still on. From performance perspective, SMT 3 might have
less contention (hence better 'performance'), but we don't do anything
special to decide to schedule work in this case to take advantage of
that. So I don't think the setup is useful to cater for or encourage.

--
Qais Yousef


2018-09-10 11:25:32

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/topology: remove smt_gain

On 07/09/18 13:42, Peter Zijlstra wrote:
> On Wed, Aug 29, 2018 at 03:08:38PM +0100, Qais Yousef wrote:
>> It was hard to track the history of this and *why* it's needed.
> Yeah, my bad..
>
> So at some point I wanted to do dynamic capacity and dynamic smt gain by
> using the x86 APERF/MPERF stuff. But it never quite worked and we got
> stuck with the remnants.


OK thanks for the info. I think that was 10 years ago? With all the code
moves and renaming git log -L was getting more difficult to use as I dug
deeper in history :p

Is this work officially dead then? If yes, it ended up not being useful
or just never got around finishing it?

Thanks

--
Qais Yousef


Subject: [tip:sched/core] sched/topology: Remove the ::smt_gain field from 'struct sched_domain'

Commit-ID: 765d0af19f5f388a34bf4533378f8398b72ded46
Gitweb: https://git.kernel.org/tip/765d0af19f5f388a34bf4533378f8398b72ded46
Author: Vincent Guittot <[email protected]>
AuthorDate: Wed, 29 Aug 2018 15:19:11 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 11 Dec 2018 15:16:57 +0100

sched/topology: Remove the ::smt_gain field from 'struct sched_domain'

::smt_gain is used to compute the capacity of CPUs of a SMT core with the
constraint 1 < ::smt_gain < 2 in order to be able to compute number of CPUs
per core. The field has_free_capacity of struct numa_stat, which was the
last user of this computation of number of CPUs per core, has been removed
by:

2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")

We can now remove this constraint on core capacity and use the defautl value
SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE
becomes the maximum compute capacity of CPUs on every systems. This should
help to simplify some code and remove fields like rd->max_cpu_capacity

Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other
places in the code when it wants the capacity of a CPUs to scale
some metrics like in pelt, deadline or schedutil. In case on SMT, the value
returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE.

So remove it.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched/topology.h | 1 -
kernel/sched/sched.h | 3 ---
kernel/sched/topology.c | 2 --
3 files changed, 6 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 6b9976180c1e..7fa0bc17cd8c 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -89,7 +89,6 @@ struct sched_domain {
unsigned int newidle_idx;
unsigned int wake_idx;
unsigned int forkexec_idx;
- unsigned int smt_gain;

int nohz_idle; /* NOHZ IDLE status */
int flags; /* See SD_* */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9bde60a11805..ceb896404869 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1864,9 +1864,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
- if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
- return sd->smt_gain / sd->span_weight;
-
return SCHED_CAPACITY_SCALE;
}
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8d7f15ba5916..7364e0b427b7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1133,7 +1133,6 @@ sd_init(struct sched_domain_topology_level *tl,

.last_balance = jiffies,
.balance_interval = sd_weight,
- .smt_gain = 0,
.max_newidle_lb_cost = 0,
.next_decay_max_lb_cost = jiffies,
.child = child,
@@ -1164,7 +1163,6 @@ sd_init(struct sched_domain_topology_level *tl,

if (sd->flags & SD_SHARE_CPUCAPACITY) {
sd->imbalance_pct = 110;
- sd->smt_gain = 1178; /* ~15% */

} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->imbalance_pct = 117;