LinuxLists.cc - [PATCH v3 0/9] sched: Streamline select_task_rq() & select_task_rq_fair()

2020-04-16 01:08:49

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 0/9] sched: Streamline select_task_rq() & select_task_rq_fair()

I've been staring at select_task_rq_fair() for some time now, and have come
to "hate" the usage of the sd_flag parameter. It is used as both an
indicator of the wakeup type (ttwu/fork/exec) and as a domain walk search
target. CFS is the only class doing this, the other classes just need the
wakeup type but get passed a domain flag instead.

This series gets rid of select_task_rq()'s sd_flag parameter and also tries
to optimize select_task_rq_fair().

This is based on top of v5.7-rc1.

Patches
=======

o Patch 1 is a simple dead parameter cleanup
o Patches 2-4 get rid of SD_LOAD_BALANCE
o Patches 5-6 involve getting rid of the sd_flag parameter for
select_task_rq().
o Patch 7 is an extra cleanup in the select_task_rq_fair() region.
o Patches 8-9 split the want_affine vs !want_affine paths of
select_task_rq_fair(), which unearths a small optimization. Sort of a
single patch split in two for the sake of review.

Testing
=======

Testing was done against a slightly older tip/sched/core at:
25ac227a25ac ("sched/fair: Remove wake_cap()")

I got annoyed by the variance in my 500 iteration runs, so I scripted
something to run batches of 5k iterations. It looks pretty stable from one
batch to another. I also stared at some boxplots to convince myself I
wasn't needlessly making things worse - you too can stare at them here [1].

Note: the 'X%' stats are the percentiles, so 50% is the 50th percentile.

Juno r0
-------

2+4 big.LITTLE. SD levels are {MC, DIE}.

Hackbench
~~~~~~~~~

15000 iterations of
$ hackbench
(lower is better):

| | -PATCH | +PATCH | DELTA |
|-------+----------+----------+--------|
| mean | 0.622235 | 0.618834 | -0.55% |
| std | 0.018121 | 0.017579 | -2.99% |
| min | 0.571000 | 0.573000 | +0.35% |
| 50% | 0.620000 | 0.617000 | -0.48% |
| 75% | 0.633000 | 0.629000 | -0.63% |
| 99% | 0.674000 | 0.669000 | -0.74% |
| max | 0.818000 | 0.756000 | -7.58% |

The boxplot shows a single outlier to the very left for both commits, which
are the minimums reported above. Other than that, +PATCH has somewhat lower
outliers on the righthand side: worst cases are a tad better.

Sysbench
~~~~~~~~

15000 iterations of
$ sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=6 run
(higher is better):

| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+---------|
| mean | 15318.954000 | 15628.416933 | +2.02% |
| std | 235.466202 | 205.162730 | -12.87% |
| min | 13025.000000 | 13554.000000 | +4.06% |
| 50% | 15366.000000 | 15681.000000 | +2.05% |
| 75% | 15497.000000 | 15765.000000 | +1.73% |
| 99% | 15651.000000 | 15893.000000 | +1.55% |
| max | 15716.000000 | 15972.000000 | +1.63% |

That's overall a tad better.

Dual-socket Xeon E5
-------------------

Each socket is 10 cores w/ SMT2 - 40 CPUs total. SD levels are
{SMT, MC, NUMA}.

Hackbench
~~~~~~~~~

25000 iterations of
$ hackbench -l 1000
(lower is better):

| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+--------|
| mean | 0.946312 | 0.944685 | -0.17% |
| std | 0.006419 | 0.006447 | +0.44% |
| min | 0.906000 | 0.897000 | -0.99% |
| 50% | 0.947000 | 0.945000 | -0.21% |
| 75% | 0.950000 | 0.949000 | -0.11% |
| 99% | 0.959000 | 0.958000 | -0.10% |
| max | 0.988000 | 0.967000 | -2.13% |

The boxplot shows that the min improvement is some sort of fluke - it's a
single point standing out on the left. The mean *is* slightly lowered,
which most likely comes from +PATCH having less high-value outliers.

I looked into some statistical tests, but my samples distribution isn't a
normal distribution (which is a requirement for most of them). This
actually can't happen by construction according to [2], since hackbench
outputs the maximum of a set of random of variables. We could instead use
the average of all sender/receiver pairs, or even the invidual time taken
per each pair; that being said, I don't think each value produced by a pair
could be seen as independent variables, given that there'll be more > 1
task per CPU.

Wilcoxon's test [3] gives me a p-value of ~1e-182, so there *is* a
significant difference between the two datasets, but it does not say if the
difference is in the mean, variance, or any other parameter of the
distribution.

Sysbench
~~~~~~~~~

25000 iterations of
$ sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=40 run
(higher is better):

| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+---------|
| mean | 23937.937560 | 24280.668640 | +1.43% |
| std | 547.139948 | 484.963639 | -11.36% |
| min | 21526.000000 | 21917.000000 | +1.82% |
| 50% | 24032.000000 | 24374.000000 | +1.42% |
| 75% | 24355.000000 | 24638.000000 | +1.16% |
| 99% | 24869.010000 | 25084.000000 | +0.86% |
| max | 25410.000000 | 25623.000000 | +0.84% |

As with the Juno, that's overall a tad better.

Takeaway
--------

The TL;DR for those fancy stats seems to be: it's hard to say much about
hackbench, and sysbench likes it a bit. The important thing for me is to
not introduce regressions with my stupid change, and AFAICT it is the case.

Links
=====

[1]: https://htmlpreview.github.io/?https://gist.githubusercontent.com/valschneider/433b3772d1776c52214dd05be2ab2f03/raw/316fbd9f774fa381c60731511c881a3360111563/streamline_v2_bplots.html
[2]: https://en.wikipedia.org/wiki/Fisher%E2%80%93Tippett%E2%80%93Gnedenko_theorem
[3]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html#scipy.stats.wilcoxon

Revisions
=========

v2 -> v3
--------
o Rebased on top of v5.7-rc1 (didn't re-run performance tests)
o Collected Reviewed-by (Dietmar)
o Updated changelog of 3/9 (Dietmar)

v1 -> v2
--------
o Removed the 'RFC' tag
o Made the sd_flags syctl read-only
o Removed the SD_LOAD_BALANCE flag
o Cleaned up ugly changes thanks to the above

Valentin Schneider (9):
sched/fair: find_idlest_group(): Remove unused sd_flag parameter
sched/debug: Make sd->flags sysctl read-only
sched: Remove checks against SD_LOAD_BALANCE
sched/topology: Kill SD_LOAD_BALANCE
sched: Add WF_TTWU, WF_EXEC wakeup flags
sched: Kill select_task_rq()'s sd_flag parameter
sched/fair: Dissociate wakeup decisions from SD flag value
sched/fair: Split select_task_rq_fair want_affine logic
sched/topology: Define and use shortcut pointers for wakeup sd_flag
scan

include/linux/sched/topology.h | 29 +++++++------
kernel/sched/core.c | 10 ++---
kernel/sched/deadline.c | 4 +-
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 75 +++++++++++++++++++---------------
kernel/sched/idle.c | 2 +-
kernel/sched/rt.c | 4 +-
kernel/sched/sched.h | 13 ++++--
kernel/sched/stop_task.c | 2 +-
kernel/sched/topology.c | 43 +++++++++----------
10 files changed, 98 insertions(+), 86 deletions(-)

--
2.24.0

2020-04-16 01:08:50

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 4/9] sched/topology: Kill SD_LOAD_BALANCE

That flag is set unconditionally in sd_init(), and no one checks for it
anymore. Remove it.

Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched/topology.h | 29 ++++++++++++++---------------
kernel/sched/topology.c | 3 +--
2 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index af9319e4cfb9..e07380669ee5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -11,21 +11,20 @@
*/
#ifdef CONFIG_SMP

-#define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */
-#define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */
-#define SD_BALANCE_EXEC 0x0004 /* Balance on exec */
-#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
-#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
-#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different CPU capacities */
-#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */
-#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
-#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */
-#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
-#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
-#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
-#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
-#define SD_NUMA 0x4000 /* cross-node balancing */
+#define SD_BALANCE_NEWIDLE 0x0001 /* Balance when about to become idle */
+#define SD_BALANCE_EXEC 0x0002 /* Balance on exec */
+#define SD_BALANCE_FORK 0x0004 /* Balance on fork, clone */
+#define SD_BALANCE_WAKE 0x0008 /* Balance on wakeup */
+#define SD_WAKE_AFFINE 0x0010 /* Wake task to waking CPU */
+#define SD_ASYM_CPUCAPACITY 0x0020 /* Domain members have different CPU capacities */
+#define SD_SHARE_CPUCAPACITY 0x0040 /* Domain members share CPU capacity */
+#define SD_SHARE_POWERDOMAIN 0x0080 /* Domain members share power domain */
+#define SD_SHARE_PKG_RESOURCES 0x0100 /* Domain members share CPU pkg resources */
+#define SD_SERIALIZE 0x0200 /* Only a single load balancing instance */
+#define SD_ASYM_PACKING 0x0400 /* Place busy groups earlier in the domain */
+#define SD_PREFER_SIBLING 0x0800 /* Prefer to place tasks in a sibling domain */
+#define SD_OVERLAP 0x1000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x2000 /* cross-node balancing */

#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a9dc34a0ebc1..1d7b446fac7d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1341,8 +1341,7 @@ sd_init(struct sched_domain_topology_level *tl,

.cache_nice_tries = 0,

- .flags = 1*SD_LOAD_BALANCE
- | 1*SD_BALANCE_NEWIDLE
+ .flags = 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
| 0*SD_BALANCE_WAKE
--
2.24.0

2020-04-16 01:08:52

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 2/9] sched/debug: Make sd->flags sysctl read-only

Writing to the sysctl of a sched_domain->flags directly updates the value of
the field, and goes nowhere near update_top_cache_domain(). This means that
the cached domain pointers can end up containing stale data (e.g. the
domain pointed to doesn't have the relevant flag set anymore).

Explicit domain walks that check for flags will be affected by
the write, but this won't be in sync with the cached pointers which will
still point to the domains that were cached at the last sched_domain
build.

In other words, writing to this interface is playing a dangerous game. It
could be made to trigger an update of the cached sched_domain pointers when
written to, but this does not seem to be worth the trouble. Make it
read-only.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/debug.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index a562df57a86e..00796b0a2723 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -258,7 +258,7 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
set_table_entry(&table[2], "busy_factor", &sd->busy_factor, sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[3], "imbalance_pct", &sd->imbalance_pct, sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[4], "cache_nice_tries", &sd->cache_nice_tries, sizeof(int), 0644, proc_dointvec_minmax);
- set_table_entry(&table[5], "flags", &sd->flags, sizeof(int), 0644, proc_dointvec_minmax);
+ set_table_entry(&table[5], "flags", &sd->flags, sizeof(int), 0444, proc_dointvec_minmax);
set_table_entry(&table[6], "max_newidle_lb_cost", &sd->max_newidle_lb_cost, sizeof(long), 0644, proc_doulongvec_minmax);
set_table_entry(&table[7], "name", sd->name, CORENAME_MAX_SIZE, 0444, proc_dostring);
/* &table[8] is terminator */
--
2.24.0

2020-04-16 01:09:00

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 1/9] sched/fair: find_idlest_group(): Remove unused sd_flag parameter

The last use of that parameter was removed by commit

57abff067a08 ("sched/fair: Rework find_idlest_group()")

Get rid of the parameter.

Reviewed-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..98321d8dde7e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5821,8 +5821,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
}

static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag);
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);

/*
* find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
@@ -5905,7 +5904,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
continue;
}

- group = find_idlest_group(sd, p, cpu, sd_flag);
+ group = find_idlest_group(sd, p, cpu);
if (!group) {
sd = sd->child;
continue;
@@ -8677,8 +8676,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
* Assumes p is allowed on at least one CPU in sd.
*/
static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
{
struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups;
struct sg_lb_stats local_sgs, tmp_sgs;
--
2.24.0

2020-04-16 01:10:33

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 8/9] sched/fair: Split select_task_rq_fair want_affine logic

The domain loop within select_task_rq_fair() depends on a few bits of
input, namely the SD flag we're looking for and whether we want_affine.

For !want_affine, the domain loop will walk up the hierarchy to reach the
highest domain with the requested sd_flag (SD_BALANCE_{WAKE, FORK, EXEC})
set. In other words, that's a call to highest_flag_domain().
Note that this is a static information wrt a given SD hierarchy, so we can
cache that - but that comes in a later patch to ease reviewing.

For want_affine, we'll walk up the hierarchy to reach the first domain with
SD_LOAD_BALANCE, SD_WAKE_AFFINE, and that spans the tasks's prev_cpu. We
still save a pointer to the last visited domain that had the requested
sd_flag set, which means that if we fail to go through the affine
condition (e.g. no domain had SD_WAKE_AFFINE) we'll use the same SD as we
would have found if we had !want_affine.

Split the domain loop in !want_affine and want_affine paths. As it is,
this leads to two domain walks instead of a single one, but stay tuned for
the next patch.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f20e5cd6515c..6f8cdb99f4a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6656,26 +6656,33 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
}

rcu_read_lock();
+
+ sd = highest_flag_domain(cpu, sd_flag);
+
+ /*
+ * If !want_affine, we just look for the highest domain where
+ * sd_flag is set.
+ */
+ if (!want_affine)
+ goto scan;
+
+ /*
+ * Otherwise we look for the lowest domain with SD_WAKE_AFFINE and that
+ * spans both 'cpu' and 'prev_cpu'.
+ */
for_each_domain(cpu, tmp) {
- /*
- * If both 'cpu' and 'prev_cpu' are part of this domain,
- * cpu is a valid SD_WAKE_AFFINE target.
- */
- if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+ if ((tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
if (cpu != prev_cpu)
new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);

- sd = NULL; /* Prefer wake_affine over balance flags */
+ /* Prefer wake_affine over SD lookup */
+ sd = NULL;
break;
}
-
- if (tmp->flags & sd_flag)
- sd = tmp;
- else if (!want_affine)
- break;
}

+scan:
if (unlikely(sd)) {
/* Slow path */
new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
--
2.24.0

2020-04-16 01:10:47

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 5/9] sched: Add WF_TTWU, WF_EXEC wakeup flags

To remove the sd_flag parameter of select_task_rq(), we need another way of
encoding wakeup kinds. There already is a WF_FORK flag, add the missing
ones.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/sched.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..f32c5fa229af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1684,11 +1684,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
}

/*
- * wake flags
+ * Wake flags
*/
#define WF_SYNC 0x01 /* Waker goes to sleep after wakeup */
-#define WF_FORK 0x02 /* Child wakeup after fork */
-#define WF_MIGRATED 0x4 /* Internal use, task got migrated */
+#define WF_TTWU 0x02 /* Regular task wakeup */
+#define WF_FORK 0x04 /* Child wakeup after fork */
+#define WF_EXEC 0x08 /* "Fake" wakeup at exec */
+#define WF_MIGRATED 0x10 /* Internal use, task got migrated */

/*
* To aid in avoiding the subversion of "niceness" due to uneven distribution
--
2.24.0

2020-04-16 01:10:54

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 6/9] sched: Kill select_task_rq()'s sd_flag parameter

Only select_task_rq_fair() uses that parameter to do an actual domain
search, other classes only care about what kind of wakeup is happening
(fork, exec, or "regular") and thus just translate the flag into a wakeup
type.

WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to
encode the wakeup types we care about. This cleans up the API a
bit, but adds an extra conversion layer within select_task_rq_fair().

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/core.c | 10 +++++-----
kernel/sched/deadline.c | 4 ++--
kernel/sched/fair.c | 18 +++++++++++++++---
kernel/sched/idle.c | 2 +-
kernel/sched/rt.c | 4 ++--
kernel/sched/sched.h | 2 +-
kernel/sched/stop_task.c | 2 +-
7 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3a61a3b8eaa9..aea9badd397a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2094,12 +2094,12 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_ptr is stable.
*/
static inline
-int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int wake_flags)
{
lockdep_assert_held(&p->pi_lock);

if (p->nr_cpus_allowed > 1)
- cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
+ cpu = p->sched_class->select_task_rq(p, cpu, wake_flags);
else
cpu = cpumask_any(p->cpus_ptr);

@@ -2612,7 +2612,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
atomic_dec(&task_rq(p)->nr_iowait);
}

- cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
+ cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
@@ -2945,7 +2945,7 @@ void wake_up_new_task(struct task_struct *p)
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+ __set_task_cpu(p, select_task_rq(p, task_cpu(p), WF_FORK));
#endif
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
@@ -3486,7 +3486,7 @@ void sched_exec(void)
int dest_cpu;

raw_spin_lock_irqsave(&p->pi_lock, flags);
- dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
if (dest_cpu == smp_processor_id())
goto unlock;

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 504d2f51b0d6..0e96b435c51b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1601,12 +1601,12 @@ static void yield_task_dl(struct rq *rq)
static int find_later_rq(struct task_struct *task);

static int
-select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_dl(struct task_struct *p, int cpu, int flags)
{
struct task_struct *curr;
struct rq *rq;

- if (sd_flag != SD_BALANCE_WAKE)
+ if (!(flags & WF_TTWU))
goto out;

rq = cpu_rq(cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d34b4e4060f..b0bf98e6798b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6611,7 +6611,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
- * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
+ * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
* SD_BALANCE_FORK, or SD_BALANCE_EXEC.
*
* Balances load by selecting the idlest CPU in the idlest group, or under
@@ -6622,13 +6622,25 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
{
+ int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
int want_affine = 0;
- int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+ int sd_flag;
+
+ switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
+ case WF_TTWU:
+ sd_flag = SD_BALANCE_WAKE;
+ break;
+ case WF_FORK:
+ sd_flag = SD_BALANCE_FORK;
+ break;
+ default:
+ sd_flag = SD_BALANCE_EXEC;
+ }

if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b743bf38f08f..e9c6a27f0647 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -367,7 +367,7 @@ void cpu_startup_entry(enum cpuhp_state state)

#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int flags)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index df11d88c9895..88427ea0231b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1426,14 +1426,14 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);

static int
-select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
struct task_struct *curr;
struct rq *rq;
bool test;

/* For anything but wake ups, just return the task_cpu */
- if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
+ if (!(flags & (WF_TTWU | WF_FORK)))
goto out;

rq = cpu_rq(cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f32c5fa229af..448f5d630544 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1767,7 +1767,7 @@ struct sched_class {

#ifdef CONFIG_SMP
int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
- int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

void (*task_woken)(struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4c9e9975684f..4f061ddf8470 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@

#ifdef CONFIG_SMP
static int
-select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int flags)
{
return task_cpu(p); /* stop tasks as never migrate */
}
--
2.24.0

2020-04-16 01:11:19

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH v3 3/9] sched: Remove checks against SD_LOAD_BALANCE

The SD_LOAD_BALANCE flag is set unconditionally for all domains in
sd_init(). By making the sched_domain->flags syctl interface read-only, we
have removed the last piece of code that could clear that flag - as such,
it will now be always present. Rather than to keep carrying it along, we
can work towards getting rid of it entirely.

cpusets don't need it because they can make CPUs be attached to the NULL
domain (e.g. cpuset with sched_load_balance=0), or to a partitioned
root_domain, i.e. a sched_domain hierarchy that doesn't span the entire
system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with
sched_load_balance=1).

isolcpus apply the same "trick": isolated CPUs are explicitly taken out of
the sched_domain rebuild (using housekeeping_cpumask()), so they get the
NULL domain treatment as well.

Remove the checks against SD_LOAD_BALANCE.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 14 ++------------
kernel/sched/topology.c | 28 +++++++++-------------------
2 files changed, 11 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98321d8dde7e..3d34b4e4060f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6645,9 +6645,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

rcu_read_lock();
for_each_domain(cpu, tmp) {
- if (!(tmp->flags & SD_LOAD_BALANCE))
- break;
-
/*
* If both 'cpu' and 'prev_cpu' are part of this domain,
* cpu is a valid SD_WAKE_AFFINE target.
@@ -9792,9 +9789,8 @@ static int active_load_balance_cpu_stop(void *data)
/* Search for an sd spanning us and the target CPU. */
rcu_read_lock();
for_each_domain(target_cpu, sd) {
- if ((sd->flags & SD_LOAD_BALANCE) &&
- cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
- break;
+ if (cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
+ break;
}

if (likely(sd)) {
@@ -9883,9 +9879,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
}
max_cost += sd->max_newidle_lb_cost;

- if (!(sd->flags & SD_LOAD_BALANCE))
- continue;
-
/*
* Stop the load balance at this level. There is another
* CPU in our sched group which is doing load balancing more
@@ -10474,9 +10467,6 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int continue_balancing = 1;
u64 t0, domain_cost;

- if (!(sd->flags & SD_LOAD_BALANCE))
- continue;
-
if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
update_next_balance(sd, &next_balance);
break;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757bba6e..a9dc34a0ebc1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -33,14 +33,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
cpumask_clear(groupmask);

printk(KERN_DEBUG "%*s domain-%d: ", level, "", level);
-
- if (!(sd->flags & SD_LOAD_BALANCE)) {
- printk("does not load-balance\n");
- if (sd->parent)
- printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
- return -1;
- }
-
printk(KERN_CONT "span=%*pbl level=%s\n",
cpumask_pr_args(sched_domain_span(sd)), sd->name);

@@ -151,8 +143,7 @@ static int sd_degenerate(struct sched_domain *sd)
return 1;

/* Following flags need at least 2 groups */
- if (sd->flags & (SD_LOAD_BALANCE |
- SD_BALANCE_NEWIDLE |
+ if (sd->flags & (SD_BALANCE_NEWIDLE |
SD_BALANCE_FORK |
SD_BALANCE_EXEC |
SD_SHARE_CPUCAPACITY |
@@ -183,15 +174,14 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)

/* Flags needing groups don't count if only 1 group in parent */
if (parent->groups == parent->groups->next) {
- pflags &= ~(SD_LOAD_BALANCE |
- SD_BALANCE_NEWIDLE |
- SD_BALANCE_FORK |
- SD_BALANCE_EXEC |
- SD_ASYM_CPUCAPACITY |
- SD_SHARE_CPUCAPACITY |
- SD_SHARE_PKG_RESOURCES |
- SD_PREFER_SIBLING |
- SD_SHARE_POWERDOMAIN);
+ pflags &= ~(SD_BALANCE_NEWIDLE |
+ SD_BALANCE_FORK |
+ SD_BALANCE_EXEC |
+ SD_ASYM_CPUCAPACITY |
+ SD_SHARE_CPUCAPACITY |
+ SD_SHARE_PKG_RESOURCES |
+ SD_PREFER_SIBLING |
+ SD_SHARE_POWERDOMAIN);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
--
2.24.0

2020-04-16 07:44:59

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v3 1/9] sched/fair: find_idlest_group(): Remove unused sd_flag parameter

On Wed, 15 Apr 2020 at 23:05, Valentin Schneider
<[email protected]> wrote:
>
> The last use of that parameter was removed by commit
>
> 57abff067a08 ("sched/fair: Rework find_idlest_group()")
>
> Get rid of the parameter.
>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>

This patch is not directly related to others

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 8 +++-----
> 1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..98321d8dde7e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5821,8 +5821,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> }
>
> static struct sched_group *
> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> - int this_cpu, int sd_flag);
> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);
>
> /*
> * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
> @@ -5905,7 +5904,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> continue;
> }
>
> - group = find_idlest_group(sd, p, cpu, sd_flag);
> + group = find_idlest_group(sd, p, cpu);
> if (!group) {
> sd = sd->child;
> continue;
> @@ -8677,8 +8676,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
> * Assumes p is allowed on at least one CPU in sd.
> */
> static struct sched_group *
> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> - int this_cpu, int sd_flag)
> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> {
> struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups;
> struct sg_lb_stats local_sgs, tmp_sgs;
> --
> 2.24.0
>

2020-04-16 07:46:18

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v3 6/9] sched: Kill select_task_rq()'s sd_flag parameter

On Wed, 15 Apr 2020 at 23:05, Valentin Schneider
<[email protected]> wrote:
>
> Only select_task_rq_fair() uses that parameter to do an actual domain
> search, other classes only care about what kind of wakeup is happening
> (fork, exec, or "regular") and thus just translate the flag into a wakeup
> type.
>
> WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to
> encode the wakeup types we care about. This cleans up the API a
> bit, but adds an extra conversion layer within select_task_rq_fair().
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/core.c | 10 +++++-----
> kernel/sched/deadline.c | 4 ++--
> kernel/sched/fair.c | 18 +++++++++++++++---
> kernel/sched/idle.c | 2 +-
> kernel/sched/rt.c | 4 ++--
> kernel/sched/sched.h | 2 +-
> kernel/sched/stop_task.c | 2 +-
> 7 files changed, 27 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3a61a3b8eaa9..aea9badd397a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2094,12 +2094,12 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> * The caller (fork, wakeup) owns p->pi_lock, ->cpus_ptr is stable.
> */
> static inline
> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
> +int select_task_rq(struct task_struct *p, int cpu, int wake_flags)
> {
> lockdep_assert_held(&p->pi_lock);
>
> if (p->nr_cpus_allowed > 1)
> - cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
> + cpu = p->sched_class->select_task_rq(p, cpu, wake_flags);
> else
> cpu = cpumask_any(p->cpus_ptr);
>
> @@ -2612,7 +2612,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> atomic_dec(&task_rq(p)->nr_iowait);
> }
>
> - cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> + cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
> if (task_cpu(p) != cpu) {
> wake_flags |= WF_MIGRATED;
> psi_ttwu_dequeue(p);
> @@ -2945,7 +2945,7 @@ void wake_up_new_task(struct task_struct *p)
> * as we're not fully set-up yet.
> */
> p->recent_used_cpu = task_cpu(p);
> - __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> + __set_task_cpu(p, select_task_rq(p, task_cpu(p), WF_FORK));
> #endif
> rq = __task_rq_lock(p, &rf);
> update_rq_clock(rq);
> @@ -3486,7 +3486,7 @@ void sched_exec(void)
> int dest_cpu;
>
> raw_spin_lock_irqsave(&p->pi_lock, flags);
> - dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
> + dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
> if (dest_cpu == smp_processor_id())
> goto unlock;
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 504d2f51b0d6..0e96b435c51b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1601,12 +1601,12 @@ static void yield_task_dl(struct rq *rq)
> static int find_later_rq(struct task_struct *task);
>
> static int
> -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_dl(struct task_struct *p, int cpu, int flags)
> {
> struct task_struct *curr;
> struct rq *rq;
>
> - if (sd_flag != SD_BALANCE_WAKE)
> + if (!(flags & WF_TTWU))
> goto out;
>
> rq = cpu_rq(cpu);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3d34b4e4060f..b0bf98e6798b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6611,7 +6611,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> - * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> + * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> *
> * Balances load by selecting the idlest CPU in the idlest group, or under
> @@ -6622,13 +6622,25 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> * preempt must be disabled.
> */
> static int
> -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> {
> + int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> struct sched_domain *tmp, *sd = NULL;
> int cpu = smp_processor_id();
> int new_cpu = prev_cpu;
> int want_affine = 0;
> - int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> + int sd_flag;
> +
> + switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {

You remove a function parameter, which was directly set with the right
flag, but then you add a switch case to recreate this sd_flag
internally. Not sure we can say that it's real benefit

> + case WF_TTWU:
> + sd_flag = SD_BALANCE_WAKE;
> + break;
> + case WF_FORK:
> + sd_flag = SD_BALANCE_FORK;
> + break;
> + default:
> + sd_flag = SD_BALANCE_EXEC;
> + }
>
> if (sd_flag & SD_BALANCE_WAKE) {
> record_wakee(p);
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index b743bf38f08f..e9c6a27f0647 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -367,7 +367,7 @@ void cpu_startup_entry(enum cpuhp_state state)
>
> #ifdef CONFIG_SMP
> static int
> -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_idle(struct task_struct *p, int cpu, int flags)
> {
> return task_cpu(p); /* IDLE tasks as never migrated */
> }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index df11d88c9895..88427ea0231b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1426,14 +1426,14 @@ static void yield_task_rt(struct rq *rq)
> static int find_lowest_rq(struct task_struct *task);
>
> static int
> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> {
> struct task_struct *curr;
> struct rq *rq;
> bool test;
>
> /* For anything but wake ups, just return the task_cpu */
> - if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> + if (!(flags & (WF_TTWU | WF_FORK)))
> goto out;
>
> rq = cpu_rq(cpu);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f32c5fa229af..448f5d630544 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1767,7 +1767,7 @@ struct sched_class {
>
> #ifdef CONFIG_SMP
> int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
> - int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> + int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
> void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
>
> void (*task_woken)(struct rq *this_rq, struct task_struct *task);
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 4c9e9975684f..4f061ddf8470 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -11,7 +11,7 @@
>
> #ifdef CONFIG_SMP
> static int
> -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_stop(struct task_struct *p, int cpu, int flags)
> {
> return task_cpu(p); /* stop tasks as never migrate */
> }
> --
> 2.24.0
>

2020-04-16 10:29:42

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 6/9] sched: Kill select_task_rq()'s sd_flag parameter

Hi Vincent,

Thanks for taking a look at this.

On 16/04/20 08:42, Vincent Guittot wrote:
>> @@ -6622,13 +6622,25 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>> * preempt must be disabled.
>> */
>> static int
>> -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>> {
>> + int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>> struct sched_domain *tmp, *sd = NULL;
>> int cpu = smp_processor_id();
>> int new_cpu = prev_cpu;
>> int want_affine = 0;
>> - int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>> + int sd_flag;
>> +
>> + switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
>
> You remove a function parameter, which was directly set with the right
> flag, but then you add a switch case to recreate this sd_flag
> internally. Not sure we can say that it's real benefit
>

It is indeed the contentious point of this series (IMO). Still, it has a
few things going for it:

1) only CFS is helped by that extra parameter

2) with patch 9, I need a control flow to pick up the right cached pointer
anyway; the alternative would be to do something like the unsavoury:

DEFINE_PER_CPU(struct sched_domain __rcu *, sd_balance_flags[3]);
...
update_top_cache_domain()
{
per_cpu(sd_balance_flags[0], cpu) = highest_flag_domain(cpu, SD_BALANCE_EXEC);
per_cpu(sd_balance_flags[1], cpu) = highest_flag_domain(cpu, SD_BALANCE_FORK);
per_cpu(sd_balance_flags[2], cpu) = highest_flag_domain(cpu, SD_BALANCE_WAKE);
}
...
select_task_rq_fair()
{
// Whatever sort of shady constant time conversion you can think of
int index = !!(wake_flags & WF_FORK) + 2 * !!(wake_flags & WF_TTWU)
sd_flag = SD_BALANCE_EXEC << index;
sd = per_cpu(sd_balance_flags[index], cpu);
}

>> + case WF_TTWU:
>> + sd_flag = SD_BALANCE_WAKE;
>> + break;
>> + case WF_FORK:
>> + sd_flag = SD_BALANCE_FORK;
>> + break;
>> + default:
>> + sd_flag = SD_BALANCE_EXEC;
>> + }
>>
>> if (sd_flag & SD_BALANCE_WAKE) {
>> record_wakee(p);

2020-04-16 11:06:15

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v3 0/9] sched: Streamline select_task_rq() & select_task_rq_fair()

On Wed, Apr 15, 2020 at 10:05:03PM +0100, Valentin Schneider wrote:
> Valentin Schneider (9):
> sched/fair: find_idlest_group(): Remove unused sd_flag parameter
> sched/debug: Make sd->flags sysctl read-only
> sched: Remove checks against SD_LOAD_BALANCE
> sched/topology: Kill SD_LOAD_BALANCE
> sched: Add WF_TTWU, WF_EXEC wakeup flags

How about I queue two first 5, and you rework these last few?

> sched: Kill select_task_rq()'s sd_flag parameter
> sched/fair: Dissociate wakeup decisions from SD flag value
> sched/fair: Split select_task_rq_fair want_affine logic
> sched/topology: Define and use shortcut pointers for wakeup sd_flag scan

2020-04-16 11:08:05

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v3 0/9] sched: Streamline select_task_rq() & select_task_rq_fair()

On Thu, Apr 16, 2020 at 12:58:28PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 15, 2020 at 10:05:03PM +0100, Valentin Schneider wrote:
> > Valentin Schneider (9):
> > sched/fair: find_idlest_group(): Remove unused sd_flag parameter
> > sched/debug: Make sd->flags sysctl read-only
> > sched: Remove checks against SD_LOAD_BALANCE
> > sched/topology: Kill SD_LOAD_BALANCE
> > sched: Add WF_TTWU, WF_EXEC wakeup flags
>
> How about I queue two first 5, and you rework these last few?

Argh, 4 ofcourse, that 5th patch doesn't make much sense if we have to
rework those flags like I proposed.

> > sched: Kill select_task_rq()'s sd_flag parameter
> > sched/fair: Dissociate wakeup decisions from SD flag value
> > sched/fair: Split select_task_rq_fair want_affine logic
> > sched/topology: Define and use shortcut pointers for wakeup sd_flag scan
>
>

2020-04-16 11:14:39

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v3 6/9] sched: Kill select_task_rq()'s sd_flag parameter

On Thu, Apr 16, 2020 at 09:42:36AM +0200, Vincent Guittot wrote:
> On Wed, 15 Apr 2020 at 23:05, Valentin Schneider
> > @@ -6622,13 +6622,25 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > * preempt must be disabled.
> > */
> > static int
> > +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > {
> > + int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> > struct sched_domain *tmp, *sd = NULL;
> > int cpu = smp_processor_id();
> > int new_cpu = prev_cpu;
> > int want_affine = 0;
> > - int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> > + int sd_flag;
> > +
> > + switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
>
> You remove a function parameter, which was directly set with the right
> flag, but then you add a switch case to recreate this sd_flag
> internally. Not sure we can say that it's real benefit
>
> > + case WF_TTWU:
> > + sd_flag = SD_BALANCE_WAKE;
> > + break;
> > + case WF_FORK:
> > + sd_flag = SD_BALANCE_FORK;
> > + break;
> > + default:
> > + sd_flag = SD_BALANCE_EXEC;
> > + }

Agreed, that's a bit yuck, how about something like so instead:

--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -11,10 +11,12 @@
*/
#ifdef CONFIG_SMP

+/* First nibble of SD_flag is shared with WF_flag */
#define SD_BALANCE_NEWIDLE 0x0001 /* Balance when about to become idle */
#define SD_BALANCE_EXEC 0x0002 /* Balance on exec */
#define SD_BALANCE_FORK 0x0004 /* Balance on fork, clone */
#define SD_BALANCE_WAKE 0x0008 /* Balance on wakeup */
+
#define SD_WAKE_AFFINE 0x0010 /* Wake task to waking CPU */
#define SD_ASYM_CPUCAPACITY 0x0020 /* Domain members have different CPU capacities */
#define SD_SHARE_CPUCAPACITY 0x0040 /* Domain members share CPU capacity */
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6635,16 +6635,8 @@ select_task_rq_fair(struct task_struct *
int want_affine = 0;
int sd_flag;

- switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
- case WF_TTWU:
- sd_flag = SD_BALANCE_WAKE;
- break;
- case WF_FORK:
- sd_flag = SD_BALANCE_FORK;
- break;
- default:
- sd_flag = SD_BALANCE_EXEC;
- }
+ /* SD_flags and WF_flags share the first nibble */
+ sd_flag = wake_flags & 0xf;

if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1685,11 +1685,12 @@ static inline int task_on_rq_migrating(s
/*
* Wake flags
*/
-#define WF_SYNC 0x01 /* Waker goes to sleep after wakeup */
-#define WF_TTWU 0x02 /* Regular task wakeup */
-#define WF_FORK 0x04 /* Child wakeup after fork */
-#define WF_EXEC 0x08 /* "Fake" wakeup at exec */
-#define WF_MIGRATED 0x10 /* Internal use, task got migrated */
+#define WF_EXEC 0x02 /* SD_BALANCE_EXEC */
+#define WF_FORK 0x04 /* SD_BALANCE_FORK */
+#define WF_TTWU 0x08 /* SD_BALANCE_WAKE */
+
+#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED 0x20 /* Internal use, task got migrated */

/*
* To aid in avoiding the subversion of "niceness" due to uneven distribution

2020-04-16 11:41:45

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 0/9] sched: Streamline select_task_rq() & select_task_rq_fair()

On 16/04/20 12:00, Peter Zijlstra wrote:
> On Thu, Apr 16, 2020 at 12:58:28PM +0200, Peter Zijlstra wrote:
>> On Wed, Apr 15, 2020 at 10:05:03PM +0100, Valentin Schneider wrote:
>> > Valentin Schneider (9):
>> > sched/fair: find_idlest_group(): Remove unused sd_flag parameter
>> > sched/debug: Make sd->flags sysctl read-only
>> > sched: Remove checks against SD_LOAD_BALANCE
>> > sched/topology: Kill SD_LOAD_BALANCE
>> > sched: Add WF_TTWU, WF_EXEC wakeup flags
>>
>> How about I queue two first 5, and you rework these last few?
>
> Argh, 4 ofcourse, that 5th patch doesn't make much sense if we have to
> rework those flags like I proposed.
>

Was about to comment on that :) Sounds good to me!

>> > sched: Kill select_task_rq()'s sd_flag parameter
>> > sched/fair: Dissociate wakeup decisions from SD flag value
>> > sched/fair: Split select_task_rq_fair want_affine logic
>> > sched/topology: Define and use shortcut pointers for wakeup sd_flag scan
>>
>>

2020-04-16 13:06:05

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v3 0/9] sched: Streamline select_task_rq() & select_task_rq_fair()

On Thu, 16 Apr 2020 at 13:01, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Apr 16, 2020 at 12:58:28PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 15, 2020 at 10:05:03PM +0100, Valentin Schneider wrote:
> > > Valentin Schneider (9):
> > > sched/fair: find_idlest_group(): Remove unused sd_flag parameter
> > > sched/debug: Make sd->flags sysctl read-only
> > > sched: Remove checks against SD_LOAD_BALANCE
> > > sched/topology: Kill SD_LOAD_BALANCE
> > > sched: Add WF_TTWU, WF_EXEC wakeup flags
> >
> > How about I queue two first 5, and you rework these last few?
>
> Argh, 4 ofcourse, that 5th patch doesn't make much sense if we have to
> rework those flags like I proposed.

Looks good to me too

>
> > > sched: Kill select_task_rq()'s sd_flag parameter
> > > sched/fair: Dissociate wakeup decisions from SD flag value
> > > sched/fair: Split select_task_rq_fair want_affine logic
> > > sched/topology: Define and use shortcut pointers for wakeup sd_flag scan
> >
> >

2020-04-16 17:26:00

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 6/9] sched: Kill select_task_rq()'s sd_flag parameter

On 16/04/20 11:47, Peter Zijlstra wrote:
> On Thu, Apr 16, 2020 at 09:42:36AM +0200, Vincent Guittot wrote:
>> On Wed, 15 Apr 2020 at 23:05, Valentin Schneider
>> > @@ -6622,13 +6622,25 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>> > * preempt must be disabled.
>> > */
>> > static int
>> > +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>> > {
>> > + int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>> > struct sched_domain *tmp, *sd = NULL;
>> > int cpu = smp_processor_id();
>> > int new_cpu = prev_cpu;
>> > int want_affine = 0;
>> > - int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>> > + int sd_flag;
>> > +
>> > + switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
>>
>> You remove a function parameter, which was directly set with the right
>> flag, but then you add a switch case to recreate this sd_flag
>> internally. Not sure we can say that it's real benefit
>>
>> > + case WF_TTWU:
>> > + sd_flag = SD_BALANCE_WAKE;
>> > + break;
>> > + case WF_FORK:
>> > + sd_flag = SD_BALANCE_FORK;
>> > + break;
>> > + default:
>> > + sd_flag = SD_BALANCE_EXEC;
>> > + }
>
> Agreed, that's a bit yuck, how about something like so instead:
>

Aligning the SD_* and WF_* flags sure makes it simpler, I just wasn't
daring enough to do that. I suppose we'll want some BUILD_BUG_ON() checks
just for good measure.

Also, it doesn't directly solve the switch case of 9/9, but I may get out
of it with some hackery such as what I suggested in my reply to Vincent.

>
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -11,10 +11,12 @@
> */
> #ifdef CONFIG_SMP
>
> +/* First nibble of SD_flag is shared with WF_flag */
> #define SD_BALANCE_NEWIDLE 0x0001 /* Balance when about to become idle */
> #define SD_BALANCE_EXEC 0x0002 /* Balance on exec */
> #define SD_BALANCE_FORK 0x0004 /* Balance on fork, clone */
> #define SD_BALANCE_WAKE 0x0008 /* Balance on wakeup */
> +
> #define SD_WAKE_AFFINE 0x0010 /* Wake task to waking CPU */
> #define SD_ASYM_CPUCAPACITY 0x0020 /* Domain members have different CPU capacities */
> #define SD_SHARE_CPUCAPACITY 0x0040 /* Domain members share CPU capacity */
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6635,16 +6635,8 @@ select_task_rq_fair(struct task_struct *
> int want_affine = 0;
> int sd_flag;
>
> - switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
> - case WF_TTWU:
> - sd_flag = SD_BALANCE_WAKE;
> - break;
> - case WF_FORK:
> - sd_flag = SD_BALANCE_FORK;
> - break;
> - default:
> - sd_flag = SD_BALANCE_EXEC;
> - }
> + /* SD_flags and WF_flags share the first nibble */
> + sd_flag = wake_flags & 0xf;
>
> if (sd_flag & SD_BALANCE_WAKE) {
> record_wakee(p);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1685,11 +1685,12 @@ static inline int task_on_rq_migrating(s
> /*
> * Wake flags
> */
> -#define WF_SYNC 0x01 /* Waker goes to sleep after wakeup */
> -#define WF_TTWU 0x02 /* Regular task wakeup */
> -#define WF_FORK 0x04 /* Child wakeup after fork */
> -#define WF_EXEC 0x08 /* "Fake" wakeup at exec */
> -#define WF_MIGRATED 0x10 /* Internal use, task got migrated */
> +#define WF_EXEC 0x02 /* SD_BALANCE_EXEC */
> +#define WF_FORK 0x04 /* SD_BALANCE_FORK */
> +#define WF_TTWU 0x08 /* SD_BALANCE_WAKE */
> +
> +#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
> +#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
>
> /*
> * To aid in avoiding the subversion of "niceness" due to uneven distribution

2020-05-01 18:24:26

by tip-bot2 for Hou Wenlong

[permalink] [raw]

Subject: [tip: sched/core] sched/fair: find_idlest_group(): Remove unused sd_flag parameter

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 45da27732b0b9b7a04696653065d5e6037bc5ac0
Gitweb: https://git.kernel.org/tip/45da27732b0b9b7a04696653065d5e6037bc5ac0
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 15 Apr 2020 22:05:04 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 30 Apr 2020 20:14:39 +02:00

sched/fair: find_idlest_group(): Remove unused sd_flag parameter

The last use of that parameter was removed by commit

57abff067a08 ("sched/fair: Rework find_idlest_group()")

Get rid of the parameter.

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63419c6..617ca44 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5825,8 +5825,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
}

static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag);
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);

/*
* find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
@@ -5909,7 +5908,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
continue;
}

- group = find_idlest_group(sd, p, cpu, sd_flag);
+ group = find_idlest_group(sd, p, cpu);
if (!group) {
sd = sd->child;
continue;
@@ -8681,8 +8680,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
* Assumes p is allowed on at least one CPU in sd.
*/
static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
{
struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups;
struct sg_lb_stats local_sgs, tmp_sgs;

2020-05-01 18:25:51

by tip-bot2 for Hou Wenlong

[permalink] [raw]

Subject: [tip: sched/core] sched/topology: Kill SD_LOAD_BALANCE

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 36c5bdc4387056af3840adb4478c752faeb9d15e
Gitweb: https://git.kernel.org/tip/36c5bdc4387056af3840adb4478c752faeb9d15e
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 15 Apr 2020 22:05:07 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 30 Apr 2020 20:14:39 +02:00

sched/topology: Kill SD_LOAD_BALANCE

That flag is set unconditionally in sd_init(), and no one checks for it
anymore. Remove it.

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/sched/topology.h | 29 ++++++++++++++---------------
kernel/sched/topology.c | 3 +--
2 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 95253ad..fb11091 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -11,21 +11,20 @@
*/
#ifdef CONFIG_SMP

-#define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */
-#define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */
-#define SD_BALANCE_EXEC 0x0004 /* Balance on exec */
-#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
-#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
-#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different CPU capacities */
-#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */
-#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
-#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */
-#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
-#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
-#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
-#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
-#define SD_NUMA 0x4000 /* cross-node balancing */
+#define SD_BALANCE_NEWIDLE 0x0001 /* Balance when about to become idle */
+#define SD_BALANCE_EXEC 0x0002 /* Balance on exec */
+#define SD_BALANCE_FORK 0x0004 /* Balance on fork, clone */
+#define SD_BALANCE_WAKE 0x0008 /* Balance on wakeup */
+#define SD_WAKE_AFFINE 0x0010 /* Wake task to waking CPU */
+#define SD_ASYM_CPUCAPACITY 0x0020 /* Domain members have different CPU capacities */
+#define SD_SHARE_CPUCAPACITY 0x0040 /* Domain members share CPU capacity */
+#define SD_SHARE_POWERDOMAIN 0x0080 /* Domain members share power domain */
+#define SD_SHARE_PKG_RESOURCES 0x0100 /* Domain members share CPU pkg resources */
+#define SD_SERIALIZE 0x0200 /* Only a single load balancing instance */
+#define SD_ASYM_PACKING 0x0400 /* Place busy groups earlier in the domain */
+#define SD_PREFER_SIBLING 0x0800 /* Prefer to place tasks in a sibling domain */
+#define SD_OVERLAP 0x1000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x2000 /* cross-node balancing */

#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a9dc34a..1d7b446 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1341,8 +1341,7 @@ sd_init(struct sched_domain_topology_level *tl,

.cache_nice_tries = 0,

- .flags = 1*SD_LOAD_BALANCE
- | 1*SD_BALANCE_NEWIDLE
+ .flags = 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
| 0*SD_BALANCE_WAKE

2020-05-01 18:27:14

by tip-bot2 for Hou Wenlong

[permalink] [raw]

Subject: [tip: sched/core] sched: Remove checks against SD_LOAD_BALANCE

The following commit has been merged into the sched/core branch of tip:

Commit-ID: e669ac8ab952df2f07dee1e1efbf40647d6de332
Gitweb: https://git.kernel.org/tip/e669ac8ab952df2f07dee1e1efbf40647d6de332
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 15 Apr 2020 22:05:06 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 30 Apr 2020 20:14:39 +02:00

sched: Remove checks against SD_LOAD_BALANCE

The SD_LOAD_BALANCE flag is set unconditionally for all domains in
sd_init(). By making the sched_domain->flags syctl interface read-only, we
have removed the last piece of code that could clear that flag - as such,
it will now be always present. Rather than to keep carrying it along, we
can work towards getting rid of it entirely.

cpusets don't need it because they can make CPUs be attached to the NULL
domain (e.g. cpuset with sched_load_balance=0), or to a partitioned
root_domain, i.e. a sched_domain hierarchy that doesn't span the entire
system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with
sched_load_balance=1).

isolcpus apply the same "trick": isolated CPUs are explicitly taken out of
the sched_domain rebuild (using housekeeping_cpumask()), so they get the
NULL domain treatment as well.

Remove the checks against SD_LOAD_BALANCE.

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 14 ++------------
kernel/sched/topology.c | 28 +++++++++-------------------
2 files changed, 11 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 617ca44..4b959c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6649,9 +6649,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

rcu_read_lock();
for_each_domain(cpu, tmp) {
- if (!(tmp->flags & SD_LOAD_BALANCE))
- break;
-
/*
* If both 'cpu' and 'prev_cpu' are part of this domain,
* cpu is a valid SD_WAKE_AFFINE target.
@@ -9790,9 +9787,8 @@ static int active_load_balance_cpu_stop(void *data)
/* Search for an sd spanning us and the target CPU. */
rcu_read_lock();
for_each_domain(target_cpu, sd) {
- if ((sd->flags & SD_LOAD_BALANCE) &&
- cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
- break;
+ if (cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
+ break;
}

if (likely(sd)) {
@@ -9881,9 +9877,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
}
max_cost += sd->max_newidle_lb_cost;

- if (!(sd->flags & SD_LOAD_BALANCE))
- continue;
-
/*
* Stop the load balance at this level. There is another
* CPU in our sched group which is doing load balancing more
@@ -10472,9 +10465,6 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int continue_balancing = 1;
u64 t0, domain_cost;

- if (!(sd->flags & SD_LOAD_BALANCE))
- continue;
-
if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
update_next_balance(sd, &next_balance);
break;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757..a9dc34a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -33,14 +33,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
cpumask_clear(groupmask);

printk(KERN_DEBUG "%*s domain-%d: ", level, "", level);
-
- if (!(sd->flags & SD_LOAD_BALANCE)) {
- printk("does not load-balance\n");
- if (sd->parent)
- printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
- return -1;
- }
-
printk(KERN_CONT "span=%*pbl level=%s\n",
cpumask_pr_args(sched_domain_span(sd)), sd->name);

@@ -151,8 +143,7 @@ static int sd_degenerate(struct sched_domain *sd)
return 1;

/* Following flags need at least 2 groups */
- if (sd->flags & (SD_LOAD_BALANCE |
- SD_BALANCE_NEWIDLE |
+ if (sd->flags & (SD_BALANCE_NEWIDLE |
SD_BALANCE_FORK |
SD_BALANCE_EXEC |
SD_SHARE_CPUCAPACITY |
@@ -183,15 +174,14 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)

/* Flags needing groups don't count if only 1 group in parent */
if (parent->groups == parent->groups->next) {
- pflags &= ~(SD_LOAD_BALANCE |
- SD_BALANCE_NEWIDLE |
- SD_BALANCE_FORK |
- SD_BALANCE_EXEC |
- SD_ASYM_CPUCAPACITY |
- SD_SHARE_CPUCAPACITY |
- SD_SHARE_PKG_RESOURCES |
- SD_PREFER_SIBLING |
- SD_SHARE_POWERDOMAIN);
+ pflags &= ~(SD_BALANCE_NEWIDLE |
+ SD_BALANCE_FORK |
+ SD_BALANCE_EXEC |
+ SD_ASYM_CPUCAPACITY |
+ SD_SHARE_CPUCAPACITY |
+ SD_SHARE_PKG_RESOURCES |
+ SD_PREFER_SIBLING |
+ SD_SHARE_POWERDOMAIN);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}

2020-05-01 18:28:43

by tip-bot2 for Hou Wenlong

[permalink] [raw]

Subject: [tip: sched/core] sched/debug: Make sd->flags sysctl read-only

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9818427c6270a9ce8c52c8621026fe9cebae0f92
Gitweb: https://git.kernel.org/tip/9818427c6270a9ce8c52c8621026fe9cebae0f92
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 15 Apr 2020 22:05:05 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 30 Apr 2020 20:14:39 +02:00

sched/debug: Make sd->flags sysctl read-only

Writing to the sysctl of a sched_domain->flags directly updates the value of
the field, and goes nowhere near update_top_cache_domain(). This means that
the cached domain pointers can end up containing stale data (e.g. the
domain pointed to doesn't have the relevant flag set anymore).

Explicit domain walks that check for flags will be affected by
the write, but this won't be in sync with the cached pointers which will
still point to the domains that were cached at the last sched_domain
build.

In other words, writing to this interface is playing a dangerous game. It
could be made to trigger an update of the cached sched_domain pointers when
written to, but this does not seem to be worth the trouble. Make it
read-only.

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/debug.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index b3ac1c1..c6cc02a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -258,7 +258,7 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
set_table_entry(&table[2], "busy_factor", &sd->busy_factor, sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[3], "imbalance_pct", &sd->imbalance_pct, sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[4], "cache_nice_tries", &sd->cache_nice_tries, sizeof(int), 0644, proc_dointvec_minmax);
- set_table_entry(&table[5], "flags", &sd->flags, sizeof(int), 0644, proc_dointvec_minmax);
+ set_table_entry(&table[5], "flags", &sd->flags, sizeof(int), 0444, proc_dointvec_minmax);
set_table_entry(&table[6], "max_newidle_lb_cost", &sd->max_newidle_lb_cost, sizeof(long), 0644, proc_doulongvec_minmax);
set_table_entry(&table[7], "name", sd->name, CORENAME_MAX_SIZE, 0444, proc_dostring);
/* &table[8] is terminator */