2020-03-11 18:17:26

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 0/9] sched: Streamline select_task_rq() & select_task_rq_fair()

I've been staring at select_task_rq_fair() for some time now, and have come
to "hate" the usage of the sd_flag parameter. It is used as both an
indicator of the wakeup type (ttwu/fork/exec) and as a domain walk search
target. CFS is the only class doing this, the other classes just need the
wakeup type but get passed a domain flag instead.

This series gets rid of select_task_rq()'s sd_flag parameter and also tries
to optimize select_task_rq_fair().

This is based on tip/sched/core at:
a0f03b617c3b ("sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found")

Patches
=======

o Patch 1 is a simple dead parameter cleanup
o Patches 2-4 get rid of SD_LOAD_BALANCE
o Patches 5-6 involve getting rid of the sd_flag parameter for
select_task_rq().
o Patch 7 is an extra cleanup in the select_task_rq_fair() region.
o Patches 8-9 split the want_affine vs !want_affine paths of
select_task_rq_fair(), which unearths a small optimization. Sort of a
single patch split in two for the sake of review.

Testing
=======

Testing was done against a slightly older tip/sched/core at:
25ac227a25ac ("sched/fair: Remove wake_cap()")

I got annoyed by the variance in my 500 iteration runs, so I scripted
something to run batches of 5k iterations. It looks pretty stable from one
batch to another. I also stared at some boxplots to convince myself I
wasn't needlessly making things worse - you too can stare at them here [1].

Note: the 'X%' stats are the percentiles, so 50% is the 50th percentile.

Juno r0
-------

2+4 big.LITTLE. SD levels are {MC, DIE}.

Hackbench
~~~~~~~~~

15000 iterations of
$ hackbench
(lower is better):

| | -PATCH | +PATCH | DELTA |
|-------+----------+----------+--------|
| mean | 0.622235 | 0.618834 | -0.55% |
| std | 0.018121 | 0.017579 | -2.99% |
| min | 0.571000 | 0.573000 | +0.35% |
| 50% | 0.620000 | 0.617000 | -0.48% |
| 75% | 0.633000 | 0.629000 | -0.63% |
| 99% | 0.674000 | 0.669000 | -0.74% |
| max | 0.818000 | 0.756000 | -7.58% |

The boxplot shows a single outlier to the very left for both commits, which
are the minimums reported above. Other than that, +PATCH has somewhat lower
outliers on the righthand side: worst cases are a tad better.

Sysbench
~~~~~~~~

15000 iterations of
$ sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=6 run
(higher is better):

| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+---------|
| mean | 15318.954000 | 15628.416933 | +2.02% |
| std | 235.466202 | 205.162730 | -12.87% |
| min | 13025.000000 | 13554.000000 | +4.06% |
| 50% | 15366.000000 | 15681.000000 | +2.05% |
| 75% | 15497.000000 | 15765.000000 | +1.73% |
| 99% | 15651.000000 | 15893.000000 | +1.55% |
| max | 15716.000000 | 15972.000000 | +1.63% |

That's overall a tad better.

Dual-socket Xeon E5
-------------------

Each socket is 10 cores w/ SMT2 - 40 CPUs total. SD levels are
{SMT, MC, NUMA}.

Hackbench
~~~~~~~~~

25000 iterations of
$ hackbench -l 1000
(lower is better):

| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+--------|
| mean | 0.946312 | 0.944685 | -0.17% |
| std | 0.006419 | 0.006447 | +0.44% |
| min | 0.906000 | 0.897000 | -0.99% |
| 50% | 0.947000 | 0.945000 | -0.21% |
| 75% | 0.950000 | 0.949000 | -0.11% |
| 99% | 0.959000 | 0.958000 | -0.10% |
| max | 0.988000 | 0.967000 | -2.13% |

The boxplot shows that the min improvement is some sort of fluke - it's a
single point standing out on the left. The mean *is* slightly lowered,
which most likely comes from +PATCH having less high-value outliers.

I looked into some statistical tests, but my samples distribution isn't a
normal distribution (which is a requirement for most of them). This
actually can't happen by construction according to [2], since hackbench
outputs the maximum of a set of random of variables. We could instead use
the average of all sender/receiver pairs, or even the invidual time taken
per each pair; that being said, I don't think each value produced by a pair
could be seen as independent variables, given that there'll be more > 1
task per CPU.

Wilcoxon's test [3] gives me a p-value of ~1e-182, so there *is* a
significant difference between the two datasets, but it does not say if the
difference is in the mean, variance, or any other parameter of the
distribution.

Sysbench
~~~~~~~~~

25000 iterations of
$ sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=40 run
(higher is better):

| | -PATCH | +PATCH | DELTA |
|-------+--------------+--------------+---------|
| mean | 23937.937560 | 24280.668640 | +1.43% |
| std | 547.139948 | 484.963639 | -11.36% |
| min | 21526.000000 | 21917.000000 | +1.82% |
| 50% | 24032.000000 | 24374.000000 | +1.42% |
| 75% | 24355.000000 | 24638.000000 | +1.16% |
| 99% | 24869.010000 | 25084.000000 | +0.86% |
| max | 25410.000000 | 25623.000000 | +0.84% |

As with the Juno, that's overall a tad better.

Takeaway
--------

The TL;DR for those fancy stats seems to be: it's hard to say much about
hackbench, and sysbench likes it a bit. The important thing for me is to
not introduce regressions with my stupid change, and AFAICT it is the case.

Links
=====

[1]: https://htmlpreview.github.io/?https://gist.githubusercontent.com/valschneider/433b3772d1776c52214dd05be2ab2f03/raw/316fbd9f774fa381c60731511c881a3360111563/streamline_v2_bplots.html
[2]: https://en.wikipedia.org/wiki/Fisher%E2%80%93Tippett%E2%80%93Gnedenko_theorem
[3]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html#scipy.stats.wilcoxon

Revisions
=========

v1 -> v2
--------
o Removed the 'RFC' tag
o Made the sd_flags syctl read-only
o Removed the SD_LOAD_BALANCE flag
o Cleaned up ugly changes thanks to the above

Valentin Schneider (9):
sched/fair: find_idlest_group(): Remove unused sd_flag parameter
sched/debug: Make sd->flags sysctl read-only
sched: Remove checks against SD_LOAD_BALANCE
sched/topology: Kill SD_LOAD_BALANCE
sched: Add WF_TTWU, WF_EXEC wakeup flags
sched: Kill select_task_rq()'s sd_flag parameter
sched/fair: Dissociate wakeup decisions from SD flag value
sched/fair: Split select_task_rq_fair want_affine logic
sched/topology: Define and use shortcut pointers for wakeup sd_flag
scan

include/linux/sched/topology.h | 29 +++++++------
kernel/sched/core.c | 10 ++---
kernel/sched/deadline.c | 4 +-
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 75 +++++++++++++++++++---------------
kernel/sched/idle.c | 2 +-
kernel/sched/rt.c | 4 +-
kernel/sched/sched.h | 13 ++++--
kernel/sched/stop_task.c | 2 +-
kernel/sched/topology.c | 43 +++++++++----------
10 files changed, 98 insertions(+), 86 deletions(-)

--
2.24.0


2020-03-11 18:17:36

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 1/9] sched/fair: find_idlest_group(): Remove unused sd_flag parameter

The last use of that parameter was removed by commit

57abff067a08 ("sched/fair: Rework find_idlest_group()")

Get rid of the parameter.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcc968669aea..1c3311277fb3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5782,8 +5782,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
}

static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag);
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);

/*
* find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
@@ -5866,7 +5865,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
continue;
}

- group = find_idlest_group(sd, p, cpu, sd_flag);
+ group = find_idlest_group(sd, p, cpu);
if (!group) {
sd = sd->child;
continue;
@@ -8624,8 +8623,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
* Assumes p is allowed on at least one CPU in sd.
*/
static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
{
struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups;
struct sg_lb_stats local_sgs, tmp_sgs;
--
2.24.0

2020-03-11 18:17:45

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 3/9] sched: Remove checks against SD_LOAD_BALANCE

Potential users of that flag could have been cpusets and isolcpus.

cpusets don't need it because they define exclusive (i.e. non-overlapping)
domain spans, see cpuset.cpu_exclusive and cpuset.sched_load_balance.
If such a cpuset contains a single CPU, it will have the NULL domain
attached to it. If it contains several CPUs, none of their domains will
extend beyond the span of the cpuset.

isolcpus apply the same "trick": isolated CPUs are explicitly taken out of
the sched_domain rebuild (using housekeeping_cpumask()), so they get the
NULL domain treatment as well.

The sched_domain systcl interface was the only way to clear that flag, and
it has just been made read-only. Since sd_init() sets it unconditionally,
remove the checks.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 14 ++------------
kernel/sched/topology.c | 28 +++++++++-------------------
2 files changed, 11 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c3311277fb3..f8eb950fbefd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6609,9 +6609,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

rcu_read_lock();
for_each_domain(cpu, tmp) {
- if (!(tmp->flags & SD_LOAD_BALANCE))
- break;
-
/*
* If both 'cpu' and 'prev_cpu' are part of this domain,
* cpu is a valid SD_WAKE_AFFINE target.
@@ -9723,9 +9720,8 @@ static int active_load_balance_cpu_stop(void *data)
/* Search for an sd spanning us and the target CPU. */
rcu_read_lock();
for_each_domain(target_cpu, sd) {
- if ((sd->flags & SD_LOAD_BALANCE) &&
- cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
- break;
+ if (cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
+ break;
}

if (likely(sd)) {
@@ -9814,9 +9810,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
}
max_cost += sd->max_newidle_lb_cost;

- if (!(sd->flags & SD_LOAD_BALANCE))
- continue;
-
/*
* Stop the load balance at this level. There is another
* CPU in our sched group which is doing load balancing more
@@ -10405,9 +10398,6 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int continue_balancing = 1;
u64 t0, domain_cost;

- if (!(sd->flags & SD_LOAD_BALANCE))
- continue;
-
if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
update_next_balance(sd, &next_balance);
break;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 00911884b7e7..79a85827be2f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -33,14 +33,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
cpumask_clear(groupmask);

printk(KERN_DEBUG "%*s domain-%d: ", level, "", level);
-
- if (!(sd->flags & SD_LOAD_BALANCE)) {
- printk("does not load-balance\n");
- if (sd->parent)
- printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
- return -1;
- }
-
printk(KERN_CONT "span=%*pbl level=%s\n",
cpumask_pr_args(sched_domain_span(sd)), sd->name);

@@ -151,8 +143,7 @@ static int sd_degenerate(struct sched_domain *sd)
return 1;

/* Following flags need at least 2 groups */
- if (sd->flags & (SD_LOAD_BALANCE |
- SD_BALANCE_NEWIDLE |
+ if (sd->flags & (SD_BALANCE_NEWIDLE |
SD_BALANCE_FORK |
SD_BALANCE_EXEC |
SD_SHARE_CPUCAPACITY |
@@ -183,15 +174,14 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)

/* Flags needing groups don't count if only 1 group in parent */
if (parent->groups == parent->groups->next) {
- pflags &= ~(SD_LOAD_BALANCE |
- SD_BALANCE_NEWIDLE |
- SD_BALANCE_FORK |
- SD_BALANCE_EXEC |
- SD_ASYM_CPUCAPACITY |
- SD_SHARE_CPUCAPACITY |
- SD_SHARE_PKG_RESOURCES |
- SD_PREFER_SIBLING |
- SD_SHARE_POWERDOMAIN);
+ pflags &= ~(SD_BALANCE_NEWIDLE |
+ SD_BALANCE_FORK |
+ SD_BALANCE_EXEC |
+ SD_ASYM_CPUCAPACITY |
+ SD_SHARE_CPUCAPACITY |
+ SD_SHARE_PKG_RESOURCES |
+ SD_PREFER_SIBLING |
+ SD_SHARE_POWERDOMAIN);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
--
2.24.0

2020-03-11 18:17:50

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 6/9] sched: Kill select_task_rq()'s sd_flag parameter

Only select_task_rq_fair() uses that parameter to do an actual domain
search, other classes only care about what kind of wakeup is happening
(fork, exec, or "regular") and thus just translate the flag into a wakeup
type.

WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to
encode the wakeup types we care about. This cleans up the API a
bit, but adds an extra conversion layer within select_task_rq_fair().

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/core.c | 10 +++++-----
kernel/sched/deadline.c | 4 ++--
kernel/sched/fair.c | 18 +++++++++++++++---
kernel/sched/idle.c | 2 +-
kernel/sched/rt.c | 4 ++--
kernel/sched/sched.h | 2 +-
kernel/sched/stop_task.c | 2 +-
7 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8e6f38073ab3..f9c1ea734061 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2094,12 +2094,12 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_ptr is stable.
*/
static inline
-int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int wake_flags)
{
lockdep_assert_held(&p->pi_lock);

if (p->nr_cpus_allowed > 1)
- cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
+ cpu = p->sched_class->select_task_rq(p, cpu, wake_flags);
else
cpu = cpumask_any(p->cpus_ptr);

@@ -2618,7 +2618,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
atomic_dec(&task_rq(p)->nr_iowait);
}

- cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
+ cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
@@ -2951,7 +2951,7 @@ void wake_up_new_task(struct task_struct *p)
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+ __set_task_cpu(p, select_task_rq(p, task_cpu(p), WF_FORK));
#endif
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
@@ -3492,7 +3492,7 @@ void sched_exec(void)
int dest_cpu;

raw_spin_lock_irqsave(&p->pi_lock, flags);
- dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
if (dest_cpu == smp_processor_id())
goto unlock;

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43323f875cb9..53216fffb871 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1599,12 +1599,12 @@ static void yield_task_dl(struct rq *rq)
static int find_later_rq(struct task_struct *task);

static int
-select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_dl(struct task_struct *p, int cpu, int flags)
{
struct task_struct *curr;
struct rq *rq;

- if (sd_flag != SD_BALANCE_WAKE)
+ if (!(flags & WF_TTWU))
goto out;

rq = cpu_rq(cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8eb950fbefd..920c4e2ce0d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6575,7 +6575,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
- * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
+ * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
* SD_BALANCE_FORK, or SD_BALANCE_EXEC.
*
* Balances load by selecting the idlest CPU in the idlest group, or under
@@ -6586,13 +6586,25 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
{
+ int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
int want_affine = 0;
- int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+ int sd_flag;
+
+ switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
+ case WF_TTWU:
+ sd_flag = SD_BALANCE_WAKE;
+ break;
+ case WF_FORK:
+ sd_flag = SD_BALANCE_FORK;
+ break;
+ default:
+ sd_flag = SD_BALANCE_EXEC;
+ }

if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b743bf38f08f..e9c6a27f0647 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -367,7 +367,7 @@ void cpu_startup_entry(enum cpuhp_state state)

#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int flags)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 55a4a5042292..d2f0931c7d62 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1426,14 +1426,14 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);

static int
-select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
struct task_struct *curr;
struct rq *rq;
bool test;

/* For anything but wake ups, just return the task_cpu */
- if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
+ if (!(flags & (WF_TTWU | WF_FORK)))
goto out;

rq = cpu_rq(cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0d1ab04622d0..ad2106245e12 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1737,7 +1737,7 @@ struct sched_class {

#ifdef CONFIG_SMP
int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
- int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

void (*task_woken)(struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4c9e9975684f..4f061ddf8470 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@

#ifdef CONFIG_SMP
static int
-select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int flags)
{
return task_cpu(p); /* stop tasks as never migrate */
}
--
2.24.0

2020-03-11 18:17:51

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 5/9] sched: Add WF_TTWU, WF_EXEC wakeup flags

To remove the sd_flag parameter of select_task_rq(), we need another way of
encoding wakeup kinds. There already is a WF_FORK flag, add the missing
ones.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/sched.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a0caf394dd4..0d1ab04622d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1654,11 +1654,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
}

/*
- * wake flags
+ * Wake flags
*/
#define WF_SYNC 0x01 /* Waker goes to sleep after wakeup */
-#define WF_FORK 0x02 /* Child wakeup after fork */
-#define WF_MIGRATED 0x4 /* Internal use, task got migrated */
+#define WF_TTWU 0x02 /* Regular task wakeup */
+#define WF_FORK 0x04 /* Child wakeup after fork */
+#define WF_EXEC 0x08 /* "Fake" wakeup at exec */
+#define WF_MIGRATED 0x10 /* Internal use, task got migrated */

/*
* To aid in avoiding the subversion of "niceness" due to uneven distribution
--
2.24.0

2020-03-11 18:18:32

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 4/9] sched/topology: Kill SD_LOAD_BALANCE

That flag is set unconditionally in sd_init(), and no one checks for it
anymore. Remove it.

Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched/topology.h | 29 ++++++++++++++---------------
kernel/sched/topology.c | 3 +--
2 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index f341163fedc9..8de2f9744569 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -11,21 +11,20 @@
*/
#ifdef CONFIG_SMP

-#define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */
-#define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */
-#define SD_BALANCE_EXEC 0x0004 /* Balance on exec */
-#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
-#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
-#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different CPU capacities */
-#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */
-#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
-#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */
-#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
-#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
-#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
-#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
-#define SD_NUMA 0x4000 /* cross-node balancing */
+#define SD_BALANCE_NEWIDLE 0x0001 /* Balance when about to become idle */
+#define SD_BALANCE_EXEC 0x0002 /* Balance on exec */
+#define SD_BALANCE_FORK 0x0004 /* Balance on fork, clone */
+#define SD_BALANCE_WAKE 0x0008 /* Balance on wakeup */
+#define SD_WAKE_AFFINE 0x0010 /* Wake task to waking CPU */
+#define SD_ASYM_CPUCAPACITY 0x0020 /* Domain members have different CPU capacities */
+#define SD_SHARE_CPUCAPACITY 0x0040 /* Domain members share CPU capacity */
+#define SD_SHARE_POWERDOMAIN 0x0080 /* Domain members share power domain */
+#define SD_SHARE_PKG_RESOURCES 0x0100 /* Domain members share CPU pkg resources */
+#define SD_SERIALIZE 0x0200 /* Only a single load balancing instance */
+#define SD_ASYM_PACKING 0x0400 /* Place busy groups earlier in the domain */
+#define SD_PREFER_SIBLING 0x0800 /* Prefer to place tasks in a sibling domain */
+#define SD_OVERLAP 0x1000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x2000 /* cross-node balancing */

#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79a85827be2f..6077b23f9723 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1333,8 +1333,7 @@ sd_init(struct sched_domain_topology_level *tl,

.cache_nice_tries = 0,

- .flags = 1*SD_LOAD_BALANCE
- | 1*SD_BALANCE_NEWIDLE
+ .flags = 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
| 0*SD_BALANCE_WAKE
--
2.24.0

2020-03-11 18:19:04

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 8/9] sched/fair: Split select_task_rq_fair want_affine logic

The domain loop within select_task_rq_fair() depends on a few bits of
input, namely the SD flag we're looking for and whether we want_affine.

For !want_affine, the domain loop will walk up the hierarchy to reach the
highest domain with the requested sd_flag (SD_BALANCE_{WAKE, FORK, EXEC})
set. In other words, that's a call to highest_flag_domain().
Note that this is a static information wrt a given SD hierarchy, so we can
cache that - but that comes in a later patch to ease reviewing.

For want_affine, we'll walk up the hierarchy to reach the first domain with
SD_LOAD_BALANCE, SD_WAKE_AFFINE, and that spans the tasks's prev_cpu. We
still save a pointer to the last visited domain that had the requested
sd_flag set, which means that if we fail to go through the affine
condition (e.g. no domain had SD_WAKE_AFFINE) we'll use the same SD as we
would have found if we had !want_affine.

Split the domain loop in !want_affine and want_affine paths. As it is,
this leads to two domain walks instead of a single one, but stay tuned for
the next patch.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f98dac0c7f82..a6fca6817e92 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6620,26 +6620,33 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
}

rcu_read_lock();
+
+ sd = highest_flag_domain(cpu, sd_flag);
+
+ /*
+ * If !want_affine, we just look for the highest domain where
+ * sd_flag is set.
+ */
+ if (!want_affine)
+ goto scan;
+
+ /*
+ * Otherwise we look for the lowest domain with SD_WAKE_AFFINE and that
+ * spans both 'cpu' and 'prev_cpu'.
+ */
for_each_domain(cpu, tmp) {
- /*
- * If both 'cpu' and 'prev_cpu' are part of this domain,
- * cpu is a valid SD_WAKE_AFFINE target.
- */
- if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+ if ((tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
if (cpu != prev_cpu)
new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);

- sd = NULL; /* Prefer wake_affine over balance flags */
+ /* Prefer wake_affine over SD lookup */
+ sd = NULL;
break;
}
-
- if (tmp->flags & sd_flag)
- sd = tmp;
- else if (!want_affine)
- break;
}

+scan:
if (unlikely(sd)) {
/* Slow path */
new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
--
2.24.0

2020-03-11 18:19:13

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 7/9] sched/fair: Dissociate wakeup decisions from SD flag value

The CFS wakeup code will only ever go through EAS / its fast path on
"regular" wakeups (i.e. not on forks or execs). These are currently gated
by a check against 'sd_flag', which would be SD_BALANCE_WAKE at wakeup.

However, we now have a flag that explicitly tells us whether a wakeup is a
"regular" one, so hinge those conditions on that flag instead.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 920c4e2ce0d1..f98dac0c7f82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6606,7 +6606,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
sd_flag = SD_BALANCE_EXEC;
}

- if (sd_flag & SD_BALANCE_WAKE) {
+ if (wake_flags & WF_TTWU) {
record_wakee(p);

if (sched_energy_enabled()) {
@@ -6643,9 +6643,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (unlikely(sd)) {
/* Slow path */
new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
- } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
+ } else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
-
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

if (want_affine)
--
2.24.0

2020-03-11 18:19:17

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 9/9] sched/topology: Define and use shortcut pointers for wakeup sd_flag scan

Reworking select_task_rq_fair()'s domain walk exposed that !want_affine
wakeups only look for highest sched_domain with the required sd_flag
set. This is something we can cache at sched domain build time to slightly
optimize select_task_rq_fair(). Note that this isn't a "free" optimization:
it costs us 3 pointers per CPU.

Add cached per-CPU pointers for the highest domains with SD_BALANCE_WAKE,
SD_BALANCE_EXEC and SD_BALANCE_FORK. Use them in select_task_rq_fair().

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 25 +++++++++++++------------
kernel/sched/sched.h | 3 +++
kernel/sched/topology.c | 12 ++++++++++++
3 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a6fca6817e92..40fb97062157 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6595,17 +6595,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
int want_affine = 0;
int sd_flag;

- switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
- case WF_TTWU:
- sd_flag = SD_BALANCE_WAKE;
- break;
- case WF_FORK:
- sd_flag = SD_BALANCE_FORK;
- break;
- default:
- sd_flag = SD_BALANCE_EXEC;
- }
-
if (wake_flags & WF_TTWU) {
record_wakee(p);

@@ -6621,7 +6610,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)

rcu_read_lock();

- sd = highest_flag_domain(cpu, sd_flag);
+ switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
+ case WF_TTWU:
+ sd_flag = SD_BALANCE_WAKE;
+ sd = rcu_dereference(per_cpu(sd_balance_wake, cpu));
+ break;
+ case WF_FORK:
+ sd_flag = SD_BALANCE_FORK;
+ sd = rcu_dereference(per_cpu(sd_balance_fork, cpu));
+ break;
+ default:
+ sd_flag = SD_BALANCE_EXEC;
+ sd = rcu_dereference(per_cpu(sd_balance_exec, cpu));
+ }

/*
* If !want_affine, we just look for the highest domain where
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ad2106245e12..3a0e38f2f713 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1393,6 +1393,9 @@ DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_balance_wake);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_balance_fork);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_balance_exec);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
extern struct static_key_false sched_asym_cpucapacity;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6077b23f9723..0270252a964b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -610,6 +610,9 @@ DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_balance_wake);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_balance_fork);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_balance_exec);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
@@ -636,6 +639,15 @@ static void update_top_cache_domain(int cpu)
sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);

+ sd = highest_flag_domain(cpu, SD_BALANCE_WAKE);
+ rcu_assign_pointer(per_cpu(sd_balance_wake, cpu), sd);
+
+ sd = highest_flag_domain(cpu, SD_BALANCE_FORK);
+ rcu_assign_pointer(per_cpu(sd_balance_fork, cpu), sd);
+
+ sd = highest_flag_domain(cpu, SD_BALANCE_EXEC);
+ rcu_assign_pointer(per_cpu(sd_balance_exec, cpu), sd);
+
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);

--
2.24.0

2020-03-11 18:19:20

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 2/9] sched/debug: Make sd->flags sysctl read-only

Writing to the sysctl of a sched_domain->flags directly updates the value of
the field, and goes nowhere near update_top_cache_domain(). This means that
the cached domain pointers can end up containing stale data (e.g. the
domain pointed to doesn't have the relevant flag set anymore).

Explicit domain walks that check for flags will be affected by
the write, but this won't be in sync with the cached pointers which will
still point to the domains that were cached at the last sched_domain
build.

In other words, writing to this interface is playing a dangerous game. It
could be made to trigger an update of the cached sched_domain pointers when
written to, but this does not seem to be worth the trouble. Make it
read-only.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/debug.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8331bc04aea2..d86f7c21feff 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -258,7 +258,7 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
set_table_entry(&table[2], "busy_factor", &sd->busy_factor, sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[3], "imbalance_pct", &sd->imbalance_pct, sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[4], "cache_nice_tries", &sd->cache_nice_tries, sizeof(int), 0644, proc_dointvec_minmax);
- set_table_entry(&table[5], "flags", &sd->flags, sizeof(int), 0644, proc_dointvec_minmax);
+ set_table_entry(&table[5], "flags", &sd->flags, sizeof(int), 0444, proc_dointvec_minmax);
set_table_entry(&table[6], "max_newidle_lb_cost", &sd->max_newidle_lb_cost, sizeof(long), 0644, proc_doulongvec_minmax);
set_table_entry(&table[7], "name", sd->name, CORENAME_MAX_SIZE, 0444, proc_dostring);
/* &table[8] is terminator */
--
2.24.0

2020-03-19 10:26:05

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] sched/fair: find_idlest_group(): Remove unused sd_flag parameter

On 11.03.20 19:15, Valentin Schneider wrote:
> The last use of that parameter was removed by commit
>
> 57abff067a08 ("sched/fair: Rework find_idlest_group()")
>
> Get rid of the parameter.
>
> Signed-off-by: Valentin Schneider <[email protected]>

[...]

Reviewed-by: Dietmar Eggemann <[email protected]>

2020-03-19 10:26:28

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] sched/debug: Make sd->flags sysctl read-only

On 11.03.20 19:15, Valentin Schneider wrote:
> Writing to the sysctl of a sched_domain->flags directly updates the value of
> the field, and goes nowhere near update_top_cache_domain(). This means that
> the cached domain pointers can end up containing stale data (e.g. the
> domain pointed to doesn't have the relevant flag set anymore).
>
> Explicit domain walks that check for flags will be affected by
> the write, but this won't be in sync with the cached pointers which will
> still point to the domains that were cached at the last sched_domain
> build.
>
> In other words, writing to this interface is playing a dangerous game. It
> could be made to trigger an update of the cached sched_domain pointers when
> written to, but this does not seem to be worth the trouble. Make it
> read-only.

As long as I don't change SD flags for which cached SD pointers exist
(SD_SHARE_PKG_RESOURCES, SD_NUMA, SD_ASYM_PACKING or
SD_ASYM_CPUCAPACITY) the write-able interface still could make some sense.

E.g. by enabling SD_BALANCE_WAKE on the fly, I can force !want_affine
wakees into slow path.

The question is, do people use the writable flags interface to tweak
select_task_rq_fair() behavior in this regard?

2020-03-19 10:29:54

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] sched: Remove checks against SD_LOAD_BALANCE

On 11.03.20 19:15, Valentin Schneider wrote:
> Potential users of that flag could have been cpusets and isolcpus.
>
> cpusets don't need it because they define exclusive (i.e. non-overlapping)
> domain spans, see cpuset.cpu_exclusive and cpuset.sched_load_balance.
> If such a cpuset contains a single CPU, it will have the NULL domain
> attached to it. If it contains several CPUs, none of their domains will
> extend beyond the span of the cpuset.

There are also non-exclusive cpusets but I assume the statement is the same.

CPUs which are only used in cpusets with cpuset.sched_load_balance=0 are
attached to the NULL sched-domain.

There seems to be no code which alters the SD_LOAD_BALANCE flag.

[...]

2020-03-19 10:30:18

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 4/9] sched/topology: Kill SD_LOAD_BALANCE

On 11.03.20 19:15, Valentin Schneider wrote:
> That flag is set unconditionally in sd_init(), and no one checks for it
> anymore. Remove it.

Why not merge 3/9 and 4/9 ?

[...]

> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index f341163fedc9..8de2f9744569 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -11,21 +11,20 @@
> */
> #ifdef CONFIG_SMP
>
> -#define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */
> -#define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */

[...]

> -#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
> -#define SD_NUMA 0x4000 /* cross-node balancing */
> +#define SD_BALANCE_NEWIDLE 0x0001 /* Balance when about to become idle */

IMHO, changing the values of the remaining SD flags will break lots of
userspace scripts.

[...]

2020-03-19 10:31:48

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] sched/fair: Split select_task_rq_fair want_affine logic

On 11.03.20 19:16, Valentin Schneider wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f98dac0c7f82..a6fca6817e92 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6620,26 +6620,33 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> }
>
> rcu_read_lock();
> +
> + sd = highest_flag_domain(cpu, sd_flag);
> +
> + /*
> + * If !want_affine, we just look for the highest domain where
> + * sd_flag is set.
> + */
> + if (!want_affine)
> + goto scan;
> +
> + /*
> + * Otherwise we look for the lowest domain with SD_WAKE_AFFINE and that
> + * spans both 'cpu' and 'prev_cpu'.
> + */
> for_each_domain(cpu, tmp) {
> - /*
> - * If both 'cpu' and 'prev_cpu' are part of this domain,
> - * cpu is a valid SD_WAKE_AFFINE target.
> - */
> - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> + if ((tmp->flags & SD_WAKE_AFFINE) &&
> cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> if (cpu != prev_cpu)
> new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
>
> - sd = NULL; /* Prefer wake_affine over balance flags */
> + /* Prefer wake_affine over SD lookup */

I assume that 'balance flags' stands for (wakeup) load balance, i.e.
find_idlest_xxx() path. So why change it?


[...]

2020-03-19 10:48:04

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] sched/topology: Define and use shortcut pointers for wakeup sd_flag scan

On 11.03.20 19:16, Valentin Schneider wrote:
> Reworking select_task_rq_fair()'s domain walk exposed that !want_affine
> wakeups only look for highest sched_domain with the required sd_flag
> set. This is something we can cache at sched domain build time to slightly
> optimize select_task_rq_fair(). Note that this isn't a "free" optimization:
> it costs us 3 pointers per CPU.
>
> Add cached per-CPU pointers for the highest domains with SD_BALANCE_WAKE,
> SD_BALANCE_EXEC and SD_BALANCE_FORK. Use them in select_task_rq_fair().
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 25 +++++++++++++------------
> kernel/sched/sched.h | 3 +++
> kernel/sched/topology.c | 12 ++++++++++++
> 3 files changed, 28 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a6fca6817e92..40fb97062157 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6595,17 +6595,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> int want_affine = 0;
> int sd_flag;
>
> - switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
> - case WF_TTWU:
> - sd_flag = SD_BALANCE_WAKE;
> - break;
> - case WF_FORK:
> - sd_flag = SD_BALANCE_FORK;
> - break;
> - default:
> - sd_flag = SD_BALANCE_EXEC;
> - }
> -
> if (wake_flags & WF_TTWU) {
> record_wakee(p);
>
> @@ -6621,7 +6610,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>
> rcu_read_lock();
>
> - sd = highest_flag_domain(cpu, sd_flag);
> + switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
> + case WF_TTWU:
> + sd_flag = SD_BALANCE_WAKE;
> + sd = rcu_dereference(per_cpu(sd_balance_wake, cpu));

IMHO, since we hard-code 0*SD_BALANCE_WAKE in sd_init(), sd would always
be NULL, so !want_affine (i.e. wake_wide()) would still go sis().

SD_BALANCE_WAKE is no a topology related sd_flag so it can't be set from
outside. Since the sd->flags sysctl is now read-only, wouldn't this case
be redundant?

[...]

2020-03-19 12:07:54

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] sched/debug: Make sd->flags sysctl read-only


On Thu, Mar 19 2020, Dietmar Eggemann wrote:
> On 11.03.20 19:15, Valentin Schneider wrote:
>> Writing to the sysctl of a sched_domain->flags directly updates the value of
>> the field, and goes nowhere near update_top_cache_domain(). This means that
>> the cached domain pointers can end up containing stale data (e.g. the
>> domain pointed to doesn't have the relevant flag set anymore).
>>
>> Explicit domain walks that check for flags will be affected by
>> the write, but this won't be in sync with the cached pointers which will
>> still point to the domains that were cached at the last sched_domain
>> build.
>>
>> In other words, writing to this interface is playing a dangerous game. It
>> could be made to trigger an update of the cached sched_domain pointers when
>> written to, but this does not seem to be worth the trouble. Make it
>> read-only.
>
> As long as I don't change SD flags for which cached SD pointers exist
> (SD_SHARE_PKG_RESOURCES, SD_NUMA, SD_ASYM_PACKING or
> SD_ASYM_CPUCAPACITY) the write-able interface still could make some sense.
>
> E.g. by enabling SD_BALANCE_WAKE on the fly, I can force !want_affine
> wakees into slow path.
>

True, although since there is no explicit differenciation between the
cached and !cached flags, this is still a landmined interface.

> The question is, do people use the writable flags interface to tweak
> select_task_rq_fair() behavior in this regard?

I did try asking around on IRC (which admittedly is just a small subset
of people) but didn't find anyone.

2020-03-19 12:08:39

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] sched: Remove checks against SD_LOAD_BALANCE


On Thu, Mar 19 2020, Dietmar Eggemann wrote:
> On 11.03.20 19:15, Valentin Schneider wrote:
>> Potential users of that flag could have been cpusets and isolcpus.
>>
>> cpusets don't need it because they define exclusive (i.e. non-overlapping)
>> domain spans, see cpuset.cpu_exclusive and cpuset.sched_load_balance.
>> If such a cpuset contains a single CPU, it will have the NULL domain
>> attached to it. If it contains several CPUs, none of their domains will
>> extend beyond the span of the cpuset.
>
> There are also non-exclusive cpusets but I assume the statement is the same.
>

Right, AFAICT the cpuset.cpu_exclusive thing doesn't actually impact the
sched_domains, only how CPUs can be allocated to cpusets. The important
bits are:

- the CPUs spanned by the cpuset
- Whether we have cpuset.sched_load_balance

> CPUs which are only used in cpusets with cpuset.sched_load_balance=0 are
> attached to the NULL sched-domain.
>

Indeed, I was only considering the case with root.sched_load_balance=0
and the siblings would have cpuset.sched_load_balance=1, in which case
we get separate root domains. If !root cpusets have
sched_load_balance=0, related CPUs will only get the NULL domain
attached to them.

> There seems to be no code which alters the SD_LOAD_BALANCE flag.
>

The sysctl interface would've been the last possible modifier.

Your comments make me realize that changelog isn't great, what about the
following?

---

The SD_LOAD_BALANCE flag is set unconditionally for all domains in
sd_init(). By making the sched_domain->flags syctl interface read-only, we
have removed the last piece of code that could clear that flag - as such,
it will now be always present. Rather than to keep carrying it along, we
can work towards getting rid of it entirely.

cpusets don't need it because they can make CPUs be attached to the NULL
domain (e.g. cpuset with sched_load_balance=0), or to a partitionned
root_domain, i.e. a sched_domain hierarchy that doesn't span the entire
system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with
sched_load_balance=1).

isolcpus apply the same "trick": isolated CPUs are explicitly taken out of
the sched_domain rebuild (using housekeeping_cpumask()), so they get the
NULL domain treatment as well.

Remove the checks against SD_LOAD_BALANCE.

2020-03-19 12:09:14

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] sched/fair: Split select_task_rq_fair want_affine logic


On Thu, Mar 19 2020, Dietmar Eggemann wrote:
>> + */
>> for_each_domain(cpu, tmp) {
>> - /*
>> - * If both 'cpu' and 'prev_cpu' are part of this domain,
>> - * cpu is a valid SD_WAKE_AFFINE target.
>> - */
>> - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
>> + if ((tmp->flags & SD_WAKE_AFFINE) &&
>> cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
>> if (cpu != prev_cpu)
>> new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
>>
>> - sd = NULL; /* Prefer wake_affine over balance flags */
>> + /* Prefer wake_affine over SD lookup */
>
> I assume that 'balance flags' stands for (wakeup) load balance, i.e.
> find_idlest_xxx() path. So why change it?
>
>

You mean the comment part, right? I was hoping to clarify it a bit - if
we go through the want_affine condition, we'll override whatever SD we
picked with the highest_flag_domain() lookup (and the cached version in
9/9). Hence me referring to the SD lookup there.

> [...]

2020-03-19 12:09:15

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 4/9] sched/topology: Kill SD_LOAD_BALANCE


On Thu, Mar 19 2020, Dietmar Eggemann wrote:
> On 11.03.20 19:15, Valentin Schneider wrote:
>> That flag is set unconditionally in sd_init(), and no one checks for it
>> anymore. Remove it.
>
> Why not merge 3/9 and 4/9 ?
>
> [...]
>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index f341163fedc9..8de2f9744569 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -11,21 +11,20 @@
>> */
>> #ifdef CONFIG_SMP
>>
>> -#define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */
>> -#define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */
>
> [...]
>
>> -#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
>> -#define SD_NUMA 0x4000 /* cross-node balancing */
>> +#define SD_BALANCE_NEWIDLE 0x0001 /* Balance when about to become idle */
>
> IMHO, changing the values of the remaining SD flags will break lots of
> userspace scripts.
>

True, and that includes some of my own scripts. That's also part of why
I have this 3/9 and 4/9 split: 4/9 is the externally visible part. If
deemed necessary, we could keep the definition of SD_LOAD_BALANCE but
kill all of its uses.

Alternatively, I was thinking that we could leverage [1] to make
/proc/sys/kernel/sched_domain/cpu*/domain*/flags print
e.g. comma-separated flag names rather than flag values. That way the
userland scripts would no longer have to contain some
{flag_value : flag_name} translation.

[1]: https://lkml.kernel.org/r/[email protected]

> [...]

2020-03-19 12:24:00

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] sched/topology: Define and use shortcut pointers for wakeup sd_flag scan


On Thu, Mar 19 2020, Dietmar Eggemann wrote:
> On 11.03.20 19:16, Valentin Schneider wrote:
>> @@ -6621,7 +6610,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>
>> rcu_read_lock();
>>
>> - sd = highest_flag_domain(cpu, sd_flag);
>> + switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
>> + case WF_TTWU:
>> + sd_flag = SD_BALANCE_WAKE;
>> + sd = rcu_dereference(per_cpu(sd_balance_wake, cpu));
>
> IMHO, since we hard-code 0*SD_BALANCE_WAKE in sd_init(), sd would always
> be NULL, so !want_affine (i.e. wake_wide()) would still go sis().
>
> SD_BALANCE_WAKE is no a topology related sd_flag so it can't be set from
> outside. Since the sd->flags sysctl is now read-only, wouldn't this case
> be redundant?
>

On a purely mainline kernel, yes, 'sd' will always be NULL for
wakeups. I'm playing a bit more conservative here with SD_BALANCE_WAKE,
as I could see some folks using that flag in some ~franken~ downstream
kernels (I am not aware of any, however).

Also, to be fair we've only very recently gotten rid of the last
SD_BALANCE_WAKE user (asymmetric CPU capacity topologies), so I felt
like keeping it around for a bit before entirely killing it would be a
sane thing to do.

> [...]

2020-03-23 14:29:16

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] sched: Remove checks against SD_LOAD_BALANCE

On 19.03.20 13:05, Valentin Schneider wrote:
>
> On Thu, Mar 19 2020, Dietmar Eggemann wrote:
>> On 11.03.20 19:15, Valentin Schneider wrote:

[...]

> Your comments make me realize that changelog isn't great, what about the
> following?
>
> ---
>
> The SD_LOAD_BALANCE flag is set unconditionally for all domains in
> sd_init(). By making the sched_domain->flags syctl interface read-only, we
> have removed the last piece of code that could clear that flag - as such,
> it will now be always present. Rather than to keep carrying it along, we
> can work towards getting rid of it entirely.
>
> cpusets don't need it because they can make CPUs be attached to the NULL
> domain (e.g. cpuset with sched_load_balance=0), or to a partitionned

s/partitionned/partitioned

> root_domain, i.e. a sched_domain hierarchy that doesn't span the entire
> system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with
> sched_load_balance=1).
>
> isolcpus apply the same "trick": isolated CPUs are explicitly taken out of
> the sched_domain rebuild (using housekeeping_cpumask()), so they get the
> NULL domain treatment as well.
>
> Remove the checks against SD_LOAD_BALANCE.

Sounds better to me:

Essentially, I was referring to examples like:

Hikey960 - 2x4

(A) exclusive cpusets:

root@h960:/sys/fs/cgroup/cpuset# mkdir cs1
root@h960:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
root@h960:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
root@h960:/sys/fs/cgroup/cpuset# echo 0-2 > cs1/cpuset.cpus
root@h960:/sys/fs/cgroup/cpuset# mkdir cs2
root@h960:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
root@h960:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
root@h960:/sys/fs/cgroup/cpuset# echo 3-5 > cs2/cpuset.cpus
root@h960:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance

root@h960:/proc/sys/kernel# tree -d sched_domain

├── cpu0
│ └── domain0
├── cpu1
│ └── domain0
├── cpu2
│ └── domain0
├── cpu3
│ └── domain0
├── cpu4
│ ├── domain0
│ └── domain1
├── cpu5
│ ├── domain0
│ └── domain1
├── cpu6
└── cpu7

(B) non-exclusive cpuset:

root@h960:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance

[ 8661.240385] CPU1 attaching NULL sched-domain.
[ 8661.244802] CPU2 attaching NULL sched-domain.
[ 8661.249255] CPU3 attaching NULL sched-domain.
[ 8661.253623] CPU4 attaching NULL sched-domain.
[ 8661.257989] CPU5 attaching NULL sched-domain.
[ 8661.262363] CPU6 attaching NULL sched-domain.
[ 8661.266730] CPU7 attaching NULL sched-domain.

root@h960:/sys/fs/cgroup/cpuset# mkdir cs1
root@h960:/sys/fs/cgroup/cpuset# echo 0-5 > cs1/cpuset.cpus

root@h960:/proc/sys/kernel# tree -d sched_domain

├── cpu0
│ ├── domain0
│ └── domain1
├── cpu1
│ ├── domain0
│ └── domain1
├── cpu2
│ ├── domain0
│ └── domain1
├── cpu3
│ ├── domain0
│ └── domain1
├── cpu4
│ ├── domain0
│ └── domain1
├── cpu5
│ ├── domain0
│ └── domain1
├── cpu6
└── cpu7

2020-03-23 17:18:42

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] sched: Remove checks against SD_LOAD_BALANCE


On Mon, Mar 23 2020, Dietmar Eggemann wrote:

> On 19.03.20 13:05, Valentin Schneider wrote:
>>
>> On Thu, Mar 19 2020, Dietmar Eggemann wrote:
>>> On 11.03.20 19:15, Valentin Schneider wrote:
>
> [...]
>
>> Your comments make me realize that changelog isn't great, what about the
>> following?
>>
>> ---
>>
>> The SD_LOAD_BALANCE flag is set unconditionally for all domains in
>> sd_init(). By making the sched_domain->flags syctl interface read-only, we
>> have removed the last piece of code that could clear that flag - as such,
>> it will now be always present. Rather than to keep carrying it along, we
>> can work towards getting rid of it entirely.
>>
>> cpusets don't need it because they can make CPUs be attached to the NULL
>> domain (e.g. cpuset with sched_load_balance=0), or to a partitionned
>
> s/partitionned/partitioned
>
>> root_domain, i.e. a sched_domain hierarchy that doesn't span the entire
>> system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with
>> sched_load_balance=1).
>>
>> isolcpus apply the same "trick": isolated CPUs are explicitly taken out of
>> the sched_domain rebuild (using housekeeping_cpumask()), so they get the
>> NULL domain treatment as well.
>>
>> Remove the checks against SD_LOAD_BALANCE.
>
> Sounds better to me:
>
> Essentially, I was referring to examples like:
>
> Hikey960 - 2x4
>
> (A) exclusive cpusets:
>
> root@h960:/sys/fs/cgroup/cpuset# mkdir cs1
> root@h960:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
> root@h960:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
> root@h960:/sys/fs/cgroup/cpuset# echo 0-2 > cs1/cpuset.cpus
> root@h960:/sys/fs/cgroup/cpuset# mkdir cs2
> root@h960:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
> root@h960:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
> root@h960:/sys/fs/cgroup/cpuset# echo 3-5 > cs2/cpuset.cpus
> root@h960:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
>

AFAICT you don't even have to bother with cpuset.cpu_exclusive if you
only care about the end-result wrt sched_domains.

> root@h960:/proc/sys/kernel# tree -d sched_domain
>
> ├── cpu0
> │ └── domain0
> ├── cpu1
> │ └── domain0
> ├── cpu2
> │ └── domain0
> ├── cpu3
> │ └── domain0
> ├── cpu4
> │ ├── domain0
> │ └── domain1
> ├── cpu5
> │ ├── domain0
> │ └── domain1
> ├── cpu6
> └── cpu7
>
> (B) non-exclusive cpuset:
>
> root@h960:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
>
> [ 8661.240385] CPU1 attaching NULL sched-domain.
> [ 8661.244802] CPU2 attaching NULL sched-domain.
> [ 8661.249255] CPU3 attaching NULL sched-domain.
> [ 8661.253623] CPU4 attaching NULL sched-domain.
> [ 8661.257989] CPU5 attaching NULL sched-domain.
> [ 8661.262363] CPU6 attaching NULL sched-domain.
> [ 8661.266730] CPU7 attaching NULL sched-domain.
>
> root@h960:/sys/fs/cgroup/cpuset# mkdir cs1
> root@h960:/sys/fs/cgroup/cpuset# echo 0-5 > cs1/cpuset.cpus
>
> root@h960:/proc/sys/kernel# tree -d sched_domain
>
> ├── cpu0
> │ ├── domain0
> │ └── domain1
> ├── cpu1
> │ ├── domain0
> │ └── domain1
> ├── cpu2
> │ ├── domain0
> │ └── domain1
> ├── cpu3
> │ ├── domain0
> │ └── domain1
> ├── cpu4
> │ ├── domain0
> │ └── domain1
> ├── cpu5
> │ ├── domain0
> │ └── domain1
> ├── cpu6
> └── cpu7

I think my updated changelog covers those cases, right?