2020-07-01 19:08:23

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 0/7] sched: Instrument sched domain flags

Hi,

I've repeatedly stared at an SD flag and asked myself "how should that be
set up in the domain hierarchy anyway?". I figured that if we formalize our
flags zoology a bit, we could also do some runtime assertions on them -
this is what this series is all about.

Patches
=======

The idea is to associate the flags with metaflags that describes how they
should be set in a sched domain hierarchy ("if this SD has it, all its {parents,
children} have it") or how they behave wrt degeneration - details are in the
comments and commit logs.

The good thing is that the debugging bits go away when CONFIG_SCHED_DEBUG isn't
set. The bad thing is that this replaces SD_* flags definitions with some
unsavoury macros. This is mainly because I wanted to avoid having to duplicate
work between declaring the flags and declaring their metaflags.

o Patches 1-2 set up the SD flag definition groundwork
o Patch 3 enables the hierarchy debugging
o Patch 4 is a derelict SD flag removal
o Patches 5-7 tweak the SD degeneration to leverage metaflags as well (suggested
by Peter).

Note that this builds a compile-time mask using the single most horrendous
macro I have ever presented to the public eye. It might end up being much
better to explicitely build the mask in topology.h, but to be honest I felt
like I had to show that monster around out of morbid fascination.

Revisions
=========

v2 -> v3
--------

o Reworded comment for SD_OVERLAP (it's about the groups, not the domains)

o Added more flags to the SD degeneration mask
o Added generation of an SD flag mask for the degeneration functions (Peter)

RFC -> v2
---------

o Rebased on top of tip/sched/core
o Aligned wording of comments between flags
o Rectified some flag descriptions (Morten)
o Added removal of SD_SHARE_POWERDOMAIN (Morten)

Valentin Schneider (7):
sched/topology: Split out SD_* flags declaration to its own file
sched/topology: Define and assign sched_domain flag metadata
sched/topology: Verify SD_* flags setup when sched_debug is on
arm, sched/topology: Remove SD_SHARE_POWERDOMAIN
sched/topology: Add more flags to the SD degeneration mask
sched/topology: Introduce SD metaflag for flags needing > 1 groups
sched/topology: Use prebuilt SD flag degeneration mask

arch/arm/kernel/topology.c | 2 +-
include/linux/sched/sd_flags.h | 153 +++++++++++++++++++++++++++++++++
include/linux/sched/topology.h | 36 +++++---
kernel/sched/topology.c | 40 +++++----
4 files changed, 197 insertions(+), 34 deletions(-)
create mode 100644 include/linux/sched/sd_flags.h

--
2.27.0


2020-07-01 19:08:31

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 3/7] sched/topology: Verify SD_* flags setup when sched_debug is on

Now that we have some description of what we expect the flags layout to
be, we can use that to assert at runtime that the actual layout is sane.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/topology.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..b3f891db16a7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -29,6 +29,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
struct cpumask *groupmask)
{
struct sched_group *group = sd->groups;
+ int flags = sd->flags;

cpumask_clear(groupmask);

@@ -43,6 +44,22 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
printk(KERN_ERR "ERROR: domain->groups does not contain CPU%d\n", cpu);
}

+ for (; flags; flags &= flags - 1) {
+ unsigned int idx = __ffs(flags);
+ unsigned int flag = BIT(idx);
+ unsigned int meta_flags = sd_flag_debug[idx].meta_flags;
+
+ if ((meta_flags & SDF_SHARED_CHILD) && sd->child &&
+ !(sd->child->flags & flag))
+ printk(KERN_ERR "ERROR: flag %s set here but not in child\n",
+ sd_flag_debug[idx].name);
+
+ if ((meta_flags & SDF_SHARED_PARENT) && sd->parent &&
+ !(sd->parent->flags & flag))
+ printk(KERN_ERR "ERROR: flag %s set here but not in parent\n",
+ sd_flag_debug[idx].name);
+ }
+
printk(KERN_DEBUG "%*s groups:", level + 1, "");
do {
if (!group) {
--
2.27.0

2020-07-01 19:08:31

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 6/7] sched/topology: Introduce SD metaflag for flags needing > 1 groups

In preparation of cleaning up the sd_degenerate*() functions, mark flags
that are only relevant when their associated domain has more than a single
group. With this, build a compile-time mask of those SD flags.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched/sd_flags.h | 60 ++++++++++++++++++++++------------
include/linux/sched/topology.h | 7 ++++
2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index c0003b252d48..408832193a94 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -8,7 +8,7 @@
#endif

/*
- * Expected flag uses
+ * Hierarchical metaflags
*
* SHARED_CHILD: These flags are meant to be set from the base domain upwards.
* If a domain has this flag set, all of its children should have it set. This
@@ -29,36 +29,50 @@
* certain level (e.g. domain starts spanning CPUs outside of the base CPU's
* socket).
*/
-#define SDF_SHARED_CHILD 0x1
-#define SDF_SHARED_PARENT 0x2
+#define SDF_SHARED_CHILD 0x1
+#define SDF_SHARED_PARENT 0x2
+
+/*
+ * Behavioural metaflags
+ *
+ * NEEDS_GROUPS: These flags are only relevant if the domain they are set on has
+ * more than one group. This is usually for balancing flags (load balancing
+ * involves equalizing a metric between groups), or for flags describing some
+ * shared resource (which would be shared between groups).
+ */
+#define SDF_NEEDS_GROUPS 0x4

/*
* Balance when about to become idle
*
* SHARED_CHILD: Set from the base domain up to cpuset.sched_relax_domain_level.
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_BALANCE_NEWIDLE, 0, SDF_SHARED_CHILD)
+SD_FLAG(SD_BALANCE_NEWIDLE, 0, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Balance on exec
*
* SHARED_CHILD: Set from the base domain up to the NUMA reclaim level.
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_BALANCE_EXEC, 1, SDF_SHARED_CHILD)
+SD_FLAG(SD_BALANCE_EXEC, 1, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Balance on fork, clone
*
* SHARED_CHILD: Set from the base domain up to the NUMA reclaim level.
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_BALANCE_FORK, 2, SDF_SHARED_CHILD)
+SD_FLAG(SD_BALANCE_FORK, 2, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Balance on wakeup
*
* SHARED_CHILD: Set from the base domain up to cpuset.sched_relax_domain_level.
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_BALANCE_WAKE, 3, SDF_SHARED_CHILD)
+SD_FLAG(SD_BALANCE_WAKE, 3, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Consider waking task on waking CPU.
@@ -71,33 +85,36 @@ SD_FLAG(SD_WAKE_AFFINE, 4, SDF_SHARED_CHILD)
* Domain members have different CPU capacities
*
* SHARED_PARENT: Set from the topmost domain down to the first domain where
- * asymmetry is detected.
+ * asymmetry is detected.
+ * NEEDS_GROUPS: Per-CPU capacity is asymmetric between groups.
*/
-SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT)
+SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)

/*
* Domain members share CPU capacity (i.e. SMT)
*
* SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
- * CPU capacity.
+ * CPU capacity.
+ * NEEDS_GROUPS: Capacity is shared between groups.
*/
-SD_FLAG(SD_SHARE_CPUCAPACITY, 6, SDF_SHARED_CHILD)
+SD_FLAG(SD_SHARE_CPUCAPACITY, 6, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Domain members share CPU package resources (i.e. caches)
*
* SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
- * the same cache(s).
+ * the same cache(s).
+ * NEEDS_GROUPS: Caches are shared between groups.
*/
-SD_FLAG(SD_SHARE_PKG_RESOURCES, 7, SDF_SHARED_CHILD)
+SD_FLAG(SD_SHARE_PKG_RESOURCES, 7, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Only a single load balancing instance
*
* SHARED_PARENT: Set for all NUMA levels above NODE. Could be set from a
- * different level upwards, but it doesn't change that if a domain has this flag
- * set, then all of its parents need to have it too (otherwise the serialization
- * doesn't make sense).
+ * different level upwards, but it doesn't change that if a
+ * domain has this flag set, then all of its parents need to have
+ * it too (otherwise the serialization doesn't make sense).
*/
SD_FLAG(SD_SERIALIZE, 8, SDF_SHARED_PARENT)

@@ -105,16 +122,18 @@ SD_FLAG(SD_SERIALIZE, 8, SDF_SHARED_PARENT)
* Place busy tasks earlier in the domain
*
* SHARED_CHILD: Usually set on the SMT level. Technically could be set further
- * up, but currently assumed to be set from the base domain upwards (see
- * update_top_cache_domain()).
+ * up, but currently assumed to be set from the base domain
+ * upwards (see update_top_cache_domain()).
*/
-SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD)
+SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Prefer to place tasks in a sibling domain
*
* Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
* flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
+ *
+ * NEEDS_GROUPS: Load balancing flag.
*/
SD_FLAG(SD_PREFER_SIBLING, 10, 0)

@@ -122,8 +141,9 @@ SD_FLAG(SD_PREFER_SIBLING, 10, 0)
* sched_groups of this level overlap
*
* SHARED_PARENT: Set for all NUMA levels above NODE.
+ * NEEDS_GROUPS: Overlaps can only exist with more than one group.
*/
-SD_FLAG(SD_OVERLAP, 11, SDF_SHARED_PARENT)
+SD_FLAG(SD_OVERLAP, 11, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)

/*
* cross-node balancing
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fce64dae09af..a6761d5d66dd 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -16,6 +16,13 @@
#include <linux/sched/sd_flags.h>
#undef SD_FLAG

+/* Generate a mask of SD flags with the SDF_NEEDS_GROUPS metaflag */
+#define SD_FLAG(name, idx, mflags) (BIT(idx) * (((mflags) & SDF_NEEDS_GROUPS) / SDF_NEEDS_GROUPS)) |
+static const int SD_DEGENERATE_GROUPS_MASK =
+#include<linux/sched/sd_flags.h>
+0;
+#undef SD_FLAG
+
#ifdef CONFIG_SCHED_DEBUG
#define SD_FLAG(_name, idx, mflags) [idx] = {.meta_flags = mflags, .name = #_name},
static const struct {
--
2.27.0

2020-07-01 19:08:36

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 4/7] arm, sched/topology: Remove SD_SHARE_POWERDOMAIN

This flag was introduced in 2014 by commit

d77b3ed5c9f8 ("sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain")

but AFAIA it was never leveraged by the scheduler. The closest thing I can
think of is EAS caring about frequency domains, and it does that by
leveraging performance domains.

Remove the flag.

Suggested-by: Morten Rasmussen <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
arch/arm/kernel/topology.c | 2 +-
include/linux/sched/sd_flags.h | 20 ++++++--------------
kernel/sched/topology.c | 10 +++-------
3 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index b5adaf744630..353f3ee660e4 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -243,7 +243,7 @@ void store_cpu_topology(unsigned int cpuid)

static inline int cpu_corepower_flags(void)
{
- return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+ return SD_SHARE_PKG_RESOURCES;
}

static struct sched_domain_topology_level arm_topology[] = {
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index b5a11df0afe4..c0003b252d48 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -83,21 +83,13 @@ SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT)
*/
SD_FLAG(SD_SHARE_CPUCAPACITY, 6, SDF_SHARED_CHILD)

-/*
- * Domain members share power domain
- *
- * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
- * the same power domain.
- */
-SD_FLAG(SD_SHARE_POWERDOMAIN, 7, SDF_SHARED_CHILD)
-
/*
* Domain members share CPU package resources (i.e. caches)
*
* SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
* the same cache(s).
*/
-SD_FLAG(SD_SHARE_PKG_RESOURCES, 8, SDF_SHARED_CHILD)
+SD_FLAG(SD_SHARE_PKG_RESOURCES, 7, SDF_SHARED_CHILD)

/*
* Only a single load balancing instance
@@ -107,7 +99,7 @@ SD_FLAG(SD_SHARE_PKG_RESOURCES, 8, SDF_SHARED_CHILD)
* set, then all of its parents need to have it too (otherwise the serialization
* doesn't make sense).
*/
-SD_FLAG(SD_SERIALIZE, 9, SDF_SHARED_PARENT)
+SD_FLAG(SD_SERIALIZE, 8, SDF_SHARED_PARENT)

/*
* Place busy tasks earlier in the domain
@@ -116,7 +108,7 @@ SD_FLAG(SD_SERIALIZE, 9, SDF_SHARED_PARENT)
* up, but currently assumed to be set from the base domain upwards (see
* update_top_cache_domain()).
*/
-SD_FLAG(SD_ASYM_PACKING, 10, SDF_SHARED_CHILD)
+SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD)

/*
* Prefer to place tasks in a sibling domain
@@ -124,18 +116,18 @@ SD_FLAG(SD_ASYM_PACKING, 10, SDF_SHARED_CHILD)
* Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
* flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
*/
-SD_FLAG(SD_PREFER_SIBLING, 11, 0)
+SD_FLAG(SD_PREFER_SIBLING, 10, 0)

/*
* sched_groups of this level overlap
*
* SHARED_PARENT: Set for all NUMA levels above NODE.
*/
-SD_FLAG(SD_OVERLAP, 12, SDF_SHARED_PARENT)
+SD_FLAG(SD_OVERLAP, 11, SDF_SHARED_PARENT)

/*
* cross-node balancing
*
* SHARED_PARENT: Set for all NUMA levels above NODE.
*/
-SD_FLAG(SD_NUMA, 13, SDF_SHARED_PARENT)
+SD_FLAG(SD_NUMA, 12, SDF_SHARED_PARENT)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b3f891db16a7..6047d491abe9 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -165,8 +165,7 @@ static int sd_degenerate(struct sched_domain *sd)
SD_BALANCE_EXEC |
SD_SHARE_CPUCAPACITY |
SD_ASYM_CPUCAPACITY |
- SD_SHARE_PKG_RESOURCES |
- SD_SHARE_POWERDOMAIN)) {
+ SD_SHARE_PKG_RESOURCES)) {
if (sd->groups != sd->groups->next)
return 0;
}
@@ -197,8 +196,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_ASYM_CPUCAPACITY |
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
- SD_PREFER_SIBLING |
- SD_SHARE_POWERDOMAIN);
+ SD_PREFER_SIBLING);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
@@ -1309,7 +1307,6 @@ int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
* SD_SHARE_CPUCAPACITY - describes SMT topologies
* SD_SHARE_PKG_RESOURCES - describes shared caches
* SD_NUMA - describes NUMA topologies
- * SD_SHARE_POWERDOMAIN - describes shared power domain
*
* Odd one out, which beside describing the topology has a quirk also
* prescribes the desired behaviour that goes along with it:
@@ -1320,8 +1317,7 @@ int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
(SD_SHARE_CPUCAPACITY | \
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
- SD_ASYM_PACKING | \
- SD_SHARE_POWERDOMAIN)
+ SD_ASYM_PACKING)

static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl,
--
2.27.0

2020-07-01 19:08:43

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 1/7] sched/topology: Split out SD_* flags declaration to its own file

To associate the SD flags with some metadata, we need some more structure
in the way they are declared.

Rather than shove that in a free-standing macro list, move the declaration
in a separate file that can be re-imported with different SD_FLAG
definitions, inspired by what we do with syscalls and unistd.h.

No change in functionality.

Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched/sd_flags.h | 33 +++++++++++++++++++++++++++++++++
include/linux/sched/topology.h | 18 ++++--------------
2 files changed, 37 insertions(+), 14 deletions(-)
create mode 100644 include/linux/sched/sd_flags.h

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
new file mode 100644
index 000000000000..685bbe736945
--- /dev/null
+++ b/include/linux/sched/sd_flags.h
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * sched-domains (multiprocessor balancing) flag declarations.
+ */
+
+/* Balance when about to become idle */
+SD_FLAG(SD_BALANCE_NEWIDLE, 0)
+/* Balance on exec */
+SD_FLAG(SD_BALANCE_EXEC, 1)
+/* Balance on fork, clone */
+SD_FLAG(SD_BALANCE_FORK, 2)
+/* Balance on wakeup */
+SD_FLAG(SD_BALANCE_WAKE, 3)
+/* Wake task to waking CPU */
+SD_FLAG(SD_WAKE_AFFINE, 4)
+/* Domain members have different CPU capacities */
+SD_FLAG(SD_ASYM_CPUCAPACITY, 5)
+/* Domain members share CPU capacity */
+SD_FLAG(SD_SHARE_CPUCAPACITY, 6)
+/* Domain members share power domain */
+SD_FLAG(SD_SHARE_POWERDOMAIN, 7)
+/* Domain members share CPU pkg resources */
+SD_FLAG(SD_SHARE_PKG_RESOURCES, 8)
+/* Only a single load balancing instance */
+SD_FLAG(SD_SERIALIZE, 9)
+/* Place busy groups earlier in the domain */
+SD_FLAG(SD_ASYM_PACKING, 10)
+/* Prefer to place tasks in a sibling domain */
+SD_FLAG(SD_PREFER_SIBLING, 11)
+/* sched_domains of this level overlap */
+SD_FLAG(SD_OVERLAP, 12)
+/* cross-node balancing */
+SD_FLAG(SD_NUMA, 13)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..f274eca282de 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -11,20 +11,10 @@
*/
#ifdef CONFIG_SMP

-#define SD_BALANCE_NEWIDLE 0x0001 /* Balance when about to become idle */
-#define SD_BALANCE_EXEC 0x0002 /* Balance on exec */
-#define SD_BALANCE_FORK 0x0004 /* Balance on fork, clone */
-#define SD_BALANCE_WAKE 0x0008 /* Balance on wakeup */
-#define SD_WAKE_AFFINE 0x0010 /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY 0x0020 /* Domain members have different CPU capacities */
-#define SD_SHARE_CPUCAPACITY 0x0040 /* Domain members share CPU capacity */
-#define SD_SHARE_POWERDOMAIN 0x0080 /* Domain members share power domain */
-#define SD_SHARE_PKG_RESOURCES 0x0100 /* Domain members share CPU pkg resources */
-#define SD_SERIALIZE 0x0200 /* Only a single load balancing instance */
-#define SD_ASYM_PACKING 0x0400 /* Place busy groups earlier in the domain */
-#define SD_PREFER_SIBLING 0x0800 /* Prefer to place tasks in a sibling domain */
-#define SD_OVERLAP 0x1000 /* sched_domains of this level overlap */
-#define SD_NUMA 0x2000 /* cross-node balancing */
+/* Generate SD_FOO = VALUE */
+#define SD_FLAG(name, idx) static const unsigned int name = BIT(idx);
+#include <linux/sched/sd_flags.h>
+#undef SD_FLAG

#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
--
2.27.0

2020-07-01 19:10:00

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 5/7] sched/topology: Add more flags to the SD degeneration mask

I don't think it is going to change much in practice, but we were missing
those:

o SD_BALANCE_WAKE: Used just like the other SD_BALANCE_* flags, so also
needs > 1 group.
o SD_ASYM_PACKING: Hinges on load balancing (periodic / wakeup), thus needs
> 1 group to happen
o SD_OVERLAP: Describes domains with overlapping groups; can't have
overlaps with a single group.

SD_PREFER_SIBLING is as always the odd one out: we currently consider it
in sd_parent_degenerate() but not in sd_degenerate(). It too hinges on load
balancing, and thus won't have any effect when set on a domain with a
single group. Add it to the sd_degenerate() groups mask.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/topology.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6047d491abe9..fe393b295444 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -161,11 +161,15 @@ static int sd_degenerate(struct sched_domain *sd)

/* Following flags need at least 2 groups */
if (sd->flags & (SD_BALANCE_NEWIDLE |
+ SD_BALANCE_WAKE |
SD_BALANCE_FORK |
SD_BALANCE_EXEC |
+ SD_ASYM_PACKING |
SD_SHARE_CPUCAPACITY |
SD_ASYM_CPUCAPACITY |
- SD_SHARE_PKG_RESOURCES)) {
+ SD_SHARE_PKG_RESOURCES |
+ SD_OVERLAP |
+ SD_PREFER_SIBLING)) {
if (sd->groups != sd->groups->next)
return 0;
}
@@ -191,11 +195,14 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
/* Flags needing groups don't count if only 1 group in parent */
if (parent->groups == parent->groups->next) {
pflags &= ~(SD_BALANCE_NEWIDLE |
+ SD_BALANCE_WAKE |
SD_BALANCE_FORK |
SD_BALANCE_EXEC |
+ SD_ASYM_PACKING |
SD_ASYM_CPUCAPACITY |
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
+ SD_OVERLAP |
SD_PREFER_SIBLING);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
--
2.27.0

2020-07-01 19:10:00

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 2/7] sched/topology: Define and assign sched_domain flag metadata

There are some expectations regarding how sched domain flags should be laid
out, but none of them are checked or asserted in
sched_domain_debug_one(). After staring at said flags for a while, I've
come to realize they all (except *one*) fall in either of two categories:

- Shared with children: those flags are set from the base CPU domain
upwards. Any domain that has it set will have it set in its children. It
hints at "some property holds true / some behaviour is enabled until this
level".

- Shared with parents: those flags are set from the topmost domain
downwards. Any domain that has it set will have it set in its parents. It
hints at "some property isn't visible / some behaviour is disabled until
this level".

The odd one out is SD_PREFER_SIBLING, which is cleared below levels with
SD_ASYM_CPUCAPACITY. The change was introduced by commit

9c63e84db29b ("sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains")

as it could break misfit migration on some systems. In light of this, we
might want to change it back to make it fit one of the two categories and
fix the issue another way.

Tweak the sched_domain flag declaration to assign each flag an expected
layout, and include the rationale for each flag "meta type" assignment as a
comment. Consolidate the flag metadata into an array; the index of a flag's
metadata can easily be found with log2(flag), IOW __ffs(flag).

Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched/sd_flags.h | 164 +++++++++++++++++++++++++++------
include/linux/sched/topology.h | 13 ++-
2 files changed, 148 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 685bbe736945..b5a11df0afe4 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -3,31 +3,139 @@
* sched-domains (multiprocessor balancing) flag declarations.
*/

-/* Balance when about to become idle */
-SD_FLAG(SD_BALANCE_NEWIDLE, 0)
-/* Balance on exec */
-SD_FLAG(SD_BALANCE_EXEC, 1)
-/* Balance on fork, clone */
-SD_FLAG(SD_BALANCE_FORK, 2)
-/* Balance on wakeup */
-SD_FLAG(SD_BALANCE_WAKE, 3)
-/* Wake task to waking CPU */
-SD_FLAG(SD_WAKE_AFFINE, 4)
-/* Domain members have different CPU capacities */
-SD_FLAG(SD_ASYM_CPUCAPACITY, 5)
-/* Domain members share CPU capacity */
-SD_FLAG(SD_SHARE_CPUCAPACITY, 6)
-/* Domain members share power domain */
-SD_FLAG(SD_SHARE_POWERDOMAIN, 7)
-/* Domain members share CPU pkg resources */
-SD_FLAG(SD_SHARE_PKG_RESOURCES, 8)
-/* Only a single load balancing instance */
-SD_FLAG(SD_SERIALIZE, 9)
-/* Place busy groups earlier in the domain */
-SD_FLAG(SD_ASYM_PACKING, 10)
-/* Prefer to place tasks in a sibling domain */
-SD_FLAG(SD_PREFER_SIBLING, 11)
-/* sched_domains of this level overlap */
-SD_FLAG(SD_OVERLAP, 12)
-/* cross-node balancing */
-SD_FLAG(SD_NUMA, 13)
+#ifndef SD_FLAG
+#define SD_FLAG(x, y, z)
+#endif
+
+/*
+ * Expected flag uses
+ *
+ * SHARED_CHILD: These flags are meant to be set from the base domain upwards.
+ * If a domain has this flag set, all of its children should have it set. This
+ * is usually because the flag describes some shared resource (all CPUs in that
+ * domain share the same foobar), or because they are tied to a scheduling
+ * behaviour that we want to disable at some point in the hierarchy for
+ * scalability reasons.
+ *
+ * In those cases it doesn't make sense to have the flag set for a domain but
+ * not have it in (some of) its children: sched domains ALWAYS span their child
+ * domains, so operations done with parent domains will cover CPUs in the lower
+ * child domains.
+ *
+ *
+ * SHARED_PARENT: These flags are meant to be set from the highest domain
+ * downwards. If a domain has this flag set, all of its parents should have it
+ * set. This is usually for topology properties that start to appear above a
+ * certain level (e.g. domain starts spanning CPUs outside of the base CPU's
+ * socket).
+ */
+#define SDF_SHARED_CHILD 0x1
+#define SDF_SHARED_PARENT 0x2
+
+/*
+ * Balance when about to become idle
+ *
+ * SHARED_CHILD: Set from the base domain up to cpuset.sched_relax_domain_level.
+ */
+SD_FLAG(SD_BALANCE_NEWIDLE, 0, SDF_SHARED_CHILD)
+
+/*
+ * Balance on exec
+ *
+ * SHARED_CHILD: Set from the base domain up to the NUMA reclaim level.
+ */
+SD_FLAG(SD_BALANCE_EXEC, 1, SDF_SHARED_CHILD)
+
+/*
+ * Balance on fork, clone
+ *
+ * SHARED_CHILD: Set from the base domain up to the NUMA reclaim level.
+ */
+SD_FLAG(SD_BALANCE_FORK, 2, SDF_SHARED_CHILD)
+
+/*
+ * Balance on wakeup
+ *
+ * SHARED_CHILD: Set from the base domain up to cpuset.sched_relax_domain_level.
+ */
+SD_FLAG(SD_BALANCE_WAKE, 3, SDF_SHARED_CHILD)
+
+/*
+ * Consider waking task on waking CPU.
+ *
+ * SHARED_CHILD: Set from the base domain up to the NUMA reclaim level.
+ */
+SD_FLAG(SD_WAKE_AFFINE, 4, SDF_SHARED_CHILD)
+
+/*
+ * Domain members have different CPU capacities
+ *
+ * SHARED_PARENT: Set from the topmost domain down to the first domain where
+ * asymmetry is detected.
+ */
+SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT)
+
+/*
+ * Domain members share CPU capacity (i.e. SMT)
+ *
+ * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
+ * CPU capacity.
+ */
+SD_FLAG(SD_SHARE_CPUCAPACITY, 6, SDF_SHARED_CHILD)
+
+/*
+ * Domain members share power domain
+ *
+ * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
+ * the same power domain.
+ */
+SD_FLAG(SD_SHARE_POWERDOMAIN, 7, SDF_SHARED_CHILD)
+
+/*
+ * Domain members share CPU package resources (i.e. caches)
+ *
+ * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
+ * the same cache(s).
+ */
+SD_FLAG(SD_SHARE_PKG_RESOURCES, 8, SDF_SHARED_CHILD)
+
+/*
+ * Only a single load balancing instance
+ *
+ * SHARED_PARENT: Set for all NUMA levels above NODE. Could be set from a
+ * different level upwards, but it doesn't change that if a domain has this flag
+ * set, then all of its parents need to have it too (otherwise the serialization
+ * doesn't make sense).
+ */
+SD_FLAG(SD_SERIALIZE, 9, SDF_SHARED_PARENT)
+
+/*
+ * Place busy tasks earlier in the domain
+ *
+ * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
+ * up, but currently assumed to be set from the base domain upwards (see
+ * update_top_cache_domain()).
+ */
+SD_FLAG(SD_ASYM_PACKING, 10, SDF_SHARED_CHILD)
+
+/*
+ * Prefer to place tasks in a sibling domain
+ *
+ * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
+ * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
+ */
+SD_FLAG(SD_PREFER_SIBLING, 11, 0)
+
+/*
+ * sched_groups of this level overlap
+ *
+ * SHARED_PARENT: Set for all NUMA levels above NODE.
+ */
+SD_FLAG(SD_OVERLAP, 12, SDF_SHARED_PARENT)
+
+/*
+ * cross-node balancing
+ *
+ * SHARED_PARENT: Set for all NUMA levels above NODE.
+ */
+SD_FLAG(SD_NUMA, 13, SDF_SHARED_PARENT)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index f274eca282de..fce64dae09af 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -12,10 +12,21 @@
#ifdef CONFIG_SMP

/* Generate SD_FOO = VALUE */
-#define SD_FLAG(name, idx) static const unsigned int name = BIT(idx);
+#define SD_FLAG(name, idx, mflags) static const unsigned int name = BIT(idx);
#include <linux/sched/sd_flags.h>
#undef SD_FLAG

+#ifdef CONFIG_SCHED_DEBUG
+#define SD_FLAG(_name, idx, mflags) [idx] = {.meta_flags = mflags, .name = #_name},
+static const struct {
+ unsigned int meta_flags;
+ char *name;
+} sd_flag_debug[] = {
+#include <linux/sched/sd_flags.h>
+};
+#undef SD_FLAG
+#endif
+
#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
{
--
2.27.0

2020-07-01 19:10:16

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 7/7] sched/topology: Use prebuilt SD flag degeneration mask

Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and
sd_degenerate_parent().

Note that this changes sd_degenerate() somewhat: I'm using the negation of
SD_DEGENERATE_GROUPS_MASK as the mask of flags not requiring groups, which
is equivalent to:

SD_WAKE_AFFINE | SD_SERIALIZE | SD_NUMA

whereas the current mask for that is simply

SD_WAKE_AFFINE

I played with a few toy NUMA topologies on QEMU and couldn't cause a
different degeneration than what mainline does currently. If that is deemed
too risky, we can go back to using SD_WAKE_AFFINE explicitly.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/topology.c | 24 +++---------------------
1 file changed, 3 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index fe393b295444..a135a0c99b16 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -160,22 +160,13 @@ static int sd_degenerate(struct sched_domain *sd)
return 1;

/* Following flags need at least 2 groups */
- if (sd->flags & (SD_BALANCE_NEWIDLE |
- SD_BALANCE_WAKE |
- SD_BALANCE_FORK |
- SD_BALANCE_EXEC |
- SD_ASYM_PACKING |
- SD_SHARE_CPUCAPACITY |
- SD_ASYM_CPUCAPACITY |
- SD_SHARE_PKG_RESOURCES |
- SD_OVERLAP |
- SD_PREFER_SIBLING)) {
+ if (sd->flags & SD_DEGENERATE_GROUPS_MASK) {
if (sd->groups != sd->groups->next)
return 0;
}

/* Following flags don't use groups */
- if (sd->flags & (SD_WAKE_AFFINE))
+ if (sd->flags & ~SD_DEGENERATE_GROUPS_MASK)
return 0;

return 1;
@@ -194,16 +185,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)

/* Flags needing groups don't count if only 1 group in parent */
if (parent->groups == parent->groups->next) {
- pflags &= ~(SD_BALANCE_NEWIDLE |
- SD_BALANCE_WAKE |
- SD_BALANCE_FORK |
- SD_BALANCE_EXEC |
- SD_ASYM_PACKING |
- SD_ASYM_CPUCAPACITY |
- SD_SHARE_CPUCAPACITY |
- SD_SHARE_PKG_RESOURCES |
- SD_OVERLAP |
- SD_PREFER_SIBLING);
+ pflags &= ~SD_DEGENERATE_GROUPS_MASK;
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
--
2.27.0

2020-07-02 12:16:48

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] sched/topology: Define and assign sched_domain flag metadata

Hey Valentin,

On Wednesday 01 Jul 2020 at 20:06:50 (+0100), Valentin Schneider wrote:
> +/*
> + * Domain members have different CPU capacities
> + *
> + * SHARED_PARENT: Set from the topmost domain down to the first domain where
> + * asymmetry is detected.
> + */
> +SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT)

Probably not a huge deal, but I don't think this is _exactly_ how
SD_ASYM_CPUCAPACITY was defined originally, nor how the topology
detection code deals with it (IIRC).

That is, IIRC Morten defined SD_ASYM_CPUCAPACITY as the _lowest_ domain
at which all CPU capacities are visible. On all real systems I can think
of, this domain also happens to be the topmost domain, so that might not
be a huge deal and we can probably change that definition to the one you
suggest. But we should perhaps make the matching changes to
asym_cpu_capacity_level()/build_sched_domains() if we're going down that
path?

Thanks,
Quentin

2020-07-02 14:24:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] sched/topology: Verify SD_* flags setup when sched_debug is on

On Wed, Jul 01, 2020 at 08:06:51PM +0100, Valentin Schneider wrote:

> @@ -29,6 +29,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
> struct cpumask *groupmask)
> {
> struct sched_group *group = sd->groups;
> + int flags = sd->flags;

unsigned long flags = sd->flags;

>
> cpumask_clear(groupmask);
>
> @@ -43,6 +44,22 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
> printk(KERN_ERR "ERROR: domain->groups does not contain CPU%d\n", cpu);
> }
>
> + for (; flags; flags &= flags - 1) {
> + unsigned int idx = __ffs(flags);

for_each_set_bit(idx, &flags, SD_MAX_BIT) {

Yes, it's a bit yuck, but far more readable imo.

> + unsigned int flag = BIT(idx);
> + unsigned int meta_flags = sd_flag_debug[idx].meta_flags;
> +
> + if ((meta_flags & SDF_SHARED_CHILD) && sd->child &&
> + !(sd->child->flags & flag))
> + printk(KERN_ERR "ERROR: flag %s set here but not in child\n",
> + sd_flag_debug[idx].name);
> +
> + if ((meta_flags & SDF_SHARED_PARENT) && sd->parent &&
> + !(sd->parent->flags & flag))
> + printk(KERN_ERR "ERROR: flag %s set here but not in parent\n",
> + sd_flag_debug[idx].name);
> + }
> +
> printk(KERN_DEBUG "%*s groups:", level + 1, "");
> do {
> if (!group) {
> --
> 2.27.0
>

2020-07-02 14:31:56

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] sched/topology: Define and assign sched_domain flag metadata


Hi Quentin,

Thanks for having a look!

On 02/07/20 13:15, Quentin Perret wrote:
> Hey Valentin,
>
> On Wednesday 01 Jul 2020 at 20:06:50 (+0100), Valentin Schneider wrote:
>> +/*
>> + * Domain members have different CPU capacities
>> + *
>> + * SHARED_PARENT: Set from the topmost domain down to the first domain where
>> + * asymmetry is detected.
>> + */
>> +SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT)
>
> Probably not a huge deal, but I don't think this is _exactly_ how
> SD_ASYM_CPUCAPACITY was defined originally, nor how the topology
> detection code deals with it (IIRC).
>
> That is, IIRC Morten defined SD_ASYM_CPUCAPACITY as the _lowest_ domain
> at which all CPU capacities are visible. On all real systems I can think
> of, this domain also happens to be the topmost domain, so that might not
> be a huge deal and we can probably change that definition to the one you
> suggest. But we should perhaps make the matching changes to
> asym_cpu_capacity_level()/build_sched_domains() if we're going down that
> path?
>

You're spot on, I was trying to sweep some of it under the carpet...

There an "interesting" quirk of asym_cpu_capacity_level() in that it does
something slightly different than what it says on the tin: it detects
the lowest topology level where *the biggest* CPU capacity is visible by
all CPUs. That works just fine on big.LITTLE, but there are questionable
DynamIQ topologies that could hit some issues.

Consider:

DIE [ ]
MC [ ][ ] <- sd_asym_cpucapacity
0 1 2 3 4 5
L L B B B B

asym_cpu_capacity_level() would pick MC as the asymmetric topology level,
and you can argue either way: it should be DIE, because that's where CPUs 4
and 5 can see a LITTLE, or it should be MC, at least for CPUs 0-3 because
there they see all CPU capacities.

I have a plan on how to fix that, but I haven't been made aware of any
"real" topology that would seriously break there. The moment one does, this
will surface up to the top of my todo-list.

In the meantime, we can make it match the SDF_SHARED_PARENT semantics, and
this actually fixes an issue with solo big CPU clusters (which I
anecdotally found out while first writing this series, and forgot to
include):

--->8
From: Valentin Schneider <[email protected]>
Date: Wed, 16 Oct 2019 18:12:12 +0100
Subject: [PATCH 1/1] sched/topology: Propagate SD_ASYM_CPUCAPACITY upwards

We currently set this flag *only* on domains whose topology level exactly
match the level where we detect asymmetry (as returned by
asym_cpu_capacity_level()). This is rather problematic.

Say there are two clusters in the system, one with a lone big CPU and the
other with a mix of big and LITTLE CPUs:

DIE [ ]
MC [ ][ ]
0 1 2 3 4
L L B B B

asym_cpu_capacity_level() will figure out that the MC level is the one
where all CPUs can see a CPU of max capacity, and we will thus set
SD_ASYM_CPUCAPACITY at MC level for all CPUs.

That lone big CPU will degenerate its MC domain, since it would be alone in
there, and will end up with just a DIE domain. Since the flag was only set
at MC, this CPU ends up not seeing any SD with the flag set, which is
broken.

Rather than clearing dflags at every topology level, clear it before
entering the topology level loop. This will properly propagate upwards
flags that are set starting from a certain level.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/topology.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b5667a273bf6..549268249645 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1965,11 +1965,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Set up domains for CPUs specified by the cpu_map: */
for_each_cpu(i, cpu_map) {
struct sched_domain_topology_level *tl;
+ int dflags = 0;

sd = NULL;
for_each_sd_topology(tl) {
- int dflags = 0;
-
if (tl == tl_asym) {
dflags |= SD_ASYM_CPUCAPACITY;
has_asym = true;
--
2.27.0

2020-07-02 14:34:01

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] sched/topology: Verify SD_* flags setup when sched_debug is on


On 02/07/20 15:20, Peter Zijlstra wrote:
> On Wed, Jul 01, 2020 at 08:06:51PM +0100, Valentin Schneider wrote:
>
>> @@ -29,6 +29,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
>> struct cpumask *groupmask)
>> {
>> struct sched_group *group = sd->groups;
>> + int flags = sd->flags;
>
> unsigned long flags = sd->flags;
>
>>
>> cpumask_clear(groupmask);
>>
>> @@ -43,6 +44,22 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
>> printk(KERN_ERR "ERROR: domain->groups does not contain CPU%d\n", cpu);
>> }
>>
>> + for (; flags; flags &= flags - 1) {
>> + unsigned int idx = __ffs(flags);
>
> for_each_set_bit(idx, &flags, SD_MAX_BIT) {
>
> Yes, it's a bit yuck, but far more readable imo.
>

Much better than homegrown stuff, thanks!

>> + unsigned int flag = BIT(idx);
>> + unsigned int meta_flags = sd_flag_debug[idx].meta_flags;
>> +
>> + if ((meta_flags & SDF_SHARED_CHILD) && sd->child &&
>> + !(sd->child->flags & flag))
>> + printk(KERN_ERR "ERROR: flag %s set here but not in child\n",
>> + sd_flag_debug[idx].name);
>> +
>> + if ((meta_flags & SDF_SHARED_PARENT) && sd->parent &&
>> + !(sd->parent->flags & flag))
>> + printk(KERN_ERR "ERROR: flag %s set here but not in parent\n",
>> + sd_flag_debug[idx].name);
>> + }
>> +
>> printk(KERN_DEBUG "%*s groups:", level + 1, "");
>> do {
>> if (!group) {
>> --
>> 2.27.0
>>

2020-07-02 15:47:18

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] sched/topology: Define and assign sched_domain flag metadata

On Thursday 02 Jul 2020 at 15:31:07 (+0100), Valentin Schneider wrote:
> There an "interesting" quirk of asym_cpu_capacity_level() in that it does
> something slightly different than what it says on the tin: it detects
> the lowest topology level where *the biggest* CPU capacity is visible by
> all CPUs. That works just fine on big.LITTLE, but there are questionable
> DynamIQ topologies that could hit some issues.
>
> Consider:
>
> DIE [ ]
> MC [ ][ ] <- sd_asym_cpucapacity
> 0 1 2 3 4 5
> L L B B B B
>
> asym_cpu_capacity_level() would pick MC as the asymmetric topology level,
> and you can argue either way: it should be DIE, because that's where CPUs 4
> and 5 can see a LITTLE, or it should be MC, at least for CPUs 0-3 because
> there they see all CPU capacities.

Right, I am not looking forward to these topologies...

> I have a plan on how to fix that, but I haven't been made aware of any
> "real" topology that would seriously break there. The moment one does, this
> will surface up to the top of my todo-list.
>
> In the meantime, we can make it match the SDF_SHARED_PARENT semantics, and
> this actually fixes an issue with solo big CPU clusters (which I
> anecdotally found out while first writing this series, and forgot to
> include):
>
> --->8
> From: Valentin Schneider <[email protected]>
> Date: Wed, 16 Oct 2019 18:12:12 +0100
> Subject: [PATCH 1/1] sched/topology: Propagate SD_ASYM_CPUCAPACITY upwards
>
> We currently set this flag *only* on domains whose topology level exactly
> match the level where we detect asymmetry (as returned by
> asym_cpu_capacity_level()). This is rather problematic.
>
> Say there are two clusters in the system, one with a lone big CPU and the
> other with a mix of big and LITTLE CPUs:
>
> DIE [ ]
> MC [ ][ ]
> 0 1 2 3 4
> L L B B B
>
> asym_cpu_capacity_level() will figure out that the MC level is the one
> where all CPUs can see a CPU of max capacity, and we will thus set
> SD_ASYM_CPUCAPACITY at MC level for all CPUs.
>
> That lone big CPU will degenerate its MC domain, since it would be alone in
> there, and will end up with just a DIE domain. Since the flag was only set
> at MC, this CPU ends up not seeing any SD with the flag set, which is
> broken.

+1

> Rather than clearing dflags at every topology level, clear it before
> entering the topology level loop. This will properly propagate upwards
> flags that are set starting from a certain level.

I'm feeling a bit nervous about that asymmetry -- in your example
select_idle_capacity() on, say, CPU3 will see less CPUs than on CPU4.
So, you might get fun side-effects where all task migrated to CPUs 0-3
will be 'stuck' there while CPU 4 stays mostly idle.

I have a few ideas to avoid that (e.g. looking at the rd span in
select_idle_capacity() instead of sd_asym_cpucapacity) but all this is
theoretical, so I'm happy to wait for a real platform to be released
before we worry too much about it.

In the meantime:

Reviewed-by: Quentin Perret <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/topology.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b5667a273bf6..549268249645 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1965,11 +1965,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> /* Set up domains for CPUs specified by the cpu_map: */
> for_each_cpu(i, cpu_map) {
> struct sched_domain_topology_level *tl;
> + int dflags = 0;
>
> sd = NULL;
> for_each_sd_topology(tl) {
> - int dflags = 0;
> -
> if (tl == tl_asym) {
> dflags |= SD_ASYM_CPUCAPACITY;
> has_asym = true;
> --
> 2.27.0

Thanks,
Quentin

2020-07-02 16:28:12

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] sched/topology: Define and assign sched_domain flag metadata


On 02/07/20 16:45, Quentin Perret wrote:
> On Thursday 02 Jul 2020 at 15:31:07 (+0100), Valentin Schneider wrote:
>> There an "interesting" quirk of asym_cpu_capacity_level() in that it does
>> something slightly different than what it says on the tin: it detects
>> the lowest topology level where *the biggest* CPU capacity is visible by
>> all CPUs. That works just fine on big.LITTLE, but there are questionable
>> DynamIQ topologies that could hit some issues.
>>
>> Consider:
>>
>> DIE [ ]
>> MC [ ][ ] <- sd_asym_cpucapacity
>> 0 1 2 3 4 5
>> L L B B B B
>>
>> asym_cpu_capacity_level() would pick MC as the asymmetric topology level,
>> and you can argue either way: it should be DIE, because that's where CPUs 4
>> and 5 can see a LITTLE, or it should be MC, at least for CPUs 0-3 because
>> there they see all CPU capacities.
>
> Right, I am not looking forward to these topologies...

I'll try my best to prevent those from seeing the light of day, but you
know how this works...

>> Say there are two clusters in the system, one with a lone big CPU and the
>> other with a mix of big and LITTLE CPUs:
>>
>> DIE [ ]
>> MC [ ][ ]
>> 0 1 2 3 4
>> L L B B B
>>
>> asym_cpu_capacity_level() will figure out that the MC level is the one
>> where all CPUs can see a CPU of max capacity, and we will thus set
>> SD_ASYM_CPUCAPACITY at MC level for all CPUs.
>>
>> That lone big CPU will degenerate its MC domain, since it would be alone in
>> there, and will end up with just a DIE domain. Since the flag was only set
>> at MC, this CPU ends up not seeing any SD with the flag set, which is
>> broken.
>
> +1
>
>> Rather than clearing dflags at every topology level, clear it before
>> entering the topology level loop. This will properly propagate upwards
>> flags that are set starting from a certain level.
>
> I'm feeling a bit nervous about that asymmetry -- in your example
> select_idle_capacity() on, say, CPU3 will see less CPUs than on CPU4.
> So, you might get fun side-effects where all task migrated to CPUs 0-3
> will be 'stuck' there while CPU 4 stays mostly idle.
>

It's actually pretty close to what happens with the LLC domain on SMP -
select_idle_sibling() doesn't look outside of it. The wake_affine() stuff
might steer the task towards a different LLC, but that's about it for
wakeups. We rely on load balancing (fork/exec, newidle, nohz and periodic)
to spread this further - and we would here too.

It gets "funny" for EAS when we aren't overutilized and thus can't rely on
load balancing; at least misfit ought to still work. It *is* a weird
topology, for sure.

> I have a few ideas to avoid that (e.g. looking at the rd span in
> select_idle_capacity() instead of sd_asym_cpucapacity) but all this is
> theoretical, so I'm happy to wait for a real platform to be released
> before we worry too much about it.
>
> In the meantime:
>
> Reviewed-by: Quentin Perret <[email protected]>

Thanks!

2020-07-02 16:38:37

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] sched/topology: Define and assign sched_domain flag metadata

On Thursday 02 Jul 2020 at 17:25:41 (+0100), Valentin Schneider wrote:
> It's actually pretty close to what happens with the LLC domain on SMP -
> select_idle_sibling() doesn't look outside of it. The wake_affine() stuff
> might steer the task towards a different LLC, but that's about it for
> wakeups. We rely on load balancing (fork/exec, newidle, nohz and periodic)
> to spread this further - and we would here too.

Sure, but on SMP the search space in select_idle_sibling is always
consistent -- you search within the LLC. With the fix you suggested,
CPUs 0-3 will search within their LLCs, while CPU4 searches the entire
system, which creates an imbalanced mess IMO.

For affine wake-ups, you could migrate from CPU4 -> CPU0-3, but CPU0-3
to CPU4 is not possible, so this asymmetry is almost guaranteed to
actively create imbalance. And sure, the periodic load balancer ought to
fix it, but really wake-up balance and periodic load balance should be
pushing in the same direction and not fighting against each other.

Anyways, enough bikeshedding for today, I'll try and have look at the
rest of the series :)

Cheers,
Quentin

2020-07-02 16:45:39

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] arm, sched/topology: Remove SD_SHARE_POWERDOMAIN

On 01/07/2020 21:06, Valentin Schneider wrote:
> This flag was introduced in 2014 by commit
>
> d77b3ed5c9f8 ("sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain")
>
> but AFAIA it was never leveraged by the scheduler. The closest thing I can
> think of is EAS caring about frequency domains, and it does that by
> leveraging performance domains.

... and even this was purely out of tree (SD_SHARE_CAP_STATES).

> Remove the flag.
>
> Suggested-by: Morten Rasmussen <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> arch/arm/kernel/topology.c | 2 +-
> include/linux/sched/sd_flags.h | 20 ++++++--------------
> kernel/sched/topology.c | 10 +++-------
> 3 files changed, 10 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index b5adaf744630..353f3ee660e4 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -243,7 +243,7 @@ void store_cpu_topology(unsigned int cpuid)
>
> static inline int cpu_corepower_flags(void)
> {
> - return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
> + return SD_SHARE_PKG_RESOURCES;
> }
>
> static struct sched_domain_topology_level arm_topology[] = {

I guess with SD_SHARE_POWERDOMAIN gone, arch arm can even use the default_topology[]:

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index b5adaf744630..87dd193165cc 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -241,20 +241,6 @@ void store_cpu_topology(unsigned int cpuid)
update_siblings_masks(cpuid);
}

-static inline int cpu_corepower_flags(void)
-{
- return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
-}
-
-static struct sched_domain_topology_level arm_topology[] = {
-#ifdef CONFIG_SCHED_MC
- { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
-#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
-};
-
/*
* init_cpu_topology is called at boot when only one cpu is running
* which prevent simultaneous write access to cpu_topology array
@@ -265,7 +251,4 @@ void __init init_cpu_topology(void)
smp_wmb();

parse_dt_topology();
-
- /* Set scheduler topology descriptor */
- set_sched_topology(arm_topology);
}

2020-07-02 16:52:26

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] sched/topology: Define and assign sched_domain flag metadata


On 02/07/20 17:37, Quentin Perret wrote:
> On Thursday 02 Jul 2020 at 17:25:41 (+0100), Valentin Schneider wrote:
>> It's actually pretty close to what happens with the LLC domain on SMP -
>> select_idle_sibling() doesn't look outside of it. The wake_affine() stuff
>> might steer the task towards a different LLC, but that's about it for
>> wakeups. We rely on load balancing (fork/exec, newidle, nohz and periodic)
>> to spread this further - and we would here too.
>
> Sure, but on SMP the search space in select_idle_sibling is always
> consistent -- you search within the LLC. With the fix you suggested,
> CPUs 0-3 will search within their LLCs, while CPU4 searches the entire
> system, which creates an imbalanced mess IMO.
>

Yeah, it is a mess.

> For affine wake-ups, you could migrate from CPU4 -> CPU0-3, but CPU0-3
> to CPU4 is not possible

AIU the wake_affine bits, you get to steer the wakeup towards the waking
CPU. So if the task previously ran on CPU0-3, wake_affine can make the
target CPU4 (waking CPU), so it would become a possible candidate.

But as you say, this thing is still an ugly asymmetric mess.

> so this asymmetry is almost guaranteed to
> actively create imbalance. And sure, the periodic load balancer ought to
> fix it, but really wake-up balance and periodic load balance should be
> pushing in the same direction and not fighting against each other.
>
> Anyways, enough bikeshedding for today, I'll try and have look at the
> rest of the series :)
>

Thanks!

> Cheers,
> Quentin

2020-07-02 18:30:28

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] sched/topology: Add more flags to the SD degeneration mask

On 01/07/2020 21:06, Valentin Schneider wrote:
> I don't think it is going to change much in practice, but we were missing
> those:
>
> o SD_BALANCE_WAKE: Used just like the other SD_BALANCE_* flags, so also
> needs > 1 group.
> o SD_ASYM_PACKING: Hinges on load balancing (periodic / wakeup), thus needs
> > 1 group to happen
> o SD_OVERLAP: Describes domains with overlapping groups; can't have
> overlaps with a single group.
>
> SD_PREFER_SIBLING is as always the odd one out: we currently consider it
> in sd_parent_degenerate() but not in sd_degenerate(). It too hinges on load
> balancing, and thus won't have any effect when set on a domain with a
> single group. Add it to the sd_degenerate() groups mask.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/topology.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6047d491abe9..fe393b295444 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -161,11 +161,15 @@ static int sd_degenerate(struct sched_domain *sd)
>
> /* Following flags need at least 2 groups */
> if (sd->flags & (SD_BALANCE_NEWIDLE |
> + SD_BALANCE_WAKE |
> SD_BALANCE_FORK |
> SD_BALANCE_EXEC |
> + SD_ASYM_PACKING |
> SD_SHARE_CPUCAPACITY |
> SD_ASYM_CPUCAPACITY |
> - SD_SHARE_PKG_RESOURCES)) {
> + SD_SHARE_PKG_RESOURCES |
> + SD_OVERLAP |
> + SD_PREFER_SIBLING)) {

Why do you add SD_PREFER_SIBLING here? It doesn't have SDF_NEEDS_GROUPS
set in [PATCH v3 6/7].

$ cat ./include/linux/sched/sd_flags.h | grep ^SD_FLAG
SD_FLAG(SD_BALANCE_NEWIDLE, 0, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
SD_FLAG(SD_BALANCE_EXEC, 1, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
SD_FLAG(SD_BALANCE_FORK, 2, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
SD_FLAG(SD_BALANCE_WAKE, 3, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
SD_FLAG(SD_WAKE_AFFINE, 4, SDF_SHARED_CHILD)
SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
SD_FLAG(SD_SHARE_CPUCAPACITY, 6, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
SD_FLAG(SD_SHARE_PKG_RESOURCES, 7, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
SD_FLAG(SD_SERIALIZE, 8, SDF_SHARED_PARENT)
SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
SD_FLAG(SD_PREFER_SIBLING, 10, 0) <-- !!!
SD_FLAG(SD_OVERLAP, 11, SDF_SHARED_PARENT |SDF_NEEDS_GROUPS)
SD_FLAG(SD_NUMA, 12, SDF_SHARED_PARENT)

2020-07-02 18:30:40

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] sched/topology: Introduce SD metaflag for flags needing > 1 groups

On 01/07/2020 21:06, Valentin Schneider wrote:

[...]

> @@ -105,16 +122,18 @@ SD_FLAG(SD_SERIALIZE, 8, SDF_SHARED_PARENT)
> * Place busy tasks earlier in the domain
> *
> * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
> - * up, but currently assumed to be set from the base domain upwards (see
> - * update_top_cache_domain()).
> + * up, but currently assumed to be set from the base domain
> + * upwards (see update_top_cache_domain()).
> */
> -SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD)
> +SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
>
> /*
> * Prefer to place tasks in a sibling domain
> *
> * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> + *
> + * NEEDS_GROUPS: Load balancing flag.
> */
> SD_FLAG(SD_PREFER_SIBLING, 10, 0)

Related to my comment in [PATCH v3 5/7], maybe you wanted to add
SDF_NEEDS_GROUPS for SD_PREFER_SIBLING as well ? This comment
'NEEDS_GROUPS: Load balancing flag.' makes me wondering.

Currently, SD_PREFER_SIBLING isn't in SD_DEGENERATE_GROUPS_MASK=0xaef.

[...]

2020-07-02 18:47:33

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] arm, sched/topology: Remove SD_SHARE_POWERDOMAIN


On 02/07/20 17:44, Dietmar Eggemann wrote:
> On 01/07/2020 21:06, Valentin Schneider wrote:
>> This flag was introduced in 2014 by commit
>>
>> d77b3ed5c9f8 ("sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain")
>>
>> but AFAIA it was never leveraged by the scheduler. The closest thing I can
>> think of is EAS caring about frequency domains, and it does that by
>> leveraging performance domains.
>
> ... and even this was purely out of tree (SD_SHARE_CAP_STATES).
>
>> Remove the flag.
>>
>> Suggested-by: Morten Rasmussen <[email protected]>
>> Signed-off-by: Valentin Schneider <[email protected]>
>> ---
>> arch/arm/kernel/topology.c | 2 +-
>> include/linux/sched/sd_flags.h | 20 ++++++--------------
>> kernel/sched/topology.c | 10 +++-------
>> 3 files changed, 10 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index b5adaf744630..353f3ee660e4 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -243,7 +243,7 @@ void store_cpu_topology(unsigned int cpuid)
>>
>> static inline int cpu_corepower_flags(void)
>> {
>> - return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
>> + return SD_SHARE_PKG_RESOURCES;
>> }
>>
>> static struct sched_domain_topology_level arm_topology[] = {
>
> I guess with SD_SHARE_POWERDOMAIN gone, arch arm can even use the default_topology[]:

That does look like it! I never noticed we declared this GMC topology
level. Given it uses the thread_sibling mask, and that no (upstream) arm DT
uses the thread topology binding, I guess that makes sense.

>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index b5adaf744630..87dd193165cc 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -241,20 +241,6 @@ void store_cpu_topology(unsigned int cpuid)
> update_siblings_masks(cpuid);
> }
>
> -static inline int cpu_corepower_flags(void)
> -{
> - return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
> -}
> -
> -static struct sched_domain_topology_level arm_topology[] = {
> -#ifdef CONFIG_SCHED_MC
> - { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
> - { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> -#endif
> - { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> - { NULL, },
> -};
> -
> /*
> * init_cpu_topology is called at boot when only one cpu is running
> * which prevent simultaneous write access to cpu_topology array
> @@ -265,7 +251,4 @@ void __init init_cpu_topology(void)
> smp_wmb();
>
> parse_dt_topology();
> -
> - /* Set scheduler topology descriptor */
> - set_sched_topology(arm_topology);
> }

2020-07-02 18:48:03

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] sched/topology: Introduce SD metaflag for flags needing > 1 groups


On 02/07/20 19:29, Dietmar Eggemann wrote:
> On 01/07/2020 21:06, Valentin Schneider wrote:
>
> [...]
>
>> @@ -105,16 +122,18 @@ SD_FLAG(SD_SERIALIZE, 8, SDF_SHARED_PARENT)
>> * Place busy tasks earlier in the domain
>> *
>> * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
>> - * up, but currently assumed to be set from the base domain upwards (see
>> - * update_top_cache_domain()).
>> + * up, but currently assumed to be set from the base domain
>> + * upwards (see update_top_cache_domain()).
>> */
>> -SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD)
>> +SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
>>
>> /*
>> * Prefer to place tasks in a sibling domain
>> *
>> * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
>> * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
>> + *
>> + * NEEDS_GROUPS: Load balancing flag.
>> */
>> SD_FLAG(SD_PREFER_SIBLING, 10, 0)
>
> Related to my comment in [PATCH v3 5/7], maybe you wanted to add
> SDF_NEEDS_GROUPS for SD_PREFER_SIBLING as well ? This comment
> 'NEEDS_GROUPS: Load balancing flag.' makes me wondering.
>
> Currently, SD_PREFER_SIBLING isn't in SD_DEGENERATE_GROUPS_MASK=0xaef.
>

You're right, that's a fail from my end. Thanks (and sorry)!

> [...]

2020-07-13 12:43:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] sched/topology: Introduce SD metaflag for flags needing > 1 groups

On Wed, Jul 01, 2020 at 08:06:54PM +0100, Valentin Schneider wrote:
> +/* Generate a mask of SD flags with the SDF_NEEDS_GROUPS metaflag */
> +#define SD_FLAG(name, idx, mflags) (BIT(idx) * (((mflags) & SDF_NEEDS_GROUPS) / SDF_NEEDS_GROUPS)) |

#define SD_FLAGS(name, idx, mflags) (!!((mflags) & SDF_NEEDS_GROUPS) * BIT(idx))

> +static const int SD_DEGENERATE_GROUPS_MASK =
> +#include<linux/sched/sd_flags.h>
> +0;
> +#undef SD_FLAG
> +
> #ifdef CONFIG_SCHED_DEBUG
> #define SD_FLAG(_name, idx, mflags) [idx] = {.meta_flags = mflags, .name = #_name},
> static const struct {
> --
> 2.27.0
>

2020-07-13 12:56:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] sched/topology: Use prebuilt SD flag degeneration mask

On Wed, Jul 01, 2020 at 08:06:55PM +0100, Valentin Schneider wrote:
> Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and
> sd_degenerate_parent().
>
> Note that this changes sd_degenerate() somewhat: I'm using the negation of
> SD_DEGENERATE_GROUPS_MASK as the mask of flags not requiring groups, which
> is equivalent to:
>
> SD_WAKE_AFFINE | SD_SERIALIZE | SD_NUMA
>
> whereas the current mask for that is simply
>
> SD_WAKE_AFFINE
>
> I played with a few toy NUMA topologies on QEMU and couldn't cause a
> different degeneration than what mainline does currently. If that is deemed
> too risky, we can go back to using SD_WAKE_AFFINE explicitly.

Arguably SD_SERIALIZE needs groups, note how we're only having that
effective for machines with at least 2 nodes. It's a bit shit how we end
up there, but IIRC that's what it ends up as.

SD_NUMA is descriptive, and not marking a group as degenerates because
it has SD_NUMA seems a bit silly. But then, it would be the top domain
and would survive anyway?

2020-07-13 13:27:12

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] sched/topology: Introduce SD metaflag for flags needing > 1 groups


On 13/07/20 13:39, Peter Zijlstra wrote:
> On Wed, Jul 01, 2020 at 08:06:54PM +0100, Valentin Schneider wrote:
>> +/* Generate a mask of SD flags with the SDF_NEEDS_GROUPS metaflag */
>> +#define SD_FLAG(name, idx, mflags) (BIT(idx) * (((mflags) & SDF_NEEDS_GROUPS) / SDF_NEEDS_GROUPS)) |
>
> #define SD_FLAGS(name, idx, mflags) (!!((mflags) & SDF_NEEDS_GROUPS) * BIT(idx))
>

Was paranoid about really getting a 1 or a 0, but AFAICT the standard
does agree with you!

>> +static const int SD_DEGENERATE_GROUPS_MASK =
>> +#include<linux/sched/sd_flags.h>
>> +0;
>> +#undef SD_FLAG
>> +
>> #ifdef CONFIG_SCHED_DEBUG
>> #define SD_FLAG(_name, idx, mflags) [idx] = {.meta_flags = mflags, .name = #_name},
>> static const struct {
>> --
>> 2.27.0
>>

2020-07-13 13:31:02

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] sched/topology: Use prebuilt SD flag degeneration mask


On 13/07/20 13:55, Peter Zijlstra wrote:
> On Wed, Jul 01, 2020 at 08:06:55PM +0100, Valentin Schneider wrote:
>> Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and
>> sd_degenerate_parent().
>>
>> Note that this changes sd_degenerate() somewhat: I'm using the negation of
>> SD_DEGENERATE_GROUPS_MASK as the mask of flags not requiring groups, which
>> is equivalent to:
>>
>> SD_WAKE_AFFINE | SD_SERIALIZE | SD_NUMA
>>
>> whereas the current mask for that is simply
>>
>> SD_WAKE_AFFINE
>>
>> I played with a few toy NUMA topologies on QEMU and couldn't cause a
>> different degeneration than what mainline does currently. If that is deemed
>> too risky, we can go back to using SD_WAKE_AFFINE explicitly.
>
> Arguably SD_SERIALIZE needs groups, note how we're only having that
> effective for machines with at least 2 nodes. It's a bit shit how we end
> up there, but IIRC that's what it ends up as.
>

Right, AFAICT we get SD_SERIALIZE wherever we have SD_NUMA, which is any
level above NODE.

> SD_NUMA is descriptive, and not marking a group as degenerates because
> it has SD_NUMA seems a bit silly.

It does, although we can still degenerate it, see below.

> But then, it would be the top domain
> and would survive anyway?

So from what I've tested we still get rid of those via
sd_parent_degenerate(): child and parent have the same flags and same span,
so parent goes out.

That happens in the middle of the NUMA topology levels on that borked
topology with weird distances, aka

node distances:
node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10

which ought to look something like (+local distance to end result)

2 10 2
1 <---> 0 <---> 2 <---> 3

We end up with the following NUMA levels (i.e. deduplicated distances)
NUMA (<= 12)
NUMA (<= 20)
NUMA (<= 22)
NUMA (<= 24)

For e.g. any CPU of node1, NUMA(<=20) is gonna have the same span as
NUMA(<=12), so we'll degenerate it.

2020-07-13 13:46:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] sched/topology: Use prebuilt SD flag degeneration mask

On Mon, Jul 13, 2020 at 02:28:29PM +0100, Valentin Schneider wrote:
>
> On 13/07/20 13:55, Peter Zijlstra wrote:
> > On Wed, Jul 01, 2020 at 08:06:55PM +0100, Valentin Schneider wrote:
> >> Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and
> >> sd_degenerate_parent().
> >>
> >> Note that this changes sd_degenerate() somewhat: I'm using the negation of
> >> SD_DEGENERATE_GROUPS_MASK as the mask of flags not requiring groups, which
> >> is equivalent to:
> >>
> >> SD_WAKE_AFFINE | SD_SERIALIZE | SD_NUMA
> >>
> >> whereas the current mask for that is simply
> >>
> >> SD_WAKE_AFFINE
> >>
> >> I played with a few toy NUMA topologies on QEMU and couldn't cause a
> >> different degeneration than what mainline does currently. If that is deemed
> >> too risky, we can go back to using SD_WAKE_AFFINE explicitly.
> >
> > Arguably SD_SERIALIZE needs groups, note how we're only having that
> > effective for machines with at least 2 nodes. It's a bit shit how we end
> > up there, but IIRC that's what it ends up as.
> >
>
> Right, AFAICT we get SD_SERIALIZE wherever we have SD_NUMA, which is any
> level above NODE.

Oh, right, I forgot we have NODE, d'0h. But in that case these lines:

if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;

are dead code, right?

> > SD_NUMA is descriptive, and not marking a group as degenerates because
> > it has SD_NUMA seems a bit silly.
>
> It does, although we can still degenerate it, see below.
>
> > But then, it would be the top domain
> > and would survive anyway?
>
> So from what I've tested we still get rid of those via
> sd_parent_degenerate(): child and parent have the same flags and same span,
> so parent goes out.
>
> That happens in the middle of the NUMA topology levels on that borked
> topology with weird distances, aka
>
> node distances:
> node 0 1 2 3
> 0: 10 12 20 22
> 1: 12 10 22 24
> 2: 20 22 10 12
> 3: 22 24 12 10
>
> which ought to look something like (+local distance to end result)
>
> 2 10 2
> 1 <---> 0 <---> 2 <---> 3
>
> We end up with the following NUMA levels (i.e. deduplicated distances)
> NUMA (<= 12)
> NUMA (<= 20)
> NUMA (<= 22)
> NUMA (<= 24)
>
> For e.g. any CPU of node1, NUMA(<=20) is gonna have the same span as
> NUMA(<=12), so we'll degenerate it.

Man, that's horrible :-) OK, fair enough, keep it as is, we'll see what
if anything breaks.

2020-07-13 13:55:12

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] sched/topology: Use prebuilt SD flag degeneration mask


On 13/07/20 14:43, Peter Zijlstra wrote:
> On Mon, Jul 13, 2020 at 02:28:29PM +0100, Valentin Schneider wrote:
>>
>> On 13/07/20 13:55, Peter Zijlstra wrote:
>> > On Wed, Jul 01, 2020 at 08:06:55PM +0100, Valentin Schneider wrote:
>> >> Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and
>> >> sd_degenerate_parent().
>> >>
>> >> Note that this changes sd_degenerate() somewhat: I'm using the negation of
>> >> SD_DEGENERATE_GROUPS_MASK as the mask of flags not requiring groups, which
>> >> is equivalent to:
>> >>
>> >> SD_WAKE_AFFINE | SD_SERIALIZE | SD_NUMA
>> >>
>> >> whereas the current mask for that is simply
>> >>
>> >> SD_WAKE_AFFINE
>> >>
>> >> I played with a few toy NUMA topologies on QEMU and couldn't cause a
>> >> different degeneration than what mainline does currently. If that is deemed
>> >> too risky, we can go back to using SD_WAKE_AFFINE explicitly.
>> >
>> > Arguably SD_SERIALIZE needs groups, note how we're only having that
>> > effective for machines with at least 2 nodes. It's a bit shit how we end
>> > up there, but IIRC that's what it ends up as.
>> >
>>
>> Right, AFAICT we get SD_SERIALIZE wherever we have SD_NUMA, which is any
>> level above NODE.
>
> Oh, right, I forgot we have NODE, d'0h. But in that case these lines:
>
> if (nr_node_ids == 1)
> pflags &= ~SD_SERIALIZE;
>
> are dead code, right?
>

Uuuuh right, that does sound like something made obsolete by having NODE;
we didn't have it back then:

5436499e6098 ("sched: fix sd_parent_degenerate on non-numa smp machine")

I'll fold that in (and test it, just to be sure :-)).

>> > SD_NUMA is descriptive, and not marking a group as degenerates because
>> > it has SD_NUMA seems a bit silly.
>>
>> It does, although we can still degenerate it, see below.
>>
>> > But then, it would be the top domain
>> > and would survive anyway?
>>
>> So from what I've tested we still get rid of those via
>> sd_parent_degenerate(): child and parent have the same flags and same span,
>> so parent goes out.
>>
>> That happens in the middle of the NUMA topology levels on that borked
>> topology with weird distances, aka
>>
>> node distances:
>> node 0 1 2 3
>> 0: 10 12 20 22
>> 1: 12 10 22 24
>> 2: 20 22 10 12
>> 3: 22 24 12 10
>>
>> which ought to look something like (+local distance to end result)
>>
>> 2 10 2
>> 1 <---> 0 <---> 2 <---> 3
>>
>> We end up with the following NUMA levels (i.e. deduplicated distances)
>> NUMA (<= 12)
>> NUMA (<= 20)
>> NUMA (<= 22)
>> NUMA (<= 24)
>>
>> For e.g. any CPU of node1, NUMA(<=20) is gonna have the same span as
>> NUMA(<=12), so we'll degenerate it.
>
> Man, that's horrible :-)

It is :(

> OK, fair enough, keep it as is, we'll see what
> if anything breaks.