2020-07-31 12:07:50

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v4 00/10] sched: Instrument sched domain flags

Hi,

I've repeatedly stared at an SD flag and asked myself "how should that be
set up in the domain hierarchy anyway?". I figured that if we formalize our
flags zoology a bit, we could also do some runtime assertions on them -
this is what this series is all about.

Patches
=======

The idea is to associate the flags with metaflags that describes how they
should be set in a sched domain hierarchy ("if this SD has it, all its {parents,
children} have it") or how they behave wrt degeneration - details are in the
comments and commit logs.

The good thing is that the debugging bits go away when CONFIG_SCHED_DEBUG isn't
set. The bad thing is that this replaces SD_* flags definitions with some
unsavoury macros. This is mainly because I wanted to avoid having to duplicate
work between declaring the flags and declaring their metaflags.

o Patches 1-3 are topology cleanups / fixes
o Patches 4-6 instrument SD flags and add assertions
o Patches 7-10 leverage the instrumentation to factorize domain degeneration

Revisions
=========

v3 -> v4
--------

o Reordered the series to have fixes / cleanups first

o Added SD_ASYM_CPUCAPACITY propagation (Quentin)
o Made ARM revert back to the default sched topology (Dietmar)
o Removed SD_SERIALIZE degeneration special case (Peter)

o Made SD_NUMA and SD_SERIALIZE have SDF_NEEDS_GROUPS

As discussed on v3, I thought this wasn't required, but thinking some more
about it there can be cases where that changes the current behaviour. For
instance, in the following wacky triangle:

0\ 30
| \
20 | 2
| /
1/ 30

there are two unique distances thus two NUMA topology levels, however the
first one for node 2 would have the same span as its child domain and thus
should be degenerated. If we don't give SD_NUMA and SD_SERIALIZE
SDF_NEEDS_GROUPS, this domain wouldn't be denegerated since its child
*doesn't* have either SD_NUMA or SD_SERIALIZE (it's the first NUMA domain),
and we'd have this weird NUMA domain lingering with a single group.

v2 -> v3
--------

o Reworded comment for SD_OVERLAP (it's about the groups, not the domains)

o Added more flags to the SD degeneration mask
o Added generation of an SD flag mask for the degeneration functions (Peter)

RFC -> v2
---------

o Rebased on top of tip/sched/core
o Aligned wording of comments between flags
o Rectified some flag descriptions (Morten)
o Added removal of SD_SHARE_POWERDOMAIN (Morten)

Valentin Schneider (10):
ARM, sched/topology: Remove SD_SHARE_POWERDOMAIN
ARM: Revert back to default scheduler topology.
sched/topology: Propagate SD_ASYM_CPUCAPACITY upwards
sched/topology: Split out SD_* flags declaration to its own file
sched/topology: Define and assign sched_domain flag metadata
sched/topology: Verify SD_* flags setup when sched_debug is on
sched/topology: Add more flags to the SD degeneration mask
sched/topology: Remove SD_SERIALIZE degeneration special case
sched/topology: Introduce SD metaflag for flags needing > 1 groups
sched/topology: Use prebuilt SD flag degeneration mask

arch/arm/kernel/topology.c | 26 ------
include/linux/sched/sd_flags.h | 156 +++++++++++++++++++++++++++++++++
include/linux/sched/topology.h | 36 +++++---
kernel/sched/topology.c | 54 ++++++------
4 files changed, 204 insertions(+), 68 deletions(-)
create mode 100644 include/linux/sched/sd_flags.h

--
2.27.0


2020-07-31 12:07:58

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v4 02/10] ARM: Revert back to default scheduler topology.

The ARM-specific GMC level is meant to be built using the thread sibling
mask, but no devicetree in arch/arm/boot/dts uses the 'thread' cpu-map
binding. With SD_SHARE_POWERDOMAIN gone, this topology level can be
removed, at which point ARM no longer benefits from having a custom defined
topology table.

Delete the GMC topology level by making ARM use the default scheduler
topology table. This essentially reverts commit

fb2aa85564f4 ("sched, ARM: Create a dedicated scheduler topology table")

Suggested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
arch/arm/kernel/topology.c | 26 --------------------------
1 file changed, 26 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 353f3ee660e4..ef0058de432b 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -177,15 +177,6 @@ static inline void parse_dt_topology(void) {}
static inline void update_cpu_capacity(unsigned int cpuid) {}
#endif

-/*
- * The current assumption is that we can power gate each core independently.
- * This will be superseded by DT binding once available.
- */
-const struct cpumask *cpu_corepower_mask(int cpu)
-{
- return &cpu_topology[cpu].thread_sibling;
-}
-
/*
* store_cpu_topology is called at boot when only one cpu is running
* and with the mutex cpu_hotplug.lock locked, when several cpus have booted,
@@ -241,20 +232,6 @@ void store_cpu_topology(unsigned int cpuid)
update_siblings_masks(cpuid);
}

-static inline int cpu_corepower_flags(void)
-{
- return SD_SHARE_PKG_RESOURCES;
-}
-
-static struct sched_domain_topology_level arm_topology[] = {
-#ifdef CONFIG_SCHED_MC
- { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
-#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
-};
-
/*
* init_cpu_topology is called at boot when only one cpu is running
* which prevent simultaneous write access to cpu_topology array
@@ -265,7 +242,4 @@ void __init init_cpu_topology(void)
smp_wmb();

parse_dt_topology();
-
- /* Set scheduler topology descriptor */
- set_sched_topology(arm_topology);
}
--
2.27.0

2020-07-31 12:08:10

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v4 09/10] sched/topology: Introduce SD metaflag for flags needing > 1 groups

In preparation of cleaning up the sd_degenerate*() functions, mark flags
that are only relevant when their associated domain has more than a single
group. With this, build a compile-time mask of those SD flags.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched/sd_flags.h | 69 ++++++++++++++++++++++------------
include/linux/sched/topology.h | 7 ++++
2 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 9d0fb8924181..6c70af066995 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -8,7 +8,7 @@
#endif

/*
- * Expected flag uses
+ * Hierarchical metaflags
*
* SHARED_CHILD: These flags are meant to be set from the base domain upwards.
* If a domain has this flag set, all of its children should have it set. This
@@ -29,36 +29,50 @@
* certain level (e.g. domain starts spanning CPUs outside of the base CPU's
* socket).
*/
-#define SDF_SHARED_CHILD 0x1
-#define SDF_SHARED_PARENT 0x2
+#define SDF_SHARED_CHILD 0x1
+#define SDF_SHARED_PARENT 0x2
+
+/*
+ * Behavioural metaflags
+ *
+ * NEEDS_GROUPS: These flags are only relevant if the domain they are set on has
+ * more than one group. This is usually for balancing flags (load balancing
+ * involves equalizing a metric between groups), or for flags describing some
+ * shared resource (which would be shared between groups).
+ */
+#define SDF_NEEDS_GROUPS 0x4

/*
* Balance when about to become idle
*
* SHARED_CHILD: Set from the base domain up to cpuset.sched_relax_domain_level.
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_BALANCE_NEWIDLE, 0, SDF_SHARED_CHILD)
+SD_FLAG(SD_BALANCE_NEWIDLE, 0, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Balance on exec
*
* SHARED_CHILD: Set from the base domain up to the NUMA reclaim level.
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_BALANCE_EXEC, 1, SDF_SHARED_CHILD)
+SD_FLAG(SD_BALANCE_EXEC, 1, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Balance on fork, clone
*
* SHARED_CHILD: Set from the base domain up to the NUMA reclaim level.
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_BALANCE_FORK, 2, SDF_SHARED_CHILD)
+SD_FLAG(SD_BALANCE_FORK, 2, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Balance on wakeup
*
* SHARED_CHILD: Set from the base domain up to cpuset.sched_relax_domain_level.
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_BALANCE_WAKE, 3, SDF_SHARED_CHILD)
+SD_FLAG(SD_BALANCE_WAKE, 3, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Consider waking task on waking CPU.
@@ -71,63 +85,72 @@ SD_FLAG(SD_WAKE_AFFINE, 4, SDF_SHARED_CHILD)
* Domain members have different CPU capacities
*
* SHARED_PARENT: Set from the topmost domain down to the first domain where
- * asymmetry is detected.
+ * asymmetry is detected.
+ * NEEDS_GROUPS: Per-CPU capacity is asymmetric between groups.
*/
-SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT)
+SD_FLAG(SD_ASYM_CPUCAPACITY, 5, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)

/*
* Domain members share CPU capacity (i.e. SMT)
*
* SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
- * CPU capacity.
+ * CPU capacity.
+ * NEEDS_GROUPS: Capacity is shared between groups.
*/
-SD_FLAG(SD_SHARE_CPUCAPACITY, 6, SDF_SHARED_CHILD)
+SD_FLAG(SD_SHARE_CPUCAPACITY, 6, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Domain members share CPU package resources (i.e. caches)
*
* SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share
- * the same cache(s).
+ * the same cache(s).
+ * NEEDS_GROUPS: Caches are shared between groups.
*/
-SD_FLAG(SD_SHARE_PKG_RESOURCES, 7, SDF_SHARED_CHILD)
+SD_FLAG(SD_SHARE_PKG_RESOURCES, 7, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Only a single load balancing instance
*
* SHARED_PARENT: Set for all NUMA levels above NODE. Could be set from a
- * different level upwards, but it doesn't change that if a domain has this flag
- * set, then all of its parents need to have it too (otherwise the serialization
- * doesn't make sense).
+ * different level upwards, but it doesn't change that if a
+ * domain has this flag set, then all of its parents need to have
+ * it too (otherwise the serialization doesn't make sense).
+ * NEEDS_GROUPS: No point in preserving domain if it has a single group.
*/
-SD_FLAG(SD_SERIALIZE, 8, SDF_SHARED_PARENT)
+SD_FLAG(SD_SERIALIZE, 8, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)

/*
* Place busy tasks earlier in the domain
*
* SHARED_CHILD: Usually set on the SMT level. Technically could be set further
- * up, but currently assumed to be set from the base domain upwards (see
- * update_top_cache_domain()).
+ * up, but currently assumed to be set from the base domain
+ * upwards (see update_top_cache_domain()).
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD)
+SD_FLAG(SD_ASYM_PACKING, 9, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)

/*
* Prefer to place tasks in a sibling domain
*
* Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
* flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
+ *
+ * NEEDS_GROUPS: Load balancing flag.
*/
-SD_FLAG(SD_PREFER_SIBLING, 10, 0)
+SD_FLAG(SD_PREFER_SIBLING, 10, SDF_NEEDS_GROUPS)

/*
* sched_groups of this level overlap
*
* SHARED_PARENT: Set for all NUMA levels above NODE.
+ * NEEDS_GROUPS: Overlaps can only exist with more than one group.
*/
-SD_FLAG(SD_OVERLAP, 11, SDF_SHARED_PARENT)
+SD_FLAG(SD_OVERLAP, 11, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)

/*
* cross-node balancing
*
* SHARED_PARENT: Set for all NUMA levels above NODE.
+ * NEEDS_GROUPS: No point in preserving domain if it has a single group.
*/
-SD_FLAG(SD_NUMA, 12, SDF_SHARED_PARENT)
+SD_FLAG(SD_NUMA, 12, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 99b16d4c03f2..74e6ec32a413 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -16,6 +16,13 @@
#include <linux/sched/sd_flags.h>
#undef SD_FLAG

+/* Generate a mask of SD flags with the SDF_NEEDS_GROUPS metaflag */
+#define SD_FLAG(name, idx, mflags) (BIT(idx) * !!((mflags) & SDF_NEEDS_GROUPS)) |
+static const unsigned int SD_DEGENERATE_GROUPS_MASK =
+#include <linux/sched/sd_flags.h>
+0;
+#undef SD_FLAG
+
#ifdef CONFIG_SCHED_DEBUG
#define SD_FLAG(_name, idx, mflags) [idx] = {.meta_flags = mflags, .name = #_name},
static const struct {
--
2.27.0

2020-07-31 12:09:09

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v4 08/10] sched/topology: Remove SD_SERIALIZE degeneration special case

If there is only a single NUMA node in the system, the only NUMA topology
level that will be generated will be NODE (identity distance), which
doesn't have SD_SERIALIZE.

This means we don't need this special case in sd_parent_degenerate(), as
having the NODE level "naturally" covers it. Thus, remove it.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/topology.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 93a7ff52335b..c6ecc395c76c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -204,8 +204,6 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_PKG_RESOURCES |
SD_OVERLAP |
SD_PREFER_SIBLING);
- if (nr_node_ids == 1)
- pflags &= ~SD_SERIALIZE;
}
if (~cflags & pflags)
return 0;
--
2.27.0

2020-08-06 16:51:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] sched: Instrument sched domain flags

On 31/07/2020 13:54, Valentin Schneider wrote:
> Hi,
>
> I've repeatedly stared at an SD flag and asked myself "how should that be
> set up in the domain hierarchy anyway?". I figured that if we formalize our
> flags zoology a bit, we could also do some runtime assertions on them -
> this is what this series is all about.
>
> Patches
> =======
>
> The idea is to associate the flags with metaflags that describes how they
> should be set in a sched domain hierarchy ("if this SD has it, all its {parents,
> children} have it") or how they behave wrt degeneration - details are in the
> comments and commit logs.
>
> The good thing is that the debugging bits go away when CONFIG_SCHED_DEBUG isn't
> set. The bad thing is that this replaces SD_* flags definitions with some
> unsavoury macros. This is mainly because I wanted to avoid having to duplicate
> work between declaring the flags and declaring their metaflags.
>
> o Patches 1-3 are topology cleanups / fixes
> o Patches 4-6 instrument SD flags and add assertions
> o Patches 7-10 leverage the instrumentation to factorize domain degeneration
>
> Revisions
> =========
>
> v3 -> v4
> --------
>
> o Reordered the series to have fixes / cleanups first
>
> o Added SD_ASYM_CPUCAPACITY propagation (Quentin)
> o Made ARM revert back to the default sched topology (Dietmar)
> o Removed SD_SERIALIZE degeneration special case (Peter)
>
> o Made SD_NUMA and SD_SERIALIZE have SDF_NEEDS_GROUPS
>
> As discussed on v3, I thought this wasn't required, but thinking some more
> about it there can be cases where that changes the current behaviour. For
> instance, in the following wacky triangle:
>
> 0\ 30
> | \
> 20 | 2
> | /
> 1/ 30
>
> there are two unique distances thus two NUMA topology levels, however the
> first one for node 2 would have the same span as its child domain and thus
> should be degenerated. If we don't give SD_NUMA and SD_SERIALIZE
> SDF_NEEDS_GROUPS, this domain wouldn't be denegerated since its child
> *doesn't* have either SD_NUMA or SD_SERIALIZE (it's the first NUMA domain),
> and we'd have this weird NUMA domain lingering with a single group.

LGTM.

Tested on Arm & Arm64 dual-cluster big.LITTLE (so only
default_topology[]) with CONFIG_SCHED_MC=y for the following cases:

(1) normal bring-up
(2) CPU hp all but one CPU of one cluster
(3) CPU hp entire cluster

Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>