2017-04-13 13:56:45

by Lauro Venancio

[permalink] [raw]
Subject: [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology

Currently, the scheduler is not able to directly move tasks between some NUMA
nodes 2-hops apart on machines with mesh topology. This occurs because some
NUMA nodes belongs to all sched groups. For more details, see the patch 2
commit log.

This bug was reported in the paper [1] as "The Scheduling Group Construction
bug".

This patchset constructs the sched groups from each CPU perspective. So each
NUMA node can have different groups in the last NUMA sched domain level.

SPECjbb2005 results show up to 63% performance improvement and a huge standard
deviation drop on a machine with 8 NUMA nodes and mesh topology.

Patch 1 - just prepare the code for patch 2
Patch 2 - change the sched groups construction
Patch 3 - fix issue with different groups starting with the same CPU

[1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

Regards,
Lauro

Lauro Ramos Venancio (3):
sched/topology: Refactor function build_overlap_sched_groups()
sched/topology: fix sched groups on NUMA machines with mesh topology
sched/topology: Different sched groups must not have the same balance
cpu

kernel/sched/topology.c | 165 ++++++++++++++++++++++++++++++++++--------------
1 file changed, 117 insertions(+), 48 deletions(-)

--
1.8.3.1


2017-04-13 13:56:56

by Lauro Venancio

[permalink] [raw]
Subject: [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups()

Create functions build_group_from_child_sched_domain() and
init_overlap_sched_group(). No functional change.

Signed-off-by: Lauro Ramos Venancio <[email protected]>
---
kernel/sched/topology.c | 62 ++++++++++++++++++++++++++++++++++---------------
1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb..d786d45 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -513,6 +513,47 @@ int group_balance_cpu(struct sched_group *sg)
return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
}

+static struct sched_group *
+build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+{
+ struct sched_group *sg;
+ struct cpumask *sg_span;
+
+ sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
+ GFP_KERNEL, cpu_to_node(cpu));
+
+ if (!sg)
+ return NULL;
+
+ sg_span = sched_group_cpus(sg);
+ if (sd->child)
+ cpumask_copy(sg_span, sched_domain_span(sd->child));
+ else
+ cpumask_copy(sg_span, sched_domain_span(sd));
+
+ return sg;
+}
+
+static void init_overlap_sched_group(struct sched_domain *sd,
+ struct sched_group *sg, int cpu)
+{
+ struct sd_data *sdd = sd->private;
+ struct cpumask *sg_span;
+
+ sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+ if (atomic_inc_return(&sg->sgc->ref) == 1)
+ build_group_mask(sd, sg);
+
+ /*
+ * Initialize sgc->capacity such that even if we mess up the
+ * domains and no possible iteration will get us here, we won't
+ * die on a /0 trap.
+ */
+ sg_span = sched_group_cpus(sg);
+ sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+ sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+}
+
static int
build_overlap_sched_groups(struct sched_domain *sd, int cpu)
{
@@ -537,31 +578,14 @@ int group_balance_cpu(struct sched_group *sg)
if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;

- sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(cpu));
-
+ sg = build_group_from_child_sched_domain(sibling, cpu);
if (!sg)
goto fail;

sg_span = sched_group_cpus(sg);
- if (sibling->child)
- cpumask_copy(sg_span, sched_domain_span(sibling->child));
- else
- cpumask_set_cpu(i, sg_span);
-
cpumask_or(covered, covered, sg_span);

- sg->sgc = *per_cpu_ptr(sdd->sgc, i);
- if (atomic_inc_return(&sg->sgc->ref) == 1)
- build_group_mask(sd, sg);
-
- /*
- * Initialize sgc->capacity such that even if we mess up the
- * domains and no possible iteration will get us here, we won't
- * die on a /0 trap.
- */
- sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
- sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ init_overlap_sched_group(sd, sg, i);

/*
* Make sure the first group of this domain contains the
--
1.8.3.1

2017-04-13 13:57:01

by Lauro Venancio

[permalink] [raw]
Subject: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu

Currently, the group balance cpu is the groups's first CPU. But with
overlapping groups, two different groups can have the same first CPU.

This patch uses the group mask to mark all the CPUs that have a
particular group as its main sched group. The group balance cpu is the
first group CPU that is also in the mask.

Signed-off-by: Lauro Ramos Venancio <[email protected]>
---
kernel/sched/topology.c | 76 ++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 62 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d0302ad..7920bbb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -477,27 +477,31 @@ enum s_alloc {
};

/*
- * Build an iteration mask that can exclude certain CPUs from the upwards
- * domain traversal.
+ * An overlap sched group may not be present in all CPUs that compose the
+ * group. So build the mask, marking all the group CPUs where it is present.
*
* Asymmetric node setups can result in situations where the domain tree is of
* unequal depth, make sure to skip domains that already cover the entire
* range.
- *
- * In that case build_sched_domains() will have terminated the iteration early
- * and our sibling sd spans will be empty. Domains should always include the
- * CPU they're built on, so check that.
*/
static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
{
- const struct cpumask *span = sched_domain_span(sd);
+ const struct cpumask *sg_span = sched_group_cpus(sg);
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
int i;

- for_each_cpu(i, span) {
+ for_each_cpu(i, sg_span) {
sibling = *per_cpu_ptr(sdd->sd, i);
- if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
+
+ /*
+ * Asymmetric node setups: skip domains that are already
+ * done.
+ */
+ if (!sibling->groups)
+ continue;
+
+ if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
continue;

cpumask_set_cpu(i, sched_group_mask(sg));
@@ -513,6 +517,28 @@ int group_balance_cpu(struct sched_group *sg)
return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
}

+/*
+ * Find the group balance cpu when the group mask is not available yet.
+ */
+static int find_group_balance_cpu(struct sched_domain *sd,
+ struct sched_group *sg)
+{
+ const struct cpumask *sg_span = sched_group_cpus(sg);
+ struct sd_data *sdd = sd->private;
+ struct sched_domain *sibling;
+ int i;
+
+ for_each_cpu(i, sg_span) {
+ sibling = *per_cpu_ptr(sdd->sd, i);
+ if (cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
+ return i;
+ }
+
+ WARN(1, "group balance cpu not found.");
+ return 0;
+}
+
+
static struct sched_group *
build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
{
@@ -554,6 +580,19 @@ static void init_overlap_sched_group(struct sched_domain *sd,
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
}

+static void init_overlap_sched_groups(struct sched_domain *sd)
+{
+ struct sched_group *sg = sd->groups;
+ int cpu;
+
+ do {
+ cpu = find_group_balance_cpu(sd, sg);
+ init_overlap_sched_group(sd, sg, cpu);
+
+ sg = sg->next;
+ } while (sg != sd->groups);
+}
+
static int
build_overlap_sched_groups(struct sched_domain *sd, int cpu)
{
@@ -568,8 +607,6 @@ static void init_overlap_sched_group(struct sched_domain *sd,
if (!sg)
return -ENOMEM;

- init_overlap_sched_group(sd, sg, cpu);
-
sd->groups = sg;
last = sg;
sg->next = sg;
@@ -584,7 +621,12 @@ static void init_overlap_sched_group(struct sched_domain *sd,

sibling = *per_cpu_ptr(sdd->sd, i);

- /* See the comment near build_group_mask(). */
+ /*
+ * In Asymmetric node setups, build_sched_domains() will have
+ * terminated the iteration early and our sibling sd spans will
+ * be empty. Domains should always include the CPU they're
+ * built on, so check that.
+ */
if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;

@@ -595,8 +637,6 @@ static void init_overlap_sched_group(struct sched_domain *sd,
sg_span = sched_group_cpus(sg);
cpumask_or(covered, covered, sg_span);

- init_overlap_sched_group(sd, sg, i);
-
last->next = sg;
last = sg;
sg->next = sd->groups;
@@ -1449,6 +1489,14 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
}
}

+ /* Init overlap groups */
+ for_each_cpu(i, cpu_map) {
+ for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ if (sd->flags & SD_OVERLAP)
+ init_overlap_sched_groups(sd);
+ }
+ }
+
/* Calculate CPU capacity for physical packages and nodes */
for (i = nr_cpumask_bits-1; i >= 0; i--) {
if (!cpumask_test_cpu(i, cpu_map))
--
1.8.3.1

2017-04-13 13:57:06

by Lauro Venancio

[permalink] [raw]
Subject: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

Currently, on a 4 nodes NUMA machine with ring topology, two sched
groups are generated for the last NUMA sched domain. One group has the
CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
scheduler is unable to directly move tasks between these nodes. In the
worst scenario, when a set of tasks are bound to nodes 1 and 3, the
performance is severely impacted because just one node is used while the
other node remains idle.

This problem also affects machines with more NUMA nodes. For instance,
currently, the scheduler is unable to directly move tasks between some
node pairs 2-hops apart on an 8 nodes machine with mesh topology.

This bug was reported in the paper [1] as "The Scheduling Group
Construction bug".

This patch constructs the sched groups from each CPU perspective. So, on
a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
[(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
apart.

SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology

Threads before after %
mean stddev mean stddev
1 22801 1950 27059 1367 +19%
8 146008 50782 209193 826 +43%
32 351030 105111 522445 9051 +49%
48 365835 116571 594905 3314 +63%

[1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

Signed-off-by: Lauro Ramos Venancio <[email protected]>
---
kernel/sched/topology.c | 33 +++++++++++++++------------------
1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d786d45..d0302ad 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -557,14 +557,24 @@ static void init_overlap_sched_group(struct sched_domain *sd,
static int
build_overlap_sched_groups(struct sched_domain *sd, int cpu)
{
- struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg;
+ struct sched_group *last = NULL, *sg;
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
int i;

- cpumask_clear(covered);
+ sg = build_group_from_child_sched_domain(sd, cpu);
+ if (!sg)
+ return -ENOMEM;
+
+ init_overlap_sched_group(sd, sg, cpu);
+
+ sd->groups = sg;
+ last = sg;
+ sg->next = sg;
+
+ cpumask_copy(covered, sched_group_cpus(sg));

for_each_cpu(i, span) {
struct cpumask *sg_span;
@@ -587,28 +597,15 @@ static void init_overlap_sched_group(struct sched_domain *sd,

init_overlap_sched_group(sd, sg, i);

- /*
- * Make sure the first group of this domain contains the
- * canonical balance CPU. Otherwise the sched_domain iteration
- * breaks. See update_sg_lb_stats().
- */
- if ((!groups && cpumask_test_cpu(cpu, sg_span)) ||
- group_balance_cpu(sg) == cpu)
- groups = sg;
-
- if (!first)
- first = sg;
- if (last)
- last->next = sg;
+ last->next = sg;
last = sg;
- last->next = first;
+ sg->next = sd->groups;
}
- sd->groups = groups;

return 0;

fail:
- free_sched_groups(first, 0);
+ free_sched_groups(sd->groups, 0);

return -ENOMEM;
}
--
1.8.3.1

2017-04-13 14:50:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups()

On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote:
> Create functions build_group_from_child_sched_domain() and
> init_overlap_sched_group(). No functional change.
>
> Signed-off-by: Lauro Ramos Venancio <[email protected]>
>
Acked-by: Rik van Riel <[email protected]>

2017-04-13 15:16:52

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote:

> SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology
>
> Threads       before              after          %
>            mean   stddev      mean    stddev
>   1       22801   1950        27059   1367     +19%
>   8       146008  50782       209193  826      +43%
>   32      351030  105111      522445  9051     +49%
>   48      365835  116571      594905  3314     +63%

Impressive!

> [1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
>
> Signed-off-by: Lauro Ramos Venancio <[email protected]>

Acked-by: Rik van Riel <[email protected]>

2017-04-13 15:28:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu

On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote:
> Currently, the group balance cpu is the groups's first CPU. But with
> overlapping groups, two different groups can have the same first CPU.
>
> This patch uses the group mask to mark all the CPUs that have a
> particular group as its main sched group. The group balance cpu is
> the
> first group CPU that is also in the mask.
>

This is not your fault, but this code is really hard
to understand.

Your comments tell me what the code does, but not
really why. 

> +++ b/kernel/sched/topology.c
> @@ -477,27 +477,31 @@ enum s_alloc {
>  };
>  
>  /*
> - * Build an iteration mask that can exclude certain CPUs from the
> upwards
> - * domain traversal.
> + * An overlap sched group may not be present in all CPUs that
> compose the
> + * group. So build the mask, marking all the group CPUs where it is
> present.
>   *
>   * Asymmetric node setups can result in situations where the domain
> tree is of
>   * unequal depth, make sure to skip domains that already cover the
> entire
>   * range.
> - *
> - * In that case build_sched_domains() will have terminated the
> iteration early
> - * and our sibling sd spans will be empty. Domains should always
> include the
> - * CPU they're built on, so check that.
>   */

Why are we doing this?

Could the comment explain why things need to be
this way?


> - for_each_cpu(i, span) {
> + for_each_cpu(i, sg_span) {
>   sibling = *per_cpu_ptr(sdd->sd, i);
> - if (!cpumask_test_cpu(i,
> sched_domain_span(sibling)))
> +
> + /*
> +  * Asymmetric node setups: skip domains that are
> already
> +  * done.
> +  */
> + if (!sibling->groups)
> + continue;
> +

What does this mean?

I would really like it if this code was a little
better documented.  This is not something I can
put on you for most of the code, but I can at least
ask it for the code you are adding :)

The same goes for the rest of this patch.

2017-04-13 15:48:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
> Currently, on a 4 nodes NUMA machine with ring topology, two sched
> groups are generated for the last NUMA sched domain. One group has the
> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
> scheduler is unable to directly move tasks between these nodes. In the
> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
> performance is severely impacted because just one node is used while the
> other node remains idle.

I feel a picture would be ever so much clearer.

> This patch constructs the sched groups from each CPU perspective. So, on
> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
> apart.

So I still have no idea what specifically goes wrong and how this fixes
it. Changelog is impenetrable.

"From each CPU's persepective" doesn't really help, there already is a
for_each_cpu() in.

Also, since I'm not sure what happend to the 4 node system, I cannot
begin to imagine what would happen on the 8 node one.

2017-04-13 20:21:08

by Lauro Venancio

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On 04/13/2017 12:48 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>> Currently, on a 4 nodes NUMA machine with ring topology, two sched
>> groups are generated for the last NUMA sched domain. One group has the
>> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
>> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
>> scheduler is unable to directly move tasks between these nodes. In the
>> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
>> performance is severely impacted because just one node is used while the
>> other node remains idle.
> I feel a picture would be ever so much clearer.
>
>> This patch constructs the sched groups from each CPU perspective. So, on
>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>> apart.
> So I still have no idea what specifically goes wrong and how this fixes
> it. Changelog is impenetrable.
On a 4 nodes machine with ring topology, the last sched domain level
contains groups with 3 numa nodes each. So we have four possible groups:
(0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to
fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are
used for all CPUs. The problem with it is that nodes 1 and 3 belongs to
both groups, becoming impossible to move tasks between these two nodes.

This patch uses different groups depending on the CPU they are
installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0,
1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and
(2, 3, 0). So the first pair of groups allows movement between nodes 0
and 2; and the second pair of groups allows movement between nodes 1 and 3.

I will improve the changelog.

> "From each CPU's persepective" doesn't really help, there already is a
> for_each_cpu() in.
The for_each_cpu() is used to iterate across all sched domain cpus. It
doesn't consider the CPU where the groups are being installed (parameter
cpu in build_overlap_sched_groups()). Currently, the parameter cpu is
used just for memory allocation and for ordering the groups, it doesn't
change the groups that are chosen. This patch uses the parameter cpu to
choose the first group, changing also, as consequence, the second group.
>
> Also, since I'm not sure what happend to the 4 node system, I cannot
> begin to imagine what would happen on the 8 node one.


2017-04-13 21:07:16

by Lauro Venancio

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On 04/13/2017 05:21 PM, Lauro Venancio wrote:
> On 04/13/2017 12:48 PM, Peter Zijlstra wrote:
>> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>>> Currently, on a 4 nodes NUMA machine with ring topology, two sched
>>> groups are generated for the last NUMA sched domain. One group has the
>>> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
>>> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
>>> scheduler is unable to directly move tasks between these nodes. In the
>>> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
>>> performance is severely impacted because just one node is used while the
>>> other node remains idle.
>> I feel a picture would be ever so much clearer.
>>
>>> This patch constructs the sched groups from each CPU perspective. So, on
>>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>>> apart.
>> So I still have no idea what specifically goes wrong and how this fixes
>> it. Changelog is impenetrable.
> On a 4 nodes machine with ring topology, the last sched domain level
> contains groups with 3 numa nodes each. So we have four possible groups:
> (0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to
> fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are
> used for all CPUs. The problem with it is that nodes 1 and 3 belongs to
> both groups, becoming impossible to move tasks between these two nodes.
>
> This patch uses different groups depending on the CPU they are
> installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0,
> 1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and
> (2, 3, 0). So the first pair of groups allows movement between nodes 0
> and 2; and the second pair of groups allows movement between nodes 1 and 3.
>
> I will improve the changelog.
>
>> "From each CPU's persepective" doesn't really help, there already is a
>> for_each_cpu() in.
> The for_each_cpu() is used to iterate across all sched domain cpus. It
> doesn't consider the CPU where the groups are being installed (parameter
> cpu in build_overlap_sched_groups()). Currently, the parameter cpu is
> used just for memory allocation and for ordering the groups, it doesn't
> change the groups that are chosen. This patch uses the parameter cpu to
> choose the first group, changing also, as consequence, the second group.
>> Also, since I'm not sure what happend to the 4 node system, I cannot
>> begin to imagine what would happen on the 8 node one.

Just for clarification, I am sending the nodes distance table for the
two most common typologies affected by this issue.

4 nodes, ring topology
node distances:
node 0 1 2 3
0: 10 20 30 20
1: 20 10 20 30
2: 30 20 10 20
3: 20 30 20 10

8 node, mesh topology
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 22 16 22 16 22
1: 16 10 16 22 22 16 22 16
2: 16 16 10 16 16 16 16 22
3: 22 22 16 10 16 16 22 16
4: 16 22 16 16 10 16 16 16
5: 22 16 16 16 16 10 22 22
6: 16 22 16 22 16 22 10 16
7: 22 16 22 16 16 22 16 10

2017-04-13 23:38:10

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On Thu, 2017-04-13 at 18:06 -0300, Lauro Venancio wrote:

> Just for clarification, I am sending the nodes distance table for the
> two most common typologies affected by this issue.

What do the sched groups look like for these topologies,
before and after your patch series?

> 4 nodes, ring topology
> node distances:
> node   0   1   2   3
>   0:  10  20  30  20
>   1:  20  10  20  30
>   2:  30  20  10  20
>   3:  20  30  20  10
>
> 8 node, mesh topology
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  16  16  22  16  22  16  22
>   1:  16  10  16  22  22  16  22  16
>   2:  16  16  10  16  16  16  16  22
>   3:  22  22  16  10  16  16  22  16
>   4:  16  22  16  16  10  16  16  16
>   5:  22  16  16  16  16  10  22  22
>   6:  16  22  16  22  16  22  10  16
>   7:  22  16  22  16  16  22  16  10

2017-04-14 10:48:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On Thu, Apr 13, 2017 at 07:38:05PM -0400, Rik van Riel wrote:

> What do the sched groups look like for these topologies,
> before and after your patch series?
>
> > 4 nodes, ring topology
> > node distances:
> > node???0???1???2???3
> > ? 0:??10??20??30??20
> > ? 1:??20??10??20??30
> > ? 2:??30??20??10??20
> > ? 3:??20??30??20??10

kvm -smp 4 -m 4G -display none -monitor null -serial stdio -kernel
defconfig-build/arch/x86/boot/bzImage -append "sched_debug debug
ignore_loglevel earlyprintk=serial,ttyS0,115200,keep
numa=fake=4:10,20,30,20,20,10,20,30,30,20,10,20,20,30,20,10,0"

(FWIW, that's defconfig+kvmconfig+SCHED_DEBUG=y+NUMA_EMU=y)

Gives me:

[ 0.075004] smpboot: Total of 4 processors activated (22345.79 BogoMIPS)
[ 0.076767] CPU0 attaching sched-domain:
[ 0.077003] domain 0: span 0-1,3 level NUMA
[ 0.078002] groups: 0 1 3
[ 0.079002] domain 1: span 0-3 level NUMA
[ 0.080002] groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072)
[ 0.081005] CPU1 attaching sched-domain:
[ 0.082003] domain 0: span 0-2 level NUMA
[ 0.083002] groups: 1 2 0
[ 0.084002] domain 1: span 0-3 level NUMA
[ 0.085002] groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)
[ 0.086004] CPU2 attaching sched-domain:
[ 0.087002] domain 0: span 1-3 level NUMA
[ 0.088002] groups: 2 3 1
[ 0.089002] domain 1: span 0-3 level NUMA
[ 0.090002] groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)
[ 0.091004] CPU3 attaching sched-domain:
[ 0.092002] domain 0: span 0,2-3 level NUMA
[ 0.093002] groups: 3 0 2
[ 0.094002] domain 1: span 0-3 level NUMA
[ 0.095002] groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072)
[ 0.096004] span: 0-3 (max cpu_capacity = 1024)


With patches it looks like:

[ 0.080006] smpboot: Total of 4 processors activated (22345.79 BogoMIPS)
[ 0.082545] CPU0 attaching sched-domain:
[ 0.083007] domain 0: span 0-1,3 level NUMA
[ 0.084004] groups: 0 1 3
[ 0.085004] domain 1: span 0-3 level NUMA
[ 0.086004] groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072)
[ 0.087007] CPU1 attaching sched-domain:
[ 0.088004] domain 0: span 0-2 level NUMA
[ 0.089004] groups: 1 0 2
[ 0.090004] domain 1: span 0-3 level NUMA
[ 0.091003] groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)
[ 0.092008] CPU2 attaching sched-domain:
[ 0.093004] domain 0: span 1-3 level NUMA
[ 0.094004] groups: 2 1 3
[ 0.095004] domain 1: span 0-3 level NUMA
[ 0.096004] groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)
[ 0.097007] CPU3 attaching sched-domain:
[ 0.098004] domain 0: span 0,2-3 level NUMA
[ 0.099003] groups: 3 0 2
[ 0.100004] domain 1: span 0-3 level NUMA
[ 0.101003] groups: 0,2-3 (cpu_capacity = 3072) 0-2 (cpu_capacity = 3072)
[ 0.102007] span: 0-3 (max cpu_capacity = 1024)


Now let me try and reverse engineer those patches ..

2017-04-14 11:38:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
> This patch constructs the sched groups from each CPU perspective. So, on
> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
> apart.

Ah,.. so after drawing pictures I see what went wrong; duh :-(

An equivalent patch would be (if for_each_cpu_wrap() were exposed):

@@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
- int i;
+ int i, wrap;

cpumask_clear(covered);

- for_each_cpu(i, span) {
+ for_each_cpu_wrap(i, span, cpu, wrap) {
struct cpumask *sg_span;

if (cpumask_test_cpu(i, covered))


We need to start iterating at @cpu, not start at 0 every time.


2017-04-14 12:20:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote:

> An equivalent patch would be (if for_each_cpu_wrap() were exposed):

---
Subject: sched,cpumask: Export for_each_cpu_wrap()

More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/cpumask.h | 16 ++++++++++++++++
kernel/sched/fair.c | 45 ++++-----------------------------------------
lib/cpumask.c | 32 ++++++++++++++++++++++++++++++++
3 files changed, 52 insertions(+), 41 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 96f1e88..4b87b7b 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -236,6 +236,22 @@ unsigned int cpumask_local_spread(unsigned int i, int node);
(cpu) = cpumask_next_zero((cpu), (mask)), \
(cpu) < nr_cpu_ids;)

+
+/**
+ * for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask: the cpumask poiter
+ * @start: the start location
+ *
+ * The implementation does not assume any bit in @mask is set (including @start).
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_wrap(cpu, mask, start) \
+ for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false); \
+ (cpu) < nr_cpumask_bits; \
+ (cpu) = cpumask_next_wrap((cpu), (mask), (start), true))
+
/**
* for_each_cpu_and - iterate over every cpu in both masks
* @cpu: the (optionally unsigned) integer iterator
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276..d89d700 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5640,43 +5640,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
}

-/*
- * Implement a for_each_cpu() variant that starts the scan at a given cpu
- * (@start), and wraps around.
- *
- * This is used to scan for idle CPUs; such that not all CPUs looking for an
- * idle CPU find the same CPU. The down-side is that tasks tend to cycle
- * through the LLC domain.
- *
- * Especially tbench is found sensitive to this.
- */
-
-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
-{
- int next;
-
-again:
- next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
-
- if (*wrapped) {
- if (next >= start)
- return nr_cpumask_bits;
- } else {
- if (next >= nr_cpumask_bits) {
- *wrapped = 1;
- n = -1;
- goto again;
- }
- }
-
- return next;
-}
-
-#define for_each_cpu_wrap(cpu, mask, start, wrap) \
- for ((wrap) = 0, (cpu) = (start)-1; \
- (cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)), \
- (cpu) < nr_cpumask_bits; )
-
#ifdef CONFIG_SCHED_SMT

static inline void set_idle_cores(int cpu, int val)
@@ -5736,7 +5699,7 @@ void __update_idle_core(struct rq *rq)
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int core, cpu, wrap;
+ int core, cpu;

if (!static_branch_likely(&sched_smt_present))
return -1;
@@ -5746,7 +5709,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int

cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);

- for_each_cpu_wrap(core, cpus, target, wrap) {
+ for_each_cpu_wrap(core, cpus, target) {
bool idle = true;

for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -5812,7 +5775,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 avg_cost, avg_idle = this_rq()->avg_idle;
u64 time, cost;
s64 delta;
- int cpu, wrap;
+ int cpu;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -5829,7 +5792,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

time = local_clock();

- for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+ for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
if (idle_cpu(cpu))
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 81dedaa..4731a08 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -43,6 +43,38 @@ int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
}
EXPORT_SYMBOL(cpumask_any_but);

+/**
+ * cpumask_next_wrap - helper to implement for_each_cpu_wrap
+ * @n: the cpu prior to the place to search
+ * @mask: the cpumask pointer
+ * @start: the start point of the iteration
+ * @wrap: assume @n crossing @start terminates the iteration
+ *
+ * Returns >= nr_cpu_ids on completion
+ *
+ * Note: the @wrap argument is required for the start condition when
+ * we cannot assume @start is set in @mask.
+ */
+int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
+{
+ int next;
+
+again:
+ next = cpumask_next(n, mask);
+
+ if (wrap && n < start && next >= start) {
+ return nr_cpumask_bits;
+
+ } else if (next >= nr_cpumask_bits) {
+ wrap = true;
+ n = -1;
+ goto again;
+ }
+
+ return next;
+}
+EXPORT_SYMBOL(cpumask_next_wrap);
+
/* These are not inline because of header tangles. */
#ifdef CONFIG_CPUMASK_OFFSTACK
/**

2017-04-14 16:49:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu

On Thu, Apr 13, 2017 at 10:56:09AM -0300, Lauro Ramos Venancio wrote:
> Currently, the group balance cpu is the groups's first CPU. But with
> overlapping groups, two different groups can have the same first CPU.
>
> This patch uses the group mask to mark all the CPUs that have a
> particular group as its main sched group. The group balance cpu is the
> first group CPU that is also in the mask.

Please give a NUMA configuration and CPU number where this goes wrong.

Because only the first group of a domain matters, and with the other
thing fixed, I'm not immediately seeing where we go wobbly.

2017-04-14 16:59:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
> > This patch constructs the sched groups from each CPU perspective. So, on
> > a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
> > groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
> > [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
> > apart.
>
> Ah,.. so after drawing pictures I see what went wrong; duh :-(
>
> An equivalent patch would be (if for_each_cpu_wrap() were exposed):
>
> @@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> struct cpumask *covered = sched_domains_tmpmask;
> struct sd_data *sdd = sd->private;
> struct sched_domain *sibling;
> - int i;
> + int i, wrap;
>
> cpumask_clear(covered);
>
> - for_each_cpu(i, span) {
> + for_each_cpu_wrap(i, span, cpu, wrap) {
> struct cpumask *sg_span;
>
> if (cpumask_test_cpu(i, covered))
>
>
> We need to start iterating at @cpu, not start at 0 every time.
>
>

OK, please have a look here:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core

2017-04-17 14:41:04

by Lauro Venancio

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

On 04/14/2017 01:58 PM, Peter Zijlstra wrote:
> On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote:
>> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>>> This patch constructs the sched groups from each CPU perspective. So, on
>>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>>> apart.
>> Ah,.. so after drawing pictures I see what went wrong; duh :-(
>>
>> An equivalent patch would be (if for_each_cpu_wrap() were exposed):
>>
>> @@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
>> struct cpumask *covered = sched_domains_tmpmask;
>> struct sd_data *sdd = sd->private;
>> struct sched_domain *sibling;
>> - int i;
>> + int i, wrap;
>>
>> cpumask_clear(covered);
>>
>> - for_each_cpu(i, span) {
>> + for_each_cpu_wrap(i, span, cpu, wrap) {
>> struct cpumask *sg_span;
>>
>> if (cpumask_test_cpu(i, covered))
>>
>>
>> We need to start iterating at @cpu, not start at 0 every time.
>>
>>
> OK, please have a look here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core

Looks good, but please hold these patches while patch 3 is not applied.
Without it, the sched_group_capacity (sg->sgc) instance is not selected
correctly and we have an important performance regression in all NUMA
machines.

I will continue this discussion in the other thread.

2017-04-17 15:34:12

by Lauro Venancio

[permalink] [raw]
Subject: Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu

On 04/14/2017 01:49 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:09AM -0300, Lauro Ramos Venancio wrote:
>> Currently, the group balance cpu is the groups's first CPU. But with
>> overlapping groups, two different groups can have the same first CPU.
>>
>> This patch uses the group mask to mark all the CPUs that have a
>> particular group as its main sched group. The group balance cpu is the
>> first group CPU that is also in the mask.
> Please give a NUMA configuration and CPU number where this goes wrong.
On a 4 nodes with ring topology, the groups (0-1,3 [cpu 0]), (0-2 [cpu
1]) and (0,2-3 [cpu 3]) share the same sched_group_capacity instance
when the first groups cpu is used to select the sgc.
>
> Because only the first group of a domain matters, and with the other
> thing fixed, I'm not immediately seeing where we go wobbly.

Before patch 2, the group balance cpu was implicitly used to select the
sched_group_capacity instance. When two different groups had the same
balance cpu, they shared the same sched_group_capacity instance.

After patch 2, one different sched_group_capacity instance is assigned
to each group instance.


This patch ensures tree things:

1) different instances of the same group share the same
sched_group_capacity instance.

2) instances of different groups don't share the same
sched_group_capacity instance.

3) the group balance cpu must be one of the cpus where the group is
installed.


I am rebasing this patch on top of your patches.

2017-04-18 12:32:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu

On Mon, Apr 17, 2017 at 12:34:05PM -0300, Lauro Venancio wrote:
> This patch ensures tree things:
>
> 1) different instances of the same group share the same
> sched_group_capacity instance.
>
> 2) instances of different groups don't share the same
> sched_group_capacity instance.
>
> 3) the group balance cpu must be one of the cpus where the group is
> installed.
>
>
> I am rebasing this patch on top of your patches.

Well, I would rather have 3 patches, each with a comprehensible
changelog.

I had already rebased the patch (trivial) so that's not the problem.
Explaining what and why it does things is.

And as per the usual rules, if it does 3 things it should be 3 patches.