LinuxLists.cc - Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD

2019-12-19 02:59:14

Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains

On Wed, 2019-12-18 at 15:44 +0000, Mel Gorman wrote:

> + /*
> + * Ignore imbalance unless busiest sd is close
> to 50%
> + * utilisation. At that point balancing for
> memory
> + * bandwidth and potentially avoiding
> unnecessary use
> + * of HT siblings is as relevant as memory
> locality.
> + */
> + imbalance_max = (busiest->group_weight >> 1) -
> imbalance_adj;
> + if (env->imbalance <= imbalance_adj &&
> + busiest->sum_nr_running < imbalance_max) {
> + env->imbalance = 0;
> + }
> + }
> return;
> }

I can see how the 50% point is often great for HT,
but I wonder if that is also the case for SMT4 and
SMT8 systems...

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2019-12-19 08:43:44

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains

On Wed, Dec 18, 2019 at 09:58:01PM -0500, Rik van Riel wrote:
> On Wed, 2019-12-18 at 15:44 +0000, Mel Gorman wrote:
>
> > + /*
> > + * Ignore imbalance unless busiest sd is close
> > to 50%
> > + * utilisation. At that point balancing for
> > memory
> > + * bandwidth and potentially avoiding
> > unnecessary use
> > + * of HT siblings is as relevant as memory
> > locality.
> > + */
> > + imbalance_max = (busiest->group_weight >> 1) -
> > imbalance_adj;
> > + if (env->imbalance <= imbalance_adj &&
> > + busiest->sum_nr_running < imbalance_max) {
> > + env->imbalance = 0;
> > + }
> > + }
> > return;
> > }
>
> I can see how the 50% point is often great for HT,
> but I wonder if that is also the case for SMT4 and
> SMT8 systems...
>

Maybe, maybe not but it's not the most important concern. The highlight
in the comment was about memory bandwidth and HT was simply an additional
concern. Ideally memory bandwidth and consumption would be taken into
account but we know nothing about either. Even if peak memory bandwidth
was known, the reference pattern matters a *lot* which can be readily
illustrated by using STREAM and observing the different bandwidths for
different reference patterns. Similarly, while we might know pages that
were referenced, we do not know the bandwidth consumption without taking
additional overhead with a PMU. Hence, it makes sense to at least hope
that the active tasks have similar memory bandwidth requirements and load
balance as normal when we are near the 50% active tasks/busy CPUs. If
SMT4 or SMT8 have different requirements or it matters for memory
bandwidth then it would need to be carefully examined by someone with
access to such hardware to determine an arch-specific and maybe even a
per-CPU-family cutoff.

In the context of this patch, it unconditionally makes sense that the
basic case of two communicating tasks are not migrating cross-node on
wakeup and then again on load balance.

--
Mel Gorman
SUSE Labs