For a single LLC per node, a NUMA imbalance is allowed up until 25%
of CPUs sharing a node could be active. One intent of the cut-off is
to avoid an imbalance of memory channels but there is no topological
information based on active memory channels. Furthermore, there can
be differences between nodes depending on the number of populated
DIMMs.
A cut-off of 25% was arbitrary but generally worked. It does have a severe
corner cases though when an parallel workload is using 25% of all available
CPUs over-saturates memory channels. This can happen due to the initial
forking of tasks that get pulled more to one node after early wakeups
(e.g. a barrier synchronisation) that is not quickly corrected by the
load balancer. The LB may fail to act quickly as the parallel tasks are
considered to be poor migrate candidates due to locality or cache hotness.
On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
assuming all memory channels are populated and is used as the new cut-off
point. A minimum of 1 is specified to allow a communicating pair to
remain local even for CPUs with low numbers of cores. For modern AMDs,
there are multiple LLCs and are not affected.
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/topology.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 810750e62118..2740e245cb37 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2295,23 +2295,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/*
* For a single LLC per node, allow an
- * imbalance up to 25% of the node. This is an
- * arbitrary cutoff based on SMT-2 to balance
- * between memory bandwidth and avoiding
- * premature sharing of HT resources and SMT-4
- * or SMT-8 *may* benefit from a different
- * cutoff.
+ * imbalance up to 12.5% of the node. This is
+ * arbitrary cutoff based two factors -- SMT and
+ * memory channels. For SMT-2, the intent is to
+ * avoid premature sharing of HT resources but
+ * SMT-4 or SMT-8 *may* benefit from a different
+ * cutoff. For memory channels, this is a very
+ * rough estimate of how many channels may be
+ * active and is based on recent CPUs with
+ * many cores.
*
* For multiple LLCs, allow an imbalance
* until multiple tasks would share an LLC
* on one node while LLCs on another node
- * remain idle.
+ * remain idle. This assumes that there are
+ * enough logical CPUs per LLC to avoid SMT
+ * factors and that there is a correlation
+ * between LLCs and memory channels.
*/
nr_llcs = sd->span_weight / child->span_weight;
if (nr_llcs == 1)
- imb = sd->span_weight >> 2;
+ imb = sd->span_weight >> 3;
else
imb = nr_llcs;
+ imb = max(1U, imb);
sd->imb_numa_nr = imb;
/* Set span based on the first NUMA domain. */
--
2.34.1
On Wed, May 11, 2022 at 03:30:38PM +0100, Mel Gorman wrote:
> For a single LLC per node, a NUMA imbalance is allowed up until 25%
> of CPUs sharing a node could be active. One intent of the cut-off is
> to avoid an imbalance of memory channels but there is no topological
> information based on active memory channels. Furthermore, there can
> be differences between nodes depending on the number of populated
> DIMMs.
>
> A cut-off of 25% was arbitrary but generally worked. It does have a severe
> corner cases though when an parallel workload is using 25% of all available
> CPUs over-saturates memory channels. This can happen due to the initial
> forking of tasks that get pulled more to one node after early wakeups
> (e.g. a barrier synchronisation) that is not quickly corrected by the
> load balancer. The LB may fail to act quickly as the parallel tasks are
> considered to be poor migrate candidates due to locality or cache hotness.
>
> On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
> assuming all memory channels are populated and is used as the new cut-off
> point. A minimum of 1 is specified to allow a communicating pair to
> remain local even for CPUs with low numbers of cores. For modern AMDs,
> there are multiple LLCs and are not affected.
Can the hardware tell us about memory channels?
On Wed, May 18, 2022 at 11:41:12AM +0200, Peter Zijlstra wrote:
> On Wed, May 11, 2022 at 03:30:38PM +0100, Mel Gorman wrote:
> > For a single LLC per node, a NUMA imbalance is allowed up until 25%
> > of CPUs sharing a node could be active. One intent of the cut-off is
> > to avoid an imbalance of memory channels but there is no topological
> > information based on active memory channels. Furthermore, there can
> > be differences between nodes depending on the number of populated
> > DIMMs.
> >
> > A cut-off of 25% was arbitrary but generally worked. It does have a severe
> > corner cases though when an parallel workload is using 25% of all available
> > CPUs over-saturates memory channels. This can happen due to the initial
> > forking of tasks that get pulled more to one node after early wakeups
> > (e.g. a barrier synchronisation) that is not quickly corrected by the
> > load balancer. The LB may fail to act quickly as the parallel tasks are
> > considered to be poor migrate candidates due to locality or cache hotness.
> >
> > On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
> > assuming all memory channels are populated and is used as the new cut-off
> > point. A minimum of 1 is specified to allow a communicating pair to
> > remain local even for CPUs with low numbers of cores. For modern AMDs,
> > there are multiple LLCs and are not affected.
>
> Can the hardware tell us about memory channels?
It's in the SMBIOS table somewhere as it's available via dmidecode. For
example, on a 2-socket machine;
$ dmidecode -t memory | grep -E "Size|Bank"
Size: 8192 MB
Bank Locator: P0_Node0_Channel0_Dimm0
Size: No Module Installed
Bank Locator: P0_Node0_Channel0_Dimm1
Size: 8192 MB
Bank Locator: P0_Node0_Channel1_Dimm0
Size: No Module Installed
Bank Locator: P0_Node0_Channel1_Dimm1
Size: 8192 MB
Bank Locator: P0_Node0_Channel2_Dimm0
Size: No Module Installed
Bank Locator: P0_Node0_Channel2_Dimm1
Size: 8192 MB
Bank Locator: P0_Node0_Channel3_Dimm0
Size: No Module Installed
Bank Locator: P0_Node0_Channel3_Dimm1
Size: 8192 MB
Bank Locator: P1_Node1_Channel0_Dimm0
Size: No Module Installed
Bank Locator: P1_Node1_Channel0_Dimm1
Size: 8192 MB
Bank Locator: P1_Node1_Channel1_Dimm0
Size: No Module Installed
Bank Locator: P1_Node1_Channel1_Dimm1
Size: 8192 MB
Bank Locator: P1_Node1_Channel2_Dimm0
Size: No Module Installed
Bank Locator: P1_Node1_Channel2_Dimm1
Size: 8192 MB
Bank Locator: P1_Node1_Channel3_Dimm0
Size: No Module Installed
Bank Locator: P1_Node1_Channel3_Dimm1
SMBIOUS contains the information on number of channels and whether they
are populated with at least one DIMM.
I'm not aware of how it can be done in-kernel on a cross architectural
basis. Reading through the arch manual, it states how many channels are
in a given processor family and it's available during memory check errors
(apparently via the EDAC driver). It's sometimes available via PMUs but
I couldn't find a place where it's generically available for topology.c
that would work on all x86-64 machines let alone every other architecture.
It's not even clear if SMBIOS was parsed in early boot whether it's a
good idea. It could result in difference imbalance thresholds for each
NUMA domain or weird corner cases where assymetric NUMA node populations
would result in run-to-run variance that are difficult to analyse.
--
Mel Gorman
SUSE Labs
On Wed, May 18, 2022 at 12:15:39PM +0100, Mel Gorman wrote:
> I'm not aware of how it can be done in-kernel on a cross architectural
> basis. Reading through the arch manual, it states how many channels are
> in a given processor family and it's available during memory check errors
> (apparently via the EDAC driver). It's sometimes available via PMUs but
> I couldn't find a place where it's generically available for topology.c
> that would work on all x86-64 machines let alone every other architecture.
So provided it is something we want (below) we can always start an arch
interface and fill it out where needed.
> It's not even clear if SMBIOS was parsed in early boot whether
We can always rebuild topology / update variables slightly later in
boot.
> it's a
> good idea. It could result in difference imbalance thresholds for each
> NUMA domain or weird corner cases where assymetric NUMA node populations
> would result in run-to-run variance that are difficult to analyse.
Yeah, maybe. OTOH having a magic value that's guestimated based on
hardware of the day is something that'll go bad any moment as well.
I'm not too worried about run-to-run since people don't typically change
DIMM population over a reboot, but yes, there's always going to be
corner cases. Same with a fixed value though, that's also going to be
wrong.
On Wed, May 18, 2022 at 04:05:03PM +0200, Peter Zijlstra wrote:
> On Wed, May 18, 2022 at 12:15:39PM +0100, Mel Gorman wrote:
>
> > I'm not aware of how it can be done in-kernel on a cross architectural
> > basis. Reading through the arch manual, it states how many channels are
> > in a given processor family and it's available during memory check errors
> > (apparently via the EDAC driver). It's sometimes available via PMUs but
> > I couldn't find a place where it's generically available for topology.c
> > that would work on all x86-64 machines let alone every other architecture.
>
> So provided it is something we want (below) we can always start an arch
> interface and fill it out where needed.
>
It could start with a function with a fixed value that architectures
can override but it might be a deep rabbit hole to discover and wire
it all up. The most straight-forward would be based on CPU family and
model but time consuming to maintain. It gets fuzzy if it's something
like PowerKVM where channel details are hidden. It could be a deep
rabbit hole.
> > It's not even clear if SMBIOS was parsed in early boot whether
>
> We can always rebuild topology / update variables slightly later in
> boot.
>
> > it's a
> > good idea. It could result in difference imbalance thresholds for each
> > NUMA domain or weird corner cases where assymetric NUMA node populations
> > would result in run-to-run variance that are difficult to analyse.
>
> Yeah, maybe. OTOH having a magic value that's guestimated based on
> hardware of the day is something that'll go bad any moment as well.
>
> I'm not too worried about run-to-run since people don't typically change
> DIMM population over a reboot, but yes, there's always going to be
> corner cases. Same with a fixed value though, that's also going to be
> wrong.
>
By run-to-run, I mean just running the same workload in a loop and
not rebooting between runs. If there are differences in how nodes are
populated, there will be some run-to-run variance based purely on what
node the workload started on because they will have different "allowed
imbalance" thresholds.
I'm running the tests to recheck exactly how much impact this patch has
on the peak performance. It takes a few hours so I won't have anything
until tomorrow.
Initially "get peak performance" and "stabilise run-to-run variances"
were my objectives. This series only aimed at the peak performance for a
finish as allowed NUMA imbalance was not the sole cause of the problem.
I still haven't spent time figuring out why c6f886546cb8 ("sched/fair:
Trigger the update of blocked load on newly idle cpu") made such a big
difference to variability.
--
Mel Gorman
SUSE Labs
On Wed, May 18, 2022 at 06:06:25PM +0100, Mel Gorman wrote:
> I'm running the tests to recheck exactly how much impact this patch has
> on the peak performance. It takes a few hours so I won't have anything
> until tomorrow.
>
It wasn't my imagination, the last path was worth a few percent
v5.3 Min 95.84 Max 96.55 Range 0.71 Mean 96.16
v5.7 Min 95.44 Max 96.51 Range 1.07 Mean 96.14
v5.8 Min 96.02 Max 197.08 Range 101.06 Mean 154.70
v5.12 Min 104.45 Max 111.03 Range 6.58 Mean 105.94
v5.13 Min 104.38 Max 170.37 Range 65.99 Mean 117.35
v5.13-revert-c6f886546cb8 Min 104.40 Max 110.70 Range 6.30 Mean 105.68
v5.18rc4-baseline Min 110.78 Max 169.84 Range 59.06 Mean 131.22
v5.18rc4-revert-c6f886546cb8 Min 113.98 Max 117.29 Range 3.31 Mean 114.71
v5.18rc4-shiftimb3-v2r2 Min 95.34 Max 165.75 Range 70.41 Mean 120.92
v5.18rc4-consistimb-v2r2 Min 104.02 Max 175.22 Range 71.20 Mean 116.62
v5.18rc4-consistimb-revert-c6f886546cb8 Min 104.02 Max 112.56 Range 8.54 Mean 105.52
v5.18rc4-shiftimb3-v2r2 is patches 1-4
v5.18rc4-consistimb-v2r2 is patches 1-3
The last path is worth around 8% and is what's necessary to bring best
performance back to kernel v5.7 levels. As you can see from the range,
the result is unstable but reverting c6f886546cb8 reduces it by a lot.
Not enough to be at v5.7 levels but enough to indicate that the allowed
NUMA balance changes are not the sole source of the problem.
--
Mel Gorman
SUSE Labs