2023-12-16 00:19:45

by Chris Hyser

[permalink] [raw]
Subject: [RFC 0/2] How effective is numa_preferred_nid w.r.t. NUMA performance?

The commentary around the initial Oracle Soft Affinity proposal [1] had
recommended investigating the use of numa_preferred_nid as a better solution.
The primary driver for the original proposal (as well as now) is better NUMA
performance involving important task's accessing RDMA pinned memory. I wanted a
fairly simple test to explore the various aspects of NUMA performance and that
didn't require lots of time running TPC-C on a tuned DB as Subhra had done. I
needed something that would allow both task and memory placement, with usable
NUMA sensitivity and I think I stumbled onto something quite useful. As the test
is only concerned with the NUMA effects of scheduler/balancer placement
decisions, no locks, no communications, no syscalls etc during the timed loop,
it does not represent any actual useful load. Thus making it, I suppose, a NUMA
micro-benchmark.

A simplified description of the resulting benchmark is first a probe process
which times an outer loop doing a specified "counts" worth of a tight inner
loop. The inner loop in sequential mode would access every u64 in a large
buffer, but in this case it is an equivalent number of random (u64 aligned)
indexes into the memory buffer accessed by a 64-bit read then 64-bit write (the
code provides seq vs rand access as well as various access patterns, but this is
the combo most interesting for this). The probe's buffer memory is either
allowed to float or be bound to particular NUMA nodes while also allowing the
NUMA affinity of the process itself to be set (uses hard affinity) as well as
supporting use of the prctl() in patch 2 to set a "Preferred Node Affinity". The
main difference between this and probably dozens of similar programs is that the
probe isn't the benchmark; its just an extremely NUMA sensitive process. If you
run it by itself on an idle system it will park on a CPU, fill up the associated
caches and tell you absolutely nothing.

What ultimately makes this interesting is running it in the presence of load,
specifically a constant percentage of cpu-only load replicated and pinned on
each CPU. So, for example, HTOP would show all but one CPU at say 60% (what I
used in generating the results here, but the "effect" occurs even with just a 1%
load) with that lone CPU running the probe and pegged at 100%. The result of
this is the load balancer really feeling the need to balance and the NUMA
awareness of those placement decisions are clearly discernible in the probe's
measured times. As well, the runtimes are sufficiently short to enable tracing
the entire life of the probe and categorizing all migrations as 'same core',
'same node', and 'cross node'.

The above is a minimal description of the benchmark. I will be making this
available if people are interested (that and when I get internal stuff sorted),
so after the holidays.

In terms of showing results, I also have test data for an AMD 8-node and an
ARM64 2-node box. I've also run tests exploring the benchmark over a range of
different migration_cost_ns values. Again, if people are interested, I have
data to share.

Test Results:
--------------
The below tests were run on an Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
box. This has two LLC-spanned memory nodes and 104 CPUs. The kernel was recent
tip:sched/core with the included patches (POC only) just to show the changes.

Key:
-----------------
NB - auto-numa-balancing (0 - off, 1 - on)
PNID - the prctl() "forced" numa_preferred_nid, ie 'Preferred Node Affinity'.
(given 2 nodes: 0, 1, and -1 for not_set)
Mem - represents the Memory node when memory is bound, else 'F' floating,
ie not set
CPU - represents the CPUs of the node that the probe is hard-affined to, else
'F' floating, ie not set
Avg - the average time of the probe's measurements in secs

Each line below represents the average of 64 test runs with the indicated
parameters.

NumSamples: 64
Kernel: 6.7.0-rc1_ch_pna7_7+_#213 SMP PREEMPT_DYNAMIC Thu Dec 7 15:16:59 EST 2023
Load: 60
CPU_Model: IntelR XeonR Platinum 8167M CPU @ 2.00GHz
NUM_CPUS: 104
migration_cost_ns: 500000

Avg max min stdv | Test Parameters
----------------------------------------------------------------------
[00] 136.50 141.76 122.08 2.95 | PNID: -1 NB: 0 Mem: 0 CPU: 0
[01] 168.78 172.07 156.04 2.58 | PNID: -1 NB: 0 Mem: 0 CPU: 1
[02] 173.15 180.73 153.41 4.89 | PNID: -1 NB: 0 Mem: 0 CPU: F
[03] 165.95 169.17 162.13 1.57 | PNID: -1 NB: 0 Mem: 1 CPU: 0
[04] 137.23 144.28 123.75 4.97 | PNID: -1 NB: 0 Mem: 1 CPU: 1
[05] 179.90 187.21 165.90 3.73 | PNID: -1 NB: 0 Mem: 1 CPU: F
[06] 163.87 170.68 147.56 6.31 | PNID: -1 NB: 0 Mem: F CPU: 0
[07] 168.96 174.40 156.51 3.74 | PNID: -1 NB: 0 Mem: F CPU: 1
[08] 180.71 185.51 169.74 3.33 | PNID: -1 NB: 0 Mem: F CPU: F

[09] 135.68 139.28 119.92 2.93 | PNID: -1 NB: 1 Mem: 0 CPU: 0
[10] 166.60 169.82 160.05 1.76 | PNID: -1 NB: 1 Mem: 0 CPU: 1
[11] 171.97 181.91 163.94 3.70 | PNID: -1 NB: 1 Mem: 0 CPU: F
[12] 164.01 170.34 152.37 2.82 | PNID: -1 NB: 1 Mem: 1 CPU: 0
[13] 138.01 142.27 135.20 1.22 | PNID: -1 NB: 1 Mem: 1 CPU: 1
[14] 177.07 184.39 163.89 3.56 | PNID: -1 NB: 1 Mem: 1 CPU: F
[15] 165.70 171.33 154.46 2.41 | PNID: -1 NB: 1 Mem: F CPU: 0
[16] 165.18 170.83 149.12 5.99 | PNID: -1 NB: 1 Mem: F CPU: 1
[17] 148.91 163.04 134.31 5.48 | PNID: -1 NB: 1 Mem: F CPU: F

[18] 135.63 138.63 122.85 2.07 | PNID: 0 NB: 1 Mem: 0 CPU: 0
[19] 162.38 170.60 146.03 6.73 | PNID: 0 NB: 1 Mem: 0 CPU: 1
[20] 129.20 135.26 114.55 3.28 | PNID: 0 NB: 1 Mem: 0 CPU: F
[21] 161.71 168.72 144.87 5.55 | PNID: 0 NB: 1 Mem: 1 CPU: 0
[22] 135.72 140.44 123.34 3.10 | PNID: 0 NB: 1 Mem: 1 CPU: 1
[23] 155.07 162.20 138.71 4.50 | PNID: 0 NB: 1 Mem: 1 CPU: F
[24] 163.42 169.29 146.95 5.04 | PNID: 0 NB: 1 Mem: F CPU: 0
[25] 165.90 170.44 157.56 1.67 | PNID: 0 NB: 1 Mem: F CPU: 1
[26] 140.45 148.37 117.02 5.81 | PNID: 0 NB: 1 Mem: F CPU: F

[27] 135.26 140.78 123.29 2.30 | PNID: 1 NB: 1 Mem: 0 CPU: 0
[28] 166.22 169.51 148.18 2.65 | PNID: 1 NB: 1 Mem: 0 CPU: 1
[29] 157.91 165.94 153.48 2.75 | PNID: 1 NB: 1 Mem: 0 CPU: F
[30] 162.08 166.76 148.14 3.37 | PNID: 1 NB: 1 Mem: 1 CPU: 0
[31] 136.86 140.03 127.42 2.01 | PNID: 1 NB: 1 Mem: 1 CPU: 1
[32] 131.85 141.38 114.66 5.55 | PNID: 1 NB: 1 Mem: 1 CPU: F
[33] 163.64 169.48 149.35 2.74 | PNID: 1 NB: 1 Mem: F CPU: 0
[34] 165.94 170.47 156.10 2.41 | PNID: 1 NB: 1 Mem: F CPU: 1
[35] 145.28 154.64 137.84 3.60 | PNID: 1 NB: 1 Mem: F CPU: F

Observations:
---------------
First we see the expected results that memory and cpu bound/pinned on the same
node {0,4,9,13,18,22,27,31} is quite a bit faster than when bound/pinned on
different nodes {1,3,10,12,19,21,28,30}. Completely unexpected was that when
binding memory to a node but allowing the CPU to float (ie, let the scheduler
"schedule", the load balancer "balance") or both float, the performance is as
bad or worse than pinning CPU's and memory on different nodes {2,5,8,11,14}. NB
does help when both memory and the CPU float.

How is that possible? I did some traces of the probe with identical
params/kernel etc. These were then categorized as "same-core", "same-node (minus
same core)", and "cross-node".

Given this platform, a reasonable hypothesis is that cross-node migrations are
trashing the LLC and that is a big deal from a pure NUMA perspective. Is there a
general correlation between the number of cross-node migrations and the longer
completion times? The answer I believe is yes. (The below are representative
samples versus averages as there is still a manual step.)

When both memory and the CPUs are pinned (same node or diff) we see no
cross-node migrations (the 1 is from when the probe started on a different node
than it later hard affined to)

CPU: 0, Mem: 0, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 846 num_migrations_samecore : 887
num_migrations_samenode : 2442 num_migrations_samenode : 2375
num_migrations_crossnode: 1 num_migrations_crossnode: 1
num_migrations: 3289 num_migrations: 3263

CPU: 1, Mem: 1, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 822 num_migrations_samecore : 886
num_migrations_samenode : 2156 num_migrations_samenode : 1982
num_migrations_crossnode: 0 num_migrations_crossnode: 0
num_migrations: 2978 num_migrations: 2868

CPU: 0, Mem: 1, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 1038 num_migrations_samecore : 1055
num_migrations_samenode : 2892 num_migrations_samenode : 2824
num_migrations_crossnode: 0 num_migrations_crossnode: 1
num_migrations: 3931 num_migrations: 3879


Compared to both CPU and memory allowed to float (as well as the impact of NB
and PNID):
CPU: F, Mem: F, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 681 num_migrations_samecore : 800
num_migrations_samenode : 2306 num_migrations_samenode : 2255
num_migrations_crossnode: 1548 num_migrations_crossnode: 1503
num_migrations: 4535 num_migrations: 4558

CPU: F, Mem: F, NB=1, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 799 num_migrations_samecore : 646
num_migrations_samenode : 3098 num_migrations_samenode : 2775
num_migrations_crossnode: 104 num_migrations_crossnode: 236
num_migrations: 4001 num_migrations: 3657

CPU: F, Mem: F, NB=1, PNID=0
-----------------------------------------------------------------
num_migrations_samecore : 718 num_migrations_samecore : 737
num_migrations_samenode : 3148 num_migrations_samenode : 3274
num_migrations_crossnode: 2 num_migrations_crossnode: 7
num_migrations: 3868 num_migrations: 4018

We see that NB does have a big impact (decrease in cross-node migrations) and
confirmed by much better measured times. line {17} vs line {8}.

In terms of the primary use case, pinned RDMA mem buffers, the interesting
results are where the CPU is allowed to float with memory pinned
{2,5,8,11,14,17,20,23,26,29,32,35}. What do the migration counts look like for
those accesses:

CPU: F, Mem: 0, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 788 num_migrations_samecore : 739
num_migrations_samenode : 2251 num_migrations_samenode : 2292
num_migrations_crossnode: 1738 num_migrations_crossnode: 1500
num_migrations: 4777 num_migrations: 4531

CPU: F, Mem: 0, NB=1, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 663 num_migrations_samecore : 657
num_migrations_samenode : 2434 num_migrations_samenode : 2427
num_migrations_crossnode: 1344 num_migrations_crossnode: 1499
num_migrations: 4441 num_migrations: 4583

CPU: F, Mem: 0, NB=1, PNID=0
-----------------------------------------------------------------
num_migrations_samecore : 653 num_migrations_samecore : 665
num_migrations_samenode : 2954 num_migrations_samenode : 2880
num_migrations_crossnode: 7 num_migrations_crossnode: 12
num_migrations: 3614 num_migrations: 3557

From a purely NUMA perspective, accurately setting the preferred node from user
space, "Preferred Node Affinity", appears to be a substantial win as can be seen
by comparing lines {2, 11} vs line {20} and lines {5, 14} vs line {32}.

We also see that NB does not have nearly the same effect with the CPU node
floating and the memory bound as when both were floating. The function
task_numa_work() does clearly skip non-migratable VMAs. The issue with this is
that when enabling NB, the most important accesses of some tasks aren't tracked,
while the accesses that are can lead to the wrong value for numa_preferred_nid,
and thus NB gets turned off.

On digging into this further, there was a 2014 presentation "Automatic NUMA
Balancing" [2] which declares support for "unmovable" memory as a future,
recognizes it's value in correctly setting numa_preferred_nid, but says it is
unclear if it is worthwhile. I am currently working on enabling this and running
such tests.

As a final note, I will have a chance to validate the effects of these changes
against the DB next month.


[1] [RFC PATCH 0/3] Scheduler Soft Affinity
https://lore.kernel.org/lkml/[email protected]/

[2] "Automatic NUMA Balancing",
https://events.static.linuxfound.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf



2023-12-16 00:20:02

by Chris Hyser

[permalink] [raw]
Subject: [RFC/POC 2/2] sched/numa: Adds simple prctl for setting task's preferred node affinity.

EXPERIMENTAL - NOT INTENDED FOR SUBMISSION

Adds a simple prctl() interface to the preferred node affinity test code.

Signed-off-by: Chris Hyser <[email protected]>
---
include/uapi/linux/prctl.h | 9 ++++++
kernel/sys.c | 66 ++++++++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 370ed14b1ae0..6c8f6c0156d8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -293,6 +293,15 @@ struct prctl_mm_map {

#define PR_GET_AUXV 0x41555856

+/*
+ * This is experimental and placed out of order to keep surrounding context
+ * the same in the presence of new prctls. Thus the patch should just apply.
+ */
+#define PR_PREFERRED_NID 101
+# define PR_PREFERRED_NID_GET 0
+# define PR_PREFERRED_NID_SET 1
+# define PR_PREFERRED_NID_CMD_MAX 2
+
#define PR_SET_MEMORY_MERGE 67
#define PR_GET_MEMORY_MERGE 68

diff --git a/kernel/sys.c b/kernel/sys.c
index 420d9cb9cc8e..6dca12da6ade 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2406,6 +2406,67 @@ static inline int prctl_set_mdwe(unsigned long bits, unsigned long arg3,
return 0;
}

+#ifdef CONFIG_NUMA_BALANCING
+
+void sched_setnuma(struct task_struct *p, int node);
+
+static int prctl_chg_pref_nid(unsigned long cmd, int nid, pid_t pid, unsigned long uaddr)
+{
+ struct task_struct *task;
+ int err = 0;
+
+ if (cmd >= PR_PREFERRED_NID_CMD_MAX)
+ return -ERANGE;
+
+ rcu_read_lock();
+ if (pid == 0) {
+ task = current;
+ } else {
+ task = find_task_by_vpid((pid_t)pid);
+ if (!task) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+ }
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ /*
+ * Check if this process has the right to modify the specified
+ * process. Use the regular "ptrace_may_access()" checks.
+ */
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+ err = -EPERM;
+ goto out;
+ }
+
+ switch (cmd) {
+ case PR_PREFERRED_NID_GET:
+ if (uaddr & 0x3) {
+ err = -EINVAL;
+ goto out;
+ }
+ err = put_user(task->numa_preferred_nid_force, (int __user *)uaddr);
+ break;
+
+ case PR_PREFERRED_NID_SET:
+ if (!(-1 <= nid && nid < num_possible_nodes())) {
+ pr_err("prctl_chg_pref_nid: %d error\n", nid);
+ err = -EINVAL;
+ goto out;
+ }
+
+ task->numa_preferred_nid_force = nid;
+ sched_setnuma(task, nid);
+ break;
+ }
+
+out:
+ put_task_struct(task);
+ return err;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
static inline int prctl_get_mdwe(unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5)
{
@@ -2698,6 +2759,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SCHED_CORE:
error = sched_core_share_pid(arg2, arg3, arg4, arg5);
break;
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+ case PR_PREFERRED_NID:
+ error = prctl_chg_pref_nid(arg2, arg3, arg4, arg5);
+ break;
#endif
case PR_SET_MDWE:
error = prctl_set_mdwe(arg2, arg3, arg4, arg5);
--
2.39.3


2024-02-27 00:48:40

by Chris Hyser

[permalink] [raw]
Subject: Re: [RFC 0/2] How effective is numa_preferred_nid w.r.t. NUMA performance?

Included is additional micro-benchmark data from an AMD 128 cpu machine
(EPYC 7551 processor) concerning the effectiveness of setting a task's
numa_preferred_nid with respect to improving the NUMA awareness of the
scheduler. The details of the test procedure are identical to that
described in the original RFC and while obviously this and the original
RFC are answers to a specific question asked by Peter, feedback on the
experimental setup as well as the data would be appreciated.

The original RFC can be found at:
[https://lore.kernel.org/lkml/[email protected]/]

Key:
-----------------
NB   - auto-numa-balancing (0 - off, 1 - on)
PNID - the prctl() "forced" numa_preferred_nid, ie 'Preferred Node
            Affinity'.
           (given 8 nodes:  0, 1, 2, 3, 4, 5, 6, 7 and -1 for not_set)
Mem  - represents the Memory node when memory is bound, else 'F' floating,
           ie not set
CPU  - represents the CPUs of the node that the probe is hard-affined
           to, else 'F' floating, ie not set
Avg  - the average time of the probe's measurements in secs

NumSamples: 36
Load: 60
CPU_Model: AMD EPYC 7551 32-Core Processor
NUM_CPUS: 128
Migration Cost: 500000

      Avg     max     min     stdv        Test Parameters
-----------------------------------------------------------------
[00] 215.78  223.77  195.02   7.60  |  PNID: -1 NB: 0 Mem: 0 CPU 0
[01] 299.77  307.21  282.93   6.60  |  PNID: -1 NB: 0 Mem: 0 CPU 1
[02] 418.78  449.45  387.53  15.64  |  PNID: -1 NB: 0 Mem: 0 CPU F
[03] 301.27  311.84  280.22   8.98  |  PNID: -1 NB: 0 Mem: 1 CPU 0
[04] 213.60  221.36  190.10   6.53  |  PNID: -1 NB: 0 Mem: 1 CPU 1
[05] 396.37  418.58  376.10  10.15  |  PNID: -1 NB: 0 Mem: 1 CPU F
[06] 402.04  411.85  378.71   8.97  |  PNID: -1 NB: 0 Mem: F CPU 0
[07] 401.28  410.06  384.80   6.41  |  PNID: -1 NB: 0 Mem: F CPU 1
[08] 439.86  459.61  392.28  19.09  |  PNID: -1 NB: 0 Mem: F CPU F

[09] 214.81  225.35  199.34   5.38  |  PNID: -1 NB: 1 Mem: 0 CPU 0
[10] 299.15  314.84  274.00   8.18  |  PNID: -1 NB: 1 Mem: 0 CPU 1
[11] 395.70  425.22  340.33  21.54  |  PNID: -1 NB: 1 Mem: 0 CPU F
[12] 300.43  310.93  281.67   7.40  |  PNID: -1 NB: 1 Mem: 1 CPU 0
[13] 210.86  222.80  189.54   7.55  |  PNID: -1 NB: 1 Mem: 1 CPU 1
[14] 402.57  433.72  299.73  32.96  |  PNID: -1 NB: 1 Mem: 1 CPU F
[15] 390.04  410.10  370.63  10.72  |  PNID: -1 NB: 1 Mem: F CPU 0
[16] 393.32  418.43  370.52  10.71  |  PNID: -1 NB: 1 Mem: F CPU 1
[17] 370.07  424.58  255.16  43.26  |  PNID: -1 NB: 1 Mem: F CPU F

[18] 216.26  224.95  198.62   5.86  |  PNID:  0 NB: 1 Mem: 0 CPU 0
[19] 303.60  314.29  275.32   7.99  |  PNID:  0 NB: 1 Mem: 0 CPU 1
[20] 280.36  316.40  242.15  18.25  |  PNID:  0 NB: 1 Mem: 0 CPU F
[21] 301.17  315.03  283.77   8.07  |  PNID:  0 NB: 1 Mem: 1 CPU 0
[22] 209.34  218.63  187.69   9.11  |  PNID:  0 NB: 1 Mem: 1 CPU 1
[23] 342.34  369.42  311.99  12.79  |  PNID:  0 NB: 1 Mem: 1 CPU F
[24] 399.23  409.19  375.73   8.15  |  PNID:  0 NB: 1 Mem: F CPU 0
[25] 391.67  410.01  372.27  10.88  |  PNID:  0 NB: 1 Mem: F CPU 1
[26] 363.19  396.58  254.56  32.02  |  PNID:  0 NB: 1 Mem: F CPU F

[27] 215.29  224.59  193.76   8.16  |  PNID:  1 NB: 1 Mem: 0 CPU 0
[28] 300.19  312.95  280.26   9.32  |  PNID:  1 NB: 1 Mem: 0 CPU 1
[29] 340.97  362.79  323.94  10.69  |  PNID:  1 NB: 1 Mem: 0 CPU F
[30] 304.41  312.14  283.69   6.59  |  PNID:  1 NB: 1 Mem: 1 CPU 0
[31] 213.58  224.24  191.11   6.98  |  PNID:  1 NB: 1 Mem: 1 CPU 1
[32] 299.73  337.17  266.98  17.04  |  PNID:  1 NB: 1 Mem: 1 CPU F
[33] 395.56  411.33  359.70  12.24  |  PNID:  1 NB: 1 Mem: F CPU 0
[34] 398.52  409.42  377.33   7.28  |  PNID:  1 NB: 1 Mem: F CPU 1
[35] 355.64  377.61  279.13  26.71  |  PNID:  1 NB: 1 Mem: F CPU F

All data is present for completeness, however the analysis can be limited
to just comparing {00,01,02} (PNID=-1, NB=0), {09,10,11} (PNID=-1, NB=1)
and {18,19,20} (PNID=0, NB=1, mem=0, cpu=F).

{00,09,18} are all basically the same when memory and CPU are both
pinned to the same node as expected since neither PNID or NB should
affect scheduling in this case. We see basically the same pattern (values
being near equal) when memory and CPU are pinned in different nodes
{01,10,19}. The interesting analysis in terms of the original problem
(pinned RDMA buffers, tasks floating) is how NB and PNID affect the
case when memory is pinned and the CPU allowed to float. The base
value {02} (PNID=-1, NB=0) is quite a bit worse than when the CPU and
memory are pinned in different nodes. This is similar to the Intel case
where allowing the load balancer to load balance is worse than pinning
tasks and memory on different nodes and while this simply may be an
artifact of the micro benchmark, given that the benchmark is really just
a sum of a large number of access times by the task to memory, it is
representative of the NUMA awareness of scheduler/load-balancer.

We do see that enabling NB (with the default values) does provide some
help {11} versus {02} and that setting PNID to the node where the memory
is at does provide a significant benefit {20} 280.36 versus {11} 395.70
versus {02} 418.78. Unlike the prior Intel results, where PNID=0, NB=1,
mem=0, cpu=F was generally less than pinned on same node {20} 129.20
versus {00} 136.5, on the AMD platform we don't see nearly the same level
of improvement {20} 280.36 versus {00} 215.78.

This can be explained by the relatively small number of CPUs in a node
(16) and that said node contains two 8-CPU LLCs.

Analysis:

As mentioned in the RFC, the entire micro-benchmark can be traced and all
migrations of the benchmark task can be tabulated.  Obviously, a same-core
migration is also a same-llc migration which is also a same-node migration.
Cross-node migrations are however further broken into 'from node 0' and
'to node 0'.


    {00}            CPU: 0, Mem: 0, NB=0, PNID=-1
--------------------------------------------------------------------
    num_migrations_samecore : 1823 num_migrations_samecore : 1683
    num_migrations_same_llc : 3455 num_migrations_same_llc : 3277
    num_migrations_samenode : 914 num_migrations_samenode : 1016
    num_migrations_crossnode: 1 num_migrations_crossnode: 1
      num_migrations_to_0   : 1 num_migrations_to_0   : 1
      num_migrations_from_0 : 0 num_migrations_from_0 : 0
    num_migrations: 6193                  num_migrations: 5977

    {01}            CPU: 1, Mem: 0, NB=0, PNID=-1
---------------------------------------------------------------------
    num_migrations_samecore : 2453 num_migrations_samecore : 2579
    num_migrations_same_llc : 4693 num_migrations_same_llc : 4735
    num_migrations_samenode : 1429 num_migrations_samenode : 1466
    num_migrations_crossnode: 1 num_migrations_crossnode: 1
      num_migrations_to_0   : 0 num_migrations_to_0   : 0
      num_migrations_from_  : 1 num_migrations_from_0 : 1
    num_migrations: 8576                  num_migrations: 8781

In the two cases where both the task's CPU and the memory buffer is
pinned we do see no cross-node migrations (ignoring the first needed to get
on to the correct node in the first place which is due to the benchmark
starting the task in a different node). Why pinning cross-node results
in more migrations in general needs more investigation as this seems fairly
consistent.

    {02}            CPU: F, Mem: 0, NB=0, PNID=-1
---------------------------------------------------------------------
    num_migrations_samecore : 1620 num_migrations_samecore : 1744
    num_migrations_samecore : 1620 num_migrations_samecore : 1744
    num_migrations_same_llc : 3142 num_migrations_same_llc : 2818
    num_migrations_samenode : 853 num_migrations_samenode : 625
    num_migrations_crossnode: 6344 num_migrations_crossnode: 6778
      num_migrations_to_0   : 769 num_migrations_to_0   : 776
      num_migrations_from_0 : 769 num_migrations_from_0 : 777
    num_migrations: 11959                 num_migrations: 11965

    {11}            CPU: F, Mem: 0, NB=1, PNID=-1
---------------------------------------------------------------------
    num_migrations_samecore : 1966 num_migrations_samecore : 1963
    num_migrations_same_llc : 2803 num_migrations_same_llc : 3314
    num_migrations_samenode : 514 num_migrations_samenode : 721
    num_migrations_crossnode: 6833 num_migrations_crossnode: 6618
      num_migrations_to_0   : 818 num_migrations_to_0   : 630
      num_migrations_from_0 : 818 num_migrations_from_0 : 630
    num_migrations: 12116                 num_migrations: 12616

From the data table, we see that {02} is slightly slower than {11} even
though there are more total migrations. Ultimately, what matters to the
total time is how much time the task spent running on node 0.

    {20}            CPU: F, Mem: 0, NB=1, PNID=0
---------------------------------------------------------------------
    num_migrations_samecore : 1706 num_migrations_samecore : 1663
    num_migrations_same_llc : 2185 num_migrations_same_llc : 2816
    num_migrations_samenode : 591 num_migrations_samenode : 980
    num_migrations_crossnode: 4621 num_migrations_crossnode: 4243
      num_migrations_to_0   : 480 num_migrations_to_0   : 419
      num_migrations_from_0 : 480 num_migrations_from_0 : 418
    num_migrations: 9103                  num_migrations: 9702

The trace results here are more representative of the observed performance
improvements. The cross-node migrations are significantly lower and the
number of migrations away from node 0 are much less.

In summary, the data (relevant copied below) shows that setting a task's
numa_preferred_nid results in a sizable improvement in completion times.

[00] 215.78  223.77  195.02   7.60  |  PNID: -1 NB: 0 Mem: 0 CPU 0
[01] 299.77  307.21  282.93   6.60  |  PNID: -1 NB: 0 Mem: 0 CPU 1
[02] 418.78  449.45  387.53  15.64  |  PNID: -1 NB: 0 Mem: 0 CPU F
[11] 395.70  425.22  340.33  21.54  |  PNID: -1 NB: 1 Mem: 0 CPU F
[20] 280.36  316.40  242.15  18.25  |  PNID:  0 NB: 1 Mem: 0 CPU F