LinuxLists.cc - [PATCH 0/2] Adjust NUMA imbalance for multiple LLCs

2021-11-25 15:21:56

Subject: [PATCH 0/2] Adjust NUMA imbalance for multiple LLCs

2021-11-25 15:22:08

Subject: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group

find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.

This patch uses the SD_NUMA weight in both.

Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/* Consider allowing a small imbalance between NUMA groups */
if (env->sd->flags & SD_NUMA) {
env->imbalance = adjust_numa_imbalance(env->imbalance,
- busiest->sum_nr_running, busiest->group_weight);
+ busiest->sum_nr_running, env->sd->span_weight);
}

return;
--
2.31.1

2021-11-25 15:22:19

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
adjusts the imbalance on multi-LLC machines to allow an imbalance up to
the point where LLCs should be balanced between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

vanilla sched-numaimb-v2r4
MB/sec copy-16 164279.50 ( 0.00%) 702962.88 ( 327.91%)
MB/sec scale-16 137487.08 ( 0.00%) 397132.98 ( 188.85%)
MB/sec add-16 157561.68 ( 0.00%) 638006.32 ( 304.92%)
MB/sec triad-16 154562.04 ( 0.00%) 641408.02 ( 314.98%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

vanilla sched-numaimb-v1r2
Min Score-16 366090.84 ( 0.00%) 401505.65 ( 9.67%)
Hmean Score-16 391416.56 ( 0.00%) 452546.28 * 15.62%*
Stddev Score-16 16452.12 ( 0.00%) 31480.31 ( -91.35%)
CoeffVar Score-16 4.20 ( 0.00%) 6.92 ( -64.99%)
Max Score-16 416666.67 ( 0.00%) 483529.77 ( 16.05%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed

vanilla sched-numaimb-v2r5
Hmean tput-1 73743.05 ( 0.00%) 72517.86 ( -1.66%)
Hmean tput-8 563036.51 ( 0.00%) 619505.85 * 10.03%*
Hmean tput-16 1016590.61 ( 0.00%) 1084022.36 ( 6.63%)
Hmean tput-24 1418558.41 ( 0.00%) 1443296.06 ( 1.74%)
Hmean tput-32 1608794.22 ( 0.00%) 1869822.05 * 16.23%*
Hmean tput-40 1761338.13 ( 0.00%) 2154415.40 * 22.32%*
Hmean tput-48 2290646.54 ( 0.00%) 2561031.20 * 11.80%*
Hmean tput-56 2463345.12 ( 0.00%) 2731874.84 * 10.90%*
Hmean tput-64 2650213.53 ( 0.00%) 2867054.47 ( 8.18%)
Hmean tput-72 2497253.28 ( 0.00%) 3017637.28 * 20.84%*
Hmean tput-80 2820786.72 ( 0.00%) 3018947.39 ( 7.03%)
Hmean tput-88 2813541.68 ( 0.00%) 3008805.43 * 6.94%*
Hmean tput-96 2604158.67 ( 0.00%) 2948056.40 * 13.21%*
Hmean tput-104 2713810.62 ( 0.00%) 2952327.00 ( 8.79%)
Hmean tput-112 2558425.37 ( 0.00%) 2909089.90 * 13.71%*
Hmean tput-120 2611434.93 ( 0.00%) 2773024.11 * 6.19%*
Hmean tput-128 2706103.22 ( 0.00%) 2765678.84 ( 2.20%)

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 26 +++++++++++++++-----------
kernel/sched/topology.c | 24 ++++++++++++++++++++++++
3 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
unsigned int busy_factor; /* less balancing by factor if busy */
unsigned int imbalance_pct; /* No balance until over watermark */
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
+ unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */

int nohz_idle; /* NOHZ IDLE status */
int flags; /* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..64f211879e43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {

int src_cpu, src_nid;
int dst_cpu, dst_nid;
+ int imb_numa_nr;

struct numa_stats src_stats, dst_stats;

@@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
dst_running = env->dst_stats.nr_running + 1;
imbalance = max(0, dst_running - src_running);
imbalance = adjust_numa_imbalance(imbalance, dst_running,
- env->dst_stats.weight);
+ env->imb_numa_nr);

/* Use idle CPU if there is no imbalance */
if (!imbalance) {
@@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p)
*/
rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
- if (sd)
+ if (sd) {
env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+ env.imb_numa_nr = sd->imb_numa_nr;
+ }
rcu_read_unlock();

/*
@@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
}

/*
- * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
- * This is an approximation as the number of running tasks may not be
- * related to the number of busy CPUs due to sched_setaffinity.
+ * Allow a NUMA imbalance if busy CPUs is less than the allowed
+ * imbalance. This is an approximation as the number of running
+ * tasks may not be related to the number of busy CPUs due to
+ * sched_setaffinity.
*/
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
{
- return (dst_running < (dst_weight >> 2));
+ return dst_running < imb_numa_nr;
}

/*
@@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* a real need of migration, periodic load balance will
* take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+ if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
return NULL;
}

@@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
#define NUMA_IMBALANCE_MIN 2

static inline long adjust_numa_imbalance(int imbalance,
- int dst_running, int dst_weight)
+ int dst_running, int imb_numa_nr)
{
- if (!allow_numa_imbalance(dst_running, dst_weight))
+ if (!allow_numa_imbalance(dst_running, imb_numa_nr))
return imbalance;

/*
@@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/* Consider allowing a small imbalance between NUMA groups */
if (env->sd->flags & SD_NUMA) {
env->imbalance = adjust_numa_imbalance(env->imbalance,
- busiest->sum_nr_running, env->sd->span_weight);
+ busiest->sum_nr_running, env->sd->imb_numa_nr);
}

return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..9adeaa89ccb4 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
}

+ /* Calculate allowed NUMA imbalance */
+ for_each_cpu(i, cpu_map) {
+ for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ struct sched_domain *child = sd->child;
+
+ if (!(sd->flags & SD_SHARE_PKG_RESOURCES) &&
+ (child->flags & SD_SHARE_PKG_RESOURCES)) {
+ struct sched_domain *sd_numa = sd;
+ int imb_numa_nr, nr_groups;
+
+ nr_groups = sd->span_weight / child->span_weight;
+ imb_numa_nr = nr_groups / num_online_nodes();
+
+ while (sd_numa) {
+ if (sd_numa->flags & SD_NUMA) {
+ sd_numa->imb_numa_nr = imb_numa_nr;
+ break;
+ }
+ sd_numa = sd_numa->parent;
+ }
+ }
+ }
+ }
+
/* Calculate CPU capacity for physical packages and nodes */
for (i = nr_cpumask_bits-1; i >= 0; i--) {
if (!cpumask_test_cpu(i, cpu_map))
--
2.31.1

2021-11-28 15:09:17

by kernel test robot

[permalink] [raw]

Subject: [sched/fair] b4d95a034c: phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s -26.3% regression

Greeting,

FYI, we noticed a

commit: b4d95a034cffb1e4424 url: in testcase: phoronix-test-suite
on test machine: 96 with following parameters:

test: tiobench-1.3.1
option_a: Random Write
option_b: 64MB
option_c: 8
cpufreq_governor: performance
ucode: 0x5003006

test-description: test-url: Thanks,
Oliver Sang

-26.3% regression of phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s due to commit:
874645549d3cac2de5c02 ("[PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
ub.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20211125-232336">https://github.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20211125-232336
.kernel.org/cgit/linux/kernel/git/tip/tip.git">https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 8c92606ab81086db00cbb73347d124b4eb169b7e
threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 128G memory
The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
www.phoronix-test-suite.com/">http://www.phoronix-test-suite.com/
kindly add following tag
test robot <[email protected]>
----------------------------------------------------------------------->
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
# job file is attached in this email
--compatible job.yaml # generate the yaml file for lkp run
run generated-yaml-file
across any failure that blocks the test,
~/.lkp and /lkp dir to run from a clean state.
==============================================================
config/option_a/option_b/option_c/rootfs/tbox_group/test/testcase/ucode:
rhel-8.3/Random Write/64MB/8/debian-x86_64-phoronix/lkp-csl-2sp8/tiobench-1.3.1/phoronix-test-suite/0x5003006
Use weight of SD_NUMA domain in find_busiest_group")
Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
ffb1e442487464554
-----------------
%stddev
\
140600 ? 3% phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s
6530 ? 64% proc-vmstat.numa_hint_faults
57153 ? 4% interrupts.CAL:Function_call_interrupts
791.14 ? 42% interrupts.CPU1.CAL:Function_call_interrupts
32.65 ? 7% perf-stat.i.cpu-migrations
104748 ? 3% perf-stat.i.node-store-misses
1105 ? 13% numa-vmstat.node0.nr_active_anon
43196 ? 14% numa-vmstat.node0.nr_anon_pages
9965 ? 4% numa-vmstat.node0.nr_kernel_stack
7413 ? 43% numa-vmstat.node0.nr_mapped
706.14 ? 15% numa-vmstat.node0.nr_page_table_pages
1105 ? 13% numa-vmstat.node0.nr_zone_active_anon
13103 ? 20% numa-vmstat.node1.nr_active_file
48088 ? 11% numa-vmstat.node1.nr_anon_pages
9834 ? 4% numa-vmstat.node1.nr_kernel_stack
746.14 ? 14% numa-vmstat.node1.nr_page_table_pages
13103 ? 20% numa-vmstat.node1.nr_zone_active_file
4426 ? 13% numa-meminfo.node0.Active(anon)
66683 ? 26% numa-meminfo.node0.AnonHugePages
172811 ? 14% numa-meminfo.node0.AnonPages
181374 ? 14% numa-meminfo.node0.AnonPages.max
-29.7% 1441806 ? 36% numa-meminfo.node0.Inactive
9966 ? 4% numa-meminfo.node0.KernelStack
29668 ? 43% numa-meminfo.node0.Mapped
-20.3% 2288843 ? 24% numa-meminfo.node0.MemUsed
2829 ? 15% numa-meminfo.node0.PageTables
55814 ? 18% numa-meminfo.node1.Active
52416 ? 20% numa-meminfo.node1.Active(file)
56834 ? 44% numa-meminfo.node1.AnonHugePages
192349 ? 11% numa-meminfo.node1.AnonPages
202789 ? 12% numa-meminfo.node1.AnonPages.max
9833 ? 4% numa-meminfo.node1.KernelStack
2983 ? 14% numa-meminfo.node1.PageTables
4237 ? 8% slabinfo.kmalloc-cg-16.active_objs
4237 ? 8% slabinfo.kmalloc-cg-16.num_objs
3709 ? 3% slabinfo.kmalloc-cg-192.active_objs
3709 ? 3% slabinfo.kmalloc-cg-192.num_objs
2845 ? 3% slabinfo.kmalloc-cg-1k.active_objs
2845 ? 3% slabinfo.kmalloc-cg-1k.num_objs
1310 ? 3% slabinfo.kmalloc-cg-2k.active_objs
1310 ? 3% slabinfo.kmalloc-cg-2k.num_objs
5347 ? 2% slabinfo.kmalloc-cg-64.active_objs
5347 ? 2% slabinfo.kmalloc-cg-64.num_objs
3382 ? 3% slabinfo.kmalloc-cg-96.active_objs
3382 ? 3% slabinfo.kmalloc-cg-96.num_objs
1133 ? 7% slabinfo.task_group.active_objs
1133 ? 7% slabinfo.task_group.num_objs
12296 ? 3% slabinfo.vmap_area.active_objs
12297 ? 3% slabinfo.vmap_area.num_objs
2.14 ?159% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
2.14 ?159% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
2.14 ?159% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
2.14 ?159% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
2.14 ?159% perf-profile.calltrace.cycles-pp.read
2.49 ?158% perf-profile.calltrace.cycles-pp.zap_pte_range.unmap_page_range.unmap_vmas.exit_mmap.mmput
1.19 ?244% perf-profile.calltrace.cycles-pp.page_remove_rmap.zap_pte_range.unmap_page_range.unmap_vmas.exit_mmap
2.14 ?159% perf-profile.calltrace.cycles-pp.new_sync_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.14 ?159% perf-profile.calltrace.cycles-pp.proc_reg_read_iter.new_sync_read.vfs_read.ksys_read.do_syscall_64
2.14 ?159% perf-profile.calltrace.cycles-pp.seq_read_iter.proc_reg_read_iter.new_sync_read.vfs_read.ksys_read
2.14 ?159% perf-profile.calltrace.cycles-pp.show_interrupts.seq_read_iter.proc_reg_read_iter.new_sync_read.vfs_read
1.19 ?244% perf-profile.calltrace.cycles-pp.release_task.wait_task_zombie.do_wait.kernel_waitid.__do_sys_waitid
0.00 perf-profile.calltrace.cycles-pp.__dentry_kill.shrink_dentry_list.shrink_dcache_parent.d_invalidate.proc_invalidate_siblings_dcache
0.00 perf-profile.calltrace.cycles-pp.d_invalidate.proc_invalidate_siblings_dcache.release_task.wait_task_zombie.do_wait
0.00 perf-profile.calltrace.cycles-pp.proc_invalidate_siblings_dcache.release_task.wait_task_zombie.do_wait.kernel_waitid
0.00 perf-profile.calltrace.cycles-pp.shrink_dcache_parent.d_invalidate.proc_invalidate_siblings_dcache.release_task.wait_task_zombie
0.00 perf-profile.calltrace.cycles-pp.shrink_dentry_list.shrink_dcache_parent.d_invalidate.proc_invalidate_siblings_dcache.release_task
4.36 ?179% perf-profile.calltrace.cycles-pp.mmput.begin_new_exec.load_elf_binary.exec_binprm.bprm_execve
4.36 ?179% perf-profile.calltrace.cycles-pp.exit_mmap.mmput.begin_new_exec.load_elf_binary.exec_binprm
1.79 ?169% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.waitid
1.79 ?169% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.waitid
1.79 ?169% perf-profile.calltrace.cycles-pp.__do_sys_waitid.do_syscall_64.entry_SYSCALL_64_after_hwframe.waitid
1.79 ?169% perf-profile.calltrace.cycles-pp.waitid
1.79 ?169% perf-profile.calltrace.cycles-pp.kernel_waitid.__do_sys_waitid.do_syscall_64.entry_SYSCALL_64_after_hwframe.waitid
1.79 ?169% perf-profile.calltrace.cycles-pp.do_wait.kernel_waitid.__do_sys_waitid.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.79 ?169% perf-profile.calltrace.cycles-pp.wait_task_zombie.do_wait.kernel_waitid.__do_sys_waitid.do_syscall_64
8.49 ?177% perf-profile.calltrace.cycles-pp.begin_new_exec.load_elf_binary.exec_binprm.bprm_execve.do_execveat_common
2.49 ?158% perf-profile.children.cycles-pp.unmap_vmas
2.14 ?159% perf-profile.children.cycles-pp.ksys_read
2.14 ?159% perf-profile.children.cycles-pp.vfs_read
2.14 ?159% perf-profile.children.cycles-pp.seq_read_iter
2.14 ?159% perf-profile.children.cycles-pp.read
2.49 ?158% perf-profile.children.cycles-pp.zap_pte_range
2.49 ?158% perf-profile.children.cycles-pp.unmap_page_range
1.19 ?244% perf-profile.children.cycles-pp.page_remove_rmap
2.14 ?159% perf-profile.children.cycles-pp.new_sync_read
2.14 ?159% perf-profile.children.cycles-pp.proc_reg_read_iter
2.14 ?159% perf-profile.children.cycles-pp.show_interrupts
1.19 ?244% perf-profile.children.cycles-pp.release_task
0.00 perf-profile.children.cycles-pp.__dentry_kill
0.00 perf-profile.children.cycles-pp.d_invalidate
0.00 perf-profile.children.cycles-pp.proc_invalidate_siblings_dcache
0.00 perf-profile.children.cycles-pp.shrink_dcache_parent
0.00 perf-profile.children.cycles-pp.shrink_dentry_list
1.79 ?169% perf-profile.children.cycles-pp.__do_sys_waitid
1.79 ?169% perf-profile.children.cycles-pp.waitid
1.79 ?169% perf-profile.children.cycles-pp.kernel_waitid
1.79 ?169% perf-profile.children.cycles-pp.do_wait
1.79 ?169% perf-profile.children.cycles-pp.wait_task_zombie
8.49 ?177% perf-profile.children.cycles-pp.begin_new_exec
1.19 ?244% perf-profile.self.cycles-pp.page_remove_rmap

st-suite.tiobench.RandomWrite.64MB.8.mb_s

-------------------------------------------------+
|
|
+ + + + +.+ |
++ +. + .+ : : + :+ +.+ + :: : : ++ |
+ + +.+ :: +.+ +: + + : :: :+ +|
+ + + +.+ + + |
|
|
|
|
O O O O |
O O |
O O O O O O |
-------------------------------------------------+

estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
Open Source Technology Center
org/hyperkitty/list/lkp@lists.01.org">https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Attachments:

(No filename) (14.08 kB)
config-5.16.0-rc1-00010-gb4d95a034cff (169.42 kB)
job-script (7.42 kB)
job.yaml (4.72 kB)
reproduce (299.00 B)
Download all attachments

2021-11-29 11:37:43

by Mel Gorman

[permalink] [raw]

Subject: Re: [sched/fair] b4d95a034c: phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s -26.3% regression

On Sun, Nov 28, 2021 at 11:06:58PM +0800, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed a -26.3% regression of phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s due to commit:
>
>
> commit: b4d95a034cffb1e4424874645549d3cac2de5c02 ("[PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
> url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20211125-232336
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 8c92606ab81086db00cbb73347d124b4eb169b7e
>
> in testcase: phoronix-test-suite
> on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 128G memory
> with following parameters:
>
> test: tiobench-1.3.1
> option_a: Random Write
> option_b: 64MB
> option_c: 8
> cpufreq_governor: performance
> ucode: 0x5003006
>
> test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
> test-url: http://www.phoronix-test-suite.com/
>

Ok, while I'm surprised there is a difference with tiobench, there
definitely is a problem with the patch and a v3 is needed. Am queueing
up a test of v3 but the diff is

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9adeaa89ccb4..fee2930745ab 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2244,25 +2244,21 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att

/* Calculate allowed NUMA imbalance */
for_each_cpu(i, cpu_map) {
+ int imb_numa_nr = 0;
+
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
struct sched_domain *child = sd->child;

- if (!(sd->flags & SD_SHARE_PKG_RESOURCES) &&
+ if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
(child->flags & SD_SHARE_PKG_RESOURCES)) {
- struct sched_domain *sd_numa = sd;
- int imb_numa_nr, nr_groups;
+ int nr_groups;

nr_groups = sd->span_weight / child->span_weight;
- imb_numa_nr = nr_groups / num_online_nodes();
-
- while (sd_numa) {
- if (sd_numa->flags & SD_NUMA) {
- sd_numa->imb_numa_nr = imb_numa_nr;
- break;
- }
- sd_numa = sd_numa->parent;
- }
+ imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
+ (nr_groups * num_online_nodes()));
}
+
+ sd->imb_numa_nr = imb_numa_nr;
}
}