Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
The series addresses two problems -- inconsistent use of scheduler domain
weights and sub-optimal performance when there are many LLCs per NUMA node.
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 26 +++++++++++++++-----------
kernel/sched/topology.c | 24 ++++++++++++++++++++++++
3 files changed, 40 insertions(+), 11 deletions(-)
--
2.31.1
find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.
This patch uses the SD_NUMA weight in both.
Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/* Consider allowing a small imbalance between NUMA groups */
if (env->sd->flags & SD_NUMA) {
env->imbalance = adjust_numa_imbalance(env->imbalance,
- busiest->sum_nr_running, busiest->group_weight);
+ busiest->sum_nr_running, env->sd->span_weight);
}
return;
--
2.31.1
Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
adjusts the imbalance on multi-LLC machines to allow an imbalance up to
the point where LLCs should be balanced between nodes.
On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are
vanilla sched-numaimb-v2r4
MB/sec copy-16 164279.50 ( 0.00%) 702962.88 ( 327.91%)
MB/sec scale-16 137487.08 ( 0.00%) 397132.98 ( 188.85%)
MB/sec add-16 157561.68 ( 0.00%) 638006.32 ( 304.92%)
MB/sec triad-16 154562.04 ( 0.00%) 641408.02 ( 314.98%)
STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.
vanilla sched-numaimb-v1r2
Min Score-16 366090.84 ( 0.00%) 401505.65 ( 9.67%)
Hmean Score-16 391416.56 ( 0.00%) 452546.28 * 15.62%*
Stddev Score-16 16452.12 ( 0.00%) 31480.31 ( -91.35%)
CoeffVar Score-16 4.20 ( 0.00%) 6.92 ( -64.99%)
Max Score-16 416666.67 ( 0.00%) 483529.77 ( 16.05%)
It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed
vanilla sched-numaimb-v2r5
Hmean tput-1 73743.05 ( 0.00%) 72517.86 ( -1.66%)
Hmean tput-8 563036.51 ( 0.00%) 619505.85 * 10.03%*
Hmean tput-16 1016590.61 ( 0.00%) 1084022.36 ( 6.63%)
Hmean tput-24 1418558.41 ( 0.00%) 1443296.06 ( 1.74%)
Hmean tput-32 1608794.22 ( 0.00%) 1869822.05 * 16.23%*
Hmean tput-40 1761338.13 ( 0.00%) 2154415.40 * 22.32%*
Hmean tput-48 2290646.54 ( 0.00%) 2561031.20 * 11.80%*
Hmean tput-56 2463345.12 ( 0.00%) 2731874.84 * 10.90%*
Hmean tput-64 2650213.53 ( 0.00%) 2867054.47 ( 8.18%)
Hmean tput-72 2497253.28 ( 0.00%) 3017637.28 * 20.84%*
Hmean tput-80 2820786.72 ( 0.00%) 3018947.39 ( 7.03%)
Hmean tput-88 2813541.68 ( 0.00%) 3008805.43 * 6.94%*
Hmean tput-96 2604158.67 ( 0.00%) 2948056.40 * 13.21%*
Hmean tput-104 2713810.62 ( 0.00%) 2952327.00 ( 8.79%)
Hmean tput-112 2558425.37 ( 0.00%) 2909089.90 * 13.71%*
Hmean tput-120 2611434.93 ( 0.00%) 2773024.11 * 6.19%*
Hmean tput-128 2706103.22 ( 0.00%) 2765678.84 ( 2.20%)
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 26 +++++++++++++++-----------
kernel/sched/topology.c | 24 ++++++++++++++++++++++++
3 files changed, 40 insertions(+), 11 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
unsigned int busy_factor; /* less balancing by factor if busy */
unsigned int imbalance_pct; /* No balance until over watermark */
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
+ unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */
int nohz_idle; /* NOHZ IDLE status */
int flags; /* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..64f211879e43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
int src_cpu, src_nid;
int dst_cpu, dst_nid;
+ int imb_numa_nr;
struct numa_stats src_stats, dst_stats;
@@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
dst_running = env->dst_stats.nr_running + 1;
imbalance = max(0, dst_running - src_running);
imbalance = adjust_numa_imbalance(imbalance, dst_running,
- env->dst_stats.weight);
+ env->imb_numa_nr);
/* Use idle CPU if there is no imbalance */
if (!imbalance) {
@@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p)
*/
rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
- if (sd)
+ if (sd) {
env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+ env.imb_numa_nr = sd->imb_numa_nr;
+ }
rcu_read_unlock();
/*
@@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
}
/*
- * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
- * This is an approximation as the number of running tasks may not be
- * related to the number of busy CPUs due to sched_setaffinity.
+ * Allow a NUMA imbalance if busy CPUs is less than the allowed
+ * imbalance. This is an approximation as the number of running
+ * tasks may not be related to the number of busy CPUs due to
+ * sched_setaffinity.
*/
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
{
- return (dst_running < (dst_weight >> 2));
+ return dst_running < imb_numa_nr;
}
/*
@@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* a real need of migration, periodic load balance will
* take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+ if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
return NULL;
}
@@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
#define NUMA_IMBALANCE_MIN 2
static inline long adjust_numa_imbalance(int imbalance,
- int dst_running, int dst_weight)
+ int dst_running, int imb_numa_nr)
{
- if (!allow_numa_imbalance(dst_running, dst_weight))
+ if (!allow_numa_imbalance(dst_running, imb_numa_nr))
return imbalance;
/*
@@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/* Consider allowing a small imbalance between NUMA groups */
if (env->sd->flags & SD_NUMA) {
env->imbalance = adjust_numa_imbalance(env->imbalance,
- busiest->sum_nr_running, env->sd->span_weight);
+ busiest->sum_nr_running, env->sd->imb_numa_nr);
}
return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..9adeaa89ccb4 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
}
+ /* Calculate allowed NUMA imbalance */
+ for_each_cpu(i, cpu_map) {
+ for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ struct sched_domain *child = sd->child;
+
+ if (!(sd->flags & SD_SHARE_PKG_RESOURCES) &&
+ (child->flags & SD_SHARE_PKG_RESOURCES)) {
+ struct sched_domain *sd_numa = sd;
+ int imb_numa_nr, nr_groups;
+
+ nr_groups = sd->span_weight / child->span_weight;
+ imb_numa_nr = nr_groups / num_online_nodes();
+
+ while (sd_numa) {
+ if (sd_numa->flags & SD_NUMA) {
+ sd_numa->imb_numa_nr = imb_numa_nr;
+ break;
+ }
+ sd_numa = sd_numa->parent;
+ }
+ }
+ }
+ }
+
/* Calculate CPU capacity for physical packages and nodes */
for (i = nr_cpumask_bits-1; i >= 0; i--) {
if (!cpumask_test_cpu(i, cpu_map))
--
2.31.1
Greeting,
FYI, we noticed a -26.3% regression of phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s due to commit:
commit: b4d95a034cffb1e4424874645549d3cac2de5c02 ("[PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20211125-232336
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 8c92606ab81086db00cbb73347d124b4eb169b7e
in testcase: phoronix-test-suite
on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 128G memory
with following parameters:
test: tiobench-1.3.1
option_a: Random Write
option_b: 64MB
option_c: 8
cpufreq_governor: performance
ucode: 0x5003006
test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/option_b/option_c/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/Random Write/64MB/8/debian-x86_64-phoronix/lkp-csl-2sp8/tiobench-1.3.1/phoronix-test-suite/0x5003006
commit:
fee45dc486 ("sched/fair: Use weight of SD_NUMA domain in find_busiest_group")
b4d95a034c ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
fee45dc486dd343a b4d95a034cffb1e442487464554
---------------- ---------------------------
%stddev %change %stddev
\ | \
190841 ? 4% -26.3% 140600 ? 3% phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s
5.17 ?128% +1.3e+05% 6530 ? 64% proc-vmstat.numa_hint_faults
76503 ? 40% -25.3% 57153 ? 4% interrupts.CAL:Function_call_interrupts
4574 ? 50% -82.7% 791.14 ? 42% interrupts.CPU1.CAL:Function_call_interrupts
3.32 ? 41% +882.9% 32.65 ? 7% perf-stat.i.cpu-migrations
51246 ? 10% +104.4% 104748 ? 3% perf-stat.i.node-store-misses
1465 ? 21% -24.6% 1105 ? 13% numa-vmstat.node0.nr_active_anon
82443 ? 2% -47.6% 43196 ? 14% numa-vmstat.node0.nr_anon_pages
10866 ? 4% -8.3% 9965 ? 4% numa-vmstat.node0.nr_kernel_stack
14846 ? 15% -50.1% 7413 ? 43% numa-vmstat.node0.nr_mapped
1033 ? 2% -31.7% 706.14 ? 15% numa-vmstat.node0.nr_page_table_pages
1465 ? 21% -24.6% 1105 ? 13% numa-vmstat.node0.nr_zone_active_anon
8909 ? 26% +47.1% 13103 ? 20% numa-vmstat.node1.nr_active_file
8603 ? 15% +458.9% 48088 ? 11% numa-vmstat.node1.nr_anon_pages
8949 ? 5% +9.9% 9834 ? 4% numa-vmstat.node1.nr_kernel_stack
416.00 ? 7% +79.4% 746.14 ? 14% numa-vmstat.node1.nr_page_table_pages
8909 ? 26% +47.1% 13103 ? 20% numa-vmstat.node1.nr_zone_active_file
5844 ? 22% -24.3% 4426 ? 13% numa-meminfo.node0.Active(anon)
121357 ? 13% -45.1% 66683 ? 26% numa-meminfo.node0.AnonHugePages
329764 ? 2% -47.6% 172811 ? 14% numa-meminfo.node0.AnonPages
346450 -47.6% 181374 ? 14% numa-meminfo.node0.AnonPages.max
2050555 ? 13% -29.7% 1441806 ? 36% numa-meminfo.node0.Inactive
10866 ? 4% -8.3% 9966 ? 4% numa-meminfo.node0.KernelStack
59355 ? 15% -50.0% 29668 ? 43% numa-meminfo.node0.Mapped
2872827 ? 12% -20.3% 2288843 ? 24% numa-meminfo.node0.MemUsed
4133 ? 3% -31.6% 2829 ? 15% numa-meminfo.node0.PageTables
37735 ? 26% +47.9% 55814 ? 18% numa-meminfo.node1.Active
35639 ? 26% +47.1% 52416 ? 20% numa-meminfo.node1.Active(file)
5616 ? 27% +912.0% 56834 ? 44% numa-meminfo.node1.AnonHugePages
34408 ? 15% +459.0% 192349 ? 11% numa-meminfo.node1.AnonPages
39089 ? 19% +418.8% 202789 ? 12% numa-meminfo.node1.AnonPages.max
8950 ? 5% +9.9% 9833 ? 4% numa-meminfo.node1.KernelStack
1666 ? 6% +79.0% 2983 ? 14% numa-meminfo.node1.PageTables
4925 ? 8% -14.0% 4237 ? 8% slabinfo.kmalloc-cg-16.active_objs
4925 ? 8% -14.0% 4237 ? 8% slabinfo.kmalloc-cg-16.num_objs
3328 +11.4% 3709 ? 3% slabinfo.kmalloc-cg-192.active_objs
3328 +11.4% 3709 ? 3% slabinfo.kmalloc-cg-192.num_objs
2545 ? 3% +11.8% 2845 ? 3% slabinfo.kmalloc-cg-1k.active_objs
2545 ? 3% +11.8% 2845 ? 3% slabinfo.kmalloc-cg-1k.num_objs
1054 ? 6% +24.3% 1310 ? 3% slabinfo.kmalloc-cg-2k.active_objs
1054 ? 6% +24.3% 1310 ? 3% slabinfo.kmalloc-cg-2k.num_objs
4376 ? 5% +22.2% 5347 ? 2% slabinfo.kmalloc-cg-64.active_objs
4376 ? 5% +22.2% 5347 ? 2% slabinfo.kmalloc-cg-64.num_objs
2663 ? 7% +27.0% 3382 ? 3% slabinfo.kmalloc-cg-96.active_objs
2663 ? 7% +27.0% 3382 ? 3% slabinfo.kmalloc-cg-96.num_objs
1446 ? 9% -21.6% 1133 ? 7% slabinfo.task_group.active_objs
1446 ? 9% -21.6% 1133 ? 7% slabinfo.task_group.num_objs
14208 ? 5% -13.5% 12296 ? 3% slabinfo.vmap_area.active_objs
14213 ? 5% -13.5% 12297 ? 3% slabinfo.vmap_area.num_objs
8.25 ?110% -6.1 2.14 ?159% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
8.25 ?110% -6.1 2.14 ?159% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
8.25 ?110% -6.1 2.14 ?159% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
8.25 ?110% -6.1 2.14 ?159% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
8.25 ?110% -6.1 2.14 ?159% perf-profile.calltrace.cycles-pp.read
7.96 ?124% -5.5 2.49 ?158% perf-profile.calltrace.cycles-pp.zap_pte_range.unmap_page_range.unmap_vmas.exit_mmap.mmput
6.44 ?111% -5.3 1.19 ?244% perf-profile.calltrace.cycles-pp.page_remove_rmap.zap_pte_range.unmap_page_range.unmap_vmas.exit_mmap
6.40 ?108% -4.3 2.14 ?159% perf-profile.calltrace.cycles-pp.new_sync_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
6.40 ?108% -4.3 2.14 ?159% perf-profile.calltrace.cycles-pp.proc_reg_read_iter.new_sync_read.vfs_read.ksys_read.do_syscall_64
6.40 ?108% -4.3 2.14 ?159% perf-profile.calltrace.cycles-pp.seq_read_iter.proc_reg_read_iter.new_sync_read.vfs_read.ksys_read
6.40 ?108% -4.3 2.14 ?159% perf-profile.calltrace.cycles-pp.show_interrupts.seq_read_iter.proc_reg_read_iter.new_sync_read.vfs_read
5.41 ?105% -4.2 1.19 ?244% perf-profile.calltrace.cycles-pp.release_task.wait_task_zombie.do_wait.kernel_waitid.__do_sys_waitid
4.22 ?101% -4.2 0.00 perf-profile.calltrace.cycles-pp.__dentry_kill.shrink_dentry_list.shrink_dcache_parent.d_invalidate.proc_invalidate_siblings_dcache
4.22 ?101% -4.2 0.00 perf-profile.calltrace.cycles-pp.d_invalidate.proc_invalidate_siblings_dcache.release_task.wait_task_zombie.do_wait
4.22 ?101% -4.2 0.00 perf-profile.calltrace.cycles-pp.proc_invalidate_siblings_dcache.release_task.wait_task_zombie.do_wait.kernel_waitid
4.22 ?101% -4.2 0.00 perf-profile.calltrace.cycles-pp.shrink_dcache_parent.d_invalidate.proc_invalidate_siblings_dcache.release_task.wait_task_zombie
4.22 ?101% -4.2 0.00 perf-profile.calltrace.cycles-pp.shrink_dentry_list.shrink_dcache_parent.d_invalidate.proc_invalidate_siblings_dcache.release_task
8.36 ?154% -4.0 4.36 ?179% perf-profile.calltrace.cycles-pp.mmput.begin_new_exec.load_elf_binary.exec_binprm.bprm_execve
8.36 ?154% -4.0 4.36 ?179% perf-profile.calltrace.cycles-pp.exit_mmap.mmput.begin_new_exec.load_elf_binary.exec_binprm
5.41 ?105% -3.6 1.79 ?169% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.waitid
5.41 ?105% -3.6 1.79 ?169% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.waitid
5.41 ?105% -3.6 1.79 ?169% perf-profile.calltrace.cycles-pp.__do_sys_waitid.do_syscall_64.entry_SYSCALL_64_after_hwframe.waitid
5.41 ?105% -3.6 1.79 ?169% perf-profile.calltrace.cycles-pp.waitid
5.41 ?105% -3.6 1.79 ?169% perf-profile.calltrace.cycles-pp.kernel_waitid.__do_sys_waitid.do_syscall_64.entry_SYSCALL_64_after_hwframe.waitid
5.41 ?105% -3.6 1.79 ?169% perf-profile.calltrace.cycles-pp.do_wait.kernel_waitid.__do_sys_waitid.do_syscall_64.entry_SYSCALL_64_after_hwframe
5.41 ?105% -3.6 1.79 ?169% perf-profile.calltrace.cycles-pp.wait_task_zombie.do_wait.kernel_waitid.__do_sys_waitid.do_syscall_64
8.36 ?154% +0.1 8.49 ?177% perf-profile.calltrace.cycles-pp.begin_new_exec.load_elf_binary.exec_binprm.bprm_execve.do_execveat_common
9.47 ?137% -7.0 2.49 ?158% perf-profile.children.cycles-pp.unmap_vmas
8.25 ?110% -6.1 2.14 ?159% perf-profile.children.cycles-pp.ksys_read
8.25 ?110% -6.1 2.14 ?159% perf-profile.children.cycles-pp.vfs_read
8.25 ?110% -6.1 2.14 ?159% perf-profile.children.cycles-pp.seq_read_iter
8.25 ?110% -6.1 2.14 ?159% perf-profile.children.cycles-pp.read
7.96 ?124% -5.5 2.49 ?158% perf-profile.children.cycles-pp.zap_pte_range
7.96 ?124% -5.5 2.49 ?158% perf-profile.children.cycles-pp.unmap_page_range
6.44 ?111% -5.3 1.19 ?244% perf-profile.children.cycles-pp.page_remove_rmap
6.40 ?108% -4.3 2.14 ?159% perf-profile.children.cycles-pp.new_sync_read
6.40 ?108% -4.3 2.14 ?159% perf-profile.children.cycles-pp.proc_reg_read_iter
6.40 ?108% -4.3 2.14 ?159% perf-profile.children.cycles-pp.show_interrupts
5.41 ?105% -4.2 1.19 ?244% perf-profile.children.cycles-pp.release_task
4.22 ?101% -4.2 0.00 perf-profile.children.cycles-pp.__dentry_kill
4.22 ?101% -4.2 0.00 perf-profile.children.cycles-pp.d_invalidate
4.22 ?101% -4.2 0.00 perf-profile.children.cycles-pp.proc_invalidate_siblings_dcache
4.22 ?101% -4.2 0.00 perf-profile.children.cycles-pp.shrink_dcache_parent
4.22 ?101% -4.2 0.00 perf-profile.children.cycles-pp.shrink_dentry_list
5.41 ?105% -3.6 1.79 ?169% perf-profile.children.cycles-pp.__do_sys_waitid
5.41 ?105% -3.6 1.79 ?169% perf-profile.children.cycles-pp.waitid
5.41 ?105% -3.6 1.79 ?169% perf-profile.children.cycles-pp.kernel_waitid
5.41 ?105% -3.6 1.79 ?169% perf-profile.children.cycles-pp.do_wait
5.41 ?105% -3.6 1.79 ?169% perf-profile.children.cycles-pp.wait_task_zombie
8.36 ?154% +0.1 8.49 ?177% perf-profile.children.cycles-pp.begin_new_exec
3.82 ?101% -2.6 1.19 ?244% perf-profile.self.cycles-pp.page_remove_rmap
phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s
220000 +------------------------------------------------------------------+
| + |
210000 |-+ :: |
200000 |-+ +.++ : +. .+ + + + + +.+ |
| + : + .+ : ++ +. + .+ : : + :+ +.+ + :: : : ++ |
190000 |.+ +: ++ + + + +.+ :: +.+ +: + + : :: :+ +|
180000 |-+ + + + + +.+ + + |
| |
170000 |-+ |
160000 |-+ |
| O |
150000 |-+ OO O O OO O O O O |
140000 |-O O O O OO OO O O |
| O O O O O O O |
130000 +------------------------------------------------------------------+
[*] bisect-good sample
[O] bisect-bad sample
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang
On Sun, Nov 28, 2021 at 11:06:58PM +0800, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed a -26.3% regression of phoronix-test-suite.tiobench.RandomWrite.64MB.8.mb_s due to commit:
>
>
> commit: b4d95a034cffb1e4424874645549d3cac2de5c02 ("[PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
> url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20211125-232336
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 8c92606ab81086db00cbb73347d124b4eb169b7e
>
> in testcase: phoronix-test-suite
> on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 128G memory
> with following parameters:
>
> test: tiobench-1.3.1
> option_a: Random Write
> option_b: 64MB
> option_c: 8
> cpufreq_governor: performance
> ucode: 0x5003006
>
> test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
> test-url: http://www.phoronix-test-suite.com/
>
Ok, while I'm surprised there is a difference with tiobench, there
definitely is a problem with the patch and a v3 is needed. Am queueing
up a test of v3 but the diff is
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9adeaa89ccb4..fee2930745ab 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2244,25 +2244,21 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Calculate allowed NUMA imbalance */
for_each_cpu(i, cpu_map) {
+ int imb_numa_nr = 0;
+
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
struct sched_domain *child = sd->child;
- if (!(sd->flags & SD_SHARE_PKG_RESOURCES) &&
+ if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
(child->flags & SD_SHARE_PKG_RESOURCES)) {
- struct sched_domain *sd_numa = sd;
- int imb_numa_nr, nr_groups;
+ int nr_groups;
nr_groups = sd->span_weight / child->span_weight;
- imb_numa_nr = nr_groups / num_online_nodes();
-
- while (sd_numa) {
- if (sd_numa->flags & SD_NUMA) {
- sd_numa->imb_numa_nr = imb_numa_nr;
- break;
- }
- sd_numa = sd_numa->parent;
- }
+ imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
+ (nr_groups * num_online_nodes()));
}
+
+ sd->imb_numa_nr = imb_numa_nr;
}
}