2022-05-11 18:51:16

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour

A problem was reported privately related to inconsistent performance of
NAS when parallelised with MPICH. The root of the problem is that the
initial placement is unpredictable and there can be a larger imbalance
than expected between NUMA nodes. As there is spare capacity and the faults
are local, the imbalance persists for a long time and performance suffers.

This is not 100% an "allowed imbalance" problem as setting the allowed
imbalance to 0 does not fix the issue but the allowed imbalance contributes
the the performance problem. The unpredictable behaviour was most recently
introduced by commit c6f886546cb8 ("sched/fair: Trigger the update of
blocked load on newly idle cpu").

mpirun forks hydra_pmi_proxy helpers with MPICH that go to sleep before the
execing the target workload. As the new tasks are sleeping, the potential
imbalance is not observed as idle_cpus does not reflect the tasks that
will be running in the near future. How bad the problem depends on the
timing of when fork happens and whether the new tasks are still running.
Consequently, a large initial imbalance may not be detected until the
workload is fully running. Once running, NUMA Balancing picks the preferred
node based on locality and runtime load balancing often ignores the tasks
as can_migrate_task() fails for either locality or task_hot reasons and
instead picks unrelated tasks.

This is the min, max and range of run time for mg.D parallelised with ~25%
of the CPUs parallelised by MPICH running on a 2-socket machine (80 CPUs,
16 active for mg.D due to limitations of mg.D).

v5.3 Min 95.84 Max 96.55 Range 0.71 Mean 96.16
v5.7 Min 95.44 Max 96.51 Range 1.07 Mean 96.14
v5.8 Min 96.02 Max 197.08 Range 101.06 Mean 154.70
v5.12 Min 104.45 Max 111.03 Range 6.58 Mean 105.94
v5.13 Min 104.38 Max 170.37 Range 65.99 Mean 117.35
v5.13-revert-c6f886546cb8 Min 104.40 Max 110.70 Range 6.30 Mean 105.68
v5.18rc4-baseline Min 104.46 Max 169.04 Range 64.58 Mean 130.49
v5.18rc4-revert-c6f886546cb8 Min 113.98 Max 117.29 Range 3.31 Mean 114.71
v5.18rc4-this_series Min 95.24 Max 175.33 Range 80.09 Mean 108.91
v5.18rc4-this_series+revert Min 95.24 Max 99.87 Range 4.63 Mean 96.54

This shows that we've had unpredictable performance for a long time for
this load. Instability was introduced somewhere between v5.7 and v5.8,
fixed in v5.12 and broken again since v5.13. The revert against 5.13
and 5.18-rc4 shows that c6f886546cb8 is the primary source of instability
although the best case is still worse than 5.7.

This series addresses the allowed imbalance problems to get the peak
performance back to 5.7 although only some of the time due to the
instability problem. The series plus the revert is both stable and has
slightly better peak performance and similar average performance. I'm
not convinced commit c6f886546cb8 is wrong but haven't isolated exactly
why it's unstable so for now, I'm just noting it has an issue.

Patch 1 initialises numa_migrate_retry. While this resolves itself
eventually, it is unpredictable early in the lifetime of
a task.

Patch 2 will not swap NUMA tasks in the same NUMA group or without
a NUMA group if there is spare capacity. Swapping is just
punishing one task to help another.

Patch 3 fixes an issue where a larger imbalance can be created at
fork time than would be allowed at run time. This behaviour
can help some workloads that are short lived and prefer
to remain local but it punishes long-lived tasks that are
memory intensive.

Patch 4 adjusts the threshold where a NUMA imbalance is allowed to
better approximate the number of memory channels, at least
for x86-64.

kernel/sched/fair.c | 59 ++++++++++++++++++++++++++---------------
kernel/sched/topology.c | 23 ++++++++++------
2 files changed, 53 insertions(+), 29 deletions(-)

--
2.34.1



2022-05-11 19:26:46

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/4] sched/numa: Do not swap tasks between nodes when spare capacity is available

If a destination node has spare capacity but there is an imbalance then
two tasks are selected for swapping. If the tasks have no numa group
or are within the same NUMA group, it's simply shuffling tasks around
without having any impact on the compute imbalance. Instead, it's just
punishing one task to help another.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867806a57119..03b1ad79d47d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1778,6 +1778,15 @@ static bool task_numa_compare(struct task_numa_env *env,
*/
cur_ng = rcu_dereference(cur->numa_group);
if (cur_ng == p_ng) {
+ /*
+ * Do not swap within a group or between tasks that have
+ * no group if there is spare capacity. Swapping does
+ * not address the load imbalance and helps one task at
+ * the cost of punishing another.
+ */
+ if (env->dst_stats.node_type == node_has_spare)
+ goto unlock;
+
imp = taskimp + task_weight(cur, env->src_nid, dist) -
task_weight(cur, env->dst_nid, dist);
/*
--
2.34.1


2022-05-11 20:18:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/4] sched/numa: Apply imbalance limitations consistently

The imbalance limitations are applied inconsistently at fork time
and at runtime. At fork, a new task can remain local until there are
too many running tasks even if the degree of imbalance is larger than
NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance
figure used during load balancing is different to the one used at NUMA
placement. Load balancing uses the number of tasks that must move to
restore imbalance where as NUMA balancing uses the total imbalance.

In combination, it is possible for a parallel workload that uses a small
number of CPUs without applying scheduler policies to have very variable
run-to-run performance.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 49 ++++++++++++++++++++++++++-------------------
1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03b1ad79d47d..602c05b22805 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9108,6 +9108,24 @@ static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
return running <= imb_numa_nr;
}

+#define NUMA_IMBALANCE_MIN 2
+
+static inline long adjust_numa_imbalance(int imbalance,
+ int dst_running, int imb_numa_nr)
+{
+ if (!allow_numa_imbalance(dst_running, imb_numa_nr))
+ return imbalance;
+
+ /*
+ * Allow a small imbalance based on a simple pair of communicating
+ * tasks that remain local when the destination is lightly loaded.
+ */
+ if (imbalance <= NUMA_IMBALANCE_MIN)
+ return 0;
+
+ return imbalance;
+}
+
/*
* find_idlest_group() finds and returns the least busy CPU group within the
* domain.
@@ -9245,8 +9263,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* allowed. If there is a real need of migration,
* periodic load balance will take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+ imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus);
+ if (!adjust_numa_imbalance(imbalance,
+ local_sgs.sum_nr_running + 1,
+ sd->imb_numa_nr)) {
return NULL;
+ }
}

/*
@@ -9334,24 +9356,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
}

-#define NUMA_IMBALANCE_MIN 2
-
-static inline long adjust_numa_imbalance(int imbalance,
- int dst_running, int imb_numa_nr)
-{
- if (!allow_numa_imbalance(dst_running, imb_numa_nr))
- return imbalance;
-
- /*
- * Allow a small imbalance based on a simple pair of communicating
- * tasks that remain local when the destination is lightly loaded.
- */
- if (imbalance <= NUMA_IMBALANCE_MIN)
- return 0;
-
- return imbalance;
-}
-
/**
* calculate_imbalance - Calculate the amount of imbalance present within the
* groups of a given sched_domain during load balance.
@@ -9436,7 +9440,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
env->migration_type = migrate_task;
lsub_positive(&nr_diff, local->sum_nr_running);
- env->imbalance = nr_diff >> 1;
+ env->imbalance = nr_diff;
} else {

/*
@@ -9445,7 +9449,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
env->migration_type = migrate_task;
env->imbalance = max_t(long, 0, (local->idle_cpus -
- busiest->idle_cpus) >> 1);
+ busiest->idle_cpus));
}

/* Consider allowing a small imbalance between NUMA groups */
@@ -9454,6 +9458,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
local->sum_nr_running + 1, env->sd->imb_numa_nr);
}

+ /* Number of tasks to move to restore balance */
+ env->imbalance >>= 1;
+
return;
}

--
2.34.1


2022-05-11 23:09:36

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/4] sched/numa: Initialise numa_migrate_retry

On clone, numa_migrate_retry is inherited from the parent which means
that the first NUMA placement of a task is non-deterministic. This
affects when load balancing recognises numa tasks and whether to
migrate "regular", "remote" or "all" tasks between NUMA scheduler
domains.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bd299d67ab..867806a57119 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2873,6 +2873,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
p->node_stamp = 0;
p->numa_scan_seq = mm ? mm->numa_scan_seq : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+ p->numa_migrate_retry = 0;
/* Protect against double add, see task_tick_numa and task_numa_work */
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
--
2.34.1


2022-05-18 09:30:10

by kernel test robot

[permalink] [raw]
Subject: [sched/numa] bb2dee337b: unixbench.score -11.2% regression



Greeting,

FYI, we noticed a -11.2% regression of unixbench.score due to commit:


commit: bb2dee337bd7d314eb7c7627e1afd754f86566bc ("[PATCH 3/4] sched/numa: Apply imbalance limitations consistently")
url: https://github.com/intel-lab-lkp/linux/commits/Mel-Gorman/Mitigate-inconsistent-NUMA-imbalance-behaviour/20220511-223233
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d70522fc541224b8351ac26f4765f2c6268f8d72
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: unixbench
on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
with following parameters:

runtime: 300s
nr_task: 1
test: shell8
cpufreq_governor: performance
ucode: 0xd000331

test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
test-url: https://github.com/kdlucas/byte-unixbench

In addition to that, the commit also has significant impact on the following tests:

+------------------+-------------------------------------------------------------------------------------+
| testcase: change | fsmark: fsmark.files_per_sec -21.5% regression |
| test machine | 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
| test parameters | cpufreq_governor=performance |
| | disk=1SSD |
| | filesize=8K |
| | fs=f2fs |
| | iterations=8 |
| | nr_directories=16d |
| | nr_files_per_directory=256fpd |
| | nr_threads=4 |
| | sync_method=fsyncBeforeClose |
| | test_size=72G |
| | ucode=0x500320a |
+------------------+-------------------------------------------------------------------------------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
gcc-11/performance/x86_64-rhel-8.3/1/debian-10.4-x86_64-20200603.cgz/300s/lkp-icl-2sp2/shell8/unixbench/0xd000331

commit:
19f9b71e9f ("sched/numa: Do not swap tasks between nodes when spare capacity is available")
bb2dee337b ("sched/numa: Apply imbalance limitations consistently")

19f9b71e9fb9c42f bb2dee337bd7d314eb7c7627e1a
---------------- ---------------------------
%stddev %change %stddev
\ | \
9456 -11.2% 8400 unixbench.score
47643 ? 2% -9.4% 43144 ? 2% unixbench.time.involuntary_context_switches
1595 ? 6% +138.4% 3802 ? 5% unixbench.time.major_page_faults
45950495 -11.1% 40848424 unixbench.time.minor_page_faults
173.45 +18.9% 206.31 unixbench.time.system_time
190.25 -16.8% 158.20 unixbench.time.user_time
1306482 -12.6% 1141414 unixbench.time.voluntary_context_switches
357454 -11.2% 317544 unixbench.workload
31.47 ? 2% +5.2% 33.10 turbostat.RAMWatt
0.14 ? 4% +0.0 0.16 mpstat.cpu.all.soft%
1.61 -0.2 1.44 mpstat.cpu.all.usr%
1149 ? 10% +32.8% 1527 ? 23% sched_debug.cfs_rq:/.runnable_avg.max
1149 ? 10% +32.8% 1527 ? 23% sched_debug.cfs_rq:/.util_avg.max
53195 -10.5% 47596 vmstat.system.cs
271972 -6.1% 255312 ? 12% vmstat.system.in
13345450 ? 12% -26.6% 9796608 ? 5% meminfo.DirectMap2M
22508 ? 2% +13.0% 25424 ? 3% meminfo.KernelStack
7966 ? 3% +151.2% 20013 ? 15% meminfo.PageTables
56391 ? 18% -57.7% 23845 ? 17% numa-vmstat.node0.nr_anon_pages
58577 ? 17% -55.3% 26191 ? 17% numa-vmstat.node0.nr_inactive_anon
11721 ? 7% +18.0% 13830 ? 5% numa-vmstat.node0.nr_kernel_stack
1110 ? 34% +126.3% 2512 ? 27% numa-vmstat.node0.nr_page_table_pages
58577 ? 17% -55.3% 26192 ? 17% numa-vmstat.node0.nr_zone_inactive_anon
16891 ? 63% +205.2% 51557 ? 6% numa-vmstat.node1.nr_anon_pages
18994 ? 53% +183.0% 53759 ? 6% numa-vmstat.node1.nr_inactive_anon
894.83 ? 41% +182.2% 2525 ? 16% numa-vmstat.node1.nr_page_table_pages
18994 ? 53% +183.0% 53759 ? 6% numa-vmstat.node1.nr_zone_inactive_anon
83590 ? 13% -73.7% 21988 ? 32% numa-meminfo.node0.AnonHugePages
225657 ? 18% -58.0% 94847 ? 18% numa-meminfo.node0.AnonPages
231652 ? 17% -55.3% 103657 ? 16% numa-meminfo.node0.AnonPages.max
234525 ? 17% -55.5% 104341 ? 18% numa-meminfo.node0.Inactive
234397 ? 17% -55.5% 104267 ? 18% numa-meminfo.node0.Inactive(anon)
11724 ? 7% +17.5% 13781 ? 5% numa-meminfo.node0.KernelStack
4472 ? 34% +117.1% 9708 ? 31% numa-meminfo.node0.PageTables
15239 ? 75% +401.2% 76387 ? 10% numa-meminfo.node1.AnonHugePages
67256 ? 63% +206.3% 205994 ? 6% numa-meminfo.node1.AnonPages
73568 ? 58% +193.1% 215644 ? 6% numa-meminfo.node1.AnonPages.max
75737 ? 53% +183.9% 215053 ? 6% numa-meminfo.node1.Inactive
75709 ? 53% +183.9% 214971 ? 6% numa-meminfo.node1.Inactive(anon)
3559 ? 42% +187.1% 10216 ? 8% numa-meminfo.node1.PageTables
73240 +2.7% 75223 proc-vmstat.nr_anon_pages
77537 +2.9% 79817 proc-vmstat.nr_inactive_anon
22505 ? 2% +12.8% 25387 ? 3% proc-vmstat.nr_kernel_stack
2003 ? 3% +148.7% 4982 ? 20% proc-vmstat.nr_page_table_pages
61769 -1.5% 60836 proc-vmstat.nr_slab_unreclaimable
77537 +2.9% 79817 proc-vmstat.nr_zone_inactive_anon
33311917 -11.1% 29603705 proc-vmstat.numa_hit
33210979 -11.2% 29488489 proc-vmstat.numa_local
2797 ? 6% +459.4% 15647 proc-vmstat.pgactivate
33306223 -11.1% 29598069 proc-vmstat.pgalloc_normal
46257005 -11.0% 41162286 proc-vmstat.pgfault
33119124 -11.2% 29411010 proc-vmstat.pgfree
2595965 -11.4% 2300851 proc-vmstat.pgreuse
1506 -9.9% 1357 proc-vmstat.thp_fault_alloc
635705 -11.2% 564736 proc-vmstat.unevictable_pgs_culled
10.72 +3.8% 11.12 perf-stat.i.MPKI
4.952e+09 -10.1% 4.454e+09 perf-stat.i.branch-instructions
1.77 +0.0 1.80 perf-stat.i.branch-miss-rate%
86802815 -8.5% 79414679 perf-stat.i.branch-misses
4.38 +10.3 14.65 perf-stat.i.cache-miss-rate%
10798607 +238.6% 36565778 perf-stat.i.cache-misses
2.638e+08 -6.7% 2.461e+08 perf-stat.i.cache-references
54996 -10.7% 49115 perf-stat.i.context-switches
0.85 +10.2% 0.93 ? 3% perf-stat.i.cpi
1071 ? 3% +91.7% 2053 perf-stat.i.cpu-migrations
1995 ? 2% -62.9% 739.38 ? 4% perf-stat.i.cycles-between-cache-misses
2713827 ? 2% -6.7% 2531773 perf-stat.i.dTLB-load-misses
6.269e+09 -10.4% 5.616e+09 perf-stat.i.dTLB-loads
3635648 -11.1% 3230371 perf-stat.i.dTLB-store-misses
3.66e+09 -11.0% 3.256e+09 perf-stat.i.dTLB-stores
2.398e+10 -10.0% 2.158e+10 perf-stat.i.instructions
1.19 -9.4% 1.08 ? 3% perf-stat.i.ipc
25.18 ? 5% +137.6% 59.84 ? 4% perf-stat.i.major-faults
102.15 +77.7% 181.53 perf-stat.i.metric.K/sec
118.29 -10.4% 106.01 perf-stat.i.metric.M/sec
709279 -11.0% 631605 perf-stat.i.minor-faults
67.04 +25.5 92.50 perf-stat.i.node-load-miss-rate%
820091 +792.0% 7315370 perf-stat.i.node-load-misses
406593 ? 4% +30.7% 531337 perf-stat.i.node-loads
5.29 ? 4% +34.5 39.84 perf-stat.i.node-store-miss-rate%
190224 ? 3% +1726.4% 3474284 perf-stat.i.node-store-misses
4299275 +19.9% 5154130 perf-stat.i.node-stores
709304 -10.9% 631665 perf-stat.i.page-faults
11.00 +3.7% 11.41 perf-stat.overall.MPKI
1.75 +0.0 1.78 perf-stat.overall.branch-miss-rate%
4.09 +10.8 14.86 perf-stat.overall.cache-miss-rate%
0.84 +10.8% 0.93 ? 2% perf-stat.overall.cpi
1855 -70.5% 546.43 ? 2% perf-stat.overall.cycles-between-cache-misses
1.20 -9.7% 1.08 ? 3% perf-stat.overall.ipc
66.87 +26.4 93.23 perf-stat.overall.node-load-miss-rate%
4.24 ? 4% +36.0 40.26 perf-stat.overall.node-store-miss-rate%
4254574 +1.1% 4300586 perf-stat.overall.path-length
4.874e+09 -10.0% 4.384e+09 perf-stat.ps.branch-instructions
85435301 -8.5% 78170086 perf-stat.ps.branch-misses
10628137 +238.7% 35992941 perf-stat.ps.cache-misses
2.597e+08 -6.7% 2.423e+08 perf-stat.ps.cache-references
54129 -10.7% 48346 perf-stat.ps.context-switches
1054 ? 3% +91.7% 2021 perf-stat.ps.cpu-migrations
2670954 ? 2% -6.7% 2492108 perf-stat.ps.dTLB-load-misses
6.17e+09 -10.4% 5.528e+09 perf-stat.ps.dTLB-loads
3578206 -11.1% 3179761 perf-stat.ps.dTLB-store-misses
3.602e+09 -11.0% 3.205e+09 perf-stat.ps.dTLB-stores
2.36e+10 -10.0% 2.124e+10 perf-stat.ps.instructions
24.78 ? 5% +137.7% 58.90 ? 4% perf-stat.ps.major-faults
698073 -10.9% 621710 perf-stat.ps.minor-faults
807155 +792.1% 7200767 perf-stat.ps.node-load-misses
400170 ? 4% +30.7% 523004 perf-stat.ps.node-loads
187229 ? 4% +1726.6% 3419862 perf-stat.ps.node-store-misses
4231370 +19.9% 5073383 perf-stat.ps.node-stores
698098 -10.9% 621769 perf-stat.ps.page-faults
1.521e+12 -10.2% 1.366e+12 perf-stat.total.instructions
13.69 ? 30% -10.9 2.76 ? 69% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.acpi_idle_do_entry.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
14.60 ? 48% -10.2 4.38 ? 36% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__libc_write
14.60 ? 48% -10.2 4.38 ? 36% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_write
14.60 ? 48% -10.2 4.38 ? 36% perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_write
14.60 ? 48% -10.2 4.38 ? 36% perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_write
14.60 ? 48% -10.2 4.38 ? 36% perf-profile.calltrace.cycles-pp.__libc_write
14.60 ? 48% -10.2 4.38 ? 36% perf-profile.calltrace.cycles-pp.generic_file_write_iter.new_sync_write.vfs_write.ksys_write.do_syscall_64
14.60 ? 48% -10.2 4.38 ? 36% perf-profile.calltrace.cycles-pp.__generic_file_write_iter.generic_file_write_iter.new_sync_write.vfs_write.ksys_write
14.53 ? 48% -10.1 4.38 ? 36% perf-profile.calltrace.cycles-pp.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.new_sync_write.vfs_write
6.10 ? 30% -4.9 1.22 ? 65% perf-profile.calltrace.cycles-pp.event_function_call.perf_event_release_kernel.perf_release.__fput.task_work_run
5.87 ? 31% -4.7 1.22 ? 65% perf-profile.calltrace.cycles-pp.smp_call_function_single.event_function_call.perf_event_release_kernel.perf_release.__fput
5.40 ? 33% -4.1 1.26 ? 81% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.fault_in_readable.fault_in_iov_iter_readable.generic_perform_write.__generic_file_write_iter
5.62 ? 53% -4.0 1.67 ? 54% perf-profile.calltrace.cycles-pp.fault_in_iov_iter_readable.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.new_sync_write
3.54 ? 50% -2.1 1.42 ? 39% perf-profile.calltrace.cycles-pp.copy_page_from_iter_atomic.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.new_sync_write
0.56 ? 75% +0.6 1.19 ? 25% perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.mmput.exit_mm.do_exit
0.51 ? 77% +0.7 1.19 ? 25% perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.mmput.exit_mm
0.28 ?143% +0.7 1.01 ? 23% perf-profile.calltrace.cycles-pp.fast_imageblit.sys_imageblit.drm_fbdev_fb_imageblit.bit_putcs.fbcon_putcs
0.28 ?143% +0.7 1.01 ? 23% perf-profile.calltrace.cycles-pp.drm_fbdev_fb_imageblit.bit_putcs.fbcon_putcs.fbcon_redraw.fbcon_scroll
0.28 ?143% +0.7 1.01 ? 23% perf-profile.calltrace.cycles-pp.sys_imageblit.drm_fbdev_fb_imageblit.bit_putcs.fbcon_putcs.fbcon_redraw
0.40 ?106% +0.9 1.32 ? 52% perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap
0.40 ?106% +1.2 1.56 ? 42% perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap.mmput
1.50 ?113% +1.9 3.45 ? 33% perf-profile.calltrace.cycles-pp.delay_tsc.wait_for_xmitr.serial8250_console_putchar.uart_console_write.serial8250_console_write
1.93 ? 26% +2.4 4.29 ? 46% perf-profile.calltrace.cycles-pp.bprm_execve.do_execveat_common.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.85 ? 30% +2.4 5.21 ? 42% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_idle_do_entry.acpi_idle_enter
2.66 ? 28% +2.4 5.08 ? 44% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_idle_do_entry
4.98 ? 32% +3.5 8.50 ? 30% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_idle_do_entry.acpi_idle_enter.cpuidle_enter_state
6.91 ? 24% +3.7 10.62 ? 28% perf-profile.calltrace.cycles-pp.acpi_idle_do_entry.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
17.43 ? 25% +12.0 29.40 ? 24% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.acpi_idle_do_entry.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
14.65 ? 48% -10.3 4.38 ? 36% perf-profile.children.cycles-pp.generic_file_write_iter
14.65 ? 48% -10.3 4.38 ? 36% perf-profile.children.cycles-pp.__generic_file_write_iter
14.61 ? 48% -10.2 4.38 ? 36% perf-profile.children.cycles-pp.generic_perform_write
14.60 ? 48% -10.2 4.38 ? 36% perf-profile.children.cycles-pp.__libc_write
7.23 ? 30% -5.7 1.52 ? 66% perf-profile.children.cycles-pp.asm_sysvec_call_function
6.17 ? 29% -4.9 1.24 ? 67% perf-profile.children.cycles-pp.event_function_call
6.05 ? 31% -4.8 1.24 ? 67% perf-profile.children.cycles-pp.smp_call_function_single
5.62 ? 53% -3.9 1.74 ? 45% perf-profile.children.cycles-pp.fault_in_iov_iter_readable
5.60 ? 53% -3.9 1.74 ? 45% perf-profile.children.cycles-pp.fault_in_readable
3.54 ? 50% -2.1 1.42 ? 39% perf-profile.children.cycles-pp.copy_page_from_iter_atomic
2.17 ? 53% -1.7 0.46 ? 68% perf-profile.children.cycles-pp.mutex_lock
1.91 ? 54% -1.6 0.36 ? 62% perf-profile.children.cycles-pp.swevent_hlist_put_cpu
2.11 ? 15% -1.1 0.98 ? 46% perf-profile.children.cycles-pp.__alloc_pages
1.40 ? 42% -1.1 0.32 ?111% perf-profile.children.cycles-pp.shmem_write_end
1.26 ? 30% -1.0 0.22 ? 79% perf-profile.children.cycles-pp.alloc_pages_vma
1.34 ? 20% -1.0 0.32 ? 78% perf-profile.children.cycles-pp.get_page_from_freelist
1.53 ? 47% -0.9 0.64 ? 48% perf-profile.children.cycles-pp.__might_resched
0.72 ? 44% -0.5 0.23 ? 86% perf-profile.children.cycles-pp.__pagevec_lru_add
0.61 ? 40% -0.5 0.14 ?111% perf-profile.children.cycles-pp.__pagevec_lru_add_fn
0.46 ? 28% -0.3 0.16 ? 82% perf-profile.children.cycles-pp.call_rcu
0.53 ? 40% -0.2 0.30 ? 70% perf-profile.children.cycles-pp.folio_add_lru
0.03 ?223% +0.2 0.23 ? 22% perf-profile.children.cycles-pp.alloc_bprm
0.03 ?223% +0.3 0.37 ? 73% perf-profile.children.cycles-pp.tick_sched_do_timer
0.10 ?141% +0.4 0.48 ? 50% perf-profile.children.cycles-pp.unlink_file_vma
0.30 ? 87% +0.7 0.96 ? 43% perf-profile.children.cycles-pp.__open64_nocancel
1.03 ? 38% +0.8 1.84 ? 35% perf-profile.children.cycles-pp.zap_pmd_range
1.08 ? 37% +0.8 1.92 ? 39% perf-profile.children.cycles-pp.unmap_page_range
1.16 ? 41% +0.9 2.03 ? 38% perf-profile.children.cycles-pp.unmap_vmas
0.89 ? 47% +1.1 2.02 ? 33% perf-profile.children.cycles-pp.exec_mmap
0.92 ? 45% +1.2 2.16 ? 33% perf-profile.children.cycles-pp.begin_new_exec
1.20 ? 51% +1.4 2.62 ? 60% perf-profile.children.cycles-pp.ktime_get
1.70 ? 89% +1.8 3.54 ? 33% perf-profile.children.cycles-pp.delay_tsc
1.83 ? 29% +2.0 3.80 ? 39% perf-profile.children.cycles-pp.exec_binprm
1.83 ? 29% +2.0 3.80 ? 39% perf-profile.children.cycles-pp.search_binary_handler
1.73 ? 26% +2.0 3.72 ? 38% perf-profile.children.cycles-pp.load_elf_binary
3.12 ? 22% +2.3 5.46 ? 47% perf-profile.children.cycles-pp.hrtimer_interrupt
1.96 ? 25% +2.5 4.42 ? 43% perf-profile.children.cycles-pp.bprm_execve
2.57 ? 20% +2.8 5.39 ? 45% perf-profile.children.cycles-pp.__x64_sys_execve
2.57 ? 20% +2.8 5.39 ? 45% perf-profile.children.cycles-pp.do_execveat_common
5.50 ? 28% +3.4 8.95 ? 32% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
12.38 ? 24% +7.3 19.71 ? 24% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
5.84 ? 31% -4.6 1.22 ? 65% perf-profile.self.cycles-pp.smp_call_function_single
2.92 ? 50% -2.0 0.88 ? 53% perf-profile.self.cycles-pp.fault_in_readable
1.56 ? 58% -1.2 0.33 ? 70% perf-profile.self.cycles-pp.mutex_lock
1.27 ? 54% -1.0 0.23 ?114% perf-profile.self.cycles-pp.shmem_write_end
0.41 ? 32% +0.3 0.71 ? 35% perf-profile.self.cycles-pp.update_rq_clock
0.03 ?223% +0.3 0.34 ? 75% perf-profile.self.cycles-pp.tick_sched_do_timer
0.43 ? 48% +0.6 1.00 ? 39% perf-profile.self.cycles-pp.zap_pte_range
0.99 ? 55% +1.4 2.39 ? 70% perf-profile.self.cycles-pp.ktime_get
1.70 ? 89% +1.8 3.54 ? 33% perf-profile.self.cycles-pp.delay_tsc


***************************************************************************************************
lkp-csl-2ap4: 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
=========================================================================================
compiler/cpufreq_governor/disk/filesize/fs/iterations/kconfig/nr_directories/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase/ucode:
gcc-11/performance/1SSD/8K/f2fs/8/x86_64-rhel-8.3/16d/256fpd/4/debian-10.4-x86_64-20200603.cgz/fsyncBeforeClose/lkp-csl-2ap4/72G/fsmark/0x500320a

commit:
19f9b71e9f ("sched/numa: Do not swap tasks between nodes when spare capacity is available")
bb2dee337b ("sched/numa: Apply imbalance limitations consistently")

19f9b71e9fb9c42f bb2dee337bd7d314eb7c7627e1a
---------------- ---------------------------
%stddev %change %stddev
\ | \
4147579 +4.2% 4322142 ? 2% fsmark.app_overhead
38252 ? 4% -21.5% 30012 ? 2% fsmark.files_per_sec
309.12 ? 3% +22.6% 378.94 ? 2% fsmark.time.elapsed_time
309.12 ? 3% +22.6% 378.94 ? 2% fsmark.time.elapsed_time.max
34264 ? 19% -57.4% 14599 ? 13% fsmark.time.involuntary_context_switches
212677 ? 12% -85.7% 30395 ? 42% fsmark.time.minor_page_faults
213.00 ? 2% +10.8% 236.00 fsmark.time.percent_of_cpu_this_job_got
628.50 ? 7% +37.8% 866.05 ? 3% fsmark.time.system_time
5.771e+10 ? 3% +22.7% 7.082e+10 ? 2% cpuidle..time
1.479e+08 ? 12% +19.2% 1.764e+08 cpuidle..usage
0.16 ? 3% +0.0 0.21 ? 3% mpstat.cpu.all.iowait%
1.11 ? 4% +0.1 1.21 mpstat.cpu.all.sys%
1040563 ? 99% +598.4% 7267234 ? 15% numa-numastat.node0.local_node
1121094 ? 92% +554.2% 7334262 ? 15% numa-numastat.node0.numa_hit
361.90 ? 3% +19.4% 432.10 uptime.boot
66590 ? 3% +19.8% 79803 uptime.idle
28220 ? 56% -68.3% 8945 ? 79% numa-meminfo.node0.Shmem
917.83 ? 8% +26.1% 1157 ? 16% numa-meminfo.node2.PageTables
274322 ? 26% -41.6% 160276 ? 24% numa-meminfo.node3.AnonPages.max
270.83 -19.9% 217.00 ? 2% vmstat.io.bi
456385 ? 3% -18.0% 374348 ? 2% vmstat.io.bo
165973 -18.2% 135713 ? 4% vmstat.system.cs
13850 ? 5% +10.8% 15340 ? 4% meminfo.Active(anon)
250973 +15.1% 288881 meminfo.AnonHugePages
55246 ? 2% -11.5% 48894 ? 3% meminfo.Shmem
812.17 ? 24% +164.0% 2144 ? 30% meminfo.Writeback
2011 ? 5% +22.2% 2458 ? 6% turbostat.Bzy_MHz
1.32e+08 ? 13% +24.5% 1.643e+08 ? 2% turbostat.IRQ
0.35 ? 2% -0.1 0.28 ? 3% turbostat.POLL%
96.62 ? 7% +23.8% 119.66 ? 4% turbostat.PkgWatt
7222 ? 57% -69.0% 2236 ? 79% numa-vmstat.node0.nr_shmem
25.17 ? 78% +521.2% 156.33 ? 22% numa-vmstat.node0.nr_writeback
1120967 ? 92% +554.2% 7333927 ? 15% numa-vmstat.node0.numa_hit
1040437 ? 99% +598.4% 7266899 ? 15% numa-vmstat.node0.numa_local
44.17 ? 91% +243.0% 151.50 ? 18% numa-vmstat.node1.nr_writeback
0.20 ? 5% +13.0% 0.23 ? 5% sched_debug.cfs_rq:/.h_nr_running.stddev
0.20 ? 5% +12.4% 0.23 ? 5% sched_debug.cfs_rq:/.nr_running.stddev
156695 ? 7% +14.2% 178962 ? 6% sched_debug.cpu.avg_idle.stddev
191435 ? 7% +18.7% 227289 ? 4% sched_debug.cpu.clock.avg
191444 ? 7% +18.7% 227298 ? 4% sched_debug.cpu.clock.max
191425 ? 7% +18.7% 227279 ? 4% sched_debug.cpu.clock.min
189015 ? 7% +19.1% 225057 ? 4% sched_debug.cpu.clock_task.avg
190352 ? 7% +18.7% 225903 ? 4% sched_debug.cpu.clock_task.max
179649 ? 8% +20.1% 215848 ? 4% sched_debug.cpu.clock_task.min
1498 ? 2% +10.6% 1657 ? 4% sched_debug.cpu.curr->pid.stddev
375793 ? 9% -21.4% 295204 ? 17% sched_debug.cpu.nr_switches.stddev
191425 ? 7% +18.7% 227279 ? 4% sched_debug.cpu_clk
190413 ? 7% +18.8% 226267 ? 4% sched_debug.ktime
192909 ? 7% +18.0% 227721 ? 4% sched_debug.sched_clk
3467 ? 4% +10.5% 3831 ? 4% proc-vmstat.nr_active_anon
102749 -2.0% 100660 proc-vmstat.nr_inactive_anon
11331 -6.1% 10642 ? 2% proc-vmstat.nr_mapped
13948 ? 3% -11.8% 12308 ? 3% proc-vmstat.nr_shmem
127329 +1.9% 129714 proc-vmstat.nr_slab_unreclaimable
199.67 ? 30% +163.0% 525.17 ? 30% proc-vmstat.nr_writeback
3467 ? 4% +10.5% 3831 ? 4% proc-vmstat.nr_zone_active_anon
102750 -2.0% 100660 proc-vmstat.nr_zone_inactive_anon
16907009 ? 16% -88.8% 1888777 ?116% proc-vmstat.numa_foreign
15129017 ? 18% +99.0% 30109594 ? 7% proc-vmstat.numa_hit
649.67 ? 8% +59.0% 1032 ? 6% proc-vmstat.numa_huge_pte_updates
14859248 ? 19% +100.8% 29841270 ? 7% proc-vmstat.numa_local
16909166 ? 16% -88.8% 1888573 ?116% proc-vmstat.numa_miss
17173636 ? 16% -87.5% 2150108 ?102% proc-vmstat.numa_other
516559 ? 5% +10.8% 572525 ? 5% proc-vmstat.numa_pte_updates
679842 -5.4% 643129 proc-vmstat.pgactivate
1688502 +4.1% 1757298 proc-vmstat.pgfault
109978 ? 3% +18.0% 129757 ? 2% proc-vmstat.pgreuse
1.724e+09 -15.5% 1.457e+09 perf-stat.i.branch-instructions
23204655 ? 4% +57.8% 36615017 ? 4% perf-stat.i.cache-misses
167616 -18.4% 136728 ? 4% perf-stat.i.context-switches
1.99 ? 7% +25.3% 2.49 ? 6% perf-stat.i.cpi
218.69 +2.6% 224.39 perf-stat.i.cpu-migrations
774.80 ? 3% -32.8% 520.33 ? 4% perf-stat.i.cycles-between-cache-misses
2.125e+09 -16.7% 1.771e+09 ? 2% perf-stat.i.dTLB-loads
1.013e+09 ? 2% -16.0% 8.514e+08 ? 2% perf-stat.i.dTLB-stores
4945258 ? 2% -9.1% 4493474 ? 3% perf-stat.i.iTLB-load-misses
7.99e+09 -15.2% 6.778e+09 perf-stat.i.instructions
1686 ? 2% -6.1% 1584 ? 2% perf-stat.i.instructions-per-iTLB-miss
0.52 ? 6% -21.1% 0.41 ? 5% perf-stat.i.ipc
1.42 ? 24% -26.9% 1.04 ? 2% perf-stat.i.major-faults
25.47 -16.6% 21.24 ? 2% perf-stat.i.metric.M/sec
4971 ? 2% -14.5% 4248 perf-stat.i.minor-faults
71.55 ? 5% +17.6 89.13 perf-stat.i.node-load-miss-rate%
2873999 ? 21% +224.2% 9317253 ? 9% perf-stat.i.node-load-misses
70.57 ? 4% +8.6 79.15 perf-stat.i.node-store-miss-rate%
1582580 ? 11% +116.9% 3432781 ? 6% perf-stat.i.node-store-misses
640907 ? 18% +38.8% 889366 ? 6% perf-stat.i.node-stores
4972 ? 2% -14.5% 4249 perf-stat.i.page-faults
22.62 ? 21% +19.2 41.84 ? 22% perf-stat.overall.cache-miss-rate%
1.92 ? 6% +29.0% 2.47 ? 5% perf-stat.overall.cpi
659.79 ? 3% -30.5% 458.63 ? 6% perf-stat.overall.cycles-between-cache-misses
1616 ? 2% -6.6% 1509 ? 3% perf-stat.overall.instructions-per-iTLB-miss
0.52 ? 5% -22.6% 0.41 ? 5% perf-stat.overall.ipc
76.92 ? 3% +12.9 89.80 perf-stat.overall.node-load-miss-rate%
71.33 ? 3% +8.1 79.41 perf-stat.overall.node-store-miss-rate%
1.718e+09 -15.4% 1.453e+09 perf-stat.ps.branch-instructions
23130416 ? 4% +57.8% 36508107 ? 4% perf-stat.ps.cache-misses
166999 -18.4% 136299 ? 4% perf-stat.ps.context-switches
218.01 +2.6% 223.77 perf-stat.ps.cpu-migrations
2.118e+09 -16.6% 1.765e+09 ? 2% perf-stat.ps.dTLB-loads
1.01e+09 ? 2% -15.9% 8.491e+08 ? 2% perf-stat.ps.dTLB-stores
4927968 ? 2% -9.1% 4480218 ? 3% perf-stat.ps.iTLB-load-misses
7.963e+09 -15.1% 6.758e+09 perf-stat.ps.instructions
1.42 ? 24% -26.8% 1.04 ? 2% perf-stat.ps.major-faults
4952 ? 2% -14.5% 4234 perf-stat.ps.minor-faults
2866096 ? 21% +224.1% 9289790 ? 9% perf-stat.ps.node-load-misses
1577533 ? 11% +117.0% 3422669 ? 6% perf-stat.ps.node-store-misses
638821 ? 18% +38.8% 886869 ? 6% perf-stat.ps.node-stores
4953 ? 2% -14.5% 4235 perf-stat.ps.page-faults
2.473e+12 ? 2% +3.8% 2.568e+12 perf-stat.total.instructions
7.80 ? 10% -1.5 6.32 ? 9% perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
2.14 ? 10% -1.2 0.98 ? 10% perf-profile.calltrace.cycles-pp.pagevec_lookup_range_tag.f2fs_fsync_node_pages.f2fs_do_sync_file.__x64_sys_fsync.do_syscall_64
2.13 ? 9% -1.2 0.97 ? 10% perf-profile.calltrace.cycles-pp.find_get_pages_range_tag.pagevec_lookup_range_tag.f2fs_fsync_node_pages.f2fs_do_sync_file.__x64_sys_fsync
1.54 ? 8% -0.6 0.96 ? 37% perf-profile.calltrace.cycles-pp.perf_prepare_sample.perf_event_output_forward.__perf_event_overflow.perf_tp_event.perf_trace_sched_stat_runtime
2.05 ? 10% -0.5 1.51 ? 12% perf-profile.calltrace.cycles-pp.generic_perform_write.f2fs_buffered_write_iter.f2fs_file_write_iter.new_sync_write.vfs_write
2.06 ? 10% -0.5 1.54 ? 12% perf-profile.calltrace.cycles-pp.f2fs_buffered_write_iter.f2fs_file_write_iter.new_sync_write.vfs_write.ksys_write
2.12 ? 8% -0.5 1.61 ? 13% perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.io_schedule
2.15 ? 8% -0.5 1.65 ? 13% perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.io_schedule.folio_wait_bit_common
1.73 ? 8% -0.5 1.25 ? 25% perf-profile.calltrace.cycles-pp.__perf_event_overflow.perf_tp_event.perf_trace_sched_stat_runtime.update_curr.dequeue_entity
1.72 ? 8% -0.5 1.24 ? 24% perf-profile.calltrace.cycles-pp.perf_event_output_forward.__perf_event_overflow.perf_tp_event.perf_trace_sched_stat_runtime.update_curr
1.94 ? 7% -0.5 1.47 ? 14% perf-profile.calltrace.cycles-pp.schedule.io_schedule.folio_wait_bit_common.folio_wait_writeback.f2fs_wait_on_page_writeback
1.94 ? 7% -0.5 1.48 ? 14% perf-profile.calltrace.cycles-pp.io_schedule.folio_wait_bit_common.folio_wait_writeback.f2fs_wait_on_page_writeback.f2fs_wait_on_node_pages_writeback
2.03 ? 6% -0.5 1.57 ? 12% perf-profile.calltrace.cycles-pp.folio_wait_bit_common.folio_wait_writeback.f2fs_wait_on_page_writeback.f2fs_wait_on_node_pages_writeback.f2fs_do_sync_file
2.05 ? 6% -0.5 1.59 ? 12% perf-profile.calltrace.cycles-pp.folio_wait_writeback.f2fs_wait_on_page_writeback.f2fs_wait_on_node_pages_writeback.f2fs_do_sync_file.__x64_sys_fsync
1.92 ? 8% -0.5 1.46 ? 14% perf-profile.calltrace.cycles-pp.update_curr.dequeue_entity.dequeue_task_fair.__schedule.schedule
1.80 ? 8% -0.4 1.39 ? 14% perf-profile.calltrace.cycles-pp.perf_trace_sched_stat_runtime.update_curr.dequeue_entity.dequeue_task_fair.__schedule
1.77 ? 8% -0.4 1.36 ? 14% perf-profile.calltrace.cycles-pp.perf_tp_event.perf_trace_sched_stat_runtime.update_curr.dequeue_entity.dequeue_task_fair
0.67 ? 19% -0.4 0.30 ?100% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork
0.86 ? 14% -0.3 0.55 ? 47% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.locked_inode_to_wb_and_lock_list.__mark_inode_dirty.f2fs_update_parent_metadata
0.89 ? 14% -0.2 0.67 ? 16% perf-profile.calltrace.cycles-pp._raw_spin_lock.locked_inode_to_wb_and_lock_list.__mark_inode_dirty.f2fs_update_parent_metadata.f2fs_add_regular_entry
0.89 ? 13% -0.2 0.69 ? 14% perf-profile.calltrace.cycles-pp.locked_inode_to_wb_and_lock_list.__mark_inode_dirty.f2fs_update_parent_metadata.f2fs_add_regular_entry.f2fs_add_dentry
0.76 ? 7% +0.2 0.95 ? 11% perf-profile.calltrace.cycles-pp.__percpu_counter_sum.f2fs_space_for_roll_forward.f2fs_do_sync_file.__x64_sys_fsync.do_syscall_64
0.80 ? 6% +0.2 1.02 ? 11% perf-profile.calltrace.cycles-pp.f2fs_space_for_roll_forward.f2fs_do_sync_file.__x64_sys_fsync.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.08 ? 16% +0.4 1.43 ? 11% perf-profile.calltrace.cycles-pp.__softirqentry_text_start.__irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
1.60 ? 10% +0.5 2.06 ? 7% perf-profile.calltrace.cycles-pp.f2fs_new_inode_page.f2fs_init_inode_metadata.f2fs_add_regular_entry.f2fs_add_dentry.f2fs_do_add_link
1.59 ? 10% +0.5 2.06 ? 7% perf-profile.calltrace.cycles-pp.f2fs_new_node_page.f2fs_new_inode_page.f2fs_init_inode_metadata.f2fs_add_regular_entry.f2fs_add_dentry
0.29 ?100% +0.5 0.76 ? 11% perf-profile.calltrace.cycles-pp.rebalance_domains.__softirqentry_text_start.__irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
3.68 ? 6% +0.5 4.16 ? 8% perf-profile.calltrace.cycles-pp.f2fs_do_add_link.f2fs_create.lookup_open.open_last_lookups.path_openat
0.79 ? 11% +0.5 1.27 ? 11% perf-profile.calltrace.cycles-pp.f2fs_preallocate_blocks.f2fs_file_write_iter.new_sync_write.vfs_write.ksys_write
3.56 ? 7% +0.5 4.07 ? 8% perf-profile.calltrace.cycles-pp.f2fs_add_regular_entry.f2fs_add_dentry.f2fs_do_add_link.f2fs_create.lookup_open
3.56 ? 7% +0.5 4.07 ? 8% perf-profile.calltrace.cycles-pp.f2fs_add_dentry.f2fs_do_add_link.f2fs_create.lookup_open.open_last_lookups
0.86 ? 24% +0.5 1.39 ? 29% perf-profile.calltrace.cycles-pp.scheduler_tick.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues
1.81 ? 11% +0.6 2.36 ? 8% perf-profile.calltrace.cycles-pp.f2fs_init_inode_metadata.f2fs_add_regular_entry.f2fs_add_dentry.f2fs_do_add_link.f2fs_create
0.18 ?141% +0.6 0.75 ? 10% perf-profile.calltrace.cycles-pp.f2fs_convert_inline_inode.f2fs_preallocate_blocks.f2fs_file_write_iter.new_sync_write.vfs_write
0.74 ? 25% +0.6 1.36 ? 8% perf-profile.calltrace.cycles-pp.f2fs_do_write_node_page.__write_node_page.f2fs_fsync_node_pages.f2fs_do_sync_file.__x64_sys_fsync
0.74 ? 25% +0.6 1.36 ? 8% perf-profile.calltrace.cycles-pp.do_write_page.f2fs_do_write_node_page.__write_node_page.f2fs_fsync_node_pages.f2fs_do_sync_file
0.00 +0.6 0.63 ? 6% perf-profile.calltrace.cycles-pp.set_node_addr.f2fs_new_node_page.f2fs_new_inode_page.f2fs_init_inode_metadata.f2fs_add_regular_entry
0.10 ?223% +0.7 0.77 ? 8% perf-profile.calltrace.cycles-pp.f2fs_allocate_data_block.do_write_page.f2fs_do_write_node_page.__write_node_page.f2fs_fsync_node_pages
0.00 +0.7 0.68 ? 8% perf-profile.calltrace.cycles-pp.f2fs_submit_page_write.do_write_page.f2fs_outplace_write_data.f2fs_do_write_data_page.f2fs_write_single_data_page
0.53 ? 52% +0.7 1.26 ? 11% perf-profile.calltrace.cycles-pp.f2fs_allocate_data_block.do_write_page.f2fs_outplace_write_data.f2fs_do_write_data_page.f2fs_write_single_data_page
1.48 ? 20% +0.8 2.32 ? 25% perf-profile.calltrace.cycles-pp.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
1.42 ? 21% +0.8 2.26 ? 26% perf-profile.calltrace.cycles-pp.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt
5.87 ? 7% +1.0 6.85 ? 8% perf-profile.calltrace.cycles-pp.f2fs_create.lookup_open.open_last_lookups.path_openat.do_filp_open
1.10 ? 20% +1.0 2.10 ? 9% perf-profile.calltrace.cycles-pp.do_write_page.f2fs_outplace_write_data.f2fs_do_write_data_page.f2fs_write_single_data_page.f2fs_write_cache_pages
0.66 ? 58% +1.0 1.67 ? 35% perf-profile.calltrace.cycles-pp.timekeeping_max_deferment.tick_nohz_next_event.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call
1.42 ? 18% +1.1 2.52 ? 9% perf-profile.calltrace.cycles-pp.f2fs_outplace_write_data.f2fs_do_write_data_page.f2fs_write_single_data_page.f2fs_write_cache_pages.__f2fs_write_data_pages
2.74 ? 22% +1.2 3.96 ? 18% perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
1.46 ? 23% +1.2 2.69 ? 7% perf-profile.calltrace.cycles-pp.__write_node_page.f2fs_fsync_node_pages.f2fs_do_sync_file.__x64_sys_fsync.do_syscall_64
1.48 ? 30% +1.3 2.73 ? 30% perf-profile.calltrace.cycles-pp.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry
1.23 ? 34% +1.3 2.50 ? 33% perf-profile.calltrace.cycles-pp.tick_nohz_next_event.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle
0.19 ?223% +1.3 1.53 ? 34% perf-profile.calltrace.cycles-pp.handle_edge_irq.__common_interrupt.common_interrupt.asm_common_interrupt.poll_idle
0.19 ?223% +1.3 1.53 ? 34% perf-profile.calltrace.cycles-pp.__common_interrupt.common_interrupt.asm_common_interrupt.poll_idle.cpuidle_enter_state
0.20 ?223% +1.4 1.57 ? 34% perf-profile.calltrace.cycles-pp.common_interrupt.asm_common_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter
0.20 ?223% +1.4 1.58 ? 34% perf-profile.calltrace.cycles-pp.asm_common_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
4.62 ? 20% +1.4 6.04 ? 10% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
4.52 ? 21% +1.4 5.96 ? 11% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
6.28 ? 8% +1.6 7.90 ? 10% perf-profile.calltrace.cycles-pp.file_write_and_wait_range.f2fs_do_sync_file.__x64_sys_fsync.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.15 ? 19% +1.7 3.87 ? 9% perf-profile.calltrace.cycles-pp.f2fs_do_write_data_page.f2fs_write_single_data_page.f2fs_write_cache_pages.__f2fs_write_data_pages.do_writepages
2.38 ? 18% +1.8 4.20 ? 9% perf-profile.calltrace.cycles-pp.f2fs_write_single_data_page.f2fs_write_cache_pages.__f2fs_write_data_pages.do_writepages.filemap_fdatawrite_wbc
3.78 ? 14% +2.0 5.79 ? 10% perf-profile.calltrace.cycles-pp.do_writepages.filemap_fdatawrite_wbc.file_write_and_wait_range.f2fs_do_sync_file.__x64_sys_fsync
3.78 ? 14% +2.0 5.80 ? 10% perf-profile.calltrace.cycles-pp.filemap_fdatawrite_wbc.file_write_and_wait_range.f2fs_do_sync_file.__x64_sys_fsync.do_syscall_64
3.40 ? 16% +2.0 5.42 ? 9% perf-profile.calltrace.cycles-pp.f2fs_write_cache_pages.__f2fs_write_data_pages.do_writepages.filemap_fdatawrite_wbc.file_write_and_wait_range
3.76 ? 15% +2.0 5.78 ? 10% perf-profile.calltrace.cycles-pp.__f2fs_write_data_pages.do_writepages.filemap_fdatawrite_wbc.file_write_and_wait_range.f2fs_do_sync_file
7.22 ? 20% +2.3 9.48 ? 9% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
7.82 ? 10% -1.5 6.36 ? 9% perf-profile.children.cycles-pp.poll_idle
2.23 ? 9% -1.2 1.05 ? 9% perf-profile.children.cycles-pp.pagevec_lookup_range_tag
2.22 ? 9% -1.2 1.05 ? 9% perf-profile.children.cycles-pp.find_get_pages_range_tag
2.06 ? 10% -0.5 1.54 ? 12% perf-profile.children.cycles-pp.f2fs_buffered_write_iter
2.43 ? 10% -0.5 1.94 ? 12% perf-profile.children.cycles-pp.generic_perform_write
2.40 ? 5% -0.4 1.98 ? 12% perf-profile.children.cycles-pp.dequeue_entity
0.95 ? 16% -0.4 0.60 ? 16% perf-profile.children.cycles-pp.copy_page_from_iter_atomic
0.93 ? 16% -0.3 0.58 ? 17% perf-profile.children.cycles-pp.copyin
0.93 ? 16% -0.3 0.59 ? 16% perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
0.70 ? 6% -0.3 0.44 ? 3% perf-profile.children.cycles-pp.xas_find_marked
0.52 ? 9% -0.2 0.30 ? 22% perf-profile.children.cycles-pp.__pagevec_release
0.92 ? 13% -0.2 0.70 ? 14% perf-profile.children.cycles-pp.locked_inode_to_wb_and_lock_list
0.67 ? 19% -0.2 0.51 ? 20% perf-profile.children.cycles-pp.worker_thread
0.49 ? 10% -0.1 0.35 ? 10% perf-profile.children.cycles-pp.alloc_inode
0.55 ? 12% -0.1 0.41 ? 10% perf-profile.children.cycles-pp.new_inode
0.36 ? 10% -0.1 0.23 ? 14% perf-profile.children.cycles-pp.__alloc_pages
0.59 ? 6% -0.1 0.46 ? 8% perf-profile.children.cycles-pp.kmem_cache_alloc_lru
0.49 ? 10% -0.1 0.36 ? 12% perf-profile.children.cycles-pp.kmem_cache_alloc
0.30 ? 15% -0.1 0.17 ? 14% perf-profile.children.cycles-pp.___slab_alloc
0.28 ? 27% -0.1 0.16 ? 21% perf-profile.children.cycles-pp.release_pages
0.30 ? 10% -0.1 0.18 ? 19% perf-profile.children.cycles-pp.folio_alloc
0.30 ? 10% -0.1 0.19 ? 16% perf-profile.children.cycles-pp.get_page_from_freelist
0.23 ? 20% -0.1 0.12 ? 13% perf-profile.children.cycles-pp.allocate_slab
0.43 ? 6% -0.1 0.32 ? 9% perf-profile.children.cycles-pp.f2fs_alloc_inode
0.37 ? 12% -0.1 0.27 ? 18% perf-profile.children.cycles-pp.ttwu_do_activate
0.25 ? 6% -0.1 0.16 ? 10% perf-profile.children.cycles-pp.memcg_slab_post_alloc_hook
0.25 ? 14% -0.1 0.16 ? 23% perf-profile.children.cycles-pp.lru_add_drain_cpu
0.25 ? 14% -0.1 0.17 ? 21% perf-profile.children.cycles-pp.__pagevec_lru_add
0.27 ? 12% -0.1 0.19 ? 17% perf-profile.children.cycles-pp.xas_create
0.19 ? 23% -0.1 0.11 ? 29% perf-profile.children.cycles-pp.nvme_poll_cq
0.14 ? 27% -0.1 0.07 ? 13% perf-profile.children.cycles-pp.setup_object
0.24 ? 11% -0.1 0.17 ? 16% perf-profile.children.cycles-pp.xas_expand
0.21 ? 14% -0.1 0.14 ? 22% perf-profile.children.cycles-pp.mempool_alloc
0.20 ? 15% -0.1 0.13 ? 25% perf-profile.children.cycles-pp.__pagevec_lru_add_fn
0.21 ? 10% -0.1 0.14 ? 19% perf-profile.children.cycles-pp.rmqueue
0.36 ? 12% -0.1 0.29 ? 9% perf-profile.children.cycles-pp.update_rq_clock
0.12 ? 16% -0.1 0.04 ? 45% perf-profile.children.cycles-pp.inode_init_once
0.16 ? 10% -0.1 0.10 ? 32% perf-profile.children.cycles-pp.set_next_entity
0.24 ? 14% -0.1 0.18 ? 18% perf-profile.children.cycles-pp.enqueue_entity
0.29 ? 8% -0.1 0.23 ? 4% perf-profile.children.cycles-pp.d_alloc_parallel
0.27 ? 8% -0.1 0.22 ? 5% perf-profile.children.cycles-pp.d_alloc
0.15 ? 18% -0.0 0.11 ? 16% perf-profile.children.cycles-pp.select_task_rq
0.07 ? 14% -0.0 0.03 ?100% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.30 ? 3% -0.0 0.27 ? 8% perf-profile.children.cycles-pp.exc_page_fault
0.08 ? 8% -0.0 0.04 ? 72% perf-profile.children.cycles-pp.filp_close
0.12 ? 7% -0.0 0.09 ? 10% perf-profile.children.cycles-pp.handle_mm_fault
0.12 ? 10% -0.0 0.08 ? 8% perf-profile.children.cycles-pp.__handle_mm_fault
0.15 ? 7% -0.0 0.12 ? 5% perf-profile.children.cycles-pp.do_user_addr_fault
0.09 ? 15% -0.0 0.06 ? 19% perf-profile.children.cycles-pp.rmqueue_bulk
0.10 ? 6% -0.0 0.07 ? 17% perf-profile.children.cycles-pp.perf_output_begin_forward
0.09 ? 27% -0.0 0.06 ? 23% perf-profile.children.cycles-pp.iostat_update_and_unbind_ctx
0.12 ? 18% -0.0 0.10 ? 11% perf-profile.children.cycles-pp.select_task_rq_fair
0.16 ? 10% -0.0 0.13 ? 9% perf-profile.children.cycles-pp.walk_component
0.13 ? 13% +0.0 0.16 ? 6% perf-profile.children.cycles-pp.wbt_wait
0.12 ? 12% +0.0 0.15 ? 6% perf-profile.children.cycles-pp.rq_qos_wait
0.14 ? 14% +0.0 0.17 ? 5% perf-profile.children.cycles-pp.__rq_qos_throttle
0.07 ? 20% +0.0 0.11 ? 22% perf-profile.children.cycles-pp.xas_set_mark
0.14 ? 9% +0.0 0.18 ? 9% perf-profile.children.cycles-pp.do_dentry_open
0.07 ? 48% +0.0 0.11 ? 9% perf-profile.children.cycles-pp.wbt_rqw_done
0.05 ? 46% +0.0 0.09 ? 19% perf-profile.children.cycles-pp.security_file_open
0.07 ? 46% +0.0 0.12 ? 9% perf-profile.children.cycles-pp.wbt_done
0.07 ? 47% +0.0 0.12 ? 11% perf-profile.children.cycles-pp.__rq_qos_done
0.04 ? 73% +0.0 0.08 ? 11% perf-profile.children.cycles-pp.queue_delayed_work_on
0.10 ? 12% +0.0 0.15 ? 13% perf-profile.children.cycles-pp.f2fs_del_fsync_node_entry
0.05 ? 46% +0.0 0.10 ? 17% perf-profile.children.cycles-pp.folio_add_lru
0.08 ? 20% +0.1 0.14 ? 15% perf-profile.children.cycles-pp.__xa_clear_mark
0.02 ?142% +0.1 0.07 ? 14% perf-profile.children.cycles-pp.apparmor_file_alloc_security
0.01 ?223% +0.1 0.06 ? 14% perf-profile.children.cycles-pp.rwsem_mark_wake
0.11 ? 10% +0.1 0.16 ? 13% perf-profile.children.cycles-pp.__set_nat_cache_dirty
0.03 ?100% +0.1 0.09 ? 20% perf-profile.children.cycles-pp.apparmor_file_open
0.05 ? 48% +0.1 0.12 ? 9% perf-profile.children.cycles-pp.f2fs_convert_inline_page
0.00 +0.1 0.06 ? 14% perf-profile.children.cycles-pp.f2fs_submit_merged_ipu_write
0.18 ? 19% +0.1 0.25 ? 19% perf-profile.children.cycles-pp.__intel_pmu_enable_all
0.05 ? 46% +0.1 0.12 ? 13% perf-profile.children.cycles-pp.ttwu_queue_wakelist
0.03 ?100% +0.1 0.10 ? 16% perf-profile.children.cycles-pp.__remove_ino_entry
0.10 ? 31% +0.1 0.17 ? 27% perf-profile.children.cycles-pp.idle_cpu
0.08 ? 34% +0.1 0.15 ? 10% perf-profile.children.cycles-pp.sb_mark_inode_writeback
0.09 ? 26% +0.1 0.16 ? 11% perf-profile.children.cycles-pp.inc_valid_block_count
0.04 ? 45% +0.1 0.12 ? 11% perf-profile.children.cycles-pp.f2fs_need_inode_block_update
0.11 ? 18% +0.1 0.18 ? 13% perf-profile.children.cycles-pp.f2fs_try_to_free_nats
0.00 +0.1 0.08 ? 12% perf-profile.children.cycles-pp.__flush_smp_call_function_queue
0.00 +0.1 0.08 ? 17% perf-profile.children.cycles-pp.__get_segment_type
0.20 ? 21% +0.1 0.27 ? 17% perf-profile.children.cycles-pp.f2fs_init_acl
0.07 ? 24% +0.1 0.15 ? 17% perf-profile.children.cycles-pp.__xa_set_mark
0.01 ?223% +0.1 0.09 ? 9% perf-profile.children.cycles-pp.inc_valid_node_count
0.17 ? 15% +0.1 0.25 ? 12% perf-profile.children.cycles-pp.f2fs_is_checkpointed_node
0.18 ? 24% +0.1 0.26 ? 16% perf-profile.children.cycles-pp.__f2fs_get_acl
0.05 ? 55% +0.1 0.14 ? 11% perf-profile.children.cycles-pp.sb_clear_inode_writeback
0.00 +0.1 0.08 ? 13% perf-profile.children.cycles-pp.f2fs_lookup_extent_cache
0.20 ? 12% +0.1 0.28 ? 9% perf-profile.children.cycles-pp.f2fs_update_inode
0.14 ? 19% +0.1 0.22 ? 15% perf-profile.children.cycles-pp.f2fs_reserve_new_blocks
0.17 ? 26% +0.1 0.26 ? 16% perf-profile.children.cycles-pp.f2fs_getxattr
0.16 ? 26% +0.1 0.24 ? 16% perf-profile.children.cycles-pp.lookup_all_xattrs
0.12 ? 17% +0.1 0.21 ? 5% perf-profile.children.cycles-pp.f2fs_inode_synced
0.08 ? 26% +0.1 0.16 ? 13% perf-profile.children.cycles-pp.sched_ttwu_pending
0.02 ?143% +0.1 0.11 ? 9% perf-profile.children.cycles-pp.__init_nat_entry
0.16 ? 10% +0.1 0.25 ? 22% perf-profile.children.cycles-pp.__is_cp_guaranteed
0.09 ? 15% +0.1 0.18 ? 14% perf-profile.children.cycles-pp.f2fs_alloc_nid_done
0.21 ? 18% +0.1 0.30 ? 6% perf-profile.children.cycles-pp.__grab_extent_tree
0.11 ? 19% +0.1 0.20 ? 16% perf-profile.children.cycles-pp.folio_account_dirtied
0.07 ? 23% +0.1 0.18 ? 10% perf-profile.children.cycles-pp.f2fs_alloc_nid
0.02 ?149% +0.1 0.13 ? 9% perf-profile.children.cycles-pp.f2fs_is_valid_blkaddr
0.09 ? 31% +0.1 0.20 ? 19% perf-profile.children.cycles-pp.update_segment_mtime
0.22 ? 20% +0.1 0.33 ? 8% perf-profile.children.cycles-pp.f2fs_init_extent_tree
0.13 ? 23% +0.1 0.25 ? 21% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.10 ? 45% +0.1 0.23 ? 14% perf-profile.children.cycles-pp.osq_lock
0.24 ? 18% +0.1 0.37 ? 8% perf-profile.children.cycles-pp.f2fs_mark_inode_dirty_sync
0.28 ? 21% +0.1 0.41 ? 4% perf-profile.children.cycles-pp.wake_up_q
0.15 ? 14% +0.1 0.30 ? 6% perf-profile.children.cycles-pp.f2fs_update_dirty_folio
0.27 ? 15% +0.2 0.42 ? 4% perf-profile.children.cycles-pp.mutex_lock
0.30 ? 8% +0.2 0.45 ? 6% perf-profile.children.cycles-pp.up_read
0.26 ? 20% +0.2 0.41 ? 7% perf-profile.children.cycles-pp.f2fs_inode_dirtied
0.21 ? 40% +0.2 0.38 ? 11% perf-profile.children.cycles-pp.f2fs_get_read_data_page
0.30 ? 23% +0.2 0.47 ? 3% perf-profile.children.cycles-pp.rwsem_wake
0.01 ?223% +0.2 0.18 ? 18% perf-profile.children.cycles-pp.flush_smp_call_function_queue
0.78 ? 6% +0.2 0.96 ? 11% perf-profile.children.cycles-pp.__percpu_counter_sum
0.05 ? 94% +0.2 0.24 ? 16% perf-profile.children.cycles-pp.f2fs_inode_chksum_verify
0.19 ? 37% +0.2 0.39 ? 9% perf-profile.children.cycles-pp.rwsem_spin_on_owner
0.23 ? 22% +0.2 0.44 ? 14% perf-profile.children.cycles-pp.f2fs_dirty_node_folio
0.24 ? 19% +0.2 0.45 ? 15% perf-profile.children.cycles-pp.__folio_mark_dirty
0.20 ? 26% +0.2 0.42 ? 9% perf-profile.children.cycles-pp.__lookup_nat_cache
0.18 ? 37% +0.2 0.39 ? 15% perf-profile.children.cycles-pp.update_sit_entry
0.09 ? 57% +0.2 0.30 ? 13% perf-profile.children.cycles-pp.has_not_enough_free_secs
0.29 ? 18% +0.2 0.51 ? 11% perf-profile.children.cycles-pp.f2fs_update_inode_page
0.29 ? 23% +0.2 0.50 ? 12% perf-profile.children.cycles-pp._raw_spin_trylock
0.27 ? 21% +0.2 0.49 ? 13% perf-profile.children.cycles-pp.f2fs_map_blocks
0.16 ? 33% +0.2 0.38 ? 14% perf-profile.children.cycles-pp.read_node_page
0.80 ? 6% +0.2 1.03 ? 11% perf-profile.children.cycles-pp.f2fs_space_for_roll_forward
0.68 ? 9% +0.2 0.91 ? 12% perf-profile.children.cycles-pp.xas_load
0.32 ? 19% +0.2 0.56 ? 11% perf-profile.children.cycles-pp.f2fs_write_inode
0.32 ? 24% +0.2 0.56 ? 8% perf-profile.children.cycles-pp.down_write
0.28 ? 21% +0.3 0.53 ? 6% perf-profile.children.cycles-pp.__radix_tree_lookup
0.37 ? 10% +0.3 0.63 ? 8% perf-profile.children.cycles-pp.f2fs_dirty_data_folio
0.48 ? 9% +0.3 0.76 ? 11% perf-profile.children.cycles-pp.f2fs_convert_inline_inode
0.38 ? 17% +0.3 0.68 ? 13% perf-profile.children.cycles-pp.filemap_dirty_folio
0.46 ? 16% +0.3 0.77 ? 9% perf-profile.children.cycles-pp.__folio_end_writeback
0.34 ? 22% +0.4 0.70 ? 8% perf-profile.children.cycles-pp.f2fs_get_node_info
0.69 ? 18% +0.4 1.07 ? 11% perf-profile.children.cycles-pp.f2fs_get_dnode_of_data
0.29 ? 36% +0.4 0.67 ? 12% perf-profile.children.cycles-pp.f2fs_balance_fs
0.63 ? 10% +0.4 1.04 ? 7% perf-profile.children.cycles-pp.down_read
0.40 ? 30% +0.4 0.82 ? 11% perf-profile.children.cycles-pp.rwsem_optimistic_spin
0.54 ? 17% +0.4 0.97 ? 7% perf-profile.children.cycles-pp.set_node_addr
0.42 ? 31% +0.4 0.87 ? 11% perf-profile.children.cycles-pp.rwsem_down_write_slowpath
0.86 ? 12% +0.5 1.32 ? 5% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
1.60 ? 10% +0.5 2.06 ? 7% perf-profile.children.cycles-pp.f2fs_new_inode_page
1.59 ? 10% +0.5 2.06 ? 7% perf-profile.children.cycles-pp.f2fs_new_node_page
0.72 ? 17% +0.5 1.19 ? 8% perf-profile.children.cycles-pp.f2fs_submit_page_write
3.68 ? 6% +0.5 4.16 ? 8% perf-profile.children.cycles-pp.f2fs_do_add_link
0.79 ? 11% +0.5 1.27 ? 11% perf-profile.children.cycles-pp.f2fs_preallocate_blocks
3.56 ? 7% +0.5 4.07 ? 8% perf-profile.children.cycles-pp.f2fs_add_regular_entry
3.56 ? 7% +0.5 4.07 ? 8% perf-profile.children.cycles-pp.f2fs_add_dentry
1.81 ? 11% +0.6 2.36 ? 8% perf-profile.children.cycles-pp.f2fs_init_inode_metadata
0.74 ? 25% +0.6 1.36 ? 8% perf-profile.children.cycles-pp.f2fs_do_write_node_page
0.90 ? 19% +0.7 1.56 ? 11% perf-profile.children.cycles-pp.__get_node_page
0.71 ? 42% +1.0 1.68 ? 35% perf-profile.children.cycles-pp.timekeeping_max_deferment
5.88 ? 7% +1.0 6.85 ? 8% perf-profile.children.cycles-pp.f2fs_create
1.13 ? 23% +1.0 2.13 ? 9% perf-profile.children.cycles-pp.f2fs_allocate_data_block
1.60 ? 16% +1.1 2.67 ? 8% perf-profile.children.cycles-pp.f2fs_outplace_write_data
1.48 ? 23% +1.2 2.70 ? 7% perf-profile.children.cycles-pp.__write_node_page
2.77 ? 22% +1.2 4.00 ? 18% perf-profile.children.cycles-pp.menu_select
5.21 ? 16% +1.3 6.46 ? 9% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
1.49 ? 30% +1.3 2.76 ? 30% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
1.24 ? 33% +1.3 2.52 ? 33% perf-profile.children.cycles-pp.tick_nohz_next_event
5.12 ? 16% +1.3 6.40 ? 9% perf-profile.children.cycles-pp.hrtimer_interrupt
2.02 ? 19% +1.6 3.60 ? 8% perf-profile.children.cycles-pp.do_write_page
6.28 ? 8% +1.6 7.91 ? 10% perf-profile.children.cycles-pp.file_write_and_wait_range
2.51 ? 16% +1.7 4.16 ? 8% perf-profile.children.cycles-pp.f2fs_do_write_data_page
2.67 ? 17% +1.8 4.50 ? 8% perf-profile.children.cycles-pp.f2fs_write_single_data_page
4.35 ? 13% +1.9 6.25 ? 9% perf-profile.children.cycles-pp.do_writepages
4.26 ? 13% +1.9 6.19 ? 9% perf-profile.children.cycles-pp.filemap_fdatawrite_wbc
3.87 ? 14% +1.9 5.81 ? 9% perf-profile.children.cycles-pp.f2fs_write_cache_pages
4.24 ? 13% +1.9 6.18 ? 9% perf-profile.children.cycles-pp.__f2fs_write_data_pages
8.12 ? 16% +2.0 10.09 ? 8% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
7.37 ? 12% -2.7 4.65 ? 20% perf-profile.self.cycles-pp.poll_idle
1.52 ? 12% -0.9 0.62 ? 16% perf-profile.self.cycles-pp.find_get_pages_range_tag
0.90 ? 15% -0.3 0.58 ? 16% perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
0.56 ? 21% -0.3 0.25 ? 13% perf-profile.self.cycles-pp.f2fs_new_node_page
0.69 ? 6% -0.3 0.43 ? 2% perf-profile.self.cycles-pp.xas_find_marked
0.40 ? 15% -0.2 0.19 ? 19% perf-profile.self.cycles-pp.f2fs_write_end
0.25 ? 29% -0.1 0.14 ? 21% perf-profile.self.cycles-pp.release_pages
0.11 ? 16% -0.1 0.04 ? 71% perf-profile.self.cycles-pp.inode_init_once
0.18 ? 23% -0.1 0.11 ? 26% perf-profile.self.cycles-pp.nvme_poll_cq
0.24 ? 7% -0.1 0.18 ? 20% perf-profile.self.cycles-pp.__schedule
0.15 ? 11% -0.1 0.09 ? 6% perf-profile.self.cycles-pp.memcg_slab_post_alloc_hook
0.20 ? 12% -0.1 0.14 ? 19% perf-profile.self.cycles-pp.kmem_cache_alloc
0.11 ? 16% -0.1 0.06 ? 56% perf-profile.self.cycles-pp.__pagevec_lru_add_fn
0.14 ? 11% -0.0 0.10 ? 16% perf-profile.self.cycles-pp.do_idle
0.15 ? 8% -0.0 0.11 ? 16% perf-profile.self.cycles-pp.update_rq_clock
0.07 ? 14% -0.0 0.03 ?100% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.09 ? 27% -0.0 0.06 ? 23% perf-profile.self.cycles-pp.iostat_update_and_unbind_ctx
0.10 ? 7% -0.0 0.07 ? 17% perf-profile.self.cycles-pp.perf_output_begin_forward
0.07 ? 15% +0.0 0.10 ? 15% perf-profile.self.cycles-pp.page_counter_charge
0.08 ? 20% +0.0 0.10 ? 13% perf-profile.self.cycles-pp.f2fs_do_write_data_page
0.07 ? 18% +0.0 0.10 ? 23% perf-profile.self.cycles-pp.xas_set_mark
0.06 ? 49% +0.0 0.10 ? 5% perf-profile.self.cycles-pp.rq_qos_wait
0.07 ? 48% +0.0 0.11 ? 9% perf-profile.self.cycles-pp.wbt_rqw_done
0.07 ? 26% +0.0 0.12 ? 16% perf-profile.self.cycles-pp.f2fs_write_single_data_page
0.08 ? 27% +0.0 0.12 ? 9% perf-profile.self.cycles-pp.f2fs_wait_on_node_pages_writeback
0.04 ? 73% +0.0 0.08 ? 11% perf-profile.self.cycles-pp.queue_delayed_work_on
0.11 ? 13% +0.0 0.16 ? 12% perf-profile.self.cycles-pp.f2fs_fsync_node_pages
0.09 ? 23% +0.1 0.15 ? 12% perf-profile.self.cycles-pp.__folio_start_writeback
0.01 ?223% +0.1 0.06 ? 11% perf-profile.self.cycles-pp.rwsem_mark_wake
0.08 ? 22% +0.1 0.14 ? 14% perf-profile.self.cycles-pp.read_node_page
0.08 ? 23% +0.1 0.14 ? 10% perf-profile.self.cycles-pp.f2fs_inode_dirtied
0.06 ? 17% +0.1 0.12 ? 12% perf-profile.self.cycles-pp.f2fs_create
0.01 ?223% +0.1 0.07 ? 18% perf-profile.self.cycles-pp.apparmor_file_alloc_security
0.01 ?223% +0.1 0.06 ? 14% perf-profile.self.cycles-pp.__set_nat_cache_dirty
0.00 +0.1 0.06 ? 19% perf-profile.self.cycles-pp.f2fs_write_begin
0.03 ?100% +0.1 0.09 ? 20% perf-profile.self.cycles-pp.apparmor_file_open
0.08 ? 26% +0.1 0.14 ? 13% perf-profile.self.cycles-pp.__mark_inode_dirty
0.00 +0.1 0.06 ? 17% perf-profile.self.cycles-pp.f2fs_space_for_roll_forward
0.00 +0.1 0.06 ? 17% perf-profile.self.cycles-pp.f2fs_convert_inline_inode
0.00 +0.1 0.06 ? 14% perf-profile.self.cycles-pp.list_lru_add
0.00 +0.1 0.06 ? 14% perf-profile.self.cycles-pp.f2fs_submit_merged_ipu_write
0.18 ? 19% +0.1 0.25 ? 19% perf-profile.self.cycles-pp.__intel_pmu_enable_all
0.01 ?223% +0.1 0.08 ? 12% perf-profile.self.cycles-pp.folio_add_lru
0.09 ? 13% +0.1 0.16 ? 3% perf-profile.self.cycles-pp.f2fs_update_dirty_folio
0.00 +0.1 0.08 ? 11% perf-profile.self.cycles-pp.f2fs_lookup_extent_cache
0.00 +0.1 0.08 ? 10% perf-profile.self.cycles-pp.f2fs_convert_inline_page
0.03 ?103% +0.1 0.11 ? 20% perf-profile.self.cycles-pp.folio_account_dirtied
0.03 ?105% +0.1 0.11 ? 17% perf-profile.self.cycles-pp.__write_node_page
0.03 ?103% +0.1 0.12 ? 15% perf-profile.self.cycles-pp.__attach_extent_node
0.10 ? 25% +0.1 0.18 ? 18% perf-profile.self.cycles-pp.__folio_end_writeback
0.10 ? 21% +0.1 0.18 ? 10% perf-profile.self.cycles-pp.f2fs_submit_page_write
0.16 ? 10% +0.1 0.25 ? 22% perf-profile.self.cycles-pp.__is_cp_guaranteed
0.04 ? 77% +0.1 0.13 ? 25% perf-profile.self.cycles-pp.f2fs_do_sync_file
0.15 ? 21% +0.1 0.24 ? 12% perf-profile.self.cycles-pp.rwsem_optimistic_spin
0.02 ?149% +0.1 0.11 ? 25% perf-profile.self.cycles-pp.__submit_merged_write_cond
0.11 ? 10% +0.1 0.20 ? 11% perf-profile.self.cycles-pp.f2fs_write_end_io
0.02 ?149% +0.1 0.13 ? 9% perf-profile.self.cycles-pp.f2fs_is_valid_blkaddr
0.02 ?223% +0.1 0.12 ? 23% perf-profile.self.cycles-pp.f2fs_get_node_page
0.08 ? 37% +0.1 0.19 ? 15% perf-profile.self.cycles-pp.update_segment_mtime
0.11 ? 22% +0.1 0.23 ? 21% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.10 ? 43% +0.1 0.22 ? 15% perf-profile.self.cycles-pp.osq_lock
0.02 ?223% +0.1 0.14 ? 21% perf-profile.self.cycles-pp.f2fs_balance_fs
0.23 ? 20% +0.1 0.37 ? 5% perf-profile.self.cycles-pp.mutex_lock
0.09 ? 42% +0.1 0.24 ? 12% perf-profile.self.cycles-pp.f2fs_get_node_info
0.29 ? 9% +0.2 0.45 ? 7% perf-profile.self.cycles-pp.up_read
0.55 ? 7% +0.2 0.74 ? 9% perf-profile.self.cycles-pp.xas_load
0.04 ?125% +0.2 0.22 ? 16% perf-profile.self.cycles-pp.f2fs_inode_chksum_verify
0.16 ? 42% +0.2 0.35 ? 9% perf-profile.self.cycles-pp.rwsem_spin_on_owner
0.17 ? 38% +0.2 0.37 ? 13% perf-profile.self.cycles-pp.update_sit_entry
0.09 ? 57% +0.2 0.30 ? 15% perf-profile.self.cycles-pp.has_not_enough_free_secs
0.28 ? 25% +0.2 0.50 ? 13% perf-profile.self.cycles-pp._raw_spin_trylock
0.27 ? 21% +0.3 0.52 ? 5% perf-profile.self.cycles-pp.__radix_tree_lookup
0.25 ? 22% +0.3 0.50 ? 9% perf-profile.self.cycles-pp.down_write
0.11 ? 46% +0.3 0.39 ? 16% perf-profile.self.cycles-pp.f2fs_allocate_data_block
0.12 ? 55% +0.3 0.40 ? 22% perf-profile.self.cycles-pp.__get_node_page
0.78 ? 9% +0.3 1.11 ? 4% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.53 ? 14% +0.4 0.94 ? 7% perf-profile.self.cycles-pp.down_read
1.41 ? 7% +0.5 1.92 ? 5% perf-profile.self.cycles-pp._raw_spin_lock
0.71 ? 43% +1.0 1.68 ? 35% perf-profile.self.cycles-pp.timekeeping_max_deferment




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (65.30 kB)
config-5.18.0-rc5-00021-gbb2dee337bd7 (165.11 kB)
job-script (8.04 kB)
job.yaml (5.40 kB)
reproduce (287.00 B)
Download all attachments

2022-05-18 09:44:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/numa: Apply imbalance limitations consistently

On Wed, May 11, 2022 at 03:30:37PM +0100, Mel Gorman wrote:

> @@ -9108,6 +9108,24 @@ static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
> return running <= imb_numa_nr;
> }
>
> +#define NUMA_IMBALANCE_MIN 2
> +
> +static inline long adjust_numa_imbalance(int imbalance,
> + int dst_running, int imb_numa_nr)
> +{
> + if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> + return imbalance;
> +
> + /*
> + * Allow a small imbalance based on a simple pair of communicating
> + * tasks that remain local when the destination is lightly loaded.
> + */
> + if (imbalance <= NUMA_IMBALANCE_MIN)
> + return 0;
> +
> + return imbalance;
> +}

> @@ -9334,24 +9356,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> }
> }
>
> -#define NUMA_IMBALANCE_MIN 2
> -
> -static inline long adjust_numa_imbalance(int imbalance,
> - int dst_running, int imb_numa_nr)
> -{
> - if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> - return imbalance;
> -
> - /*
> - * Allow a small imbalance based on a simple pair of communicating
> - * tasks that remain local when the destination is lightly loaded.
> - */
> - if (imbalance <= NUMA_IMBALANCE_MIN)
> - return 0;
> -
> - return imbalance;
> -}

If we're going to move that one up and remove the only other caller of
allow_numa_imbalance() we might as well move it up further still and
fold the functions.

Hmm?

(Although I do wonder about that 25% figure in the comment; that doesn't
seem to relate to any actual code anymore)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1536,8 +1536,29 @@ struct task_numa_env {

static unsigned long cpu_load(struct rq *rq);
static unsigned long cpu_runnable(struct rq *rq);
-static inline long adjust_numa_imbalance(int imbalance,
- int dst_running, int imb_numa_nr);
+
+#define NUMA_IMBALANCE_MIN 2
+
+static inline long
+adjust_numa_imbalance(int imbalance, int dst_running, int imb_numa_nr)
+{
+ /*
+ * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
+ * This is an approximation as the number of running tasks may not be
+ * related to the number of busy CPUs due to sched_setaffinity.
+ */
+ if (dst_running > imb_numa_nr)
+ return imbalance;
+
+ /*
+ * Allow a small imbalance based on a simple pair of communicating
+ * tasks that remain local when the destination is lightly loaded.
+ */
+ if (imbalance <= NUMA_IMBALANCE_MIN)
+ return 0;
+
+ return imbalance;
+}

static inline enum
numa_type numa_classify(unsigned int imbalance_pct,
@@ -9099,16 +9120,6 @@ static bool update_pick_idlest(struct sc
}

/*
- * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
- * This is an approximation as the number of running tasks may not be
- * related to the number of busy CPUs due to sched_setaffinity.
- */
-static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
-{
- return running <= imb_numa_nr;
-}
-
-/*
* find_idlest_group() finds and returns the least busy CPU group within the
* domain.
*
@@ -9245,8 +9256,12 @@ find_idlest_group(struct sched_domain *s
* allowed. If there is a real need of migration,
* periodic load balance will take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+ imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus);
+ if (!adjust_numa_imbalance(imbalance,
+ local_sgs.sum_nr_running + 1,
+ sd->imb_numa_nr)) {
return NULL;
+ }
}

/*
@@ -9334,24 +9349,6 @@ static inline void update_sd_lb_stats(st
}
}

-#define NUMA_IMBALANCE_MIN 2
-
-static inline long adjust_numa_imbalance(int imbalance,
- int dst_running, int imb_numa_nr)
-{
- if (!allow_numa_imbalance(dst_running, imb_numa_nr))
- return imbalance;
-
- /*
- * Allow a small imbalance based on a simple pair of communicating
- * tasks that remain local when the destination is lightly loaded.
- */
- if (imbalance <= NUMA_IMBALANCE_MIN)
- return 0;
-
- return imbalance;
-}
-
/**
* calculate_imbalance - Calculate the amount of imbalance present within the
* groups of a given sched_domain during load balance.
@@ -9436,7 +9433,7 @@ static inline void calculate_imbalance(s
*/
env->migration_type = migrate_task;
lsub_positive(&nr_diff, local->sum_nr_running);
- env->imbalance = nr_diff >> 1;
+ env->imbalance = nr_diff;
} else {

/*
@@ -9444,16 +9441,20 @@ static inline void calculate_imbalance(s
* idle cpus.
*/
env->migration_type = migrate_task;
- env->imbalance = max_t(long, 0, (local->idle_cpus -
- busiest->idle_cpus) >> 1);
+ env->imbalance = max_t(long, 0,
+ (local->idle_cpus - busiest->idle_cpus));
}

/* Consider allowing a small imbalance between NUMA groups */
if (env->sd->flags & SD_NUMA) {
env->imbalance = adjust_numa_imbalance(env->imbalance,
- local->sum_nr_running + 1, env->sd->imb_numa_nr);
+ local->sum_nr_running + 1,
+ env->sd->imb_numa_nr);
}

+ /* Number of tasks to move to restore balance */
+ env->imbalance >>= 1;
+
return;
}



2022-05-18 10:57:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/numa: Apply imbalance limitations consistently

On Wed, May 18, 2022 at 11:31:56AM +0200, Peter Zijlstra wrote:
> On Wed, May 11, 2022 at 03:30:37PM +0100, Mel Gorman wrote:
>
> > @@ -9108,6 +9108,24 @@ static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
> > return running <= imb_numa_nr;
> > }
> >
> > +#define NUMA_IMBALANCE_MIN 2
> > +
> > +static inline long adjust_numa_imbalance(int imbalance,
> > + int dst_running, int imb_numa_nr)
> > +{
> > + if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > + return imbalance;
> > +
> > + /*
> > + * Allow a small imbalance based on a simple pair of communicating
> > + * tasks that remain local when the destination is lightly loaded.
> > + */
> > + if (imbalance <= NUMA_IMBALANCE_MIN)
> > + return 0;
> > +
> > + return imbalance;
> > +}
>
> > @@ -9334,24 +9356,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > }
> > }
> >
> > -#define NUMA_IMBALANCE_MIN 2
> > -
> > -static inline long adjust_numa_imbalance(int imbalance,
> > - int dst_running, int imb_numa_nr)
> > -{
> > - if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > - return imbalance;
> > -
> > - /*
> > - * Allow a small imbalance based on a simple pair of communicating
> > - * tasks that remain local when the destination is lightly loaded.
> > - */
> > - if (imbalance <= NUMA_IMBALANCE_MIN)
> > - return 0;
> > -
> > - return imbalance;
> > -}
>
> If we're going to move that one up and remove the only other caller of
> allow_numa_imbalance() we might as well move it up further still and
> fold the functions.
>
> Hmm?
>

Yes, that would be fine and makes sense. I remember thinking that they
should be folded and then failed to follow through.

> (Although I do wonder about that 25% figure in the comment; that doesn't
> seem to relate to any actual code anymore)
>

You're right, by the end of the series it's completely inaccurate and
currently it's not accurate if there are multiple LLCs per node. I
adjusted the wording to "Allow a NUMA imbalance if busy CPUs is less
than the maximum threshold. Above this threshold, individual tasks may
be contending for both memory bandwidth and any shared HT resources."

Diff between v1 and v2 is now below

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 602c05b22805..51fde61ec756 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1536,8 +1536,31 @@ struct task_numa_env {

static unsigned long cpu_load(struct rq *rq);
static unsigned long cpu_runnable(struct rq *rq);
-static inline long adjust_numa_imbalance(int imbalance,
- int dst_running, int imb_numa_nr);
+
+#define NUMA_IMBALANCE_MIN 2
+
+static inline long
+adjust_numa_imbalance(int imbalance, int dst_running, int imb_numa_nr)
+{
+ /*
+ * Allow a NUMA imbalance if busy CPUs is less than the maximum
+ * threshold. Above this threshold, individual tasks may be contending
+ * for both memory bandwidth and any shared HT resources. This is an
+ * approximation as the number of running tasks may not be related to
+ * the number of busy CPUs due to sched_setaffinity.
+ */
+ if (dst_running > imb_numa_nr)
+ return imbalance;
+
+ /*
+ * Allow a small imbalance based on a simple pair of communicating
+ * tasks that remain local when the destination is lightly loaded.
+ */
+ if (imbalance <= NUMA_IMBALANCE_MIN)
+ return 0;
+
+ return imbalance;
+}

static inline enum
numa_type numa_classify(unsigned int imbalance_pct,
@@ -9098,34 +9121,6 @@ static bool update_pick_idlest(struct sched_group *idlest,
return true;
}

-/*
- * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
- * This is an approximation as the number of running tasks may not be
- * related to the number of busy CPUs due to sched_setaffinity.
- */
-static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
-{
- return running <= imb_numa_nr;
-}
-
-#define NUMA_IMBALANCE_MIN 2
-
-static inline long adjust_numa_imbalance(int imbalance,
- int dst_running, int imb_numa_nr)
-{
- if (!allow_numa_imbalance(dst_running, imb_numa_nr))
- return imbalance;
-
- /*
- * Allow a small imbalance based on a simple pair of communicating
- * tasks that remain local when the destination is lightly loaded.
- */
- if (imbalance <= NUMA_IMBALANCE_MIN)
- return 0;
-
- return imbalance;
-}
-
/*
* find_idlest_group() finds and returns the least busy CPU group within the
* domain.
@@ -9448,14 +9443,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* idle cpus.
*/
env->migration_type = migrate_task;
- env->imbalance = max_t(long, 0, (local->idle_cpus -
- busiest->idle_cpus));
+ env->imbalance = max_t(long, 0,
+ (local->idle_cpus - busiest->idle_cpus));
}

/* Consider allowing a small imbalance between NUMA groups */
if (env->sd->flags & SD_NUMA) {
env->imbalance = adjust_numa_imbalance(env->imbalance,
- local->sum_nr_running + 1, env->sd->imb_numa_nr);
+ local->sum_nr_running + 1,
+ env->sd->imb_numa_nr);
}

/* Number of tasks to move to restore balance */

2022-05-18 14:03:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/numa: Apply imbalance limitations consistently

On Wed, May 18, 2022 at 11:46:52AM +0100, Mel Gorman wrote:

> > (Although I do wonder about that 25% figure in the comment; that doesn't
> > seem to relate to any actual code anymore)
> >
>
> You're right, by the end of the series it's completely inaccurate and
> currently it's not accurate if there are multiple LLCs per node. I
> adjusted the wording to "Allow a NUMA imbalance if busy CPUs is less
> than the maximum threshold. Above this threshold, individual tasks may
> be contending for both memory bandwidth and any shared HT resources."
>

Looks good. Meanwhile I saw a 0-day complaint that this regresses
something something unixbench by a bit. Do we care enough? I suppose
this is one of those trade-off patches again, win some, loose some.

2022-05-18 15:25:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [sched/numa] bb2dee337b: unixbench.score -11.2% regression

On Wed, May 18, 2022 at 05:24:14PM +0800, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed a -11.2% regression of unixbench.score due to commit:
>
>
> commit: bb2dee337bd7d314eb7c7627e1afd754f86566bc ("[PATCH 3/4] sched/numa: Apply imbalance limitations consistently")
> url: https://github.com/intel-lab-lkp/linux/commits/Mel-Gorman/Mitigate-inconsistent-NUMA-imbalance-behaviour/20220511-223233
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d70522fc541224b8351ac26f4765f2c6268f8d72
> patch link: https://lore.kernel.org/lkml/[email protected]
>
> in testcase: unixbench
> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
> with following parameters:
>
> runtime: 300s
> nr_task: 1
> test: shell8
> cpufreq_governor: performance
> ucode: 0xd000331
>
> test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> test-url: https://github.com/kdlucas/byte-unixbench

I think what is happening for unixbench is that it prefers to run all
instances on a local node if possible. shell8 is creating 8 scripts,
each of which spawn more processes. The total number of tasks may exceed
the allowed imbalance at fork time of 16 tasks. Some spill over to a
remote node and as they are using files, some accesses are remote and it
suffers. It's not memory bandwidth bound but is sensitive to locality.
The stats somewhat support this idea

> 83590 ? 13% -73.7% 21988 ? 32% numa-meminfo.node0.AnonHugePages
> 225657 ? 18% -58.0% 94847 ? 18% numa-meminfo.node0.AnonPages
> 231652 ? 17% -55.3% 103657 ? 16% numa-meminfo.node0.AnonPages.max
> 234525 ? 17% -55.5% 104341 ? 18% numa-meminfo.node0.Inactive
> 234397 ? 17% -55.5% 104267 ? 18% numa-meminfo.node0.Inactive(anon)
> 11724 ? 7% +17.5% 13781 ? 5% numa-meminfo.node0.KernelStack
> 4472 ? 34% +117.1% 9708 ? 31% numa-meminfo.node0.PageTables
> 15239 ? 75% +401.2% 76387 ? 10% numa-meminfo.node1.AnonHugePages
> 67256 ? 63% +206.3% 205994 ? 6% numa-meminfo.node1.AnonPages
> 73568 ? 58% +193.1% 215644 ? 6% numa-meminfo.node1.AnonPages.max
> 75737 ? 53% +183.9% 215053 ? 6% numa-meminfo.node1.Inactive
> 75709 ? 53% +183.9% 214971 ? 6% numa-meminfo.node1.Inactive(anon)
> 3559 ? 42% +187.1% 10216 ? 8% numa-meminfo.node1.PageTables

There is less memory used on one node and more on the other so it's
getting split.

> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | fsmark: fsmark.files_per_sec -21.5% regression |
> | test machine | 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
> | test parameters | cpufreq_governor=performance |
> | | disk=1SSD |
> | | filesize=8K |
> | | fs=f2fs |
> | | iterations=8 |
> | | nr_directories=16d |
> | | nr_files_per_directory=256fpd |
> | | nr_threads=4 |
> | | sync_method=fsyncBeforeClose |
> | | test_size=72G |
> | | ucode=0x500320a |
> +------------------+-------------------------------------------------------------------------------------+
>

It's less clearcut for this from the stats but it's likely getting split
too and had preferred locality. It's curious that f2fs is affected but
maybe other filesystems were too.

In both cases, the workloads are not memory bandwidth limited so prefer to
stack on one node and previously, because they were cache hot, the load
balancer would avoid splitting them apart if there were other candidates
available.

This is a tradeoff between loads that want to stick on one node for
locality because they are not bandwidth limited and workloads that are
memory bandwidth limited and want to spread wide. We can't tell what
type of workload it is at fork time.

Given there is no crystal ball and it's a tradeoff, I think it's better
to be consistent and use similar logic at both fork time and runtime even
if it doesn't have universal benefit.

--
Mel Gorman
SUSE Labs

2022-05-18 15:46:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/numa: Apply imbalance limitations consistently

On Wed, May 18, 2022 at 03:59:34PM +0200, Peter Zijlstra wrote:
> On Wed, May 18, 2022 at 11:46:52AM +0100, Mel Gorman wrote:
>
> > > (Although I do wonder about that 25% figure in the comment; that doesn't
> > > seem to relate to any actual code anymore)
> > >
> >
> > You're right, by the end of the series it's completely inaccurate and
> > currently it's not accurate if there are multiple LLCs per node. I
> > adjusted the wording to "Allow a NUMA imbalance if busy CPUs is less
> > than the maximum threshold. Above this threshold, individual tasks may
> > be contending for both memory bandwidth and any shared HT resources."
> >
>
> Looks good. Meanwhile I saw a 0-day complaint that this regresses
> something something unixbench by a bit. Do we care enough? I suppose
> this is one of those trade-off patches again, win some, loose some.

I think it's a trade-off. I made a more complete response to the 0-day
people at https://lore.kernel.org/all/[email protected]/

--
Mel Gorman
SUSE Labs

2022-05-19 09:20:38

by Huang, Ying

[permalink] [raw]
Subject: Re: [sched/numa] bb2dee337b: unixbench.score -11.2% regression

Hi, Mel,

On Wed, 2022-05-18 at 16:22 +0100, Mel Gorman wrote:
> On Wed, May 18, 2022 at 05:24:14PM +0800, kernel test robot wrote:
> >
> >
> > Greeting,
> >
> > FYI, we noticed a -11.2% regression of unixbench.score due to commit:
> >
> >
> > commit: bb2dee337bd7d314eb7c7627e1afd754f86566bc ("[PATCH 3/4] sched/numa: Apply imbalance limitations consistently")
> > url: https://github.com/intel-lab-lkp/linux/commits/Mel-Gorman/Mitigate-inconsistent-NUMA-imbalance-behaviour/20220511-223233
> > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d70522fc541224b8351ac26f4765f2c6268f8d72
> > patch link: https://lore.kernel.org/lkml/[email protected]
> >
> > in testcase: unixbench
> > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
> > with following parameters:
> >
> > runtime: 300s
> > nr_task: 1
> > test: shell8
> > cpufreq_governor: performance
> > ucode: 0xd000331
> >
> > test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> > test-url: https://github.com/kdlucas/byte-unixbench
>
> I think what is happening for unixbench is that it prefers to run all
> instances on a local node if possible. shell8 is creating 8 scripts,
> each of which spawn more processes. The total number of tasks may exceed
> the allowed imbalance at fork time of 16 tasks. Some spill over to a
> remote node and as they are using files, some accesses are remote and it
> suffers. It's not memory bandwidth bound but is sensitive to locality.
> The stats somewhat support this idea
>
> >      83590 ± 13% -73.7% 21988 ± 32% numa-meminfo.node0.AnonHugePages
> >     225657 ± 18% -58.0% 94847 ± 18% numa-meminfo.node0.AnonPages
> >     231652 ± 17% -55.3% 103657 ± 16% numa-meminfo.node0.AnonPages.max
> >     234525 ± 17% -55.5% 104341 ± 18% numa-meminfo.node0.Inactive
> >     234397 ± 17% -55.5% 104267 ± 18% numa-meminfo.node0.Inactive(anon)
> >      11724 ± 7% +17.5% 13781 ± 5% numa-meminfo.node0.KernelStack
> >       4472 ± 34% +117.1% 9708 ± 31% numa-meminfo.node0.PageTables
> >      15239 ± 75% +401.2% 76387 ± 10% numa-meminfo.node1.AnonHugePages
> >      67256 ± 63% +206.3% 205994 ± 6% numa-meminfo.node1.AnonPages
> >      73568 ± 58% +193.1% 215644 ± 6% numa-meminfo.node1.AnonPages.max
> >      75737 ± 53% +183.9% 215053 ± 6% numa-meminfo.node1.Inactive
> >      75709 ± 53% +183.9% 214971 ± 6% numa-meminfo.node1.Inactive(anon)
> >       3559 ± 42% +187.1% 10216 ± 8% numa-meminfo.node1.PageTables
>
> There is less memory used on one node and more on the other so it's
> getting split.

This makes sense. I will also check CPU utilization per node to verify
this directly.

>
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+-------------------------------------------------------------------------------------+
> > > testcase: change | fsmark: fsmark.files_per_sec -21.5% regression |
> > > test machine | 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
> > > test parameters | cpufreq_governor=performance |
> > >                  | disk=1SSD |
> > >                  | filesize=8K |
> > >                  | fs=f2fs |
> > >                  | iterations=8 |
> > >                  | nr_directories=16d |
> > >                  | nr_files_per_directory=256fpd |
> > >                  | nr_threads=4 |
> > >                  | sync_method=fsyncBeforeClose |
> > >                  | test_size=72G |
> > >                  | ucode=0x500320a |
> > +------------------+-------------------------------------------------------------------------------------+
> >
>
> It's less clearcut for this from the stats but it's likely getting split
> too and had preferred locality. It's curious that f2fs is affected but
> maybe other filesystems were too.
>
> In both cases, the workloads are not memory bandwidth limited so prefer to
> stack on one node and previously, because they were cache hot, the load
> balancer would avoid splitting them apart if there were other candidates
> available.
>
> This is a tradeoff between loads that want to stick on one node for
> locality because they are not bandwidth limited and workloads that are
> memory bandwidth limited and want to spread wide. We can't tell what
> type of workload it is at fork time.
>
> Given there is no crystal ball and it's a tradeoff, I think it's better
> to be consistent and use similar logic at both fork time and runtime even
> if it doesn't have universal benefit.
>

Thanks for detailed explanation. So some other workloads may benefit
from this patch. Can you give me some candidate so I can test them too?

Best Regards,
Huang, Ying



2022-05-21 03:27:19

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour

Hello Mel,

Thank you for looking at the results.

On 5/20/2022 3:48 PM, Mel Gorman wrote:
> On Fri, May 20, 2022 at 10:28:02AM +0530, K Prateek Nayak wrote:
>> Hello Mel,
>>
>> We tested the patch series on a our systems.
>>
>> tl;dr
>>
>> Results of testing:
>> - Benefits short running Stream tasks in NPS2 and NPS4 mode.
>> - Benefits seen for tbench in NPS1 mode for 8-128 worker count.
>> - Regression in Hackbench with 16 groups in NPS1 mode. A rerun for
>> same data point suggested run to run variation on patched kernel.
>> - Regression in case of tbench with 32 and 64 workers in NPS2 mode.
>> Patched kernel however seems to report more stable value for 64
>> worker count compared to tip.
>> - Slight regression in schbench in NPS2 and NPS4 mode for large
>> worker count but we did spot some run to run variation with
>> both tip and patched kernel.
>>
>> Below are all the detailed numbers for the benchmarks.
>>
> Thanks!
>
> I looked through the results but I do not see anything that is very
> alarming. Some notes.
>
> o Hackbench with 16 groups on NPS1, that would likely be 640 tasks
> communicating unless other paramters are used. I expect it to be
> variable and it's a heavily overloaded scenario. Initial placement is
> not necessarily critical as migrations are likely to be very high.
> On NPS1, there is going to be random luck given that the latency
> to individual CPUs and the physical topology is hidden.
I agree. On rerun, the numbers are quite close so I don't think it
is a concern currently.
> o NPS2 with 128 workers. That's at the threshold where load is
> potentially evenly split between the two sockets but not perfectly
> split due to migrate-on-wakeup being a little unpredictable. Might
> be worth checking the variability there.

For schbench, following are the stats recorded for 128 workers:

Configuration: NPS2

- tip

Min           : 357.00
Max           : 407.00
Median        : 369.00
AMean         : 376.30
AMean Stddev  : 19.15
AMean CoefVar : 5.09 pct

- NUMA Bal

Min           : 384.00
Max           : 410.00
Median        : 400.50
AMean         : 400.40
AMean Stddev  : 8.36
AMean CoefVar : 2.09 pct


Configuration: NPS4

- tip

Min           : 361.00
Max           : 399.00
Median        : 377.00
AMean         : 377.00
AMean Stddev  : 10.31
AMean CoefVar : 2.73 pct

- NUMA Bal

Min           : 379.00
Max           : 394.00
Median        : 390.50
AMean         : 388.10
AMean Stddev  : 5.55
AMean CoefVar : 1.43 pct

In the above cases, the patched kernel seems to
be giving more stable results compared to the tip.
schbench is run 10 times for each worker count to
gather these statistics.

> o Same observations for tbench. I looked at my own results for NPS1
> on Zen3 and what I see is that there is a small blip there but
> the mpstat heat map indicates that the nodes are being more evenly
> used than without the patch which is expected.
I agree. The task distribution should have improved with the patch.
Following are the stats recorded for the tbench run for 32 and 64
workers.

Configuration: NPS2

o 32 workers

- tip

Min           : 10250.10
Max           : 10721.90
Median        : 10651.00
AMean         : 10541.00
AMean Stddev  : 254.41
AMean CoefVar : 2.41 pct

- NUMA Bal

Min           : 8932.03
Max           : 10065.10
Median        : 9894.89
AMean         : 9630.67
AMean Stddev  : 611.00
AMean CoefVar : 6.34 pct

o 64 workers

- tip

Min           : 16197.20
Max           : 17175.90
Median        : 16291.20
AMean         : 16554.77
AMean Stddev  : 539.97
AMean CoefVar : 3.26 pct

- NUMA Bal

Min           : 14386.80
Max           : 16625.50
Median        : 16441.10
AMean         : 15817.80
AMean Stddev  : 1242.71
AMean CoefVar : 7.86 pct

We are observing tip to be more stable in this case.
tbench is run 3 times with for given worker count
to gather these statistics.

> o STREAM is interesting in that there are large differences between
> 10 runs and 100 hundred runs. In indicates that without pinning that
> STREAM can be a bit variable. The problem might be similar to NAS
> as reported in the leader mail with the variability due to commit
> c6f886546cb8 for unknown reasons.
There are some cases of Stream where two Stream threads will be co-located
on the same LLC which results in performance drop. I suspect the
patch helps in such situation by getting a better balance much earlier.
>>> kernel/sched/fair.c | 59 ++++++++++++++++++++++++++---------------
>>> kernel/sched/topology.c | 23 ++++++++++------
>>> 2 files changed, 53 insertions(+), 29 deletions(-)
>>>
>> Please let me know if you would like me to get some additional
>> data on the test system.
> Other than checking variability, the min, max and range, I don't need
> additional data. I suspect in some cases like what I observed with NAS
> that there is wide variability for reasons independent of this series.
I've inlined the data above.
> I'm of the opinion though that your results are not a barrier for
> merging. Do you agree?
The results overall look good and shouldn't be a barrier for merging.

Tested-by: K Prateek Nayak <[email protected]>

--
Thanks and Regards,
Prateek


2022-05-22 04:03:50

by Huang, Ying

[permalink] [raw]
Subject: Re: [LKP] Re: [sched/numa] bb2dee337b: unixbench.score -11.2% regression

On Thu, 2022-05-19 at 15:54 +0800, [email protected] wrote:
> Hi, Mel,
>
> On Wed, 2022-05-18 at 16:22 +0100, Mel Gorman wrote:
> > On Wed, May 18, 2022 at 05:24:14PM +0800, kernel test robot wrote:
> > >
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -11.2% regression of unixbench.score due to commit:
> > >
> > >
> > > commit: bb2dee337bd7d314eb7c7627e1afd754f86566bc ("[PATCH 3/4] sched/numa: Apply imbalance limitations consistently")
> > > url: https://github.com/intel-lab-lkp/linux/commits/Mel-Gorman/Mitigate-inconsistent-NUMA-imbalance-behaviour/20220511-223233
> > > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d70522fc541224b8351ac26f4765f2c6268f8d72
> > > patch link: https://lore.kernel.org/lkml/[email protected]
> > >
> > > in testcase: unixbench
> > > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
> > > with following parameters:
> > >
> > > runtime: 300s
> > > nr_task: 1
> > > test: shell8
> > > cpufreq_governor: performance
> > > ucode: 0xd000331
> > >
> > > test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> > > test-url: https://github.com/kdlucas/byte-unixbench
> >
> > I think what is happening for unixbench is that it prefers to run all
> > instances on a local node if possible. shell8 is creating 8 scripts,
> > each of which spawn more processes. The total number of tasks may exceed
> > the allowed imbalance at fork time of 16 tasks. Some spill over to a
> > remote node and as they are using files, some accesses are remote and it
> > suffers. It's not memory bandwidth bound but is sensitive to locality.
> > The stats somewhat support this idea
> >
> > >      83590 ± 13% -73.7% 21988 ± 32% numa-meminfo.node0.AnonHugePages
> > >     225657 ± 18% -58.0% 94847 ± 18% numa-meminfo.node0.AnonPages
> > >     231652 ± 17% -55.3% 103657 ± 16% numa-meminfo.node0.AnonPages.max
> > >     234525 ± 17% -55.5% 104341 ± 18% numa-meminfo.node0.Inactive
> > >     234397 ± 17% -55.5% 104267 ± 18% numa-meminfo.node0.Inactive(anon)
> > >      11724 ± 7% +17.5% 13781 ± 5% numa-meminfo.node0.KernelStack
> > >       4472 ± 34% +117.1% 9708 ± 31% numa-meminfo.node0.PageTables
> > >      15239 ± 75% +401.2% 76387 ± 10% numa-meminfo.node1.AnonHugePages
> > >      67256 ± 63% +206.3% 205994 ± 6% numa-meminfo.node1.AnonPages
> > >      73568 ± 58% +193.1% 215644 ± 6% numa-meminfo.node1.AnonPages.max
> > >      75737 ± 53% +183.9% 215053 ± 6% numa-meminfo.node1.Inactive
> > >      75709 ± 53% +183.9% 214971 ± 6% numa-meminfo.node1.Inactive(anon)
> > >       3559 ± 42% +187.1% 10216 ± 8% numa-meminfo.node1.PageTables
> >
> > There is less memory used on one node and more on the other so it's
> > getting split.
>
> This makes sense. I will also check CPU utilization per node to verify
> this directly.

I run this workload 3 times for the commit and its parent with mpstat
node statistics.

For the parent commit,

"mpstat.node.0.usr%": [
0.1396875,
3.0806153846153848,
0.05303030303030303
],
"mpstat.node.0.sys%": [
0.10515625,
5.597692307692308,
0.1340909090909091
],

"mpstat.node.1.usr%": [
3.1015625,
0.1306153846153846,
3.0275757575757574
],
"mpstat.node.1.sys%": [
5.66703125,
0.11676923076923076,
5.498181818181818
],

The difference between two nodes are quite large.

For the commit,

"mpstat.node.0.usr%": [
1.42109375,
1.4725,
1.5140625
],
"mpstat.node.0.sys%": [
3.00125,
3.16390625,
3.1284375
],

"mpstat.node.1.usr%": [
1.4909375,
1.41609375,
1.3740625
],
"mpstat.node.1.sys%": [
3.1671875,
3.00109375,
3.044375
],

The difference between 2 nodes reduces greatly. So this proves your
theory directly.

Best Regards,
Huang, Ying


[snip]


2022-05-23 06:05:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour

On Fri, May 20, 2022 at 10:28:02AM +0530, K Prateek Nayak wrote:
> Hello Mel,
>
> We tested the patch series on a our systems.
>
> tl;dr
>
> Results of testing:
> - Benefits short running Stream tasks in NPS2 and NPS4 mode.
> - Benefits seen for tbench in NPS1 mode for 8-128 worker count.
> - Regression in Hackbench with 16 groups in NPS1 mode. A rerun for
> same data point suggested run to run variation on patched kernel.
> - Regression in case of tbench with 32 and 64 workers in NPS2 mode.
> Patched kernel however seems to report more stable value for 64
> worker count compared to tip.
> - Slight regression in schbench in NPS2 and NPS4 mode for large
> worker count but we did spot some run to run variation with
> both tip and patched kernel.
>
> Below are all the detailed numbers for the benchmarks.
>

Thanks!

I looked through the results but I do not see anything that is very
alarming. Some notes.

o Hackbench with 16 groups on NPS1, that would likely be 640 tasks
communicating unless other paramters are used. I expect it to be
variable and it's a heavily overloaded scenario. Initial placement is
not necessarily critical as migrations are likely to be very high.
On NPS1, there is going to be random luck given that the latency
to individual CPUs and the physical topology is hidden.

o NPS2 with 128 workers. That's at the threshold where load is
potentially evenly split between the two sockets but not perfectly
split due to migrate-on-wakeup being a little unpredictable. Might
be worth checking the variability there.

o Same observations for tbench. I looked at my own results for NPS1
on Zen3 and what I see is that there is a small blip there but
the mpstat heat map indicates that the nodes are being more evenly
used than without the patch which is expected.

o STREAM is interesting in that there are large differences between
10 runs and 100 hundred runs. In indicates that without pinning that
STREAM can be a bit variable. The problem might be similar to NAS
as reported in the leader mail with the variability due to commit
c6f886546cb8 for unknown reasons.

> >
> > kernel/sched/fair.c | 59 ++++++++++++++++++++++++++---------------
> > kernel/sched/topology.c | 23 ++++++++++------
> > 2 files changed, 53 insertions(+), 29 deletions(-)
> >
>
> Please let me know if you would like me to get some additional
> data on the test system.

Other than checking variability, the min, max and range, I don't need
additional data. I suspect in some cases like what I observed with NAS
that there is wide variability for reasons independent of this series.

I'm of the opinion though that your results are not a barrier for
merging. Do you agree?

--
Mel Gorman
SUSE Labs

2022-05-23 07:03:03

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour

Hello Mel,

We tested the patch series on a our systems.

tl;dr

Results of testing:
- Benefits short running Stream tasks in NPS2 and NPS4 mode.
- Benefits seen for tbench in NPS1 mode for 8-128 worker count.
- Regression in Hackbench with 16 groups in NPS1 mode. A rerun for
same data point suggested run to run variation on patched kernel.
- Regression in case of tbench with 32 and 64 workers in NPS2 mode.
Patched kernel however seems to report more stable value for 64
worker count compared to tip.
- Slight regression in schbench in NPS2 and NPS4 mode for large
worker count but we did spot some run to run variation with
both tip and patched kernel.

Below are all the detailed numbers for the benchmarks.

On 5/11/2022 8:00 PM, Mel Gorman wrote:
> A problem was reported privately related to inconsistent performance of
> NAS when parallelised with MPICH. The root of the problem is that the
> initial placement is unpredictable and there can be a larger imbalance
> than expected between NUMA nodes. As there is spare capacity and the faults
> are local, the imbalance persists for a long time and performance suffers.
>
> This is not 100% an "allowed imbalance" problem as setting the allowed
> imbalance to 0 does not fix the issue but the allowed imbalance contributes
> the the performance problem. The unpredictable behaviour was most recently
> introduced by commit c6f886546cb8 ("sched/fair: Trigger the update of
> blocked load on newly idle cpu").
>
> mpirun forks hydra_pmi_proxy helpers with MPICH that go to sleep before the
> execing the target workload. As the new tasks are sleeping, the potential
> imbalance is not observed as idle_cpus does not reflect the tasks that
> will be running in the near future. How bad the problem depends on the
> timing of when fork happens and whether the new tasks are still running.
> Consequently, a large initial imbalance may not be detected until the
> workload is fully running. Once running, NUMA Balancing picks the preferred
> node based on locality and runtime load balancing often ignores the tasks
> as can_migrate_task() fails for either locality or task_hot reasons and
> instead picks unrelated tasks.
>
> This is the min, max and range of run time for mg.D parallelised with ~25%
> of the CPUs parallelised by MPICH running on a 2-socket machine (80 CPUs,
> 16 active for mg.D due to limitations of mg.D).
>
> v5.3 Min 95.84 Max 96.55 Range 0.71 Mean 96.16
> v5.7 Min 95.44 Max 96.51 Range 1.07 Mean 96.14
> v5.8 Min 96.02 Max 197.08 Range 101.06 Mean 154.70
> v5.12 Min 104.45 Max 111.03 Range 6.58 Mean 105.94
> v5.13 Min 104.38 Max 170.37 Range 65.99 Mean 117.35
> v5.13-revert-c6f886546cb8 Min 104.40 Max 110.70 Range 6.30 Mean 105.68
> v5.18rc4-baseline Min 104.46 Max 169.04 Range 64.58 Mean 130.49
> v5.18rc4-revert-c6f886546cb8 Min 113.98 Max 117.29 Range 3.31 Mean 114.71
> v5.18rc4-this_series Min 95.24 Max 175.33 Range 80.09 Mean 108.91
> v5.18rc4-this_series+revert Min 95.24 Max 99.87 Range 4.63 Mean 96.54

Following are the results from testing on a dual socket Zen3 system
(2 x 64C/128T) in different NPS modes.

Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

Kernel versions:
- tip: 5.18-rc1 tip sched/core
- Numa Bal: 5.18-rc1 tip sched/core + this patch

When we began testing, we recorded the tip at:

commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"

Following are the results from the benchmark:

Note: Results marked with * are data points of concern. A rerun
for the data point has been provided on both the tip and the
patched kernel to check for any run to run variation.

~~~~~~~~~
hackbench
~~~~~~~~~

NPS1

Test: tip NUMA Bal
1-groups: 4.64 (0.00 pct) 4.67 (-0.64 pct)
2-groups: 5.38 (0.00 pct) 5.47 (-1.67 pct)
4-groups: 6.15 (0.00 pct) 6.24 (-1.46 pct)
8-groups: 7.42 (0.00 pct) 7.45 (-0.40 pct)
16-groups: 10.70 (0.00 pct) 12.04 (-12.52 pct) *
16-groups: 10.81 (0.00 pct) 11.00 (-1.72 pct) [Verification Run]

NPS2

Test: tip NUMA Bal
1-groups: 4.70 (0.00 pct) 4.68 (0.42 pct)
2-groups: 5.45 (0.00 pct) 5.50 (-0.91 pct)
4-groups: 6.13 (0.00 pct) 6.13 (0.00 pct)
8-groups: 7.30 (0.00 pct) 7.21 (1.23 pct)
16-groups: 10.30 (0.00 pct) 10.29 (0.09 pct)

NPS4

Test: tip NUMA Bal
1-groups: 4.60 (0.00 pct) 4.55 (1.08 pct)
2-groups: 5.41 (0.00 pct) 5.37 (0.73 pct)
4-groups: 6.12 (0.00 pct) 6.20 (-1.30 pct)
8-groups: 7.22 (0.00 pct) 7.29 (-0.96 pct)
16-groups: 10.24 (0.00 pct) 10.27 (-0.29 pct)

~~~~~~~~
schbench
~~~~~~~~

NPS1

#workers: tip NUMA Bal
1: 29.00 (0.00 pct) 22.50 (22.41 pct)
2: 28.00 (0.00 pct) 27.00 (3.57 pct)
4: 31.50 (0.00 pct) 32.00 (-1.58 pct)
8: 42.00 (0.00 pct) 39.50 (5.95 pct)
16: 56.50 (0.00 pct) 56.50 (0.00 pct)
32: 94.50 (0.00 pct) 95.00 (-0.52 pct)
64: 176.00 (0.00 pct) 176.00 (0.00 pct)
128: 404.00 (0.00 pct) 395.50 (2.10 pct)
256: 869.00 (0.00 pct) 856.00 (1.49 pct)
512: 58432.00 (0.00 pct) 58688.00 (-0.43 pct)

NPS2

#workers: tip NUMA Bal
1: 26.50 (0.00 pct) 26.00 (1.88 pct)
2: 26.50 (0.00 pct) 24.50 (7.54 pct)
4: 34.50 (0.00 pct) 30.50 (11.59 pct)
8: 45.00 (0.00 pct) 42.00 (6.66 pct)
16: 56.50 (0.00 pct) 55.50 (1.76 pct)
32: 95.50 (0.00 pct) 95.00 (0.52 pct)
64: 179.00 (0.00 pct) 176.00 (1.67 pct)
128: 369.00 (0.00 pct) 400.50 (-8.53 pct) *
128: 380.00 (0.00 pct) 388.00 (-2.10 pct) [Verification Run]
256: 898.00 (0.00 pct) 883.00 (1.67 pct)
512: 56256.00 (0.00 pct) 58752.00 (-4.43 pct)

NPS4

#workers: tip NUMA Bal
1: 25.00 (0.00 pct) 24.00 (4.00 pct)
2: 28.00 (0.00 pct) 27.50 (1.78 pct)
4: 29.50 (0.00 pct) 29.50 (0.00 pct)
8: 41.00 (0.00 pct) 39.00 (4.87 pct)
16: 65.50 (0.00 pct) 66.00 (-0.76 pct)
32: 93.00 (0.00 pct) 94.50 (-1.61 pct)
64: 170.50 (0.00 pct) 176.50 (-3.51 pct)
128: 377.00 (0.00 pct) 390.50 (-3.58 pct)
256: 867.00 (0.00 pct) 919.00 (-5.99 pct) *
256: 890.00 (0.00 pct) 930.00 (-4.49 pct) [Verification Run]
512: 58048.00 (0.00 pct) 59520.00 (-2.53 pct)

~~~~~~
tbench
~~~~~~

NPS1

Clients: tip NUMA Bal
1 443.31 (0.00 pct) 458.77 (3.48 pct)
2 877.32 (0.00 pct) 898.76 (2.44 pct)
4 1665.11 (0.00 pct) 1658.76 (-0.38 pct)
8 3016.68 (0.00 pct) 3133.91 (3.88 pct)
16 5374.30 (0.00 pct) 5816.28 (8.22 pct)
32 8763.86 (0.00 pct) 9843.94 (12.32 pct)
64 15786.93 (0.00 pct) 17562.26 (11.24 pct)
128 26826.08 (0.00 pct) 28241.35 (5.27 pct)
256 24207.35 (0.00 pct) 22242.20 (-8.11 pct)
512 51740.58 (0.00 pct) 51678.30 (-0.12 pct)
1024 51177.82 (0.00 pct) 50699.27 (-0.93 pct)

NPS2

Clients: tip NUMA Bal
1 449.49 (0.00 pct) 467.77 (4.06 pct)
2 867.28 (0.00 pct) 876.20 (1.02 pct)
4 1643.60 (0.00 pct) 1661.94 (1.11 pct)
8 3047.35 (0.00 pct) 3040.70 (-0.21 pct)
16 5340.77 (0.00 pct) 5168.57 (-3.22 pct)
32 10536.85 (0.00 pct) 9603.93 (-8.85 pct) *
32 10424.00 (0.00 pct) 9868.67 (-5.32 pct) [Verification Run]
64 16543.23 (0.00 pct) 15749.69 (-4.79 pct) *
64 17753.50 (0.00 pct) 15599.03 (-12.13 pct) [Verification Run]
128 26400.40 (0.00 pct) 27745.52 (5.09 pct)
256 23436.75 (0.00 pct) 27978.91 (19.38 pct)
512 50902.60 (0.00 pct) 50770.42 (-0.25 pct)
1024 50216.10 (0.00 pct) 49702.00 (-1.02 pct)

NPS4

Clients: tip NUMA Bal
1 443.82 (0.00 pct) 452.63 (1.98 pct)
2 849.14 (0.00 pct) 857.86 (1.02 pct)
4 1603.26 (0.00 pct) 1635.02 (1.98 pct)
8 2972.37 (0.00 pct) 3090.09 (3.96 pct)
16 5277.13 (0.00 pct) 5524.38 (4.68 pct)
32 9744.73 (0.00 pct) 10152.62 (4.18 pct)
64 15854.80 (0.00 pct) 17442.86 (10.01 pct)
128 26116.97 (0.00 pct) 26757.21 (2.45 pct)
256 22403.25 (0.00 pct) 21178.82 (-5.46 pct)
512 48317.20 (0.00 pct) 47433.34 (-1.82 pct)
1024 50445.41 (0.00 pct) 50311.83 (-0.26 pct)

Note: tbench resuts for 256 workers are known to have
a great amount of run to run variation on the test
machine. Any regression seen for the data point can
be safely ignored.

~~~~~~
Stream
~~~~~~

- 10 runs

NPS1

Test: tip NUMA Bal
Copy: 189113.11 (0.00 pct) 183548.36 (-2.94 pct)
Scale: 201190.61 (0.00 pct) 199548.74 (-0.81 pct)
Add: 232654.21 (0.00 pct) 230058.79 (-1.11 pct)
Triad: 226583.57 (0.00 pct) 224761.89 (-0.80 pct)

NPS2

Test: tip NUMA Bal
Copy: 155347.14 (0.00 pct) 226212.24 (45.61 pct)
Scale: 191701.53 (0.00 pct) 212667.40 (10.93 pct)
Add: 210013.97 (0.00 pct) 257112.85 (22.42 pct)
Triad: 207602.00 (0.00 pct) 250309.89 (20.57 pct)

NPS4

Test: tip NUMA Bal
Copy: 136421.15 (0.00 pct) 159681.42 (17.05 pct)
Scale: 191217.59 (0.00 pct) 193113.39 (0.99 pct)
Add: 189229.52 (0.00 pct) 209058.15 (10.47 pct)
Triad: 188052.99 (0.00 pct) 205945.57 (9.51 pct)

- 100 runs

NPS1

Test: tip NUMA Bal
Copy: 244693.32 (0.00 pct) 233080.12 (-4.74 pct)
Scale: 221874.99 (0.00 pct) 215975.10 (-2.65 pct)
Add: 268363.89 (0.00 pct) 263649.67 (-1.75 pct)
Triad: 260945.24 (0.00 pct) 250936.80 (-3.83 pct)

NPS2

Test: tip NUMA Bal
Copy: 211262.00 (0.00 pct) 251292.59 (18.94 pct)
Scale: 222493.34 (0.00 pct) 222258.48 (-0.10 pct)
Add: 280277.17 (0.00 pct) 279649.40 (-0.22 pct)
Triad: 265860.49 (0.00 pct) 265383.54 (-0.17 pct)

NPS4

Test: tip NUMA Bal
Copy: 250171.40 (0.00 pct) 252465.44 (0.91 pct)
Scale: 222293.56 (0.00 pct) 228169.89 (2.64 pct)
Add: 279222.16 (0.00 pct) 290568.29 (4.06 pct)
Triad: 262013.92 (0.00 pct) 273825.25 (4.50 pct)

~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~

NPS1

sched-tip: 303718.33 (var: 1.31)
NUMA Bal: 299859.00 (var: 1.05) (-1.27%)

NPS2

sched-tip: 304536.33 (var: 2.46)
NUMA Bal: 302469.67 (var: 1.38) (-0.67%)

NPS4

sched-tip: 301192.33 (var: 1.81)
NUMA Bal: 300948.00 (var: 0.85) (-0.08%)

~~~~~
Notes
~~~~~

- Hackbench on NPS1 mode seems to show run to run variation with
patched kernel. I'll gather some more data to check if this happens
consistently or not.
The number reported for hackbench is the Amean of 10 runs.
- schbench seems to show some variation on both tip and the patched
kernel for the data points with regression. These are evident from
the [Verification run] done for these data points.
schbench runs are done with 1 messenger and n workers.
- tbench seems to show some regression for 32 worker and 64 workers
in NPS2 mode. The case with 32 workers shows consistent result
however the tip seems to see slight run to run variation for 64
workers.

- Stream sees great benefit in NPS2 mode and NPS4 mode for short runs.
- Great improvements seen for tbench with 8-128 workers in NPS1 mode.

>
> This shows that we've had unpredictable performance for a long time for
> this load. Instability was introduced somewhere between v5.7 and v5.8,
> fixed in v5.12 and broken again since v5.13. The revert against 5.13
> and 5.18-rc4 shows that c6f886546cb8 is the primary source of instability
> although the best case is still worse than 5.7.
>
> This series addresses the allowed imbalance problems to get the peak
> performance back to 5.7 although only some of the time due to the
> instability problem. The series plus the revert is both stable and has
> slightly better peak performance and similar average performance. I'm
> not convinced commit c6f886546cb8 is wrong but haven't isolated exactly
> why it's unstable so for now, I'm just noting it has an issue.
>
> Patch 1 initialises numa_migrate_retry. While this resolves itself
> eventually, it is unpredictable early in the lifetime of
> a task.
>
> Patch 2 will not swap NUMA tasks in the same NUMA group or without
> a NUMA group if there is spare capacity. Swapping is just
> punishing one task to help another.
>
> Patch 3 fixes an issue where a larger imbalance can be created at
> fork time than would be allowed at run time. This behaviour
> can help some workloads that are short lived and prefer
> to remain local but it punishes long-lived tasks that are
> memory intensive.
>
> Patch 4 adjusts the threshold where a NUMA imbalance is allowed to
> better approximate the number of memory channels, at least
> for x86-64.

The entire patch series was applied as is for testing.

>
> kernel/sched/fair.c | 59 ++++++++++++++++++++++++++---------------
> kernel/sched/topology.c | 23 ++++++++++------
> 2 files changed, 53 insertions(+), 29 deletions(-)
>

Please let me know if you would like me to get some additional
data on the test system.
--
Thanks and Regards,
Prateek