LinuxLists.cc - [linus:master] [mm] 088b8aa537: vm-scalability.throughput -6.5% regression

2022-11-21 03:25:53

Subject: [linus:master] [mm] 088b8aa537: vm-scalability.throughput -6.5% regression

Greeting,

FYI, we noticed a
commit: 088b8aa537c2c767765
Details are as below:

=========================== compiler/cpufreq_governor/k gcc-11/performance/x86_64
commit:
e7b72c48d6 ("mm/mremap_pages: 088b8aa537 ("mm:
e7b72c48d677c244 088b8aa537 ---------------- ---------- %stddev %change \ | 1061123 1061123 106.50 106.50 95.00 -3.9% 464.83 ? 45% +77.2% 41828947 ? 3% 46348 ? 3% +36.6% 593.33 626370 272382 ? 4% +14.9% 6487093 ? 55% 1184 ? 5% -24.7% 96594 -5.3% 193425 1004718 283.33 ? 19% -28.5% 282.17 ? 19% -28.4% 1504 ? 2% -16.9% 3589 ? 13% +115.7% 22448232 9440 ? 22% +69.2% 19906 ? 3% +4.9% 9569 ? 22% +69.7% 1089 ? 5% +38.4% 1.77e+09 190.04 25.17 ? 10% +14.9 690930 ? 4% 2149984 ? 8% 6.979e+09 10160 ? 4% -36.6% 69.47 ? 31% +78.3% 223174 223243 24.44 ? 8% +15.8 10119 ? 4% -39.2% 0.91 ? 5% -9.1% 57.75 ? 6% +8.2 1.753e+09 188.25 684471 ? 4% 2129593 ? 8% 6.914e+09 68.83 ? 31% +78.5% 221092 221161 2.62 ? 37% -2.0 2.82 ? 36% -1.9 2.86 ? 35% -1.9 0.00 +2.4 0.00 +2.4 0.00 +2.4 0.00 +2.4 2.87 ? 35% -1.9 2.63 ? 37% -1.9 0.23 ? 29% +0.1 0.39 ? 18% +0.1 0.07 ? 20% +0.1 0.53 ? 17% +0.1 0.20 ? 31% +0.2 0.00 +0.2 0.35 ? 25% +0.5 0.35 ? 26% +0.5 0.43 ? 25% +0.6 0.73 ? 23% +0.8 0.00 +2.4 0.00 +2.4 0.95 ? 11% -0.2 0.23 ? 13% -0.0 0.10 ? 29% +0.1 0.38 ? 18% +0.1 0.07 ? 20% +0.1 0.00 +0.2 0.09 ? 31% +0.2

If you fix the issue, | Reported-by: kernel | Link: f1c19b555f21ffe555786 ("mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast")
l.org/cgit/linux/kernel/git/torvalds/linux.git">https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz (Cascade Lake) with 128G memory
The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
/git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/">https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
==============================================================
config/nr_pmem/nr_task/priority/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled:
-rhel-8.3/2/1/1/debian-11.1-x86_64-20220510.cgz/lkp-csl-2sp9/swap-w-seq/vm-scalability/never/never
save a few cycles in get_dev_pagemap()")
fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast")
c2c767765f1c19b55
-----------------
%stddev
\
-6.5% 991719 ? 3% vm-scalability.median
-6.5% 991719 ? 3% vm-scalability.throughput
+6.7% 113.64 ? 3% vm-scalability.time.elapsed_time
+6.7% 113.64 ? 3% vm-scalability.time.elapsed_time.max
91.33 vm-scalability.time.percent_of_cpu_this_job_got
823.50 ? 32% numa-vmstat.node1.workingset_refault_anon
+29.1% 54019629 ? 2% turbostat.IRQ
63322 ? 41% turbostat.POLL
-3.5% 572.83 ? 2% vmstat.swap.si
-5.7% 590394 ? 3% vmstat.swap.so
313058 ? 2% vmstat.system.in
-69.9% 1954992 ? 36% proc-vmstat.compact_migrate_scanned
892.17 ? 20% proc-vmstat.kswapd_low_wmark_hit_quickly
91472 ? 2% proc-vmstat.nr_dirty_background_threshold
-5.3% 183168 ? 2% proc-vmstat.nr_dirty_threshold
-5.1% 953385 proc-vmstat.nr_free_pages
202.67 ? 24% proc-vmstat.nr_inactive_file
202.17 ? 24% proc-vmstat.nr_zone_inactive_file
1250 ? 10% proc-vmstat.pageoutrun
7743 ? 12% proc-vmstat.pgactivate
+3.3% 23184909 proc-vmstat.pgalloc_normal
15974 proc-vmstat.pgmajfault
20882 ? 2% proc-vmstat.pgreuse
16238 proc-vmstat.pswpin
1507 ? 23% proc-vmstat.workingset_refault_anon
-4.6% 1.689e+09 ? 2% perf-stat.i.branch-instructions
+1.4% 192.69 perf-stat.i.cpu-migrations
40.02 ? 9% perf-stat.i.iTLB-load-miss-rate%
+57.4% 1087333 ? 4% perf-stat.i.iTLB-load-misses
-24.2% 1630299 ? 13% perf-stat.i.iTLB-loads
-4.4% 6.675e+09 ? 2% perf-stat.i.instructions
6443 ? 5% perf-stat.i.instructions-per-iTLB-miss
123.87 ? 16% perf-stat.i.major-faults
-6.0% 209787 ? 3% perf-stat.i.minor-faults
-6.0% 209911 ? 3% perf-stat.i.page-faults
40.28 ? 9% perf-stat.overall.iTLB-load-miss-rate%
6151 ? 4% perf-stat.overall.instructions-per-iTLB-miss
0.82 ? 4% perf-stat.overall.ipc
65.97 ? 5% perf-stat.overall.node-load-miss-rate%
-4.5% 1.674e+09 ? 2% perf-stat.ps.branch-instructions
+1.5% 191.00 perf-stat.ps.cpu-migrations
+57.5% 1077795 ? 4% perf-stat.ps.iTLB-load-misses
-24.1% 1616077 ? 13% perf-stat.ps.iTLB-loads
-4.3% 6.617e+09 ? 2% perf-stat.ps.instructions
122.87 ? 16% perf-stat.ps.major-faults
-5.9% 207957 ? 3% perf-stat.ps.minor-faults
-5.9% 208080 ? 3% perf-stat.ps.page-faults
0.62 ? 79% perf-profile.calltrace.cycles-pp.try_to_unmap_one.rmap_walk_anon.try_to_unmap.shrink_page_list.shrink_inactive_list
0.88 ? 60% perf-profile.calltrace.cycles-pp.rmap_walk_anon.try_to_unmap.shrink_page_list.shrink_inactive_list.shrink_lruvec
0.92 ? 60% perf-profile.calltrace.cycles-pp.try_to_unmap.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node_memcgs
2.36 ? 44% perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.on_each_cpu_cond_mask.arch_tlbbatch_flush.try_to_unmap_flush_dirty.shrink_page_list
2.37 ? 44% perf-profile.calltrace.cycles-pp.on_each_cpu_cond_mask.arch_tlbbatch_flush.try_to_unmap_flush_dirty.shrink_page_list.shrink_inactive_list
2.41 ? 44% perf-profile.calltrace.cycles-pp.arch_tlbbatch_flush.try_to_unmap_flush_dirty.shrink_page_list.shrink_inactive_list.shrink_lruvec
2.41 ? 44% perf-profile.calltrace.cycles-pp.try_to_unmap_flush_dirty.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node_memcgs
0.97 ? 49% perf-profile.children.cycles-pp.try_to_unmap
0.75 ? 47% perf-profile.children.cycles-pp.try_to_unmap_one
0.35 ? 29% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.52 ? 10% perf-profile.children.cycles-pp.sync_regs
0.20 ? 23% perf-profile.children.cycles-pp.llist_reverse_order
0.67 ? 12% perf-profile.children.cycles-pp.error_entry
0.35 ? 13% perf-profile.children.cycles-pp.flush_tlb_func
0.19 ? 17% perf-profile.children.cycles-pp.native_flush_tlb_local
0.88 ? 19% perf-profile.children.cycles-pp.__flush_smp_call_function_queue
0.88 ? 19% perf-profile.children.cycles-pp.__sysvec_call_function_single
0.99 ? 17% perf-profile.children.cycles-pp.sysvec_call_function_single
1.54 ? 14% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
2.41 ? 44% perf-profile.children.cycles-pp.arch_tlbbatch_flush
2.41 ? 44% perf-profile.children.cycles-pp.try_to_unmap_flush_dirty
0.75 ? 10% perf-profile.self.cycles-pp.shrink_page_list
0.18 ? 17% perf-profile.self.cycles-pp.cpuidle_idle_call
0.16 ? 15% perf-profile.self.cycles-pp.__handle_mm_fault
0.51 ? 10% perf-profile.self.cycles-pp.sync_regs
0.20 ? 23% perf-profile.self.cycles-pp.llist_reverse_order
0.19 ? 18% perf-profile.self.cycles-pp.native_flush_tlb_local
0.33 ? 36% perf-profile.self.cycles-pp.__flush_smp_call_function_queue
kindly add following tag
test robot <[email protected]>
ore.kernel.org/oe-lkp/202211211037.2b2e5e1f-yujie.liu@intel.com">https://lore.kernel.org/oe-lkp/[email protected]
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
# job file is attached in this email
--compatible job.yaml # generate the yaml file for lkp run
run generated-yaml-file
across any failure that blocks the test,
~/.lkp and /lkp dir to run from a clean state.
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
p">https://01.org/lkp

Attachments:

(No filename) (9.11 kB)
config-6.0.0-rc3-00139-g088b8aa537c2 (166.82 kB)
job-script (8.46 kB)
job.yaml (5.93 kB)
reproduce (1.00 kB)
Download all attachments

2022-11-21 08:13:11

by David Hildenbrand

[permalink] [raw]

Subject: Re: [linus:master] [mm] 088b8aa537: vm-scalability.throughput -6.5% regression

On 21.11.22 04:03, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -6.5% regression of vm-scalability.throughput due to commit:
>
> commit: 088b8aa537c2c767765f1c19b555f21ffe555786 ("mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: vm-scalability
> on test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz (Cascade Lake) with 128G memory
> with following parameters:
>
> thp_enabled: never
> thp_defrag: never
> nr_task: 1
> nr_pmem: 2
> priority: 1
> test: swap-w-seq
> cpufreq_governor: performance
>
> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>

Yes, page_try_share_anon_rmap() might now be a bit more expensive now,
turning try_to_unmap_one() a bit more expensive. However, that patch
also changes the unconditional TLB flush into a conditional TLB flush,
so results might vary heavily between machines/architectures.

smp_mb__after_atomic() is a NOP on x86, so the smp_mb() before the
page_maybe_dma_pinned() check would have to be responsible.

While there might certainly be ways for optimizing that further (e.g.,
if the ptep_get_and_clear() already implies a smp_mb()), the facts that:

(1) It's a swap micro-benchmark
(2) We have 3% stddev

Don't make me get active now ;)

--
Thanks,

David / dhildenb