2020-06-24 08:43:33

by Zhangshaokun

[permalink] [raw]
Subject: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode

get_file_rcu_many, which is called by __fget_files, has used
atomic_try_cmpxchg now and it can reduce the access number of the global
variable to improve the performance of atomic instruction compared with
atomic_cmpxchg.

__fget_files does check the @f_mode with mask variable and will do some
atomic operations on @f_count, but both are on the same cacheline.
Many CPU cores do file access and it will cause much conflicts on @f_count.
If we could make the two members into different cachelines, it shall relax
the siutations.

We have tested this on ARM64 and X86, the result is as follows:
Syscall of unixbench has been run on Huawei Kunpeng920 with this patch:
24 x System Call Overhead 1

System Call Overhead 3160841.4 lps (10.0 s, 1 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 3160841.4 2107.2
========
System Benchmarks Index Score (Partial Only) 2107.2

Without this patch:
24 x System Call Overhead 1

System Call Overhead 2222456.0 lps (10.0 s, 1 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 2222456.0 1481.6
========
System Benchmarks Index Score (Partial Only) 1481.6

And on Intel 6248 platform with this patch:
40 CPUs in system; running 24 parallel copies of tests

System Call Overhead 4288509.1 lps (10.0 s, 1 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 4288509.1 2859.0
========
System Benchmarks Index Score (Partial Only) 2859.0

Without this patch:
40 CPUs in system; running 24 parallel copies of tests

System Call Overhead 3666313.0 lps (10.0 s, 1 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 3666313.0 2444.2
========
System Benchmarks Index Score (Partial Only) 2444.2

Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Yuqi Jin <[email protected]>
Signed-off-by: Shaokun Zhang <[email protected]>
---
include/linux/fs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f881a892ea7..0faeab5622fb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -955,7 +955,6 @@ struct file {
*/
spinlock_t f_lock;
enum rw_hint f_write_hint;
- atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
struct mutex f_pos_lock;
@@ -979,6 +978,7 @@ struct file {
struct address_space *f_mapping;
errseq_t f_wb_err;
errseq_t f_sb_err; /* for syncfs */
+ atomic_long_t f_count;
} __randomize_layout
__attribute__((aligned(4))); /* lest something weird decides that 2 is OK */

--
2.7.4


2020-07-08 07:49:49

by Chen, Rong A

[permalink] [raw]
Subject: [fs] 936e92b615: unixbench.score 32.3% improvement

Greeting,

FYI, we noticed a 32.3% improvement of unixbench.score due to commit:


commit: 936e92b615e212d08eb74951324bef25ba564c34 ("[PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode")
url: https://github.com/0day-ci/linux/commits/Shaokun-Zhang/fs-Move-f_count-to-different-cacheline-with-f_mode/20200624-163511
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 5e857ce6eae7ca21b2055cca4885545e29228fe2

in testcase: unixbench
on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
with following parameters:

runtime: 300s
nr_task: 30%
test: syscall
cpufreq_governor: performance
ucode: 0x5002f01

test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
test-url: https://github.com/kdlucas/byte-unixbench





Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-7.6/30%/debian-x86_64-20191114.cgz/300s/lkp-csl-2ap3/syscall/unixbench/0x5002f01

commit:
5e857ce6ea ("Merge branch 'hch' (maccess patches from Christoph Hellwig)")
936e92b615 ("fs: Move @f_count to different cacheline with @f_mode")

5e857ce6eae7ca21 936e92b615e212d08eb74951324
---------------- ---------------------------
%stddev %change %stddev
\ | \
2297 ± 2% +32.3% 3038 unixbench.score
171.74 +34.8% 231.55 unixbench.time.user_time
1.366e+09 +32.6% 1.812e+09 unixbench.workload
26472 ± 6% +1270.0% 362665 ±158% cpuidle.C1.usage
0.25 ± 2% +0.1 0.33 mpstat.cpu.all.usr%
8.32 ± 43% +129.7% 19.12 ± 63% sched_debug.cpu.clock.stddev
8.32 ± 43% +129.7% 19.12 ± 63% sched_debug.cpu.clock_task.stddev
2100 ± 2% -15.6% 1772 ± 9% sched_debug.cpu.nr_switches.min
373.34 ± 3% +12.4% 419.48 ± 6% sched_debug.cpu.ttwu_local.stddev
2740 ± 12% -72.3% 757.75 ±105% numa-vmstat.node0.nr_inactive_anon
3139 ± 8% -69.9% 946.25 ± 97% numa-vmstat.node0.nr_shmem
2740 ± 12% -72.3% 757.75 ±105% numa-vmstat.node0.nr_zone_inactive_anon
373.75 ± 51% +443.3% 2030 ± 26% numa-vmstat.node2.nr_inactive_anon
496.00 ± 19% +366.1% 2311 ± 29% numa-vmstat.node2.nr_shmem
373.75 ± 51% +443.3% 2030 ± 26% numa-vmstat.node2.nr_zone_inactive_anon
13728 ± 13% +148.1% 34056 ± 46% numa-vmstat.node3.nr_active_anon
78558 +11.3% 87431 ± 6% numa-vmstat.node3.nr_file_pages
9939 ± 8% +19.7% 11902 ± 13% numa-vmstat.node3.nr_shmem
13728 ± 13% +148.1% 34056 ± 46% numa-vmstat.node3.nr_zone_active_anon
11103 ± 13% -71.2% 3201 ± 99% numa-meminfo.node0.Inactive
10962 ± 12% -72.3% 3032 ±105% numa-meminfo.node0.Inactive(anon)
8551 ± 31% -29.4% 6034 ± 18% numa-meminfo.node0.Mapped
12560 ± 8% -69.9% 3786 ± 97% numa-meminfo.node0.Shmem
1596 ± 51% +415.6% 8230 ± 24% numa-meminfo.node2.Inactive
1496 ± 51% +442.8% 8122 ± 26% numa-meminfo.node2.Inactive(anon)
1984 ± 19% +366.1% 9248 ± 29% numa-meminfo.node2.Shmem
54929 ± 13% +148.0% 136212 ± 46% numa-meminfo.node3.Active
54929 ± 13% +148.0% 136206 ± 46% numa-meminfo.node3.Active(anon)
314216 +11.3% 349697 ± 6% numa-meminfo.node3.FilePages
747907 ± 2% +15.2% 861672 ± 9% numa-meminfo.node3.MemUsed
39744 ± 8% +19.7% 47580 ± 13% numa-meminfo.node3.Shmem
13.94 ± 6% -13.9 0.00 perf-profile.calltrace.cycles-pp.dnotify_flush.filp_close.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +0.7 0.66 ± 8% perf-profile.calltrace.cycles-pp.__x64_sys_umask.do_syscall_64.entry_SYSCALL_64_after_hwframe
31.64 ± 8% +3.4 35.08 ± 5% perf-profile.calltrace.cycles-pp.__fget_files.ksys_dup.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe
6.82 ± 8% +5.6 12.41 ± 12% perf-profile.calltrace.cycles-pp.fput_many.filp_close.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
23.54 ± 58% +12.7 36.27 ± 5% perf-profile.calltrace.cycles-pp.ksys_dup.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe
23.54 ± 58% +12.7 36.29 ± 5% perf-profile.calltrace.cycles-pp.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe
13.98 ± 6% -14.0 0.00 perf-profile.children.cycles-pp.dnotify_flush
39.81 ± 6% -10.8 28.96 ± 9% perf-profile.children.cycles-pp.filp_close
40.13 ± 6% -10.7 29.44 ± 9% perf-profile.children.cycles-pp.__x64_sys_close
0.15 ± 10% -0.0 0.13 ± 8% perf-profile.children.cycles-pp.scheduler_tick
0.05 ± 8% +0.0 0.07 ± 6% perf-profile.children.cycles-pp.__x64_sys_getuid
0.10 ± 7% +0.0 0.12 ± 8% perf-profile.children.cycles-pp.__prepare_exit_to_usermode
0.44 ± 7% +0.1 0.56 ± 6% perf-profile.children.cycles-pp.syscall_return_via_sysret
31.78 ± 8% +3.4 35.22 ± 5% perf-profile.children.cycles-pp.__fget_files
32.52 ± 8% +3.7 36.27 ± 5% perf-profile.children.cycles-pp.ksys_dup
32.54 ± 8% +3.8 36.30 ± 5% perf-profile.children.cycles-pp.__x64_sys_dup
6.86 ± 7% +5.6 12.45 ± 12% perf-profile.children.cycles-pp.fput_many
13.91 ± 6% -13.9 0.00 perf-profile.self.cycles-pp.dnotify_flush
18.05 ± 5% -1.6 16.41 ± 7% perf-profile.self.cycles-pp.filp_close
0.06 ± 6% +0.0 0.08 ± 8% perf-profile.self.cycles-pp.__prepare_exit_to_usermode
0.09 ± 9% +0.0 0.11 ± 7% perf-profile.self.cycles-pp.do_syscall_64
0.16 ± 9% +0.0 0.20 ± 4% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.30 ± 8% +0.1 0.36 ± 7% perf-profile.self.cycles-pp.entry_SYSCALL_64
0.44 ± 7% +0.1 0.56 ± 6% perf-profile.self.cycles-pp.syscall_return_via_sysret
31.61 ± 8% +3.4 35.00 ± 5% perf-profile.self.cycles-pp.__fget_files
6.81 ± 7% +5.6 12.38 ± 12% perf-profile.self.cycles-pp.fput_many
36623 ± 3% +11.5% 40822 ± 7% softirqs.CPU100.SCHED
16499 ± 40% +27.8% 21088 ± 35% softirqs.CPU122.RCU
16758 ± 41% +30.0% 21781 ± 35% softirqs.CPU126.RCU
178.25 ± 11% +7718.2% 13936 ±168% softirqs.CPU13.NET_RX
40883 ± 4% -6.9% 38055 ± 2% softirqs.CPU132.SCHED
16029 ± 41% +35.9% 21789 ± 33% softirqs.CPU144.RCU
16220 ± 43% +32.4% 21484 ± 35% softirqs.CPU145.RCU
16393 ± 39% +29.9% 21301 ± 32% softirqs.CPU146.RCU
16217 ± 39% +29.8% 21055 ± 35% softirqs.CPU147.RCU
37011 ± 12% +12.4% 41589 ± 5% softirqs.CPU149.SCHED
16127 ± 41% +34.5% 21685 ± 34% softirqs.CPU150.RCU
16131 ± 41% +32.3% 21333 ± 35% softirqs.CPU151.RCU
16558 ± 37% +28.2% 21230 ± 34% softirqs.CPU152.RCU
15863 ± 40% +34.1% 21266 ± 32% softirqs.CPU153.RCU
16044 ± 41% +32.7% 21286 ± 34% softirqs.CPU154.RCU
16057 ± 40% +34.9% 21658 ± 33% softirqs.CPU155.RCU
16352 ± 39% +31.0% 21423 ± 33% softirqs.CPU156.RCU
16006 ± 39% +33.4% 21348 ± 32% softirqs.CPU158.RCU
16300 ± 41% +32.0% 21521 ± 34% softirqs.CPU161.RCU
37546 ± 4% +13.5% 42605 ± 3% softirqs.CPU161.SCHED
16411 ± 41% +33.4% 21894 ± 33% softirqs.CPU162.RCU
16329 ± 41% +32.9% 21704 ± 35% softirqs.CPU163.RCU
16517 ± 39% +29.8% 21441 ± 34% softirqs.CPU164.RCU
16227 ± 41% +32.3% 21471 ± 34% softirqs.CPU165.RCU
16347 ± 40% +31.4% 21481 ± 35% softirqs.CPU166.RCU
16360 ± 43% +32.2% 21631 ± 35% softirqs.CPU167.RCU
36986 +11.3% 41148 ± 6% softirqs.CPU167.SCHED
16218 ± 44% +34.7% 21843 ± 33% softirqs.CPU189.RCU
16501 ± 39% +32.0% 21783 ± 33% softirqs.CPU52.RCU
17101 ± 41% +29.4% 22121 ± 35% softirqs.CPU68.RCU
1.087e+09 +20.9% 1.314e+09 perf-stat.i.branch-instructions
19778787 +22.1% 24144895 ± 16% perf-stat.i.branch-misses
22.88 -17.7% 18.84 ± 2% perf-stat.i.cpi
1.635e+09 +23.6% 2.021e+09 perf-stat.i.dTLB-loads
20648 ± 2% +218.4% 65736 ±110% perf-stat.i.dTLB-store-misses
1.023e+09 +24.8% 1.276e+09 perf-stat.i.dTLB-stores
78.10 +1.4 79.54 perf-stat.i.iTLB-load-miss-rate%
16169669 +8.2% 17493234 perf-stat.i.iTLB-load-misses
5.364e+09 +21.3% 6.507e+09 perf-stat.i.instructions
369.33 +11.8% 413.03 ± 5% perf-stat.i.instructions-per-iTLB-miss
0.41 ± 2% +83.3% 0.76 ± 16% perf-stat.i.metric.K/sec
19.79 +23.2% 24.39 perf-stat.i.metric.M/sec
4460149 ± 2% -45.1% 2447884 ± 14% perf-stat.i.node-load-misses
241219 ± 2% -58.8% 99443 ± 47% perf-stat.i.node-loads
1679821 ± 2% -4.4% 1605611 ± 3% perf-stat.i.node-store-misses
25.91 -17.6% 21.36 perf-stat.overall.cpi
82.51 +1.7 84.17 perf-stat.overall.iTLB-load-miss-rate%
331.21 +12.2% 371.62 perf-stat.overall.instructions-per-iTLB-miss
0.04 +21.3% 0.05 perf-stat.overall.ipc
1566 -8.4% 1435 perf-stat.overall.path-length
1.089e+09 +21.0% 1.318e+09 perf-stat.ps.branch-instructions
19801099 +21.7% 24102537 ± 15% perf-stat.ps.branch-misses
1.641e+09 +23.6% 2.028e+09 perf-stat.ps.dTLB-loads
20512 ± 2% +212.7% 64142 ±109% perf-stat.ps.dTLB-store-misses
1.027e+09 +24.8% 1.282e+09 perf-stat.ps.dTLB-stores
16239916 +8.2% 17567773 perf-stat.ps.iTLB-load-misses
5.378e+09 +21.4% 6.527e+09 perf-stat.ps.instructions
4485062 ± 2% -45.2% 2458026 ± 14% perf-stat.ps.node-load-misses
242388 ± 2% -59.0% 99493 ± 47% perf-stat.ps.node-loads
1689890 ± 2% -4.5% 1614182 ± 3% perf-stat.ps.node-store-misses
2.139e+12 +21.5% 2.6e+12 perf-stat.total.instructions
288.00 ± 13% +8910.9% 25951 ±168% interrupts.34:PCI-MSI.524292-edge.eth0-TxRx-3
2042 ± 57% +190.2% 5927 ± 26% interrupts.CPU1.NMI:Non-maskable_interrupts
2042 ± 57% +190.2% 5927 ± 26% interrupts.CPU1.PMI:Performance_monitoring_interrupts
3.75 ± 34% +2373.3% 92.75 ±130% interrupts.CPU100.TLB:TLB_shootdowns
3510 ± 88% -85.1% 522.00 ±124% interrupts.CPU107.NMI:Non-maskable_interrupts
3510 ± 88% -85.1% 522.00 ±124% interrupts.CPU107.PMI:Performance_monitoring_interrupts
3813 ± 74% -73.3% 1018 ±150% interrupts.CPU110.NMI:Non-maskable_interrupts
3813 ± 74% -73.3% 1018 ±150% interrupts.CPU110.PMI:Performance_monitoring_interrupts
4536 ± 51% -97.1% 131.50 ± 8% interrupts.CPU111.NMI:Non-maskable_interrupts
4536 ± 51% -97.1% 131.50 ± 8% interrupts.CPU111.PMI:Performance_monitoring_interrupts
4476 ± 47% -97.5% 113.00 ± 19% interrupts.CPU112.NMI:Non-maskable_interrupts
4476 ± 47% -97.5% 113.00 ± 19% interrupts.CPU112.PMI:Performance_monitoring_interrupts
3522 ± 36% +92.7% 6787 ± 16% interrupts.CPU120.NMI:Non-maskable_interrupts
3522 ± 36% +92.7% 6787 ± 16% interrupts.CPU120.PMI:Performance_monitoring_interrupts
2888 ± 66% +117.5% 6283 ± 21% interrupts.CPU123.NMI:Non-maskable_interrupts
2888 ± 66% +117.5% 6283 ± 21% interrupts.CPU123.PMI:Performance_monitoring_interrupts
3109 ± 61% +132.5% 7230 ± 7% interrupts.CPU124.NMI:Non-maskable_interrupts
3109 ± 61% +132.5% 7230 ± 7% interrupts.CPU124.PMI:Performance_monitoring_interrupts
1067 ± 19% -21.6% 836.50 interrupts.CPU125.CAL:Function_call_interrupts
288.00 ± 13% +8910.9% 25951 ±168% interrupts.CPU13.34:PCI-MSI.524292-edge.eth0-TxRx-3
244.25 ± 96% -95.3% 11.50 ± 95% interrupts.CPU13.TLB:TLB_shootdowns
2056 ±117% +206.3% 6298 ± 20% interrupts.CPU130.NMI:Non-maskable_interrupts
2056 ±117% +206.3% 6298 ± 20% interrupts.CPU130.PMI:Performance_monitoring_interrupts
831.50 +21.4% 1009 ± 13% interrupts.CPU133.CAL:Function_call_interrupts
8.00 ± 29% +634.4% 58.75 ±119% interrupts.CPU133.RES:Rescheduling_interrupts
1629 ±159% +265.3% 5952 ± 29% interrupts.CPU139.NMI:Non-maskable_interrupts
1629 ±159% +265.3% 5952 ± 29% interrupts.CPU139.PMI:Performance_monitoring_interrupts
1660 ±159% +161.0% 4332 ± 61% interrupts.CPU141.NMI:Non-maskable_interrupts
1660 ±159% +161.0% 4332 ± 61% interrupts.CPU141.PMI:Performance_monitoring_interrupts
882.75 ±147% +542.5% 5671 ± 38% interrupts.CPU143.NMI:Non-maskable_interrupts
882.75 ±147% +542.5% 5671 ± 38% interrupts.CPU143.PMI:Performance_monitoring_interrupts
2600 ± 29% +68.8% 4389 ± 47% interrupts.CPU144.NMI:Non-maskable_interrupts
2600 ± 29% +68.8% 4389 ± 47% interrupts.CPU144.PMI:Performance_monitoring_interrupts
1494 ± 20% +91.3% 2859 ± 29% interrupts.CPU147.NMI:Non-maskable_interrupts
1494 ± 20% +91.3% 2859 ± 29% interrupts.CPU147.PMI:Performance_monitoring_interrupts
3657 ± 54% -96.3% 133.75 ± 8% interrupts.CPU15.NMI:Non-maskable_interrupts
3657 ± 54% -96.3% 133.75 ± 8% interrupts.CPU15.PMI:Performance_monitoring_interrupts
5165 ± 40% -97.8% 115.00 ± 26% interrupts.CPU16.NMI:Non-maskable_interrupts
5165 ± 40% -97.8% 115.00 ± 26% interrupts.CPU16.PMI:Performance_monitoring_interrupts
34.00 ±125% -84.6% 5.25 ± 49% interrupts.CPU186.RES:Rescheduling_interrupts
1033 ± 24% -19.0% 836.75 interrupts.CPU190.CAL:Function_call_interrupts
68.00 ± 28% +55.5% 105.75 ± 9% interrupts.CPU26.RES:Rescheduling_interrupts
882.25 ± 4% +6.3% 937.75 ± 7% interrupts.CPU32.CAL:Function_call_interrupts
139.25 ± 96% -74.0% 36.25 ± 72% interrupts.CPU32.TLB:TLB_shootdowns
848.25 ±130% +368.9% 3977 ± 56% interrupts.CPU35.NMI:Non-maskable_interrupts
848.25 ±130% +368.9% 3977 ± 56% interrupts.CPU35.PMI:Performance_monitoring_interrupts
958.25 ± 11% -10.6% 856.75 interrupts.CPU36.CAL:Function_call_interrupts
1903 ± 72% +127.9% 4337 ± 23% interrupts.CPU41.NMI:Non-maskable_interrupts
1903 ± 72% +127.9% 4337 ± 23% interrupts.CPU41.PMI:Performance_monitoring_interrupts
1320 ±158% +245.4% 4560 ± 32% interrupts.CPU47.NMI:Non-maskable_interrupts
1320 ±158% +245.4% 4560 ± 32% interrupts.CPU47.PMI:Performance_monitoring_interrupts
837.50 +5.2% 881.25 ± 4% interrupts.CPU61.CAL:Function_call_interrupts
1074 ± 28% -22.1% 836.50 interrupts.CPU69.CAL:Function_call_interrupts
1042 ± 12% -18.7% 847.50 ± 2% interrupts.CPU86.CAL:Function_call_interrupts



unixbench.score

3200 +--------------------------------------------------------------------+
| O O O |
3000 |-+ O O O O O O O O O |
| O O O O |
| O |
2800 |-+ |
| |
2600 |-+ |
| |
2400 |-+ |
| +.+.. .+.+..+. +..+. .+. .+. .+..+.+.+..+.+.+. .+.|
|.+.. + .+ +.+..+. + + +. + +. |
2200 |-+ + + + |
| |
2000 +--------------------------------------------------------------------+


unixbench.workload

1.9e+09 +-----------------------------------------------------------------+
| O O O O |
1.8e+09 |-+ O O O O O O O O |
| O O O O O |
1.7e+09 |-+ |
| |
1.6e+09 |-+ |
| |
1.5e+09 |-+ |
| |
1.4e+09 |-+ +.+ .+..+.+ +.+. .+.. .+. .+..+. .+. .+.. .|
|.+. .. : + + .+.+.. + + + +.+ + + +.+ |
1.3e+09 |-+ + : + + + |
| + |
1.2e+09 +-----------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (19.94 kB)
config-5.8.0-rc1-00128-g936e92b615e21 (209.44 kB)
job-script (7.44 kB)
job.yaml (5.08 kB)
reproduce (303.00 B)
Download all attachments

2020-07-13 08:50:09

by Zhangshaokun

[permalink] [raw]
Subject: Re: [fs] 936e92b615: unixbench.score 32.3% improvement

Hi maintainers,

This issue is debugged on Huawei Kunpeng 920 which is an ARM64 platform and we also do more tests
on x86 platform.
Since Rong has also reported the improvement on x86,it seems necessary for us to do it.
Any comments on it?

Thanks,
Shaokun

在 2020/7/8 15:23, kernel test robot 写道:
> Greeting,
>
> FYI, we noticed a 32.3% improvement of unixbench.score due to commit:
>
>
> commit: 936e92b615e212d08eb74951324bef25ba564c34 ("[PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode")
> url: https://github.com/0day-ci/linux/commits/Shaokun-Zhang/fs-Move-f_count-to-different-cacheline-with-f_mode/20200624-163511
> base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 5e857ce6eae7ca21b2055cca4885545e29228fe2
>
> in testcase: unixbench
> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> with following parameters:
>
> runtime: 300s
> nr_task: 30%
> test: syscall
> cpufreq_governor: performance
> ucode: 0x5002f01
>
> test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> test-url: https://github.com/kdlucas/byte-unixbench
>
>
>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
> gcc-9/performance/x86_64-rhel-7.6/30%/debian-x86_64-20191114.cgz/300s/lkp-csl-2ap3/syscall/unixbench/0x5002f01
>
> commit:
> 5e857ce6ea ("Merge branch 'hch' (maccess patches from Christoph Hellwig)")
> 936e92b615 ("fs: Move @f_count to different cacheline with @f_mode")
>
> 5e857ce6eae7ca21 936e92b615e212d08eb74951324
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 2297 ± 2% +32.3% 3038 unixbench.score
> 171.74 +34.8% 231.55 unixbench.time.user_time
> 1.366e+09 +32.6% 1.812e+09 unixbench.workload
> 26472 ± 6% +1270.0% 362665 ±158% cpuidle.C1.usage
> 0.25 ± 2% +0.1 0.33 mpstat.cpu.all.usr%
> 8.32 ± 43% +129.7% 19.12 ± 63% sched_debug.cpu.clock.stddev
> 8.32 ± 43% +129.7% 19.12 ± 63% sched_debug.cpu.clock_task.stddev
> 2100 ± 2% -15.6% 1772 ± 9% sched_debug.cpu.nr_switches.min
> 373.34 ± 3% +12.4% 419.48 ± 6% sched_debug.cpu.ttwu_local.stddev
> 2740 ± 12% -72.3% 757.75 ±105% numa-vmstat.node0.nr_inactive_anon
> 3139 ± 8% -69.9% 946.25 ± 97% numa-vmstat.node0.nr_shmem
> 2740 ± 12% -72.3% 757.75 ±105% numa-vmstat.node0.nr_zone_inactive_anon
> 373.75 ± 51% +443.3% 2030 ± 26% numa-vmstat.node2.nr_inactive_anon
> 496.00 ± 19% +366.1% 2311 ± 29% numa-vmstat.node2.nr_shmem
> 373.75 ± 51% +443.3% 2030 ± 26% numa-vmstat.node2.nr_zone_inactive_anon
> 13728 ± 13% +148.1% 34056 ± 46% numa-vmstat.node3.nr_active_anon
> 78558 +11.3% 87431 ± 6% numa-vmstat.node3.nr_file_pages
> 9939 ± 8% +19.7% 11902 ± 13% numa-vmstat.node3.nr_shmem
> 13728 ± 13% +148.1% 34056 ± 46% numa-vmstat.node3.nr_zone_active_anon
> 11103 ± 13% -71.2% 3201 ± 99% numa-meminfo.node0.Inactive
> 10962 ± 12% -72.3% 3032 ±105% numa-meminfo.node0.Inactive(anon)
> 8551 ± 31% -29.4% 6034 ± 18% numa-meminfo.node0.Mapped
> 12560 ± 8% -69.9% 3786 ± 97% numa-meminfo.node0.Shmem
> 1596 ± 51% +415.6% 8230 ± 24% numa-meminfo.node2.Inactive
> 1496 ± 51% +442.8% 8122 ± 26% numa-meminfo.node2.Inactive(anon)
> 1984 ± 19% +366.1% 9248 ± 29% numa-meminfo.node2.Shmem
> 54929 ± 13% +148.0% 136212 ± 46% numa-meminfo.node3.Active
> 54929 ± 13% +148.0% 136206 ± 46% numa-meminfo.node3.Active(anon)
> 314216 +11.3% 349697 ± 6% numa-meminfo.node3.FilePages
> 747907 ± 2% +15.2% 861672 ± 9% numa-meminfo.node3.MemUsed
> 39744 ± 8% +19.7% 47580 ± 13% numa-meminfo.node3.Shmem
> 13.94 ± 6% -13.9 0.00 perf-profile.calltrace.cycles-pp.dnotify_flush.filp_close.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.00 +0.7 0.66 ± 8% perf-profile.calltrace.cycles-pp.__x64_sys_umask.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 31.64 ± 8% +3.4 35.08 ± 5% perf-profile.calltrace.cycles-pp.__fget_files.ksys_dup.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 6.82 ± 8% +5.6 12.41 ± 12% perf-profile.calltrace.cycles-pp.fput_many.filp_close.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 23.54 ± 58% +12.7 36.27 ± 5% perf-profile.calltrace.cycles-pp.ksys_dup.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 23.54 ± 58% +12.7 36.29 ± 5% perf-profile.calltrace.cycles-pp.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 13.98 ± 6% -14.0 0.00 perf-profile.children.cycles-pp.dnotify_flush
> 39.81 ± 6% -10.8 28.96 ± 9% perf-profile.children.cycles-pp.filp_close
> 40.13 ± 6% -10.7 29.44 ± 9% perf-profile.children.cycles-pp.__x64_sys_close
> 0.15 ± 10% -0.0 0.13 ± 8% perf-profile.children.cycles-pp.scheduler_tick
> 0.05 ± 8% +0.0 0.07 ± 6% perf-profile.children.cycles-pp.__x64_sys_getuid
> 0.10 ± 7% +0.0 0.12 ± 8% perf-profile.children.cycles-pp.__prepare_exit_to_usermode
> 0.44 ± 7% +0.1 0.56 ± 6% perf-profile.children.cycles-pp.syscall_return_via_sysret
> 31.78 ± 8% +3.4 35.22 ± 5% perf-profile.children.cycles-pp.__fget_files
> 32.52 ± 8% +3.7 36.27 ± 5% perf-profile.children.cycles-pp.ksys_dup
> 32.54 ± 8% +3.8 36.30 ± 5% perf-profile.children.cycles-pp.__x64_sys_dup
> 6.86 ± 7% +5.6 12.45 ± 12% perf-profile.children.cycles-pp.fput_many
> 13.91 ± 6% -13.9 0.00 perf-profile.self.cycles-pp.dnotify_flush
> 18.05 ± 5% -1.6 16.41 ± 7% perf-profile.self.cycles-pp.filp_close
> 0.06 ± 6% +0.0 0.08 ± 8% perf-profile.self.cycles-pp.__prepare_exit_to_usermode
> 0.09 ± 9% +0.0 0.11 ± 7% perf-profile.self.cycles-pp.do_syscall_64
> 0.16 ± 9% +0.0 0.20 ± 4% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
> 0.30 ± 8% +0.1 0.36 ± 7% perf-profile.self.cycles-pp.entry_SYSCALL_64
> 0.44 ± 7% +0.1 0.56 ± 6% perf-profile.self.cycles-pp.syscall_return_via_sysret
> 31.61 ± 8% +3.4 35.00 ± 5% perf-profile.self.cycles-pp.__fget_files
> 6.81 ± 7% +5.6 12.38 ± 12% perf-profile.self.cycles-pp.fput_many
> 36623 ± 3% +11.5% 40822 ± 7% softirqs.CPU100.SCHED
> 16499 ± 40% +27.8% 21088 ± 35% softirqs.CPU122.RCU
> 16758 ± 41% +30.0% 21781 ± 35% softirqs.CPU126.RCU
> 178.25 ± 11% +7718.2% 13936 ±168% softirqs.CPU13.NET_RX
> 40883 ± 4% -6.9% 38055 ± 2% softirqs.CPU132.SCHED
> 16029 ± 41% +35.9% 21789 ± 33% softirqs.CPU144.RCU
> 16220 ± 43% +32.4% 21484 ± 35% softirqs.CPU145.RCU
> 16393 ± 39% +29.9% 21301 ± 32% softirqs.CPU146.RCU
> 16217 ± 39% +29.8% 21055 ± 35% softirqs.CPU147.RCU
> 37011 ± 12% +12.4% 41589 ± 5% softirqs.CPU149.SCHED
> 16127 ± 41% +34.5% 21685 ± 34% softirqs.CPU150.RCU
> 16131 ± 41% +32.3% 21333 ± 35% softirqs.CPU151.RCU
> 16558 ± 37% +28.2% 21230 ± 34% softirqs.CPU152.RCU
> 15863 ± 40% +34.1% 21266 ± 32% softirqs.CPU153.RCU
> 16044 ± 41% +32.7% 21286 ± 34% softirqs.CPU154.RCU
> 16057 ± 40% +34.9% 21658 ± 33% softirqs.CPU155.RCU
> 16352 ± 39% +31.0% 21423 ± 33% softirqs.CPU156.RCU
> 16006 ± 39% +33.4% 21348 ± 32% softirqs.CPU158.RCU
> 16300 ± 41% +32.0% 21521 ± 34% softirqs.CPU161.RCU
> 37546 ± 4% +13.5% 42605 ± 3% softirqs.CPU161.SCHED
> 16411 ± 41% +33.4% 21894 ± 33% softirqs.CPU162.RCU
> 16329 ± 41% +32.9% 21704 ± 35% softirqs.CPU163.RCU
> 16517 ± 39% +29.8% 21441 ± 34% softirqs.CPU164.RCU
> 16227 ± 41% +32.3% 21471 ± 34% softirqs.CPU165.RCU
> 16347 ± 40% +31.4% 21481 ± 35% softirqs.CPU166.RCU
> 16360 ± 43% +32.2% 21631 ± 35% softirqs.CPU167.RCU
> 36986 +11.3% 41148 ± 6% softirqs.CPU167.SCHED
> 16218 ± 44% +34.7% 21843 ± 33% softirqs.CPU189.RCU
> 16501 ± 39% +32.0% 21783 ± 33% softirqs.CPU52.RCU
> 17101 ± 41% +29.4% 22121 ± 35% softirqs.CPU68.RCU
> 1.087e+09 +20.9% 1.314e+09 perf-stat.i.branch-instructions
> 19778787 +22.1% 24144895 ± 16% perf-stat.i.branch-misses
> 22.88 -17.7% 18.84 ± 2% perf-stat.i.cpi
> 1.635e+09 +23.6% 2.021e+09 perf-stat.i.dTLB-loads
> 20648 ± 2% +218.4% 65736 ±110% perf-stat.i.dTLB-store-misses
> 1.023e+09 +24.8% 1.276e+09 perf-stat.i.dTLB-stores
> 78.10 +1.4 79.54 perf-stat.i.iTLB-load-miss-rate%
> 16169669 +8.2% 17493234 perf-stat.i.iTLB-load-misses
> 5.364e+09 +21.3% 6.507e+09 perf-stat.i.instructions
> 369.33 +11.8% 413.03 ± 5% perf-stat.i.instructions-per-iTLB-miss
> 0.41 ± 2% +83.3% 0.76 ± 16% perf-stat.i.metric.K/sec
> 19.79 +23.2% 24.39 perf-stat.i.metric.M/sec
> 4460149 ± 2% -45.1% 2447884 ± 14% perf-stat.i.node-load-misses
> 241219 ± 2% -58.8% 99443 ± 47% perf-stat.i.node-loads
> 1679821 ± 2% -4.4% 1605611 ± 3% perf-stat.i.node-store-misses
> 25.91 -17.6% 21.36 perf-stat.overall.cpi
> 82.51 +1.7 84.17 perf-stat.overall.iTLB-load-miss-rate%
> 331.21 +12.2% 371.62 perf-stat.overall.instructions-per-iTLB-miss
> 0.04 +21.3% 0.05 perf-stat.overall.ipc
> 1566 -8.4% 1435 perf-stat.overall.path-length
> 1.089e+09 +21.0% 1.318e+09 perf-stat.ps.branch-instructions
> 19801099 +21.7% 24102537 ± 15% perf-stat.ps.branch-misses
> 1.641e+09 +23.6% 2.028e+09 perf-stat.ps.dTLB-loads
> 20512 ± 2% +212.7% 64142 ±109% perf-stat.ps.dTLB-store-misses
> 1.027e+09 +24.8% 1.282e+09 perf-stat.ps.dTLB-stores
> 16239916 +8.2% 17567773 perf-stat.ps.iTLB-load-misses
> 5.378e+09 +21.4% 6.527e+09 perf-stat.ps.instructions
> 4485062 ± 2% -45.2% 2458026 ± 14% perf-stat.ps.node-load-misses
> 242388 ± 2% -59.0% 99493 ± 47% perf-stat.ps.node-loads
> 1689890 ± 2% -4.5% 1614182 ± 3% perf-stat.ps.node-store-misses
> 2.139e+12 +21.5% 2.6e+12 perf-stat.total.instructions
> 288.00 ± 13% +8910.9% 25951 ±168% interrupts.34:PCI-MSI.524292-edge.eth0-TxRx-3
> 2042 ± 57% +190.2% 5927 ± 26% interrupts.CPU1.NMI:Non-maskable_interrupts
> 2042 ± 57% +190.2% 5927 ± 26% interrupts.CPU1.PMI:Performance_monitoring_interrupts
> 3.75 ± 34% +2373.3% 92.75 ±130% interrupts.CPU100.TLB:TLB_shootdowns
> 3510 ± 88% -85.1% 522.00 ±124% interrupts.CPU107.NMI:Non-maskable_interrupts
> 3510 ± 88% -85.1% 522.00 ±124% interrupts.CPU107.PMI:Performance_monitoring_interrupts
> 3813 ± 74% -73.3% 1018 ±150% interrupts.CPU110.NMI:Non-maskable_interrupts
> 3813 ± 74% -73.3% 1018 ±150% interrupts.CPU110.PMI:Performance_monitoring_interrupts
> 4536 ± 51% -97.1% 131.50 ± 8% interrupts.CPU111.NMI:Non-maskable_interrupts
> 4536 ± 51% -97.1% 131.50 ± 8% interrupts.CPU111.PMI:Performance_monitoring_interrupts
> 4476 ± 47% -97.5% 113.00 ± 19% interrupts.CPU112.NMI:Non-maskable_interrupts
> 4476 ± 47% -97.5% 113.00 ± 19% interrupts.CPU112.PMI:Performance_monitoring_interrupts
> 3522 ± 36% +92.7% 6787 ± 16% interrupts.CPU120.NMI:Non-maskable_interrupts
> 3522 ± 36% +92.7% 6787 ± 16% interrupts.CPU120.PMI:Performance_monitoring_interrupts
> 2888 ± 66% +117.5% 6283 ± 21% interrupts.CPU123.NMI:Non-maskable_interrupts
> 2888 ± 66% +117.5% 6283 ± 21% interrupts.CPU123.PMI:Performance_monitoring_interrupts
> 3109 ± 61% +132.5% 7230 ± 7% interrupts.CPU124.NMI:Non-maskable_interrupts
> 3109 ± 61% +132.5% 7230 ± 7% interrupts.CPU124.PMI:Performance_monitoring_interrupts
> 1067 ± 19% -21.6% 836.50 interrupts.CPU125.CAL:Function_call_interrupts
> 288.00 ± 13% +8910.9% 25951 ±168% interrupts.CPU13.34:PCI-MSI.524292-edge.eth0-TxRx-3
> 244.25 ± 96% -95.3% 11.50 ± 95% interrupts.CPU13.TLB:TLB_shootdowns
> 2056 ±117% +206.3% 6298 ± 20% interrupts.CPU130.NMI:Non-maskable_interrupts
> 2056 ±117% +206.3% 6298 ± 20% interrupts.CPU130.PMI:Performance_monitoring_interrupts
> 831.50 +21.4% 1009 ± 13% interrupts.CPU133.CAL:Function_call_interrupts
> 8.00 ± 29% +634.4% 58.75 ±119% interrupts.CPU133.RES:Rescheduling_interrupts
> 1629 ±159% +265.3% 5952 ± 29% interrupts.CPU139.NMI:Non-maskable_interrupts
> 1629 ±159% +265.3% 5952 ± 29% interrupts.CPU139.PMI:Performance_monitoring_interrupts
> 1660 ±159% +161.0% 4332 ± 61% interrupts.CPU141.NMI:Non-maskable_interrupts
> 1660 ±159% +161.0% 4332 ± 61% interrupts.CPU141.PMI:Performance_monitoring_interrupts
> 882.75 ±147% +542.5% 5671 ± 38% interrupts.CPU143.NMI:Non-maskable_interrupts
> 882.75 ±147% +542.5% 5671 ± 38% interrupts.CPU143.PMI:Performance_monitoring_interrupts
> 2600 ± 29% +68.8% 4389 ± 47% interrupts.CPU144.NMI:Non-maskable_interrupts
> 2600 ± 29% +68.8% 4389 ± 47% interrupts.CPU144.PMI:Performance_monitoring_interrupts
> 1494 ± 20% +91.3% 2859 ± 29% interrupts.CPU147.NMI:Non-maskable_interrupts
> 1494 ± 20% +91.3% 2859 ± 29% interrupts.CPU147.PMI:Performance_monitoring_interrupts
> 3657 ± 54% -96.3% 133.75 ± 8% interrupts.CPU15.NMI:Non-maskable_interrupts
> 3657 ± 54% -96.3% 133.75 ± 8% interrupts.CPU15.PMI:Performance_monitoring_interrupts
> 5165 ± 40% -97.8% 115.00 ± 26% interrupts.CPU16.NMI:Non-maskable_interrupts
> 5165 ± 40% -97.8% 115.00 ± 26% interrupts.CPU16.PMI:Performance_monitoring_interrupts
> 34.00 ±125% -84.6% 5.25 ± 49% interrupts.CPU186.RES:Rescheduling_interrupts
> 1033 ± 24% -19.0% 836.75 interrupts.CPU190.CAL:Function_call_interrupts
> 68.00 ± 28% +55.5% 105.75 ± 9% interrupts.CPU26.RES:Rescheduling_interrupts
> 882.25 ± 4% +6.3% 937.75 ± 7% interrupts.CPU32.CAL:Function_call_interrupts
> 139.25 ± 96% -74.0% 36.25 ± 72% interrupts.CPU32.TLB:TLB_shootdowns
> 848.25 ±130% +368.9% 3977 ± 56% interrupts.CPU35.NMI:Non-maskable_interrupts
> 848.25 ±130% +368.9% 3977 ± 56% interrupts.CPU35.PMI:Performance_monitoring_interrupts
> 958.25 ± 11% -10.6% 856.75 interrupts.CPU36.CAL:Function_call_interrupts
> 1903 ± 72% +127.9% 4337 ± 23% interrupts.CPU41.NMI:Non-maskable_interrupts
> 1903 ± 72% +127.9% 4337 ± 23% interrupts.CPU41.PMI:Performance_monitoring_interrupts
> 1320 ±158% +245.4% 4560 ± 32% interrupts.CPU47.NMI:Non-maskable_interrupts
> 1320 ±158% +245.4% 4560 ± 32% interrupts.CPU47.PMI:Performance_monitoring_interrupts
> 837.50 +5.2% 881.25 ± 4% interrupts.CPU61.CAL:Function_call_interrupts
> 1074 ± 28% -22.1% 836.50 interrupts.CPU69.CAL:Function_call_interrupts
> 1042 ± 12% -18.7% 847.50 ± 2% interrupts.CPU86.CAL:Function_call_interrupts
>
>
>
> unixbench.score
>
> 3200 +--------------------------------------------------------------------+
> | O O O |
> 3000 |-+ O O O O O O O O O |
> | O O O O |
> | O |
> 2800 |-+ |
> | |
> 2600 |-+ |
> | |
> 2400 |-+ |
> | +.+.. .+.+..+. +..+. .+. .+. .+..+.+.+..+.+.+. .+.|
> |.+.. + .+ +.+..+. + + +. + +. |
> 2200 |-+ + + + |
> | |
> 2000 +--------------------------------------------------------------------+
>
>
> unixbench.workload
>
> 1.9e+09 +-----------------------------------------------------------------+
> | O O O O |
> 1.8e+09 |-+ O O O O O O O O |
> | O O O O O |
> 1.7e+09 |-+ |
> | |
> 1.6e+09 |-+ |
> | |
> 1.5e+09 |-+ |
> | |
> 1.4e+09 |-+ +.+ .+..+.+ +.+. .+.. .+. .+..+. .+. .+.. .|
> |.+. .. : + + .+.+.. + + + +.+ + + +.+ |
> 1.3e+09 |-+ + : + + + |
> | + |
> 1.2e+09 +-----------------------------------------------------------------+
>
>
> [*] bisect-good sample
> [O] bisect-bad sample
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Rong Chen
>

2020-08-21 16:04:36

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode

On Wed, Jun 24, 2020 at 04:32:28PM +0800, Shaokun Zhang wrote:
> get_file_rcu_many, which is called by __fget_files, has used
> atomic_try_cmpxchg now and it can reduce the access number of the global
> variable to improve the performance of atomic instruction compared with
> atomic_cmpxchg.
>
> __fget_files does check the @f_mode with mask variable and will do some
> atomic operations on @f_count, but both are on the same cacheline.
> Many CPU cores do file access and it will cause much conflicts on @f_count.
> If we could make the two members into different cachelines, it shall relax
> the siutations.
>
> We have tested this on ARM64 and X86, the result is as follows:
> Syscall of unixbench has been run on Huawei Kunpeng920 with this patch:
> 24 x System Call Overhead 1
>
> System Call Overhead 3160841.4 lps (10.0 s, 1 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> System Call Overhead 15000.0 3160841.4 2107.2
> ========
> System Benchmarks Index Score (Partial Only) 2107.2
>
> Without this patch:
> 24 x System Call Overhead 1
>
> System Call Overhead 2222456.0 lps (10.0 s, 1 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> System Call Overhead 15000.0 2222456.0 1481.6
> ========
> System Benchmarks Index Score (Partial Only) 1481.6
>
> And on Intel 6248 platform with this patch:
> 40 CPUs in system; running 24 parallel copies of tests
>
> System Call Overhead 4288509.1 lps (10.0 s, 1 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> System Call Overhead 15000.0 4288509.1 2859.0
> ========
> System Benchmarks Index Score (Partial Only) 2859.0
>
> Without this patch:
> 40 CPUs in system; running 24 parallel copies of tests
>
> System Call Overhead 3666313.0 lps (10.0 s, 1 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> System Call Overhead 15000.0 3666313.0 2444.2
> ========
> System Benchmarks Index Score (Partial Only) 2444.2
>
> Cc: Will Deacon <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Signed-off-by: Yuqi Jin <[email protected]>
> Signed-off-by: Shaokun Zhang <[email protected]>
> ---
> include/linux/fs.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3f881a892ea7..0faeab5622fb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -955,7 +955,6 @@ struct file {
> */
> spinlock_t f_lock;
> enum rw_hint f_write_hint;
> - atomic_long_t f_count;
> unsigned int f_flags;
> fmode_t f_mode;
> struct mutex f_pos_lock;
> @@ -979,6 +978,7 @@ struct file {
> struct address_space *f_mapping;
> errseq_t f_wb_err;
> errseq_t f_sb_err; /* for syncfs */
> + atomic_long_t f_count;
> } __randomize_layout
> __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */

Hmm. So the microbenchmark numbers look lovely, but:

- What impact does it actually have for real workloads?
- How do we avoid regressing performance by innocently changing the struct
again later on?
- This thing is tagged with __randomize_layout, so it doesn't help anybody
using that crazy plugin
- What about all the other atomics and locks that share cachelines?

Will

2020-08-26 07:25:55

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode

Hi Will??

?? 2020/8/22 0:02, Will Deacon д??:
> On Wed, Jun 24, 2020 at 04:32:28PM +0800, Shaokun Zhang wrote:
>> get_file_rcu_many, which is called by __fget_files, has used
>> atomic_try_cmpxchg now and it can reduce the access number of the global
>> variable to improve the performance of atomic instruction compared with
>> atomic_cmpxchg.
>>
>> __fget_files does check the @f_mode with mask variable and will do some
>> atomic operations on @f_count, but both are on the same cacheline.
>> Many CPU cores do file access and it will cause much conflicts on @f_count.
>> If we could make the two members into different cachelines, it shall relax
>> the siutations.
>>
>> We have tested this on ARM64 and X86, the result is as follows:
>> Syscall of unixbench has been run on Huawei Kunpeng920 with this patch:
>> 24 x System Call Overhead 1
>>
>> System Call Overhead 3160841.4 lps (10.0 s, 1 samples)
>>
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> System Call Overhead 15000.0 3160841.4 2107.2
>> ========
>> System Benchmarks Index Score (Partial Only) 2107.2
>>
>> Without this patch:
>> 24 x System Call Overhead 1
>>
>> System Call Overhead 2222456.0 lps (10.0 s, 1 samples)
>>
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> System Call Overhead 15000.0 2222456.0 1481.6
>> ========
>> System Benchmarks Index Score (Partial Only) 1481.6
>>
>> And on Intel 6248 platform with this patch:
>> 40 CPUs in system; running 24 parallel copies of tests
>>
>> System Call Overhead 4288509.1 lps (10.0 s, 1 samples)
>>
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> System Call Overhead 15000.0 4288509.1 2859.0
>> ========
>> System Benchmarks Index Score (Partial Only) 2859.0
>>
>> Without this patch:
>> 40 CPUs in system; running 24 parallel copies of tests
>>
>> System Call Overhead 3666313.0 lps (10.0 s, 1 samples)
>>
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> System Call Overhead 15000.0 3666313.0 2444.2
>> ========
>> System Benchmarks Index Score (Partial Only) 2444.2
>>
>> Cc: Will Deacon <[email protected]>
>> Cc: Mark Rutland <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Alexander Viro <[email protected]>
>> Cc: Boqun Feng <[email protected]>
>> Signed-off-by: Yuqi Jin <[email protected]>
>> Signed-off-by: Shaokun Zhang <[email protected]>
>> ---
>> include/linux/fs.h | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 3f881a892ea7..0faeab5622fb 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -955,7 +955,6 @@ struct file {
>> */
>> spinlock_t f_lock;
>> enum rw_hint f_write_hint;
>> - atomic_long_t f_count;
>> unsigned int f_flags;
>> fmode_t f_mode;
>> struct mutex f_pos_lock;
>> @@ -979,6 +978,7 @@ struct file {
>> struct address_space *f_mapping;
>> errseq_t f_wb_err;
>> errseq_t f_sb_err; /* for syncfs */
>> + atomic_long_t f_count;
>> } __randomize_layout
>> __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
>
> Hmm. So the microbenchmark numbers look lovely, but:

Thanks,

>
> - What impact does it actually have for real workloads?

It is exposed by we do the unixbench test. About the real workloads, if it has many
threads and open the same file, it shall be useful like unixbench.
If not the scenes, it should not be regression with the patch because we only change
the poistion of @f_count with @f_mode.

> - How do we avoid regressing performance by innocently changing the struct
> again later on?

It shall be commented this change on the @f_count, I'm not sure it is enough.

> - This thing is tagged with __randomize_layout, so it doesn't help anybody
> using that crazy plugin

This patch isolated the @f_count with @f_mode absolutely and we don't care the
base address of the structure, or I may miss something what you said.

> - What about all the other atomics and locks that share cachelines?

An interesting question, to be honest, about this issue, we did performance
profile using unixbench and found it, then we want to relax the conflicts.
For other scenes, this method may be useful if it is debugged by the same
conflicts, but it can't be detected automatically.

Thanks,
Shaokun

>
> Will
>
> .
>

2020-08-26 08:25:32

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode

On 2020-08-26, Shaokun Zhang <[email protected]> wrote:
> 在 2020/8/22 0:02, Will Deacon 写道:
> > - This thing is tagged with __randomize_layout, so it doesn't help anybody
> > using that crazy plugin
>
> This patch isolated the @f_count with @f_mode absolutely and we don't care the
> base address of the structure, or I may miss something what you said.

__randomize_layout randomises the order of fields in a structure on each
kernel rebuild (to make attacks against sensitive kernel structures
theoretically harder because the offset of a field is per-build). It is
separate to ASLR or other base-related randomisation. However it depends
on having CONFIG_GCC_PLUGIN_RANDSTRUCT=y and I believe (at least for
distribution kernels) this isn't a widely-used configuration.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (912.00 B)
signature.asc (235.00 B)
Download all attachments

2020-08-27 08:30:49

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode

Hi Aleksa,

在 2020/8/26 16:24, Aleksa Sarai 写道:
> On 2020-08-26, Shaokun Zhang <[email protected]> wrote:
>> 在 2020/8/22 0:02, Will Deacon 写道:
>>> - This thing is tagged with __randomize_layout, so it doesn't help anybody
>>> using that crazy plugin
>>
>> This patch isolated the @f_count with @f_mode absolutely and we don't care the
>> base address of the structure, or I may miss something what you said.
>
> __randomize_layout randomises the order of fields in a structure on each
> kernel rebuild (to make attacks against sensitive kernel structures
> theoretically harder because the offset of a field is per-build). It is

My bad, I missed Will's comments for my poor understanding on it.

> separate to ASLR or other base-related randomisation. However it depends
> on having CONFIG_GCC_PLUGIN_RANDSTRUCT=y and I believe (at least for
> distribution kernels) this isn't a widely-used configuration.

Thanks for more explanations about it, in our test, this config is also
disabled. If having CONFIG_GCC_PLUGIN_RANDSTRUCT=y, it seems this patch
will lose its value.
If it isn't widely-used for this config, hopefully we can do something on
the scene.

Thanks,
Shaokun

>

2020-09-08 15:41:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode

On Wed 24-06-20 16:32:28, Shaokun Zhang wrote:
> get_file_rcu_many, which is called by __fget_files, has used
> atomic_try_cmpxchg now and it can reduce the access number of the global
> variable to improve the performance of atomic instruction compared with
> atomic_cmpxchg.
>
> __fget_files does check the @f_mode with mask variable and will do some
> atomic operations on @f_count, but both are on the same cacheline.
> Many CPU cores do file access and it will cause much conflicts on @f_count.
> If we could make the two members into different cachelines, it shall relax
> the siutations.

<snip nice unixbench results>

Thanks for the patch! The wins for your microbenchmark heavily sharing
struct file are nice but I'm not sure your change is a universal win. When
struct file is not shared (which is far more common), hot code paths like
__fget() or __fget_light() will now need to fetch two cache lines from
struct file instead of one. So I don't think that for most users the
tradeoff is really worth it...

Honza

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3f881a892ea7..0faeab5622fb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -955,7 +955,6 @@ struct file {
> */
> spinlock_t f_lock;
> enum rw_hint f_write_hint;
> - atomic_long_t f_count;
> unsigned int f_flags;
> fmode_t f_mode;
> struct mutex f_pos_lock;
> @@ -979,6 +978,7 @@ struct file {
> struct address_space *f_mapping;
> errseq_t f_wb_err;
> errseq_t f_sb_err; /* for syncfs */
> + atomic_long_t f_count;
> } __randomize_layout
> __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
>
> --
> 2.7.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR