2019-07-18 09:51:31

by Chen, Rong A

[permalink] [raw]
Subject: [x86/modules] f2c65fb322: will-it-scale.per_process_ops -2.9% regression

Greeting,

FYI, we noticed a -2.9% regression of will-it-scale.per_process_ops due to commit:


commit: f2c65fb3221adc6b73b0549fc7ba892022db9797 ("x86/modules: Avoid breaking W^X while loading modules")
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux.git master

in testcase: will-it-scale
on test machine: 192 threads Intel(R) Xeon(R) CPU @ 2.20GHz with 192G memory
with following parameters:

nr_task: 100%
mode: process
test: poll1
cpufreq_governor: performance

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale



Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.6/process/100%/debian-x86_64-2019-05-14.cgz/lkp-csl-2ap1/poll1/will-it-scale

commit:
7298e24f90 ("x86/kprobes: Set instruction page as executable")
f2c65fb322 ("x86/modules: Avoid breaking W^X while loading modules")

7298e24f904224fa f2c65fb3221adc6b73b0549fc7b
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:4 50% 2:4 dmesg.WARNING:at#for_ip_interrupt_entry/0x
%stddev %change %stddev
\ | \
5017865 -2.9% 4874578 will-it-scale.per_process_ops
9.634e+08 -2.9% 9.359e+08 will-it-scale.workload
33.31 -1.0% 32.99 boot-time.dhcp
10541056 ± 7% +12.1% 11820032 meminfo.DirectMap2M
652.58 ± 7% +10.3% 719.62 ± 8% sched_debug.cfs_rq:/.util_avg.min
19388 ± 14% -17.5% 16003 ± 6% softirqs.CPU134.RCU
1140906 -2.7% 1109655 ± 2% proc-vmstat.pgalloc_normal
1069002 ± 2% -4.7% 1018671 ± 3% proc-vmstat.pgfree
2124 ± 11% -21.8% 1662 ± 4% slabinfo.pool_workqueue.active_objs
2220 ± 11% -21.6% 1742 ± 5% slabinfo.pool_workqueue.num_objs
8708 ± 3% +11.6% 9714 ± 6% numa-vmstat.node0.nr_kernel_stack
6771 ± 7% -12.5% 5924 ± 2% numa-vmstat.node3.nr_kernel_stack
5180 ± 19% -31.7% 3540 ± 13% numa-vmstat.node3.nr_slab_reclaimable
8710 ± 3% +11.5% 9714 ± 6% numa-meminfo.node0.KernelStack
20724 ± 19% -31.7% 14162 ± 13% numa-meminfo.node3.KReclaimable
6773 ± 7% -12.5% 5923 ± 2% numa-meminfo.node3.KernelStack
20724 ± 19% -31.7% 14162 ± 13% numa-meminfo.node3.SReclaimable
82391 ± 9% -12.4% 72177 ± 8% numa-meminfo.node3.Slab
171.00 ±103% -87.6% 21.25 ± 70% interrupts.CPU110.RES:Rescheduling_interrupts
8457 -37.6% 5278 ± 34% interrupts.CPU145.NMI:Non-maskable_interrupts
8457 -37.6% 5278 ± 34% interrupts.CPU145.PMI:Performance_monitoring_interrupts
8453 -37.5% 5285 ± 34% interrupts.CPU146.NMI:Non-maskable_interrupts
8453 -37.5% 5285 ± 34% interrupts.CPU146.PMI:Performance_monitoring_interrupts
8458 -37.5% 5284 ± 34% interrupts.CPU147.NMI:Non-maskable_interrupts
8458 -37.5% 5284 ± 34% interrupts.CPU147.PMI:Performance_monitoring_interrupts
8456 -37.6% 5279 ± 34% interrupts.CPU149.NMI:Non-maskable_interrupts
8456 -37.6% 5279 ± 34% interrupts.CPU149.PMI:Performance_monitoring_interrupts
8449 -37.5% 5278 ± 34% interrupts.CPU151.NMI:Non-maskable_interrupts
8449 -37.5% 5278 ± 34% interrupts.CPU151.PMI:Performance_monitoring_interrupts
14.75 ± 41% +2237.3% 344.75 ± 93% interrupts.CPU154.RES:Rescheduling_interrupts
108.75 ±112% -85.7% 15.50 ± 41% interrupts.CPU168.RES:Rescheduling_interrupts
168.00 ± 19% -41.8% 97.75 ± 40% interrupts.CPU18.RES:Rescheduling_interrupts
1882 ± 87% -94.9% 95.50 ± 50% interrupts.CPU19.RES:Rescheduling_interrupts
218.75 ± 88% +196.9% 649.50 ± 69% interrupts.CPU22.RES:Rescheduling_interrupts
8448 -37.4% 5291 ± 34% interrupts.CPU31.NMI:Non-maskable_interrupts
8448 -37.4% 5291 ± 34% interrupts.CPU31.PMI:Performance_monitoring_interrupts
79.50 ± 23% +537.4% 506.75 ± 75% interrupts.CPU33.RES:Rescheduling_interrupts
8436 -37.5% 5274 ± 34% interrupts.CPU34.NMI:Non-maskable_interrupts
8436 -37.5% 5274 ± 34% interrupts.CPU34.PMI:Performance_monitoring_interrupts
195.50 ±127% +423.1% 1022 ± 47% interrupts.CPU36.RES:Rescheduling_interrupts
43.00 ± 46% +3592.4% 1587 ±150% interrupts.CPU37.RES:Rescheduling_interrupts
82.00 ± 28% +103.4% 166.75 ± 52% interrupts.CPU41.RES:Rescheduling_interrupts
102.75 ± 29% -53.3% 48.00 ± 45% interrupts.CPU50.RES:Rescheduling_interrupts
474.00 ± 30% -73.0% 127.75 ± 75% interrupts.CPU68.RES:Rescheduling_interrupts
8042 ± 4% +6.5% 8565 ± 4% interrupts.CPU95.RES:Rescheduling_interrupts
0.02 -6.0% 0.02 ± 5% perf-stat.i.MPKI
1.111e+11 -2.8% 1.079e+11 perf-stat.i.branch-instructions
2.454e+08 -3.4% 2.371e+08 perf-stat.i.branch-misses
726252 ± 3% -7.8% 669916 ± 3% perf-stat.i.cache-misses
13163065 -8.5% 12044270 ± 5% perf-stat.i.cache-references
1.02 +3.0% 1.05 perf-stat.i.cpi
835228 +7.0% 893984 ± 4% perf-stat.i.cycles-between-cache-misses
1.419e+11 -2.8% 1.38e+11 perf-stat.i.dTLB-loads
0.00 -0.0 0.00 ± 3% perf-stat.i.dTLB-store-miss-rate%
13827 -8.8% 12606 ± 3% perf-stat.i.dTLB-store-misses
1.023e+11 -2.8% 9.937e+10 perf-stat.i.dTLB-stores
2.909e+08 -12.0% 2.56e+08 ± 2% perf-stat.i.iTLB-load-misses
5.448e+11 -2.8% 5.293e+11 perf-stat.i.instructions
1876 +10.4% 2071 ± 2% perf-stat.i.instructions-per-iTLB-miss
0.98 -2.9% 0.95 perf-stat.i.ipc
3297 -1.7% 3241 perf-stat.i.minor-faults
31602 ± 11% -21.5% 24805 ± 15% perf-stat.i.node-loads
3297 -1.7% 3241 perf-stat.i.page-faults
0.02 -5.8% 0.02 ± 5% perf-stat.overall.MPKI
1.02 +3.0% 1.05 perf-stat.overall.cpi
766162 ± 3% +8.5% 831132 ± 3% perf-stat.overall.cycles-between-cache-misses
0.00 -0.0 0.00 ± 3% perf-stat.overall.dTLB-store-miss-rate%
1872 +10.4% 2068 ± 2% perf-stat.overall.instructions-per-iTLB-miss
0.98 -2.9% 0.95 perf-stat.overall.ipc
85.08 +3.2 88.25 ± 2% perf-stat.overall.node-load-miss-rate%
1.107e+11 -2.8% 1.075e+11 perf-stat.ps.branch-instructions
2.446e+08 -3.4% 2.363e+08 perf-stat.ps.branch-misses
723856 ± 3% -7.8% 667682 ± 3% perf-stat.ps.cache-misses
13118781 -8.5% 12003520 ± 5% perf-stat.ps.cache-references
1.414e+11 -2.8% 1.375e+11 perf-stat.ps.dTLB-loads
13781 -8.8% 12564 ± 3% perf-stat.ps.dTLB-store-misses
1.019e+11 -2.8% 9.903e+10 perf-stat.ps.dTLB-stores
2.899e+08 -12.0% 2.551e+08 ± 2% perf-stat.ps.iTLB-load-misses
5.429e+11 -2.8% 5.275e+11 perf-stat.ps.instructions
3286 -1.7% 3230 perf-stat.ps.minor-faults
31502 ± 11% -21.5% 24724 ± 15% perf-stat.ps.node-loads
3286 -1.7% 3230 perf-stat.ps.page-faults
1.623e+14 -3.3% 1.57e+14 perf-stat.total.instructions
25.77 -0.7 25.07 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.poll
3.34 -0.3 3.08 perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64
18.82 -0.2 18.60 perf-profile.calltrace.cycles-pp._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
98.24 -0.2 98.08 perf-profile.calltrace.cycles-pp.poll
4.33 -0.2 4.17 perf-profile.calltrace.cycles-pp.copy_user_generic_unrolled._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64
4.96 -0.1 4.86 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.poll
3.32 -0.1 3.21 perf-profile.calltrace.cycles-pp.___might_sleep.__might_fault._copy_from_user.do_sys_poll.__x64_sys_poll
1.63 +0.2 1.79 perf-profile.calltrace.cycles-pp.testcase
6.54 +0.2 6.72 perf-profile.calltrace.cycles-pp.__might_fault._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64
1.85 +0.2 2.05 perf-profile.calltrace.cycles-pp.__might_sleep.__might_fault._copy_from_user.do_sys_poll.__x64_sys_poll
1.41 +0.3 1.74 ± 2% perf-profile.calltrace.cycles-pp.check_stack_object.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64
3.44 +0.5 3.91 perf-profile.calltrace.cycles-pp.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.12 +0.6 1.68 perf-profile.calltrace.cycles-pp.poll_freewait.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
64.34 +0.7 65.03 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.poll
60.51 +0.7 61.25 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.poll
49.87 +0.8 50.68 perf-profile.calltrace.cycles-pp.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.poll
56.87 +1.0 57.84 perf-profile.calltrace.cycles-pp.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.poll
17.21 -0.4 16.78 perf-profile.children.cycles-pp.entry_SYSCALL_64
13.72 -0.4 13.32 perf-profile.children.cycles-pp.syscall_return_via_sysret
3.67 -0.3 3.38 perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
5.37 -0.3 5.10 perf-profile.children.cycles-pp.copy_user_generic_unrolled
18.32 -0.1 18.20 perf-profile.children.cycles-pp._copy_from_user
3.53 -0.1 3.42 perf-profile.children.cycles-pp.___might_sleep
98.40 -0.1 98.31 perf-profile.children.cycles-pp.poll
1.82 +0.1 1.96 perf-profile.children.cycles-pp.testcase
6.96 +0.2 7.15 perf-profile.children.cycles-pp.__might_fault
1.96 +0.2 2.17 perf-profile.children.cycles-pp.__might_sleep
1.55 +0.3 1.82 perf-profile.children.cycles-pp.check_stack_object
3.79 +0.6 4.40 perf-profile.children.cycles-pp.__check_object_size
1.13 +0.6 1.76 perf-profile.children.cycles-pp.poll_freewait
64.46 +0.7 65.13 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
60.89 +0.7 61.62 perf-profile.children.cycles-pp.do_syscall_64
56.65 +0.9 57.52 perf-profile.children.cycles-pp.__x64_sys_poll
51.27 +1.1 52.38 perf-profile.children.cycles-pp.do_sys_poll
13.70 -0.4 13.30 perf-profile.self.cycles-pp.syscall_return_via_sysret
4.78 -0.3 4.45 perf-profile.self.cycles-pp.__x64_sys_poll
3.40 -0.3 3.15 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
5.01 -0.2 4.78 perf-profile.self.cycles-pp.copy_user_generic_unrolled
11.45 -0.2 11.25 perf-profile.self.cycles-pp.poll
8.55 -0.2 8.39 perf-profile.self.cycles-pp.entry_SYSCALL_64
3.96 -0.1 3.85 perf-profile.self.cycles-pp.do_syscall_64
3.40 -0.1 3.28 perf-profile.self.cycles-pp.___might_sleep
3.53 -0.1 3.47 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.73 -0.0 0.71 perf-profile.self.cycles-pp.poll_select_set_timeout
2.78 +0.1 2.84 perf-profile.self.cycles-pp._copy_from_user
1.58 +0.1 1.66 perf-profile.self.cycles-pp.__might_fault
1.56 +0.1 1.65 perf-profile.self.cycles-pp.testcase
1.41 +0.2 1.57 perf-profile.self.cycles-pp.check_stack_object
1.73 +0.2 1.94 perf-profile.self.cycles-pp.__might_sleep
2.30 +0.3 2.61 perf-profile.self.cycles-pp.__check_object_size
23.36 +0.3 23.70 perf-profile.self.cycles-pp.do_sys_poll
0.92 +0.6 1.53 perf-profile.self.cycles-pp.poll_freewait



will-it-scale.per_process_ops

6e+06 +-+-----------------------------------------------------------------+
| |
5e+06 +-++...+ +..+...+..+...+..+..+...+..+...+..+..+...+..+...+..|
O O O O O O O O O O O O O O O O O O O
| : : |
4e+06 +-+ : : |
| : : |
3e+06 +-+ : : |
| : : |
2e+06 +-+ : : |
| : : |
| : : |
1e+06 +-+ : : |
| : : |
0 +-+-------O---O-----------------------------------------------------+


will-it-scale.workload

1e+09 +-+-----------------------------------------------------------------+
9e+08 O-+O...O O..O. O O O..O..O...O..O...O..O..O...O. O O..O
| : : |
8e+08 +-+ : : |
7e+08 +-+ : : |
| : : |
6e+08 +-+ : : |
5e+08 +-+ : : |
4e+08 +-+ : : |
| : : |
3e+08 +-+ : : |
2e+08 +-+ : : |
| : : |
1e+08 +-+ : : |
0 +-+-------O---O-----------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (17.69 kB)
config-5.1.0-rc7-00027-gf2c65fb3221adc (192.28 kB)
job-script (7.33 kB)
job.yaml (4.92 kB)
reproduce (320.00 B)
Download all attachments

2019-07-18 18:37:55

by Nadav Amit

[permalink] [raw]
Subject: Re: [x86/modules] f2c65fb322: will-it-scale.per_process_ops -2.9% regression

> On Jul 18, 2019, at 2:50 AM, kernel test robot <[email protected]> wrote:
>
> Greeting,
>
> FYI, we noticed a -2.9% regression of will-it-scale.per_process_ops due to commit:
>
>
> commit: f2c65fb3221adc6b73b0549fc7ba892022db9797 ("x86/modules: Avoid breaking W^X while loading modules")
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkernel.googlesource.com%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git&amp;data=02%7C01%7Cnamit%40vmware.com%7Cdfe19d72eecd46a93fc508d70b65652b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636990402460329367&amp;sdata=9MHPlha0MIkDWKe%2BurDDr0QRDMtJ3pPACgDtVXy8pL4%3D&amp;reserved=0 master
>
> in testcase: will-it-scale
> on test machine: 192 threads Intel(R) Xeon(R) CPU @ 2.20GHz with 192G memory
> with following parameters:
>
> nr_task: 100%
> mode: process
> test: poll1
> cpufreq_governor: performance
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fantonblanchard%2Fwill-it-scale&amp;data=02%7C01%7Cnamit%40vmware.com%7Cdfe19d72eecd46a93fc508d70b65652b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636990402460329367&amp;sdata=YmsJebf0t3yyo7KMCdUu6VIOysebmB%2Fc7huxkOe5cV0%3D&amp;reserved=0

I don’t understand how this patch has any impact on this workload.

I ran it and set a function tracer for any function that is impacted by this
patch:

# cd /sys/kernel/debug/tracing
# echo text_poke_early > set_ftrace_filter
# echo module_alloc >> set_ftrace_filter
# echo bpf_int_jit_compile >> set_ftrace_filter
# tail -f trace

Nothing came up. Can you please check if you see any of them invoked on your
setup? Perhaps you have some bpf filters being installed, although even then
this is a one-time (small) overhead for each process invocation.

2019-07-19 17:35:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [x86/modules] f2c65fb322: will-it-scale.per_process_ops -2.9% regression

On 7/18/19 10:59 AM, Nadav Amit wrote:
> I don’t understand how this patch has any impact on this workload.
>
> I ran it and set a function tracer for any function that is impacted by this
> patch:
>
> # cd /sys/kernel/debug/tracing
> # echo text_poke_early > set_ftrace_filter
> # echo module_alloc >> set_ftrace_filter
> # echo bpf_int_jit_compile >> set_ftrace_filter
> # tail -f trace
>
> Nothing came up. Can you please check if you see any of them invoked on your
> setup? Perhaps you have some bpf filters being installed, although even then
> this is a one-time (small) overhead for each process invocation.

I think the direct map's structure changed. I noticed the following the
0day data:

> 7298e24f904224fa f2c65fb3221adc6b73b0549fc7b
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> :4 50% 2:4 dmesg.WARNING:at#for_ip_interrupt_entry/0x
> %stddev %change %stddev
> \ | \
> 10541056 ± 7% +12.1% 11820032 meminfo.DirectMap2> 2.909e+08 -12.0% 2.56e+08 ± 2% perf-stat.i.iTLB-load-misses
> 1876 +10.4% 2071 ± 2% perf-stat.i.instructions-per-iTLB-miss
> 1872 +10.4% 2068 ± 2% perf-stat.overall.instructions-per-iTLB-miss
> 2.899e+08 -12.0% 2.551e+08 ± 2% perf-stat.ps.iTLB-load-misses

So something with the direct map changed as did iTLB miss behavior.