2018-02-25 15:08:04

by kernel test robot

[permalink] [raw]
Subject: [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression


Greeting,

FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:


commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: aim7
on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
with following parameters:

disk: 4BRD_12G
md: RAID0
fs: xfs
test: disk_src
load: 3000
cpufreq_governor: performance

test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/



Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7

commit:
3da90b159b (" f2fs-for-4.16-rc1")
c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")

3da90b159b146672 c0cef30e4ff0dc025f4a1660b8
---------------- --------------------------
%stddev %change %stddev
\ | \
40183 -18.0% 32964 aim7.jobs-per-min
448.60 +21.9% 546.68 aim7.time.elapsed_time
448.60 +21.9% 546.68 aim7.time.elapsed_time.max
5615 ? 5% +33.4% 7489 ? 4% aim7.time.involuntary_context_switches
3086 +14.0% 3518 aim7.time.system_time
19439782 -5.6% 18359474 aim7.time.voluntary_context_switches
199333 +14.3% 227794 ? 2% interrupts.CAL:Function_call_interrupts
0.59 -0.1 0.50 mpstat.cpu.usr%
2839401 +16.0% 3293688 softirqs.SCHED
7600068 +15.1% 8747820 softirqs.TIMER
118.00 ? 43% +98.7% 234.50 ? 15% vmstat.io.bo
87840 -22.4% 68154 vmstat.system.cs
552798 ? 6% +15.8% 640107 ? 4% numa-numastat.node0.local_node
557345 ? 6% +15.7% 644666 ? 4% numa-numastat.node0.numa_hit
528341 ? 7% +21.7% 642933 ? 4% numa-numastat.node1.local_node
531604 ? 7% +21.6% 646209 ? 4% numa-numastat.node1.numa_hit
2.147e+09 -12.4% 1.88e+09 cpuidle.C1.time
13702041 -14.7% 11683737 cpuidle.C1.usage
2.082e+08 ? 4% +28.1% 2.667e+08 ? 5% cpuidle.C1E.time
4.719e+08 ? 2% +23.1% 5.807e+08 ? 4% cpuidle.C3.time
1.141e+10 +31.0% 1.496e+10 cpuidle.C6.time
15672622 +27.8% 20031028 cpuidle.C6.usage
13520572 ? 3% +29.5% 17514398 ? 9% cpuidle.POLL.time
278.25 ? 5% -46.0% 150.25 ? 73% numa-vmstat.node0.nr_dirtied
3200 ? 14% -20.6% 2542 ? 19% numa-vmstat.node0.nr_mapped
277.75 ? 5% -46.2% 149.50 ? 73% numa-vmstat.node0.nr_written
28.50 ? 52% +448.2% 156.25 ? 70% numa-vmstat.node1.nr_dirtied
2577 ? 19% +26.3% 3255 ? 15% numa-vmstat.node1.nr_mapped
634338 ? 4% +7.8% 683959 ? 4% numa-vmstat.node1.numa_hit
457411 ? 6% +10.8% 506800 ? 5% numa-vmstat.node1.numa_local
3734 ? 8% -11.5% 3306 ? 6% proc-vmstat.numa_hint_faults_local
1114538 +18.3% 1318978 proc-vmstat.numa_hit
1106722 +18.5% 1311136 proc-vmstat.numa_local
22100 +7.5% 23753 ? 4% proc-vmstat.numa_pages_migrated
1174556 +18.0% 1386359 proc-vmstat.pgalloc_normal
1241445 +18.1% 1466086 proc-vmstat.pgfault
1138310 +19.3% 1358132 proc-vmstat.pgfree
22100 +7.5% 23753 ? 4% proc-vmstat.pgmigrate_success
53332 ? 43% +143.0% 129617 ? 14% proc-vmstat.pgpgout
1.42 ? 2% +1.7 3.07 perf-stat.branch-miss-rate%
1.064e+10 +123.3% 2.375e+10 perf-stat.branch-misses
10.79 +0.6 11.43 perf-stat.cache-miss-rate%
5.583e+09 +5.9% 5.915e+09 perf-stat.cache-misses
39652092 -5.0% 37662545 perf-stat.context-switches
1.29 +11.7% 1.44 perf-stat.cpi
4.637e+12 +12.8% 5.23e+12 perf-stat.cpu-cycles
8.653e+11 +9.8% 9.498e+11 ? 2% perf-stat.dTLB-loads
3.654e+11 +12.4% 4.109e+11 perf-stat.dTLB-stores
0.78 -10.5% 0.70 perf-stat.ipc
1214932 +17.9% 1432266 perf-stat.minor-faults
1.334e+09 -1.8% 1.31e+09 perf-stat.node-store-misses
1.651e+09 -1.8% 1.62e+09 perf-stat.node-stores
1214954 +17.9% 1432313 perf-stat.page-faults
256.75 -100.0% 0.00 turbostat.Avg_MHz
21.39 -21.4 0.00 turbostat.Busy%
1200 -100.0% 0.00 turbostat.Bzy_MHz
13695007 -100.0% 0.00 turbostat.C1
11.92 -11.9 0.00 turbostat.C1%
2116683 ? 2% -100.0% 0.00 turbostat.C1E
1.16 ? 4% -1.2 0.00 turbostat.C1E%
3112269 -100.0% 0.00 turbostat.C3
2.62 ? 2% -2.6 0.00 turbostat.C3%
15671277 -100.0% 0.00 turbostat.C6
63.38 -63.4 0.00 turbostat.C6%
49.46 -100.0% 0.00 turbostat.CPU%c1
1.42 ? 2% -100.0% 0.00 turbostat.CPU%c3
27.73 -100.0% 0.00 turbostat.CPU%c6
31.41 -100.0% 0.00 turbostat.CorWatt
63.25 -100.0% 0.00 turbostat.CoreTmp
18919351 -100.0% 0.00 turbostat.IRQ
1.21 ? 18% -100.0% 0.00 turbostat.Pkg%pc2
0.67 ? 31% -100.0% 0.00 turbostat.Pkg%pc6
63.25 -100.0% 0.00 turbostat.PkgTmp
57.63 -100.0% 0.00 turbostat.PkgWatt
30.73 -100.0% 0.00 turbostat.RAMWatt
36030 -100.0% 0.00 turbostat.SMI
3000 -100.0% 0.00 turbostat.TSC_MHz


aim7.jobs-per-min

41000 +-+-----------------------------------------------------------------+
| ..+....+.... ..+....+....+.... ..+....+....+...+.... |
40000 +-+ +. +.. + |
39000 +-+ |
| |
38000 +-+ |
37000 +-+ |
| |
36000 +-+ |
35000 +-+ |
| |
34000 +-+ |
33000 +-+ O O |
O O O O O O O O O O O O O
32000 +-+-----------------------------------------------------------------+



[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Xiaolong


Attachments:
(No filename) (8.93 kB)
config-4.15.0-02342-gc0cef30 (167.39 kB)
job-script (7.67 kB)
job.yaml (5.36 kB)
reproduce (1.07 kB)
Download all attachments

2018-02-25 15:42:43

by Jeff Layton

[permalink] [raw]
Subject: Re: [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>
>
> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: aim7
> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
> with following parameters:
>
> disk: 4BRD_12G
> md: RAID0
> fs: xfs
> test: disk_src
> load: 3000
> cpufreq_governor: performance
>
> test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>
>

I'm a bit suspicious of this result.

This patch only changes inode_cmp_iversion{+raw} (since renamed to
inode_eq_iversion{+raw}), and that neither should ever be called from
xfs. The patch is fairly trivial too, and I wouldn't expect a big
performance hit.

Is IMA involved here at all? I didn't see any evidence of it, but the
kernel config did have it enabled.


>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
> gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
>
> commit:
> 3da90b159b (" f2fs-for-4.16-rc1")
> c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
>
> 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8
> ---------------- --------------------------
> %stddev %change %stddev
> \ | \
> 40183 -18.0% 32964 aim7.jobs-per-min
> 448.60 +21.9% 546.68 aim7.time.elapsed_time
> 448.60 +21.9% 546.68 aim7.time.elapsed_time.max
> 5615 ± 5% +33.4% 7489 ± 4% aim7.time.involuntary_context_switches
> 3086 +14.0% 3518 aim7.time.system_time
> 19439782 -5.6% 18359474 aim7.time.voluntary_context_switches
> 199333 +14.3% 227794 ± 2% interrupts.CAL:Function_call_interrupts
> 0.59 -0.1 0.50 mpstat.cpu.usr%
> 2839401 +16.0% 3293688 softirqs.SCHED
> 7600068 +15.1% 8747820 softirqs.TIMER
> 118.00 ± 43% +98.7% 234.50 ± 15% vmstat.io.bo
> 87840 -22.4% 68154 vmstat.system.cs
> 552798 ± 6% +15.8% 640107 ± 4% numa-numastat.node0.local_node
> 557345 ± 6% +15.7% 644666 ± 4% numa-numastat.node0.numa_hit
> 528341 ± 7% +21.7% 642933 ± 4% numa-numastat.node1.local_node
> 531604 ± 7% +21.6% 646209 ± 4% numa-numastat.node1.numa_hit
> 2.147e+09 -12.4% 1.88e+09 cpuidle.C1.time
> 13702041 -14.7% 11683737 cpuidle.C1.usage
> 2.082e+08 ± 4% +28.1% 2.667e+08 ± 5% cpuidle.C1E.time
> 4.719e+08 ± 2% +23.1% 5.807e+08 ± 4% cpuidle.C3.time
> 1.141e+10 +31.0% 1.496e+10 cpuidle.C6.time
> 15672622 +27.8% 20031028 cpuidle.C6.usage
> 13520572 ± 3% +29.5% 17514398 ± 9% cpuidle.POLL.time
> 278.25 ± 5% -46.0% 150.25 ± 73% numa-vmstat.node0.nr_dirtied
> 3200 ± 14% -20.6% 2542 ± 19% numa-vmstat.node0.nr_mapped
> 277.75 ± 5% -46.2% 149.50 ± 73% numa-vmstat.node0.nr_written
> 28.50 ± 52% +448.2% 156.25 ± 70% numa-vmstat.node1.nr_dirtied
> 2577 ± 19% +26.3% 3255 ± 15% numa-vmstat.node1.nr_mapped
> 634338 ± 4% +7.8% 683959 ± 4% numa-vmstat.node1.numa_hit
> 457411 ± 6% +10.8% 506800 ± 5% numa-vmstat.node1.numa_local
> 3734 ± 8% -11.5% 3306 ± 6% proc-vmstat.numa_hint_faults_local
> 1114538 +18.3% 1318978 proc-vmstat.numa_hit
> 1106722 +18.5% 1311136 proc-vmstat.numa_local
> 22100 +7.5% 23753 ± 4% proc-vmstat.numa_pages_migrated
> 1174556 +18.0% 1386359 proc-vmstat.pgalloc_normal
> 1241445 +18.1% 1466086 proc-vmstat.pgfault
> 1138310 +19.3% 1358132 proc-vmstat.pgfree
> 22100 +7.5% 23753 ± 4% proc-vmstat.pgmigrate_success
> 53332 ± 43% +143.0% 129617 ± 14% proc-vmstat.pgpgout
> 1.42 ± 2% +1.7 3.07 perf-stat.branch-miss-rate%
> 1.064e+10 +123.3% 2.375e+10 perf-stat.branch-misses
> 10.79 +0.6 11.43 perf-stat.cache-miss-rate%
> 5.583e+09 +5.9% 5.915e+09 perf-stat.cache-misses
> 39652092 -5.0% 37662545 perf-stat.context-switches
> 1.29 +11.7% 1.44 perf-stat.cpi
> 4.637e+12 +12.8% 5.23e+12 perf-stat.cpu-cycles
> 8.653e+11 +9.8% 9.498e+11 ± 2% perf-stat.dTLB-loads
> 3.654e+11 +12.4% 4.109e+11 perf-stat.dTLB-stores
> 0.78 -10.5% 0.70 perf-stat.ipc
> 1214932 +17.9% 1432266 perf-stat.minor-faults
> 1.334e+09 -1.8% 1.31e+09 perf-stat.node-store-misses
> 1.651e+09 -1.8% 1.62e+09 perf-stat.node-stores
> 1214954 +17.9% 1432313 perf-stat.page-faults
> 256.75 -100.0% 0.00 turbostat.Avg_MHz
> 21.39 -21.4 0.00 turbostat.Busy%
> 1200 -100.0% 0.00 turbostat.Bzy_MHz
> 13695007 -100.0% 0.00 turbostat.C1
> 11.92 -11.9 0.00 turbostat.C1%
> 2116683 ± 2% -100.0% 0.00 turbostat.C1E
> 1.16 ± 4% -1.2 0.00 turbostat.C1E%
> 3112269 -100.0% 0.00 turbostat.C3
> 2.62 ± 2% -2.6 0.00 turbostat.C3%
> 15671277 -100.0% 0.00 turbostat.C6
> 63.38 -63.4 0.00 turbostat.C6%
> 49.46 -100.0% 0.00 turbostat.CPU%c1
> 1.42 ± 2% -100.0% 0.00 turbostat.CPU%c3
> 27.73 -100.0% 0.00 turbostat.CPU%c6
> 31.41 -100.0% 0.00 turbostat.CorWatt
> 63.25 -100.0% 0.00 turbostat.CoreTmp
> 18919351 -100.0% 0.00 turbostat.IRQ
> 1.21 ± 18% -100.0% 0.00 turbostat.Pkg%pc2
> 0.67 ± 31% -100.0% 0.00 turbostat.Pkg%pc6
> 63.25 -100.0% 0.00 turbostat.PkgTmp
> 57.63 -100.0% 0.00 turbostat.PkgWatt
> 30.73 -100.0% 0.00 turbostat.RAMWatt
> 36030 -100.0% 0.00 turbostat.SMI
> 3000 -100.0% 0.00 turbostat.TSC_MHz
>
>
> aim7.jobs-per-min
>
> 41000 +-+-----------------------------------------------------------------+
> | ..+....+.... ..+....+....+.... ..+....+....+...+.... |
> 40000 +-+ +. +.. + |
> 39000 +-+ |
> | |
> 38000 +-+ |
> 37000 +-+ |
> | |
> 36000 +-+ |
> 35000 +-+ |
> | |
> 34000 +-+ |
> 33000 +-+ O O |
> O O O O O O O O O O O O O
> 32000 +-+-----------------------------------------------------------------+
>
>
>
> [*] bisect-good sample
> [O] bisect-bad sample
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Xiaolong
--
Jeff Layton <[email protected]>

2018-02-25 21:20:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On Sun, Feb 25, 2018 at 7:05 AM, kernel test robot
<[email protected]> wrote:
>
> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>
> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")

That looks very unlikely. Can you double-check that? Because it
shouldn't even change anything apart from trivial compiler instruction
selection.

Linus

2018-02-26 08:41:01

by kernel test robot

[permalink] [raw]
Subject: Re: [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On 02/25, Jeff Layton wrote:
>On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>>
>>
>> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: aim7
>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
>> with following parameters:
>>
>> disk: 4BRD_12G
>> md: RAID0
>> fs: xfs
>> test: disk_src
>> load: 3000
>> cpufreq_governor: performance
>>
>> test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>
>>
>
>I'm a bit suspicious of this result.
>
>This patch only changes inode_cmp_iversion{+raw} (since renamed to
>inode_eq_iversion{+raw}), and that neither should ever be called from
>xfs. The patch is fairly trivial too, and I wouldn't expect a big
>performance hit.

I tried to queue 4 more times test for both commit c0cef30e4f and its parent,
the result seems quite stable.

c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:
"aim7.jobs-per-min": [
32964.01,
32938.68,
33068.18,
32886.32,
32843.72,
32798.83,
32898.34,
32952.55
],

3da90b159b146672f830bcd2489dd3a1f4e9e089:
"aim7.jobs-per-min": [
40239.65,
40163.33,
40353.32,
39976.9,
40185.75,
40411.3,
40213.58,
39900.69
],

Any other test data you may need?

>
>Is IMA involved here at all? I didn't see any evidence of it, but the
>kernel config did have it enabled.
>

Sorry, not quite familiar with IMA, could you tell more about how to check it?

Thanks,
Xiaolong
>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp install job.yaml # job file is attached in this email
>> bin/lkp run job.yaml
>>
>> =========================================================================================
>> compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
>> gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
>>
>> commit:
>> 3da90b159b (" f2fs-for-4.16-rc1")
>> c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
>>
>> 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8
>> ---------------- --------------------------
>> %stddev %change %stddev
>> \ | \
>> 40183 -18.0% 32964 aim7.jobs-per-min
>> 448.60 +21.9% 546.68 aim7.time.elapsed_time
>> 448.60 +21.9% 546.68 aim7.time.elapsed_time.max
>> 5615 ? 5% +33.4% 7489 ? 4% aim7.time.involuntary_context_switches
>> 3086 +14.0% 3518 aim7.time.system_time
>> 19439782 -5.6% 18359474 aim7.time.voluntary_context_switches
>> 199333 +14.3% 227794 ? 2% interrupts.CAL:Function_call_interrupts
>> 0.59 -0.1 0.50 mpstat.cpu.usr%
>> 2839401 +16.0% 3293688 softirqs.SCHED
>> 7600068 +15.1% 8747820 softirqs.TIMER
>> 118.00 ? 43% +98.7% 234.50 ? 15% vmstat.io.bo
>> 87840 -22.4% 68154 vmstat.system.cs
>> 552798 ? 6% +15.8% 640107 ? 4% numa-numastat.node0.local_node
>> 557345 ? 6% +15.7% 644666 ? 4% numa-numastat.node0.numa_hit
>> 528341 ? 7% +21.7% 642933 ? 4% numa-numastat.node1.local_node
>> 531604 ? 7% +21.6% 646209 ? 4% numa-numastat.node1.numa_hit
>> 2.147e+09 -12.4% 1.88e+09 cpuidle.C1.time
>> 13702041 -14.7% 11683737 cpuidle.C1.usage
>> 2.082e+08 ? 4% +28.1% 2.667e+08 ? 5% cpuidle.C1E.time
>> 4.719e+08 ? 2% +23.1% 5.807e+08 ? 4% cpuidle.C3.time
>> 1.141e+10 +31.0% 1.496e+10 cpuidle.C6.time
>> 15672622 +27.8% 20031028 cpuidle.C6.usage
>> 13520572 ? 3% +29.5% 17514398 ? 9% cpuidle.POLL.time
>> 278.25 ? 5% -46.0% 150.25 ? 73% numa-vmstat.node0.nr_dirtied
>> 3200 ? 14% -20.6% 2542 ? 19% numa-vmstat.node0.nr_mapped
>> 277.75 ? 5% -46.2% 149.50 ? 73% numa-vmstat.node0.nr_written
>> 28.50 ? 52% +448.2% 156.25 ? 70% numa-vmstat.node1.nr_dirtied
>> 2577 ? 19% +26.3% 3255 ? 15% numa-vmstat.node1.nr_mapped
>> 634338 ? 4% +7.8% 683959 ? 4% numa-vmstat.node1.numa_hit
>> 457411 ? 6% +10.8% 506800 ? 5% numa-vmstat.node1.numa_local
>> 3734 ? 8% -11.5% 3306 ? 6% proc-vmstat.numa_hint_faults_local
>> 1114538 +18.3% 1318978 proc-vmstat.numa_hit
>> 1106722 +18.5% 1311136 proc-vmstat.numa_local
>> 22100 +7.5% 23753 ? 4% proc-vmstat.numa_pages_migrated
>> 1174556 +18.0% 1386359 proc-vmstat.pgalloc_normal
>> 1241445 +18.1% 1466086 proc-vmstat.pgfault
>> 1138310 +19.3% 1358132 proc-vmstat.pgfree
>> 22100 +7.5% 23753 ? 4% proc-vmstat.pgmigrate_success
>> 53332 ? 43% +143.0% 129617 ? 14% proc-vmstat.pgpgout
>> 1.42 ? 2% +1.7 3.07 perf-stat.branch-miss-rate%
>> 1.064e+10 +123.3% 2.375e+10 perf-stat.branch-misses
>> 10.79 +0.6 11.43 perf-stat.cache-miss-rate%
>> 5.583e+09 +5.9% 5.915e+09 perf-stat.cache-misses
>> 39652092 -5.0% 37662545 perf-stat.context-switches
>> 1.29 +11.7% 1.44 perf-stat.cpi
>> 4.637e+12 +12.8% 5.23e+12 perf-stat.cpu-cycles
>> 8.653e+11 +9.8% 9.498e+11 ? 2% perf-stat.dTLB-loads
>> 3.654e+11 +12.4% 4.109e+11 perf-stat.dTLB-stores
>> 0.78 -10.5% 0.70 perf-stat.ipc
>> 1214932 +17.9% 1432266 perf-stat.minor-faults
>> 1.334e+09 -1.8% 1.31e+09 perf-stat.node-store-misses
>> 1.651e+09 -1.8% 1.62e+09 perf-stat.node-stores
>> 1214954 +17.9% 1432313 perf-stat.page-faults
>> 256.75 -100.0% 0.00 turbostat.Avg_MHz
>> 21.39 -21.4 0.00 turbostat.Busy%
>> 1200 -100.0% 0.00 turbostat.Bzy_MHz
>> 13695007 -100.0% 0.00 turbostat.C1
>> 11.92 -11.9 0.00 turbostat.C1%
>> 2116683 ? 2% -100.0% 0.00 turbostat.C1E
>> 1.16 ? 4% -1.2 0.00 turbostat.C1E%
>> 3112269 -100.0% 0.00 turbostat.C3
>> 2.62 ? 2% -2.6 0.00 turbostat.C3%
>> 15671277 -100.0% 0.00 turbostat.C6
>> 63.38 -63.4 0.00 turbostat.C6%
>> 49.46 -100.0% 0.00 turbostat.CPU%c1
>> 1.42 ? 2% -100.0% 0.00 turbostat.CPU%c3
>> 27.73 -100.0% 0.00 turbostat.CPU%c6
>> 31.41 -100.0% 0.00 turbostat.CorWatt
>> 63.25 -100.0% 0.00 turbostat.CoreTmp
>> 18919351 -100.0% 0.00 turbostat.IRQ
>> 1.21 ? 18% -100.0% 0.00 turbostat.Pkg%pc2
>> 0.67 ? 31% -100.0% 0.00 turbostat.Pkg%pc6
>> 63.25 -100.0% 0.00 turbostat.PkgTmp
>> 57.63 -100.0% 0.00 turbostat.PkgWatt
>> 30.73 -100.0% 0.00 turbostat.RAMWatt
>> 36030 -100.0% 0.00 turbostat.SMI
>> 3000 -100.0% 0.00 turbostat.TSC_MHz
>>
>>
>> aim7.jobs-per-min
>>
>> 41000 +-+-----------------------------------------------------------------+
>> | ..+....+.... ..+....+....+.... ..+....+....+...+.... |
>> 40000 +-+ +. +.. + |
>> 39000 +-+ |
>> | |
>> 38000 +-+ |
>> 37000 +-+ |
>> | |
>> 36000 +-+ |
>> 35000 +-+ |
>> | |
>> 34000 +-+ |
>> 33000 +-+ O O |
>> O O O O O O O O O O O O O
>> 32000 +-+-----------------------------------------------------------------+
>>
>>
>>
>> [*] bisect-good sample
>> [O] bisect-bad sample
>>
>>
>>
>> Disclaimer:
>> Results have been estimated based on internal Intel analysis and are provided
>> for informational purposes only. Any difference in system hardware or software
>> design or configuration may affect actual performance.
>>
>>
>> Thanks,
>> Xiaolong
>--
>Jeff Layton <[email protected]>

2018-02-26 11:45:25

by Jeff Layton

[permalink] [raw]
Subject: Re: [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On Mon, 2018-02-26 at 16:38 +0800, Ye Xiaolong wrote:
> On 02/25, Jeff Layton wrote:
> > On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
> > > Greeting,
> > >
> > > FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
> > >
> > >
> > > commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: aim7
> > > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
> > > with following parameters:
> > >
> > > disk: 4BRD_12G
> > > md: RAID0
> > > fs: xfs
> > > test: disk_src
> > > load: 3000
> > > cpufreq_governor: performance
> > >
> > > test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
> > > test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
> > >
> > >
> >
> > I'm a bit suspicious of this result.
> >
> > This patch only changes inode_cmp_iversion{+raw} (since renamed to
> > inode_eq_iversion{+raw}), and that neither should ever be called from
> > xfs. The patch is fairly trivial too, and I wouldn't expect a big
> > performance hit.
>
> I tried to queue 4 more times test for both commit c0cef30e4f and its parent,
> the result seems quite stable.
>
> c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:
> "aim7.jobs-per-min": [
> 32964.01,
> 32938.68,
> 33068.18,
> 32886.32,
> 32843.72,
> 32798.83,
> 32898.34,
> 32952.55
> ],
>
> 3da90b159b146672f830bcd2489dd3a1f4e9e089:
> "aim7.jobs-per-min": [
> 40239.65,
> 40163.33,
> 40353.32,
> 39976.9,
> 40185.75,
> 40411.3,
> 40213.58,
> 39900.69
> ],
>
> Any other test data you may need?
>
> >
> > Is IMA involved here at all? I didn't see any evidence of it, but the
> > kernel config did have it enabled.
> >
>
> Sorry, not quite familiar with IMA, could you tell more about how to check it?
>

Thanks for retesting it, but I'm at a loss for why we're seeing this:

IMA is the the integrity management subsystem. It will use the iversion
field to determine whether to remeasure files during remeasurement. It
looks like the kernel config has it enabled, but it doesn't look like
it's in use, based on the info in the initial report.

This patch only affects two inlined functions inode_cmp_iversion and
inode_cmp_iversion_raw. The patch is pretty trivial (as Linus points
out). These functions are only called from IMA and fs-specific code
(usually in readdir implementations to detect directory changes).

XFS does not call either of these functions however, so I'm a little
unclear on how this patch could slow anything down on this test. The
only thing I can think to do here would be to profile this and see what
stands out.

Note that we do need to keep this in perspective too. This 18%
regression on this test follows around a ~230% improvement that occurred
when we merged the bulk of these patches. It's should still be quite a
bit faster than the v4.15 in this regard.

Still, it'd be good to understand what's going on here.


> Thanks,
> Xiaolong
> >
> > >
> > > Details are as below:
> > > -------------------------------------------------------------------------------------------------->
> > >
> > >
> > > To reproduce:
> > >
> > > git clone https://github.com/intel/lkp-tests.git
> > > cd lkp-tests
> > > bin/lkp install job.yaml # job file is attached in this email
> > > bin/lkp run job.yaml
> > >
> > > =========================================================================================
> > > compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
> > > gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
> > >
> > > commit:
> > > 3da90b159b (" f2fs-for-4.16-rc1")
> > > c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
> > >
> > > 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8
> > > ---------------- --------------------------
> > > %stddev %change %stddev
> > > \ | \
> > > 40183 -18.0% 32964 aim7.jobs-per-min
> > > 448.60 +21.9% 546.68 aim7.time.elapsed_time
> > > 448.60 +21.9% 546.68 aim7.time.elapsed_time.max
> > > 5615 ± 5% +33.4% 7489 ± 4% aim7.time.involuntary_context_switches
> > > 3086 +14.0% 3518 aim7.time.system_time
> > > 19439782 -5.6% 18359474 aim7.time.voluntary_context_switches
> > > 199333 +14.3% 227794 ± 2% interrupts.CAL:Function_call_interrupts
> > > 0.59 -0.1 0.50 mpstat.cpu.usr%
> > > 2839401 +16.0% 3293688 softirqs.SCHED
> > > 7600068 +15.1% 8747820 softirqs.TIMER
> > > 118.00 ± 43% +98.7% 234.50 ± 15% vmstat.io.bo
> > > 87840 -22.4% 68154 vmstat.system.cs
> > > 552798 ± 6% +15.8% 640107 ± 4% numa-numastat.node0.local_node
> > > 557345 ± 6% +15.7% 644666 ± 4% numa-numastat.node0.numa_hit
> > > 528341 ± 7% +21.7% 642933 ± 4% numa-numastat.node1.local_node
> > > 531604 ± 7% +21.6% 646209 ± 4% numa-numastat.node1.numa_hit
> > > 2.147e+09 -12.4% 1.88e+09 cpuidle.C1.time
> > > 13702041 -14.7% 11683737 cpuidle.C1.usage
> > > 2.082e+08 ± 4% +28.1% 2.667e+08 ± 5% cpuidle.C1E.time
> > > 4.719e+08 ± 2% +23.1% 5.807e+08 ± 4% cpuidle.C3.time
> > > 1.141e+10 +31.0% 1.496e+10 cpuidle.C6.time
> > > 15672622 +27.8% 20031028 cpuidle.C6.usage
> > > 13520572 ± 3% +29.5% 17514398 ± 9% cpuidle.POLL.time
> > > 278.25 ± 5% -46.0% 150.25 ± 73% numa-vmstat.node0.nr_dirtied
> > > 3200 ± 14% -20.6% 2542 ± 19% numa-vmstat.node0.nr_mapped
> > > 277.75 ± 5% -46.2% 149.50 ± 73% numa-vmstat.node0.nr_written
> > > 28.50 ± 52% +448.2% 156.25 ± 70% numa-vmstat.node1.nr_dirtied
> > > 2577 ± 19% +26.3% 3255 ± 15% numa-vmstat.node1.nr_mapped
> > > 634338 ± 4% +7.8% 683959 ± 4% numa-vmstat.node1.numa_hit
> > > 457411 ± 6% +10.8% 506800 ± 5% numa-vmstat.node1.numa_local
> > > 3734 ± 8% -11.5% 3306 ± 6% proc-vmstat.numa_hint_faults_local
> > > 1114538 +18.3% 1318978 proc-vmstat.numa_hit
> > > 1106722 +18.5% 1311136 proc-vmstat.numa_local
> > > 22100 +7.5% 23753 ± 4% proc-vmstat.numa_pages_migrated
> > > 1174556 +18.0% 1386359 proc-vmstat.pgalloc_normal
> > > 1241445 +18.1% 1466086 proc-vmstat.pgfault
> > > 1138310 +19.3% 1358132 proc-vmstat.pgfree
> > > 22100 +7.5% 23753 ± 4% proc-vmstat.pgmigrate_success
> > > 53332 ± 43% +143.0% 129617 ± 14% proc-vmstat.pgpgout
> > > 1.42 ± 2% +1.7 3.07 perf-stat.branch-miss-rate%
> > > 1.064e+10 +123.3% 2.375e+10 perf-stat.branch-misses
> > > 10.79 +0.6 11.43 perf-stat.cache-miss-rate%
> > > 5.583e+09 +5.9% 5.915e+09 perf-stat.cache-misses
> > > 39652092 -5.0% 37662545 perf-stat.context-switches
> > > 1.29 +11.7% 1.44 perf-stat.cpi
> > > 4.637e+12 +12.8% 5.23e+12 perf-stat.cpu-cycles
> > > 8.653e+11 +9.8% 9.498e+11 ± 2% perf-stat.dTLB-loads
> > > 3.654e+11 +12.4% 4.109e+11 perf-stat.dTLB-stores
> > > 0.78 -10.5% 0.70 perf-stat.ipc
> > > 1214932 +17.9% 1432266 perf-stat.minor-faults
> > > 1.334e+09 -1.8% 1.31e+09 perf-stat.node-store-misses
> > > 1.651e+09 -1.8% 1.62e+09 perf-stat.node-stores
> > > 1214954 +17.9% 1432313 perf-stat.page-faults
> > > 256.75 -100.0% 0.00 turbostat.Avg_MHz
> > > 21.39 -21.4 0.00 turbostat.Busy%
> > > 1200 -100.0% 0.00 turbostat.Bzy_MHz
> > > 13695007 -100.0% 0.00 turbostat.C1
> > > 11.92 -11.9 0.00 turbostat.C1%
> > > 2116683 ± 2% -100.0% 0.00 turbostat.C1E
> > > 1.16 ± 4% -1.2 0.00 turbostat.C1E%
> > > 3112269 -100.0% 0.00 turbostat.C3
> > > 2.62 ± 2% -2.6 0.00 turbostat.C3%
> > > 15671277 -100.0% 0.00 turbostat.C6
> > > 63.38 -63.4 0.00 turbostat.C6%
> > > 49.46 -100.0% 0.00 turbostat.CPU%c1
> > > 1.42 ± 2% -100.0% 0.00 turbostat.CPU%c3
> > > 27.73 -100.0% 0.00 turbostat.CPU%c6
> > > 31.41 -100.0% 0.00 turbostat.CorWatt
> > > 63.25 -100.0% 0.00 turbostat.CoreTmp
> > > 18919351 -100.0% 0.00 turbostat.IRQ
> > > 1.21 ± 18% -100.0% 0.00 turbostat.Pkg%pc2
> > > 0.67 ± 31% -100.0% 0.00 turbostat.Pkg%pc6
> > > 63.25 -100.0% 0.00 turbostat.PkgTmp
> > > 57.63 -100.0% 0.00 turbostat.PkgWatt
> > > 30.73 -100.0% 0.00 turbostat.RAMWatt
> > > 36030 -100.0% 0.00 turbostat.SMI
> > > 3000 -100.0% 0.00 turbostat.TSC_MHz
> > >
> > >
> > > aim7.jobs-per-min
> > >
> > > 41000 +-+-----------------------------------------------------------------+
> > > | ..+....+.... ..+....+....+.... ..+....+....+...+.... |
> > > 40000 +-+ +. +.. + |
> > > 39000 +-+ |
> > > | |
> > > 38000 +-+ |
> > > 37000 +-+ |
> > > | |
> > > 36000 +-+ |
> > > 35000 +-+ |
> > > | |
> > > 34000 +-+ |
> > > 33000 +-+ O O |
> > > O O O O O O O O O O O O O
> > > 32000 +-+-----------------------------------------------------------------+
> > >
> > >
> > >
> > > [*] bisect-good sample
> > > [O] bisect-bad sample
> > >
> > >
> > >
> > > Disclaimer:
> > > Results have been estimated based on internal Intel analysis and are provided
> > > for informational purposes only. Any difference in system hardware or software
> > > design or configuration may affect actual performance.
> > >
> > >
> > > Thanks,
> > > Xiaolong
> >
> > --
> > Jeff Layton <[email protected]>

--
Jeff Layton <[email protected]>

2018-02-26 12:36:24

by Jeff Layton

[permalink] [raw]
Subject: Re: [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On Mon, 2018-02-26 at 06:43 -0500, Jeff Layton wrote:
> On Mon, 2018-02-26 at 16:38 +0800, Ye Xiaolong wrote:
> > On 02/25, Jeff Layton wrote:
> > > On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
> > > > Greeting,
> > > >
> > > > FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
> > > >
> > > >
> > > > commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > > > in testcase: aim7
> > > > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
> > > > with following parameters:
> > > >
> > > > disk: 4BRD_12G
> > > > md: RAID0
> > > > fs: xfs
> > > > test: disk_src
> > > > load: 3000
> > > > cpufreq_governor: performance
> > > >
> > > > test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
> > > > test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
> > > >
> > > >
> > >
> > > I'm a bit suspicious of this result.
> > >
> > > This patch only changes inode_cmp_iversion{+raw} (since renamed to
> > > inode_eq_iversion{+raw}), and that neither should ever be called from
> > > xfs. The patch is fairly trivial too, and I wouldn't expect a big
> > > performance hit.
> >
> > I tried to queue 4 more times test for both commit c0cef30e4f and its parent,
> > the result seems quite stable.
> >
> > c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:
> > "aim7.jobs-per-min": [
> > 32964.01,
> > 32938.68,
> > 33068.18,
> > 32886.32,
> > 32843.72,
> > 32798.83,
> > 32898.34,
> > 32952.55
> > ],
> >
> > 3da90b159b146672f830bcd2489dd3a1f4e9e089:
> > "aim7.jobs-per-min": [
> > 40239.65,
> > 40163.33,
> > 40353.32,
> > 39976.9,
> > 40185.75,
> > 40411.3,
> > 40213.58,
> > 39900.69
> > ],
> >
> > Any other test data you may need?
> >
> > >
> > > Is IMA involved here at all? I didn't see any evidence of it, but the
> > > kernel config did have it enabled.
> > >
> >
> > Sorry, not quite familiar with IMA, could you tell more about how to check it?
> >
>
> Thanks for retesting it, but I'm at a loss for why we're seeing this:
>
> IMA is the the integrity management subsystem. It will use the iversion
> field to determine whether to remeasure files during remeasurement. It
> looks like the kernel config has it enabled, but it doesn't look like
> it's in use, based on the info in the initial report.
>
> This patch only affects two inlined functions inode_cmp_iversion and
> inode_cmp_iversion_raw. The patch is pretty trivial (as Linus points
> out). These functions are only called from IMA and fs-specific code
> (usually in readdir implementations to detect directory changes).
>
> XFS does not call either of these functions however, so I'm a little
> unclear on how this patch could slow anything down on this test. The
> only thing I can think to do here would be to profile this and see what
> stands out.
>
> Note that we do need to keep this in perspective too. This 18%
> regression on this test follows around a ~230% improvement that occurred
> when we merged the bulk of these patches. It's should still be quite a
> bit faster than the v4.15 in this regard.
>
> Still, it'd be good to understand what's going on here.
>
>

Could we see the dmesg from this boot? It'd be good to confirm that IMA
is not involved here, as that's the only place that I can see that would
call into this code at all here.

Thanks,
Jeff


> > Thanks,
> > Xiaolong
> > >
> > > >
> > > > Details are as below:
> > > > -------------------------------------------------------------------------------------------------->
> > > >
> > > >
> > > > To reproduce:
> > > >
> > > > git clone https://github.com/intel/lkp-tests.git
> > > > cd lkp-tests
> > > > bin/lkp install job.yaml # job file is attached in this email
> > > > bin/lkp run job.yaml
> > > >
> > > > =========================================================================================
> > > > compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
> > > > gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
> > > >
> > > > commit:
> > > > 3da90b159b (" f2fs-for-4.16-rc1")
> > > > c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
> > > >
> > > > 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8
> > > > ---------------- --------------------------
> > > > %stddev %change %stddev
> > > > \ | \
> > > > 40183 -18.0% 32964 aim7.jobs-per-min
> > > > 448.60 +21.9% 546.68 aim7.time.elapsed_time
> > > > 448.60 +21.9% 546.68 aim7.time.elapsed_time.max
> > > > 5615 ± 5% +33.4% 7489 ± 4% aim7.time.involuntary_context_switches
> > > > 3086 +14.0% 3518 aim7.time.system_time
> > > > 19439782 -5.6% 18359474 aim7.time.voluntary_context_switches
> > > > 199333 +14.3% 227794 ± 2% interrupts.CAL:Function_call_interrupts
> > > > 0.59 -0.1 0.50 mpstat.cpu.usr%
> > > > 2839401 +16.0% 3293688 softirqs.SCHED
> > > > 7600068 +15.1% 8747820 softirqs.TIMER
> > > > 118.00 ± 43% +98.7% 234.50 ± 15% vmstat.io.bo
> > > > 87840 -22.4% 68154 vmstat.system.cs
> > > > 552798 ± 6% +15.8% 640107 ± 4% numa-numastat.node0.local_node
> > > > 557345 ± 6% +15.7% 644666 ± 4% numa-numastat.node0.numa_hit
> > > > 528341 ± 7% +21.7% 642933 ± 4% numa-numastat.node1.local_node
> > > > 531604 ± 7% +21.6% 646209 ± 4% numa-numastat.node1.numa_hit
> > > > 2.147e+09 -12.4% 1.88e+09 cpuidle.C1.time
> > > > 13702041 -14.7% 11683737 cpuidle.C1.usage
> > > > 2.082e+08 ± 4% +28.1% 2.667e+08 ± 5% cpuidle.C1E.time
> > > > 4.719e+08 ± 2% +23.1% 5.807e+08 ± 4% cpuidle.C3.time
> > > > 1.141e+10 +31.0% 1.496e+10 cpuidle.C6.time
> > > > 15672622 +27.8% 20031028 cpuidle.C6.usage
> > > > 13520572 ± 3% +29.5% 17514398 ± 9% cpuidle.POLL.time
> > > > 278.25 ± 5% -46.0% 150.25 ± 73% numa-vmstat.node0.nr_dirtied
> > > > 3200 ± 14% -20.6% 2542 ± 19% numa-vmstat.node0.nr_mapped
> > > > 277.75 ± 5% -46.2% 149.50 ± 73% numa-vmstat.node0.nr_written
> > > > 28.50 ± 52% +448.2% 156.25 ± 70% numa-vmstat.node1.nr_dirtied
> > > > 2577 ± 19% +26.3% 3255 ± 15% numa-vmstat.node1.nr_mapped
> > > > 634338 ± 4% +7.8% 683959 ± 4% numa-vmstat.node1.numa_hit
> > > > 457411 ± 6% +10.8% 506800 ± 5% numa-vmstat.node1.numa_local
> > > > 3734 ± 8% -11.5% 3306 ± 6% proc-vmstat.numa_hint_faults_local
> > > > 1114538 +18.3% 1318978 proc-vmstat.numa_hit
> > > > 1106722 +18.5% 1311136 proc-vmstat.numa_local
> > > > 22100 +7.5% 23753 ± 4% proc-vmstat.numa_pages_migrated
> > > > 1174556 +18.0% 1386359 proc-vmstat.pgalloc_normal
> > > > 1241445 +18.1% 1466086 proc-vmstat.pgfault
> > > > 1138310 +19.3% 1358132 proc-vmstat.pgfree
> > > > 22100 +7.5% 23753 ± 4% proc-vmstat.pgmigrate_success
> > > > 53332 ± 43% +143.0% 129617 ± 14% proc-vmstat.pgpgout
> > > > 1.42 ± 2% +1.7 3.07 perf-stat.branch-miss-rate%
> > > > 1.064e+10 +123.3% 2.375e+10 perf-stat.branch-misses
> > > > 10.79 +0.6 11.43 perf-stat.cache-miss-rate%
> > > > 5.583e+09 +5.9% 5.915e+09 perf-stat.cache-misses
> > > > 39652092 -5.0% 37662545 perf-stat.context-switches
> > > > 1.29 +11.7% 1.44 perf-stat.cpi
> > > > 4.637e+12 +12.8% 5.23e+12 perf-stat.cpu-cycles
> > > > 8.653e+11 +9.8% 9.498e+11 ± 2% perf-stat.dTLB-loads
> > > > 3.654e+11 +12.4% 4.109e+11 perf-stat.dTLB-stores
> > > > 0.78 -10.5% 0.70 perf-stat.ipc
> > > > 1214932 +17.9% 1432266 perf-stat.minor-faults
> > > > 1.334e+09 -1.8% 1.31e+09 perf-stat.node-store-misses
> > > > 1.651e+09 -1.8% 1.62e+09 perf-stat.node-stores
> > > > 1214954 +17.9% 1432313 perf-stat.page-faults
> > > > 256.75 -100.0% 0.00 turbostat.Avg_MHz
> > > > 21.39 -21.4 0.00 turbostat.Busy%
> > > > 1200 -100.0% 0.00 turbostat.Bzy_MHz
> > > > 13695007 -100.0% 0.00 turbostat.C1
> > > > 11.92 -11.9 0.00 turbostat.C1%
> > > > 2116683 ± 2% -100.0% 0.00 turbostat.C1E
> > > > 1.16 ± 4% -1.2 0.00 turbostat.C1E%
> > > > 3112269 -100.0% 0.00 turbostat.C3
> > > > 2.62 ± 2% -2.6 0.00 turbostat.C3%
> > > > 15671277 -100.0% 0.00 turbostat.C6
> > > > 63.38 -63.4 0.00 turbostat.C6%
> > > > 49.46 -100.0% 0.00 turbostat.CPU%c1
> > > > 1.42 ± 2% -100.0% 0.00 turbostat.CPU%c3
> > > > 27.73 -100.0% 0.00 turbostat.CPU%c6
> > > > 31.41 -100.0% 0.00 turbostat.CorWatt
> > > > 63.25 -100.0% 0.00 turbostat.CoreTmp
> > > > 18919351 -100.0% 0.00 turbostat.IRQ
> > > > 1.21 ± 18% -100.0% 0.00 turbostat.Pkg%pc2
> > > > 0.67 ± 31% -100.0% 0.00 turbostat.Pkg%pc6
> > > > 63.25 -100.0% 0.00 turbostat.PkgTmp
> > > > 57.63 -100.0% 0.00 turbostat.PkgWatt
> > > > 30.73 -100.0% 0.00 turbostat.RAMWatt
> > > > 36030 -100.0% 0.00 turbostat.SMI
> > > > 3000 -100.0% 0.00 turbostat.TSC_MHz
> > > >
> > > >
> > > > aim7.jobs-per-min
> > > >
> > > > 41000 +-+-----------------------------------------------------------------+
> > > > | ..+....+.... ..+....+....+.... ..+....+....+...+.... |
> > > > 40000 +-+ +. +.. + |
> > > > 39000 +-+ |
> > > > | |
> > > > 38000 +-+ |
> > > > 37000 +-+ |
> > > > | |
> > > > 36000 +-+ |
> > > > 35000 +-+ |
> > > > | |
> > > > 34000 +-+ |
> > > > 33000 +-+ O O |
> > > > O O O O O O O O O O O O O
> > > > 32000 +-+-----------------------------------------------------------------+
> > > >
> > > >
> > > >
> > > > [*] bisect-good sample
> > > > [O] bisect-bad sample
> > > >
> > > >
> > > >
> > > > Disclaimer:
> > > > Results have been estimated based on internal Intel analysis and are provided
> > > > for informational purposes only. Any difference in system hardware or software
> > > > design or configuration may affect actual performance.
> > > >
> > > >
> > > > Thanks,
> > > > Xiaolong
> > >
> > > --
> > > Jeff Layton <[email protected]>
>
>

--
Jeff Layton <[email protected]>

2018-02-27 07:48:12

by Kemi Wang

[permalink] [raw]
Subject: Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression



On 2018年02月26日 20:33, Jeff Layton wrote:
> On Mon, 2018-02-26 at 06:43 -0500, Jeff Layton wrote:
>> On Mon, 2018-02-26 at 16:38 +0800, Ye Xiaolong wrote:
>>> On 02/25, Jeff Layton wrote:
>>>> On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
>>>>> Greeting,
>>>>>
>>>>> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>>>>>
>>>>>
>>>>> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>
>>>>> in testcase: aim7
>>>>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
>>>>> with following parameters:
>>>>>
>>>>> disk: 4BRD_12G
>>>>> md: RAID0
>>>>> fs: xfs
>>>>> test: disk_src
>>>>> load: 3000
>>>>> cpufreq_governor: performance
>>>>>
>>>>> test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
>>>>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>>>>
>>>>>
>>>>
>>>> I'm a bit suspicious of this result.
>>>>
>>>> This patch only changes inode_cmp_iversion{+raw} (since renamed to
>>>> inode_eq_iversion{+raw}), and that neither should ever be called from
>>>> xfs. The patch is fairly trivial too, and I wouldn't expect a big
>>>> performance hit.
>>>
>>> I tried to queue 4 more times test for both commit c0cef30e4f and its parent,
>>> the result seems quite stable.
>>>
>>> c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:
>>> "aim7.jobs-per-min": [
>>> 32964.01,
>>> 32938.68,
>>> 33068.18,
>>> 32886.32,
>>> 32843.72,
>>> 32798.83,
>>> 32898.34,
>>> 32952.55
>>> ],
>>>
>>> 3da90b159b146672f830bcd2489dd3a1f4e9e089:
>>> "aim7.jobs-per-min": [
>>> 40239.65,
>>> 40163.33,
>>> 40353.32,
>>> 39976.9,
>>> 40185.75,
>>> 40411.3,
>>> 40213.58,
>>> 39900.69
>>> ],
>>>
>>> Any other test data you may need?
>>>
>>>>
>>>> Is IMA involved here at all? I didn't see any evidence of it, but the
>>>> kernel config did have it enabled.
>>>>
>>>
>>> Sorry, not quite familiar with IMA, could you tell more about how to check it?
>>>
>>
>> Thanks for retesting it, but I'm at a loss for why we're seeing this:
>>
>> IMA is the the integrity management subsystem. It will use the iversion
>> field to determine whether to remeasure files during remeasurement. It
>> looks like the kernel config has it enabled, but it doesn't look like
>> it's in use, based on the info in the initial report.
>>
>> This patch only affects two inlined functions inode_cmp_iversion and
>> inode_cmp_iversion_raw. The patch is pretty trivial (as Linus points
>> out). These functions are only called from IMA and fs-specific code
>> (usually in readdir implementations to detect directory changes).
>>
>> XFS does not call either of these functions however, so I'm a little
>> unclear on how this patch could slow anything down on this test. The
>> only thing I can think to do here would be to profile this and see what
>> stands out.
>>
>> Note that we do need to keep this in perspective too. This 18%
>> regression on this test follows around a ~230% improvement that occurred
>> when we merged the bulk of these patches. It's should still be quite a
>> bit faster than the v4.15 in this regard.
>>
>> Still, it'd be good to understand what's going on here.
>>
>>
>
> Could we see the dmesg from this boot? It'd be good to confirm that IMA
> is not involved here, as that's the only place that I can see that would
> call into this code at all here.
>

See attachment for info on dmesg/perf-profile/compare_result.
Feel free to let Xiaolong or me know if anything else you would like to check.

> Thanks,
> Jeff
>
>
>>> Thanks,
>>> Xiaolong
>>>>
>>>>>
>>>>> Details are as below:
>>>>> -------------------------------------------------------------------------------------------------->
>>>>>
>>>>>
>>>>> To reproduce:
>>>>>
>>>>> git clone https://github.com/intel/lkp-tests.git
>>>>> cd lkp-tests
>>>>> bin/lkp install job.yaml # job file is attached in this email
>>>>> bin/lkp run job.yaml
>>>>>
>>>>> =========================================================================================
>>>>> compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
>>>>> gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
>>>>>
>>>>> commit:
>>>>> 3da90b159b (" f2fs-for-4.16-rc1")
>>>>> c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
>>>>>
>>>>> 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8
>>>>> ---------------- --------------------------
>>>>> %stddev %change %stddev
>>>>> \ | \
>>>>> 40183 -18.0% 32964 aim7.jobs-per-min
>>>>> 448.60 +21.9% 546.68 aim7.time.elapsed_time
>>>>> 448.60 +21.9% 546.68 aim7.time.elapsed_time.max
>>>>> 5615 ± 5% +33.4% 7489 ± 4% aim7.time.involuntary_context_switches
>>>>> 3086 +14.0% 3518 aim7.time.system_time
>>>>> 19439782 -5.6% 18359474 aim7.time.voluntary_context_switches
>>>>> 199333 +14.3% 227794 ± 2% interrupts.CAL:Function_call_interrupts
>>>>> 0.59 -0.1 0.50 mpstat.cpu.usr%
>>>>> 2839401 +16.0% 3293688 softirqs.SCHED
>>>>> 7600068 +15.1% 8747820 softirqs.TIMER
>>>>> 118.00 ± 43% +98.7% 234.50 ± 15% vmstat.io.bo
>>>>> 87840 -22.4% 68154 vmstat.system.cs
>>>>> 552798 ± 6% +15.8% 640107 ± 4% numa-numastat.node0.local_node
>>>>> 557345 ± 6% +15.7% 644666 ± 4% numa-numastat.node0.numa_hit
>>>>> 528341 ± 7% +21.7% 642933 ± 4% numa-numastat.node1.local_node
>>>>> 531604 ± 7% +21.6% 646209 ± 4% numa-numastat.node1.numa_hit
>>>>> 2.147e+09 -12.4% 1.88e+09 cpuidle.C1.time
>>>>> 13702041 -14.7% 11683737 cpuidle.C1.usage
>>>>> 2.082e+08 ± 4% +28.1% 2.667e+08 ± 5% cpuidle.C1E.time
>>>>> 4.719e+08 ± 2% +23.1% 5.807e+08 ± 4% cpuidle.C3.time
>>>>> 1.141e+10 +31.0% 1.496e+10 cpuidle.C6.time
>>>>> 15672622 +27.8% 20031028 cpuidle.C6.usage
>>>>> 13520572 ± 3% +29.5% 17514398 ± 9% cpuidle.POLL.time
>>>>> 278.25 ± 5% -46.0% 150.25 ± 73% numa-vmstat.node0.nr_dirtied
>>>>> 3200 ± 14% -20.6% 2542 ± 19% numa-vmstat.node0.nr_mapped
>>>>> 277.75 ± 5% -46.2% 149.50 ± 73% numa-vmstat.node0.nr_written
>>>>> 28.50 ± 52% +448.2% 156.25 ± 70% numa-vmstat.node1.nr_dirtied
>>>>> 2577 ± 19% +26.3% 3255 ± 15% numa-vmstat.node1.nr_mapped
>>>>> 634338 ± 4% +7.8% 683959 ± 4% numa-vmstat.node1.numa_hit
>>>>> 457411 ± 6% +10.8% 506800 ± 5% numa-vmstat.node1.numa_local
>>>>> 3734 ± 8% -11.5% 3306 ± 6% proc-vmstat.numa_hint_faults_local
>>>>> 1114538 +18.3% 1318978 proc-vmstat.numa_hit
>>>>> 1106722 +18.5% 1311136 proc-vmstat.numa_local
>>>>> 22100 +7.5% 23753 ± 4% proc-vmstat.numa_pages_migrated
>>>>> 1174556 +18.0% 1386359 proc-vmstat.pgalloc_normal
>>>>> 1241445 +18.1% 1466086 proc-vmstat.pgfault
>>>>> 1138310 +19.3% 1358132 proc-vmstat.pgfree
>>>>> 22100 +7.5% 23753 ± 4% proc-vmstat.pgmigrate_success
>>>>> 53332 ± 43% +143.0% 129617 ± 14% proc-vmstat.pgpgout
>>>>> 1.42 ± 2% +1.7 3.07 perf-stat.branch-miss-rate%
>>>>> 1.064e+10 +123.3% 2.375e+10 perf-stat.branch-misses
>>>>> 10.79 +0.6 11.43 perf-stat.cache-miss-rate%
>>>>> 5.583e+09 +5.9% 5.915e+09 perf-stat.cache-misses
>>>>> 39652092 -5.0% 37662545 perf-stat.context-switches
>>>>> 1.29 +11.7% 1.44 perf-stat.cpi
>>>>> 4.637e+12 +12.8% 5.23e+12 perf-stat.cpu-cycles
>>>>> 8.653e+11 +9.8% 9.498e+11 ± 2% perf-stat.dTLB-loads
>>>>> 3.654e+11 +12.4% 4.109e+11 perf-stat.dTLB-stores
>>>>> 0.78 -10.5% 0.70 perf-stat.ipc
>>>>> 1214932 +17.9% 1432266 perf-stat.minor-faults
>>>>> 1.334e+09 -1.8% 1.31e+09 perf-stat.node-store-misses
>>>>> 1.651e+09 -1.8% 1.62e+09 perf-stat.node-stores
>>>>> 1214954 +17.9% 1432313 perf-stat.page-faults
>>>>> 256.75 -100.0% 0.00 turbostat.Avg_MHz
>>>>> 21.39 -21.4 0.00 turbostat.Busy%
>>>>> 1200 -100.0% 0.00 turbostat.Bzy_MHz
>>>>> 13695007 -100.0% 0.00 turbostat.C1
>>>>> 11.92 -11.9 0.00 turbostat.C1%
>>>>> 2116683 ± 2% -100.0% 0.00 turbostat.C1E
>>>>> 1.16 ± 4% -1.2 0.00 turbostat.C1E%
>>>>> 3112269 -100.0% 0.00 turbostat.C3
>>>>> 2.62 ± 2% -2.6 0.00 turbostat.C3%
>>>>> 15671277 -100.0% 0.00 turbostat.C6
>>>>> 63.38 -63.4 0.00 turbostat.C6%
>>>>> 49.46 -100.0% 0.00 turbostat.CPU%c1
>>>>> 1.42 ± 2% -100.0% 0.00 turbostat.CPU%c3
>>>>> 27.73 -100.0% 0.00 turbostat.CPU%c6
>>>>> 31.41 -100.0% 0.00 turbostat.CorWatt
>>>>> 63.25 -100.0% 0.00 turbostat.CoreTmp
>>>>> 18919351 -100.0% 0.00 turbostat.IRQ
>>>>> 1.21 ± 18% -100.0% 0.00 turbostat.Pkg%pc2
>>>>> 0.67 ± 31% -100.0% 0.00 turbostat.Pkg%pc6
>>>>> 63.25 -100.0% 0.00 turbostat.PkgTmp
>>>>> 57.63 -100.0% 0.00 turbostat.PkgWatt
>>>>> 30.73 -100.0% 0.00 turbostat.RAMWatt
>>>>> 36030 -100.0% 0.00 turbostat.SMI
>>>>> 3000 -100.0% 0.00 turbostat.TSC_MHz
>>>>>
>>>>>
>>>>> aim7.jobs-per-min
>>>>>
>>>>> 41000 +-+-----------------------------------------------------------------+
>>>>> | ..+....+.... ..+....+....+.... ..+....+....+...+.... |
>>>>> 40000 +-+ +. +.. + |
>>>>> 39000 +-+ |
>>>>> | |
>>>>> 38000 +-+ |
>>>>> 37000 +-+ |
>>>>> | |
>>>>> 36000 +-+ |
>>>>> 35000 +-+ |
>>>>> | |
>>>>> 34000 +-+ |
>>>>> 33000 +-+ O O |
>>>>> O O O O O O O O O O O O O
>>>>> 32000 +-+-----------------------------------------------------------------+
>>>>>
>>>>>
>>>>>
>>>>> [*] bisect-good sample
>>>>> [O] bisect-bad sample
>>>>>
>>>>>
>>>>>
>>>>> Disclaimer:
>>>>> Results have been estimated based on internal Intel analysis and are provided
>>>>> for informational purposes only. Any difference in system hardware or software
>>>>> design or configuration may affect actual performance.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Xiaolong
>>>>
>>>> --
>>>> Jeff Layton <[email protected]>
>>
>>
>


Attachments:
compare_result.txt (22.06 kB)
3da90b159b146672f830bcd2489dd3a1f4e9e089_dmesg.xz (21.53 kB)
3da90b159b146672f830bcd2489dd3a1f4e9e089_perf-profile.gz (13.32 kB)
c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e_dmesg.xz (21.58 kB)
c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e_perf-profile.gz (12.31 kB)
Download all attachments

2018-02-27 13:30:30

by Jeff Layton

[permalink] [raw]
Subject: Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On Tue, 2018-02-27 at 15:42 +0800, kemi wrote:
>
> On 2018年02月26日 20:33, Jeff Layton wrote:
> > On Mon, 2018-02-26 at 06:43 -0500, Jeff Layton wrote:
> > > On Mon, 2018-02-26 at 16:38 +0800, Ye Xiaolong wrote:
> > > > On 02/25, Jeff Layton wrote:
> > > > > On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
> > > > > > Greeting,
> > > > > >
> > > > > > FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
> > > > > >
> > > > > >
> > > > > > commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > >
> > > > > > in testcase: aim7
> > > > > > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
> > > > > > with following parameters:
> > > > > >
> > > > > > disk: 4BRD_12G
> > > > > > md: RAID0
> > > > > > fs: xfs
> > > > > > test: disk_src
> > > > > > load: 3000
> > > > > > cpufreq_governor: performance
> > > > > >
> > > > > > test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
> > > > > > test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
> > > > > >
> > > > > >
> > > > >
> > > > > I'm a bit suspicious of this result.
> > > > >
> > > > > This patch only changes inode_cmp_iversion{+raw} (since renamed to
> > > > > inode_eq_iversion{+raw}), and that neither should ever be called from
> > > > > xfs. The patch is fairly trivial too, and I wouldn't expect a big
> > > > > performance hit.
> > > >
> > > > I tried to queue 4 more times test for both commit c0cef30e4f and its parent,
> > > > the result seems quite stable.
> > > >
> > > > c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:
> > > > "aim7.jobs-per-min": [
> > > > 32964.01,
> > > > 32938.68,
> > > > 33068.18,
> > > > 32886.32,
> > > > 32843.72,
> > > > 32798.83,
> > > > 32898.34,
> > > > 32952.55
> > > > ],
> > > >
> > > > 3da90b159b146672f830bcd2489dd3a1f4e9e089:
> > > > "aim7.jobs-per-min": [
> > > > 40239.65,
> > > > 40163.33,
> > > > 40353.32,
> > > > 39976.9,
> > > > 40185.75,
> > > > 40411.3,
> > > > 40213.58,
> > > > 39900.69
> > > > ],
> > > >
> > > > Any other test data you may need?
> > > >
> > > > >
> > > > > Is IMA involved here at all? I didn't see any evidence of it, but the
> > > > > kernel config did have it enabled.
> > > > >
> > > >
> > > > Sorry, not quite familiar with IMA, could you tell more about how to check it?
> > > >
> > >
> > > Thanks for retesting it, but I'm at a loss for why we're seeing this:
> > >
> > > IMA is the the integrity management subsystem. It will use the iversion
> > > field to determine whether to remeasure files during remeasurement. It
> > > looks like the kernel config has it enabled, but it doesn't look like
> > > it's in use, based on the info in the initial report.
> > >
> > > This patch only affects two inlined functions inode_cmp_iversion and
> > > inode_cmp_iversion_raw. The patch is pretty trivial (as Linus points
> > > out). These functions are only called from IMA and fs-specific code
> > > (usually in readdir implementations to detect directory changes).
> > >
> > > XFS does not call either of these functions however, so I'm a little
> > > unclear on how this patch could slow anything down on this test. The
> > > only thing I can think to do here would be to profile this and see what
> > > stands out.
> > >
> > > Note that we do need to keep this in perspective too. This 18%
> > > regression on this test follows around a ~230% improvement that occurred
> > > when we merged the bulk of these patches. It's should still be quite a
> > > bit faster than the v4.15 in this regard.
> > >
> > > Still, it'd be good to understand what's going on here.
> > >
> > >
> >
> > Could we see the dmesg from this boot? It'd be good to confirm that IMA
> > is not involved here, as that's the only place that I can see that would
> > call into this code at all here.
> >
>
> See attachment for info on dmesg/perf-profile/compare_result.
> Feel free to let Xiaolong or me know if anything else you would like to check.
>

Many thanks,

Only one caller of the functions touched by this patch shows up in the
profiles: ima_file_free. That calls ima_check_last_writer, which calls
inode_cmp_iversion. The lines from the profiles show:

3da90b159b146672f830bcd2489dd3a1f4e9e089:
0.00% 0.00% [kernel.kallsyms] [k] ima_file_free

c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:
0.01% 0.01% [kernel.kallsyms] [k] ima_file_free

Seems pretty insignificant, but perhaps that is somehow accounting for
the difference. This is called when a file is freed so there could be an
effect I guess if we're closing a lot of files for write.

Looking at the disassembly from the builds on my box there is some
slight difference, since we did alter the implementation.
inode->iversion is 0x150 bytes into the struct on my builds:

3da90b159b146672f830bcd2489dd3a1f4e9e089:

0xffffffff813ae858 <+136>: je 0xffffffff813ae871 <ima_file_free+161>
0xffffffff813ae85a <+138>: mov 0x150(%rbp),%rsi
0xffffffff813ae861 <+145>: mov 0x20(%rax),%rcx
0xffffffff813ae865 <+149>: and $0xfffffffffffffffe,%rsi
0xffffffff813ae869 <+153>: add %rcx,%rcx
0xffffffff813ae86c <+156>: cmp %rcx,%rsi
0xffffffff813ae86f <+159>: je 0xffffffff813ae899 <ima_file_free+201>

c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e:

0xffffffff813ae828 <+136>: je 0xffffffff813ae83a <ima_file_free+154>
0xffffffff813ae82a <+138>: mov 0x150(%rbp),%rcx
0xffffffff813ae831 <+145>: shr %rcx
0xffffffff813ae834 <+148>: cmp %rcx,0x20(%rax)
0xffffffff813ae838 <+152>: je 0xffffffff813ae862 <ima_file_free+194>

The patched one looks like it ought to be more efficient. It's certainly
fewer instructions and it doesn't touch %rsi now. Are shifts more
expensive?

In any case, what might be good at this point as a sanity check is to
turn off CONFIG_IMA and recheck this. That should tell us whether we're
on the right track here. With that disabled, this patch should have no
effect on anything at all.

Thanks,
--
Jeff Layton <[email protected]>

2018-02-27 13:44:16

by David Howells

[permalink] [raw]
Subject: Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

Jeff Layton <[email protected]> wrote:

> 0xffffffff813ae828 <+136>: je 0xffffffff813ae83a <ima_file_free+154>
> 0xffffffff813ae82a <+138>: mov 0x150(%rbp),%rcx
> 0xffffffff813ae831 <+145>: shr %rcx
> 0xffffffff813ae834 <+148>: cmp %rcx,0x20(%rax)
> 0xffffffff813ae838 <+152>: je 0xffffffff813ae862 <ima_file_free+194>

Is it possible there's a stall between the load of RCX and the subsequent
instructions because they all have to wait for RCX to become available?

The interleaving between operating on RSI and RCX in the older code might
alleviate that.

In addition, the load if the 20(%rax) value is now done in the CMP instruction
rather than earlier, so it might not get speculatively loaded in time, whereas
the earlier code explicitly loads it up front.

David

2018-02-27 16:38:50

by Jeff Layton

[permalink] [raw]
Subject: Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On Tue, 2018-02-27 at 13:43 +0000, David Howells wrote:
> Jeff Layton <[email protected]> wrote:
>
> > 0xffffffff813ae828 <+136>: je 0xffffffff813ae83a <ima_file_free+154>
> > 0xffffffff813ae82a <+138>: mov 0x150(%rbp),%rcx
> > 0xffffffff813ae831 <+145>: shr %rcx
> > 0xffffffff813ae834 <+148>: cmp %rcx,0x20(%rax)
> > 0xffffffff813ae838 <+152>: je 0xffffffff813ae862 <ima_file_free+194>
>
> Is it possible there's a stall between the load of RCX and the subsequent
> instructions because they all have to wait for RCX to become available?
>
> The interleaving between operating on RSI and RCX in the older code might
> alleviate that.
>
> In addition, the load if the 20(%rax) value is now done in the CMP instruction
> rather than earlier, so it might not get speculatively loaded in time, whereas
> the earlier code explicitly loads it up front.
>

Thanks David, that makes sense.

At this point, I think we ought to wait and see what the results look
like without IMA compiled in at all.

It's possible we're misunderstanding this completely. At most, we'll be
hitting this once on every close of a file. It doesn't seem like that
ought to be causing something this noticeable though.
--
Jeff Layton <[email protected]>

2018-02-27 17:05:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On Tue, Feb 27, 2018 at 5:43 AM, David Howells <[email protected]> wrote:
> Is it possible there's a stall between the load of RCX and the subsequent
> instructions because they all have to wait for RCX to become available?

No. Modern Intel big-core CPU's simply aren't that fragile. All these
instructions should do OoO fine for trivial sequences like this, and
as far as I can tell, the new code sequence should be better.

And even if it were worse for some odd reason, it would be worse by a cycle.

This kind of 18% change is something else, it is definitely not about
instruction scheduling.

Now, if the change to inode_cmp_iversion() causes some actual
_behavioral_ changes, and we get more IO, that's more like it. But the
code really does seem to be equivalent. In both cases it is simply
comparing 63 bits: the high 63 bits of 0x150(%rbp) - inode->i_version
- with the low 63 bits of 0x20(%rax) - iint->version.

The only issue would be if the high bit of 0x20(%rax) was somehow set.
The new code doesn't shift that bit away an more, but it should never
be set since it comes from

i_version = inode_query_iversion(inode);
...
iint->version = i_version;

and that inode_query_iversion() will have done the version shift.

> The interleaving between operating on RSI and RCX in the older code might
> alleviate that.
>
> In addition, the load if the 20(%rax) value is now done in the CMP instruction
> rather than earlier, so it might not get speculatively loaded in time, whereas
> the earlier code explicitly loads it up front.

No again, OoO cores will generally hide details like that.

You can see effects of it, but it's hard, and it can go both ways.

Anyway, I think the _real_ change has nothing to with instruction
scheduling, and everything to do with this:

107.62 ± 37% +139.1% 257.38 ± 16% vmstat.io.bo
48740 ± 36% +191.4% 142047 ± 16% proc-vmstat.pgpgout

(There's fairly big variation in those numbers, but the changes are
even bigger) or this:

258.12 -100.0% 0.00 turbostat.Avg_MHz
21.48 -21.5 0.00 turbostat.Busy%

or this:

27397 ±194% +43598.3% 11972338 ±139%
latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
27942 ±189% +96489.5% 26989044 ±139%
latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath

but those all sound like something changed in the setup, not in the kernel.

Odd.

Linus

2018-03-02 05:57:58

by Kemi Wang

[permalink] [raw]
Subject: Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression



On 2018年02月28日 01:04, Linus Torvalds wrote:
> On Tue, Feb 27, 2018 at 5:43 AM, David Howells <[email protected]> wrote:
>> Is it possible there's a stall between the load of RCX and the subsequent
>> instructions because they all have to wait for RCX to become available?
>
> No. Modern Intel big-core CPU's simply aren't that fragile. All these
> instructions should do OoO fine for trivial sequences like this, and
> as far as I can tell, the new code sequence should be better.
>
> And even if it were worse for some odd reason, it would be worse by a cycle.
>
> This kind of 18% change is something else, it is definitely not about
> instruction scheduling.
>
> Now, if the change to inode_cmp_iversion() causes some actual
> _behavioral_ changes, and we get more IO, that's more like it. But the
> code really does seem to be equivalent. In both cases it is simply
> comparing 63 bits: the high 63 bits of 0x150(%rbp) - inode->i_version
> - with the low 63 bits of 0x20(%rax) - iint->version.
>
> The only issue would be if the high bit of 0x20(%rax) was somehow set.
> The new code doesn't shift that bit away an more, but it should never
> be set since it comes from
>
> i_version = inode_query_iversion(inode);
> ...
> iint->version = i_version;
>
> and that inode_query_iversion() will have done the version shift.
>
>> The interleaving between operating on RSI and RCX in the older code might
>> alleviate that.
>>
>> In addition, the load if the 20(%rax) value is now done in the CMP instruction
>> rather than earlier, so it might not get speculatively loaded in time, whereas
>> the earlier code explicitly loads it up front.
>
> No again, OoO cores will generally hide details like that.
>
> You can see effects of it, but it's hard, and it can go both ways.
>
> Anyway, I think the _real_ change has nothing to with instruction
> scheduling, and everything to do with this:
>
> 107.62 ± 37% +139.1% 257.38 ± 16% vmstat.io.bo
> 48740 ± 36% +191.4% 142047 ± 16% proc-vmstat.pgpgout
>
> (There's fairly big variation in those numbers, but the changes are
> even bigger) or this:
>
> 258.12 -100.0% 0.00 turbostat.Avg_MHz
> 21.48 -21.5 0.00 turbostat.Busy%
>

This is caused by a limitation in current turbostat parse script of lkp. It
treats a string including wildcard character (e.g. 30.**) in the output of turbostat
monitor as an error and set all the stats value as 0.

Turbostat monitor runs successfully during these tests.

> or this:
>
> 27397 ±194% +43598.3% 11972338 ±139%
> latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 27942 ±189% +96489.5% 26989044 ±139%
> latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
>
> but those all sound like something changed in the setup, not in the kernel.
>
> Odd.
>
> Linus
>

2018-03-15 07:37:16

by Kemi Wang

[permalink] [raw]
Subject: Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

Hi, Jeff
Today, I deleted the previous kernel images for commit 3da90b159b146672f830bcd2489dd3a1f4e9e089
and commit c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e, respectively. And, re-run the same aim7
jobs for three times for each commit. The aim7 score between two commit does not have obvious difference.

Perhaps something weird happen when compiling kernel. Please ignore this report, apologize for the bother.


On 2018年02月25日 23:41, Jeff Layton wrote:
> On Sun, 2018-02-25 at 23:05 +0800, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -18.0% regression of aim7.jobs-per-min due to commit:
>>
>>
>> commit: c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: aim7
>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
>> with following parameters:
>>
>> disk: 4BRD_12G
>> md: RAID0
>> fs: xfs
>> test: disk_src
>> load: 3000
>> cpufreq_governor: performance
>>
>> test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>
>>
>
> I'm a bit suspicious of this result.
>
> This patch only changes inode_cmp_iversion{+raw} (since renamed to
> inode_eq_iversion{+raw}), and that neither should ever be called from
> xfs. The patch is fairly trivial too, and I wouldn't expect a big
> performance hit.
>
> Is IMA involved here at all? I didn't see any evidence of it, but the
> kernel config did have it enabled.
>
>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp install job.yaml # job file is attached in this email
>> bin/lkp run job.yaml
>>
>> =========================================================================================
>> compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
>> gcc-7/performance/4BRD_12G/xfs/x86_64-rhel-7.2/3000/RAID0/debian-x86_64-2016-08-31.cgz/lkp-ivb-ep01/disk_src/aim7
>>
>> commit:
>> 3da90b159b (" f2fs-for-4.16-rc1")
>> c0cef30e4f ("iversion: make inode_cmp_iversion{+raw} return bool instead of s64")
>>
>> 3da90b159b146672 c0cef30e4ff0dc025f4a1660b8
>> ---------------- --------------------------
>> %stddev %change %stddev
>> \ | \
>> 40183 -18.0% 32964 aim7.jobs-per-min
>> 448.60 +21.9% 546.68 aim7.time.elapsed_time
>> 448.60 +21.9% 546.68 aim7.time.elapsed_time.max
>> 5615 ± 5% +33.4% 7489 ± 4% aim7.time.involuntary_context_switches
>> 3086 +14.0% 3518 aim7.time.system_time
>> 19439782 -5.6% 18359474 aim7.time.voluntary_context_switches
>> 199333 +14.3% 227794 ± 2% interrupts.CAL:Function_call_interrupts
>> 0.59 -0.1 0.50 mpstat.cpu.usr%
>> 2839401 +16.0% 3293688 softirqs.SCHED
>> 7600068 +15.1% 8747820 softirqs.TIMER
>> 118.00 ± 43% +98.7% 234.50 ± 15% vmstat.io.bo
>> 87840 -22.4% 68154 vmstat.system.cs
>> 552798 ± 6% +15.8% 640107 ± 4% numa-numastat.node0.local_node
>> 557345 ± 6% +15.7% 644666 ± 4% numa-numastat.node0.numa_hit
>> 528341 ± 7% +21.7% 642933 ± 4% numa-numastat.node1.local_node
>> 531604 ± 7% +21.6% 646209 ± 4% numa-numastat.node1.numa_hit
>> 2.147e+09 -12.4% 1.88e+09 cpuidle.C1.time
>> 13702041 -14.7% 11683737 cpuidle.C1.usage
>> 2.082e+08 ± 4% +28.1% 2.667e+08 ± 5% cpuidle.C1E.time
>> 4.719e+08 ± 2% +23.1% 5.807e+08 ± 4% cpuidle.C3.time
>> 1.141e+10 +31.0% 1.496e+10 cpuidle.C6.time
>> 15672622 +27.8% 20031028 cpuidle.C6.usage
>> 13520572 ± 3% +29.5% 17514398 ± 9% cpuidle.POLL.time
>> 278.25 ± 5% -46.0% 150.25 ± 73% numa-vmstat.node0.nr_dirtied
>> 3200 ± 14% -20.6% 2542 ± 19% numa-vmstat.node0.nr_mapped
>> 277.75 ± 5% -46.2% 149.50 ± 73% numa-vmstat.node0.nr_written
>> 28.50 ± 52% +448.2% 156.25 ± 70% numa-vmstat.node1.nr_dirtied
>> 2577 ± 19% +26.3% 3255 ± 15% numa-vmstat.node1.nr_mapped
>> 634338 ± 4% +7.8% 683959 ± 4% numa-vmstat.node1.numa_hit
>> 457411 ± 6% +10.8% 506800 ± 5% numa-vmstat.node1.numa_local
>> 3734 ± 8% -11.5% 3306 ± 6% proc-vmstat.numa_hint_faults_local
>> 1114538 +18.3% 1318978 proc-vmstat.numa_hit
>> 1106722 +18.5% 1311136 proc-vmstat.numa_local
>> 22100 +7.5% 23753 ± 4% proc-vmstat.numa_pages_migrated
>> 1174556 +18.0% 1386359 proc-vmstat.pgalloc_normal
>> 1241445 +18.1% 1466086 proc-vmstat.pgfault
>> 1138310 +19.3% 1358132 proc-vmstat.pgfree
>> 22100 +7.5% 23753 ± 4% proc-vmstat.pgmigrate_success
>> 53332 ± 43% +143.0% 129617 ± 14% proc-vmstat.pgpgout
>> 1.42 ± 2% +1.7 3.07 perf-stat.branch-miss-rate%
>> 1.064e+10 +123.3% 2.375e+10 perf-stat.branch-misses
>> 10.79 +0.6 11.43 perf-stat.cache-miss-rate%
>> 5.583e+09 +5.9% 5.915e+09 perf-stat.cache-misses
>> 39652092 -5.0% 37662545 perf-stat.context-switches
>> 1.29 +11.7% 1.44 perf-stat.cpi
>> 4.637e+12 +12.8% 5.23e+12 perf-stat.cpu-cycles
>> 8.653e+11 +9.8% 9.498e+11 ± 2% perf-stat.dTLB-loads
>> 3.654e+11 +12.4% 4.109e+11 perf-stat.dTLB-stores
>> 0.78 -10.5% 0.70 perf-stat.ipc
>> 1214932 +17.9% 1432266 perf-stat.minor-faults
>> 1.334e+09 -1.8% 1.31e+09 perf-stat.node-store-misses
>> 1.651e+09 -1.8% 1.62e+09 perf-stat.node-stores
>> 1214954 +17.9% 1432313 perf-stat.page-faults
>> 256.75 -100.0% 0.00 turbostat.Avg_MHz
>> 21.39 -21.4 0.00 turbostat.Busy%
>> 1200 -100.0% 0.00 turbostat.Bzy_MHz
>> 13695007 -100.0% 0.00 turbostat.C1
>> 11.92 -11.9 0.00 turbostat.C1%
>> 2116683 ± 2% -100.0% 0.00 turbostat.C1E
>> 1.16 ± 4% -1.2 0.00 turbostat.C1E%
>> 3112269 -100.0% 0.00 turbostat.C3
>> 2.62 ± 2% -2.6 0.00 turbostat.C3%
>> 15671277 -100.0% 0.00 turbostat.C6
>> 63.38 -63.4 0.00 turbostat.C6%
>> 49.46 -100.0% 0.00 turbostat.CPU%c1
>> 1.42 ± 2% -100.0% 0.00 turbostat.CPU%c3
>> 27.73 -100.0% 0.00 turbostat.CPU%c6
>> 31.41 -100.0% 0.00 turbostat.CorWatt
>> 63.25 -100.0% 0.00 turbostat.CoreTmp
>> 18919351 -100.0% 0.00 turbostat.IRQ
>> 1.21 ± 18% -100.0% 0.00 turbostat.Pkg%pc2
>> 0.67 ± 31% -100.0% 0.00 turbostat.Pkg%pc6
>> 63.25 -100.0% 0.00 turbostat.PkgTmp
>> 57.63 -100.0% 0.00 turbostat.PkgWatt
>> 30.73 -100.0% 0.00 turbostat.RAMWatt
>> 36030 -100.0% 0.00 turbostat.SMI
>> 3000 -100.0% 0.00 turbostat.TSC_MHz
>>
>>
>> aim7.jobs-per-min
>>
>> 41000 +-+-----------------------------------------------------------------+
>> | ..+....+.... ..+....+....+.... ..+....+....+...+.... |
>> 40000 +-+ +. +.. + |
>> 39000 +-+ |
>> | |
>> 38000 +-+ |
>> 37000 +-+ |
>> | |
>> 36000 +-+ |
>> 35000 +-+ |
>> | |
>> 34000 +-+ |
>> 33000 +-+ O O |
>> O O O O O O O O O O O O O
>> 32000 +-+-----------------------------------------------------------------+
>>
>>
>>
>> [*] bisect-good sample
>> [O] bisect-bad sample
>>
>>
>>
>> Disclaimer:
>> Results have been estimated based on internal Intel analysis and are provided
>> for informational purposes only. Any difference in system hardware or software
>> design or configuration may affect actual performance.
>>
>>
>> Thanks,
>> Xiaolong

2018-03-15 17:48:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] [lkp-robot] [iversion] c0cef30e4f: aim7.jobs-per-min -18.0% regression

On Thu, Mar 15, 2018 at 12:33 AM, kemi <[email protected]> wrote:
> Hi, Jeff
> Today, I deleted the previous kernel images for commit 3da90b159b146672f830bcd2489dd3a1f4e9e089
> and commit c0cef30e4ff0dc025f4a1660b8f0ba43ed58426e, respectively. And, re-run the same aim7
> jobs for three times for each commit. The aim7 score between two commit does not have obvious difference.
>
> Perhaps something weird happen when compiling kernel. Please ignore this report, apologize for the bother.

Ok, I'm not entirely happy with how random this was, but that commit
never looked like a remotely likely culprit, so I think we'll just
chalk it up to bad luck and unknown effects.

Linus