2021-04-20 02:53:20

by kernel test robot

[permalink] [raw]
Subject: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression



Greeting,

FYI, we noticed a -3.3% regression of will-it-scale.per_thread_ops due to commit:


commit: 4bad58ebc8bc4f20d89cff95417c9b4674769709 ("signal: Allow tasks to cache one sigqueue struct")
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core


in testcase: will-it-scale
on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
with following parameters:

nr_task: 100%
mode: thread
test: futex3
cpufreq_governor: performance
ucode: 0x5003006

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/100%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2ap2/futex3/will-it-scale/0x5003006

commit:
69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")

69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
---------------- ---------------------------
%stddev %change %stddev
\ | \
1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
6630224 -3.3% 6409738 will-it-scale.per_thread_ops
1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
1638 ? 3% -7.8% 1510 ? 5% sched_debug.cfs_rq:/.runnable_avg.max
297.83 ? 68% +1747.6% 5502 ?152% interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
297.83 ? 68% +1747.6% 5502 ?152% interrupts.CPU12.33:PCI-MSI.524291-edge.eth0-TxRx-2
8200 -33.4% 5459 ? 35% interrupts.CPU27.NMI:Non-maskable_interrupts
8200 -33.4% 5459 ? 35% interrupts.CPU27.PMI:Performance_monitoring_interrupts
8199 -33.4% 5459 ? 35% interrupts.CPU28.NMI:Non-maskable_interrupts
8199 -33.4% 5459 ? 35% interrupts.CPU28.PMI:Performance_monitoring_interrupts
6148 ? 33% -11.2% 5459 ? 35% interrupts.CPU29.NMI:Non-maskable_interrupts
6148 ? 33% -11.2% 5459 ? 35% interrupts.CPU29.PMI:Performance_monitoring_interrupts
4287 ? 8% +33.6% 5730 ? 15% interrupts.CPU49.CAL:Function_call_interrupts
6356 ? 19% +49.6% 9509 ? 19% interrupts.CPU97.CAL:Function_call_interrupts
9.163e+10 -3.3% 8.857e+10 perf-stat.i.branch-instructions
3.211e+08 -2.9% 3.118e+08 perf-stat.i.branch-misses
0.94 +3.2% 0.97 perf-stat.i.cpi
407730 ? 8% +37.5% 560565 ? 7% perf-stat.i.dTLB-load-misses
1.551e+11 -3.3% 1.499e+11 perf-stat.i.dTLB-loads
274320 -8.4% 251354 ? 18% perf-stat.i.dTLB-store-misses
1.169e+11 -3.3% 1.13e+11 perf-stat.i.dTLB-stores
5.952e+11 -3.3% 5.754e+11 perf-stat.i.instructions
1900 -4.9% 1807 perf-stat.i.instructions-per-iTLB-miss
1.07 -3.2% 1.03 perf-stat.i.ipc
1893 -3.3% 1830 perf-stat.i.metric.M/sec
0.93 +3.3% 0.97 perf-stat.overall.cpi
0.00 ? 8% +0.0 0.00 ? 7% perf-stat.overall.dTLB-load-miss-rate%
1896 -5.1% 1800 perf-stat.overall.instructions-per-iTLB-miss
1.07 -3.2% 1.04 perf-stat.overall.ipc
9.131e+10 -3.3% 8.827e+10 perf-stat.ps.branch-instructions
3.2e+08 -2.9% 3.107e+08 perf-stat.ps.branch-misses
415959 ? 8% +40.4% 583928 ? 7% perf-stat.ps.dTLB-load-misses
1.545e+11 -3.3% 1.494e+11 perf-stat.ps.dTLB-loads
274020 -8.4% 250940 ? 18% perf-stat.ps.dTLB-store-misses
1.165e+11 -3.3% 1.126e+11 perf-stat.ps.dTLB-stores
5.932e+11 -3.3% 5.734e+11 perf-stat.ps.instructions
1.793e+14 -3.3% 1.733e+14 perf-stat.total.instructions
32.73 -1.0 31.71 perf-profile.calltrace.cycles-pp.__entry_text_start.syscall
8.37 -0.2 8.20 perf-profile.calltrace.cycles-pp.hash_futex.futex_wake.do_futex.__x64_sys_futex.do_syscall_64
1.52 -0.1 1.38 perf-profile.calltrace.cycles-pp.rcu_nocb_flush_deferred_wakeup.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall
2.27 -0.1 2.17 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_safe_stack.syscall
2.17 -0.1 2.08 perf-profile.calltrace.cycles-pp.syscall_enter_from_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
1.32 -0.1 1.26 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall
5.45 +0.3 5.71 perf-profile.calltrace.cycles-pp.get_futex_key.futex_wake.do_futex.__x64_sys_futex.do_syscall_64
7.55 +0.4 7.98 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall
5.07 +0.5 5.58 perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall
28.26 +0.9 29.19 perf-profile.calltrace.cycles-pp.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
37.41 +1.1 38.50 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
33.56 +1.2 34.78 perf-profile.calltrace.cycles-pp.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
52.14 +1.3 53.40 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.syscall
23.03 +1.4 24.44 perf-profile.calltrace.cycles-pp.futex_wake.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
21.10 -0.7 20.44 perf-profile.children.cycles-pp.__entry_text_start
17.77 -0.5 17.31 perf-profile.children.cycles-pp.syscall_return_via_sysret
8.48 -0.2 8.28 perf-profile.children.cycles-pp.hash_futex
1.58 -0.1 1.44 perf-profile.children.cycles-pp.rcu_nocb_flush_deferred_wakeup
2.43 -0.1 2.33 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
2.20 -0.1 2.11 perf-profile.children.cycles-pp.syscall_enter_from_user_mode
0.42 ? 6% -0.1 0.36 ? 2% perf-profile.children.cycles-pp.tick_sched_handle
0.42 ? 6% -0.1 0.36 ? 2% perf-profile.children.cycles-pp.update_process_times
1.34 -0.1 1.29 perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
0.52 ? 2% -0.0 0.48 ? 2% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.47 ? 2% -0.0 0.43 ? 2% perf-profile.children.cycles-pp.tick_sched_timer
0.23 ? 4% -0.0 0.20 ? 2% perf-profile.children.cycles-pp.update_curr
0.18 ? 4% -0.0 0.16 ? 3% perf-profile.children.cycles-pp.perf_prepare_sample
5.60 +0.3 5.89 perf-profile.children.cycles-pp.get_futex_key
8.20 +0.4 8.59 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
5.36 +0.5 5.86 perf-profile.children.cycles-pp.exit_to_user_mode_prepare
23.57 +0.8 24.36 perf-profile.children.cycles-pp.futex_wake
37.58 +1.1 38.68 perf-profile.children.cycles-pp.do_syscall_64
52.56 +1.2 53.80 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
33.87 +1.2 35.11 perf-profile.children.cycles-pp.__x64_sys_futex
28.60 +1.3 29.89 perf-profile.children.cycles-pp.do_futex
17.64 -0.4 17.20 perf-profile.self.cycles-pp.syscall_return_via_sysret
9.47 -0.3 9.15 ? 2% perf-profile.self.cycles-pp.__entry_text_start
6.88 -0.3 6.61 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
8.18 -0.2 7.98 perf-profile.self.cycles-pp.hash_futex
1.33 -0.1 1.22 perf-profile.self.cycles-pp.rcu_nocb_flush_deferred_wakeup
2.42 -0.1 2.32 perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
1.85 -0.1 1.77 perf-profile.self.cycles-pp.syscall_exit_to_user_mode
1.88 -0.1 1.81 perf-profile.self.cycles-pp.syscall_enter_from_user_mode
1.25 -0.0 1.21 perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
1.27 -0.0 1.23 perf-profile.self.cycles-pp.do_syscall_64
1.69 +0.0 1.71 perf-profile.self.cycles-pp.testcase
5.26 +0.2 5.48 perf-profile.self.cycles-pp.get_futex_key
9.61 +0.4 10.02 perf-profile.self.cycles-pp.futex_wake
3.74 +0.6 4.37 perf-profile.self.cycles-pp.exit_to_user_mode_prepare
5.08 +0.8 5.90 perf-profile.self.cycles-pp.do_futex



will-it-scale.per_thread_ops

6.65e+06 +----------------------------------------------------------------+
| + +.++.+|
6.6e+06 |-+ + : |
| .++. +.+ + + : : |
6.55e+06 |-+ + ++ +.+ +.++. :: :: : |
6.5e+06 |-.++.+ : + : : : + |
|+ :+.: : + +.+ +.+ |
6.45e+06 |-+ + + + +.+ .+ +.+ |
| O +O + |
6.4e+06 |-+ OO OOO OO O O |
6.35e+06 |-+ |
| |
6.3e+06 |-+ OO OO O |
|O OO O OO O O |
6.25e+06 +----------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (12.43 kB)
config-5.12.0-rc2-00046-g4bad58ebc8bc (175.55 kB)
job-script (8.01 kB)
job.yaml (5.29 kB)
reproduce (348.00 B)
Download all attachments

2021-04-20 18:37:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

On Tue, Apr 20 2021 at 11:08, kernel test robot wrote:
> FYI, we noticed a -3.3% regression of will-it-scale.per_thread_ops due to commit:
>
> commit: 4bad58ebc8bc4f20d89cff95417c9b4674769709 ("signal: Allow tasks to cache one sigqueue struct")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
>
> in testcase: will-it-scale
> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> with following parameters:
>
> nr_task: 100%
> mode: thread
> test: futex3
> cpufreq_governor: performance
> ucode: 0x5003006
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
> commit:
> 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
> 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
>
> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
> 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
> 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
> 1638 ± 3% -7.8% 1510 ± 5% sched_debug.cfs_rq:/.runnable_avg.max
> 297.83 ± 68% +1747.6% 5502 ±152% interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
> 297.83 ± 68% +1747.6% 5502 ±152% interrupts.CPU12.33:PCI-MSI.524291-edge.eth0-TxRx-2

This change is definitely not causing more network traffic

> 8200 -33.4% 5459 ± 35% interrupts.CPU27.NMI:Non-maskable_interrupts
> 8200 -33.4% 5459 ± 35% interrupts.CPU27.PMI:Performance_monitoring_interrupts
> 8199 -33.4% 5459 ± 35% interrupts.CPU28.NMI:Non-maskable_interrupts
> 8199 -33.4% 5459 ± 35% interrupts.CPU28.PMI:Performance_monitoring_interrupts
> 6148 ± 33% -11.2% 5459 ± 35% interrupts.CPU29.NMI:Non-maskable_interrupts
> 6148 ± 33% -11.2% 5459 ± 35% interrupts.CPU29.PMI:Performance_monitoring_interrupts
> 4287 ± 8% +33.6% 5730 ± 15% interrupts.CPU49.CAL:Function_call_interrupts
> 6356 ± 19% +49.6% 9509 ± 19% interrupts.CPU97.CAL:Function_call_interrupts

Neither does it increase the number of function calls

> 407730 ± 8% +37.5% 560565 ± 7% perf-stat.i.dTLB-load-misses
> 415959 ± 8% +40.4% 583928 ± 7% perf-stat.ps.dTLB-load-misses

And this massive increase does not make sense either.

Confused.

Thanks,

tglx

2021-04-22 05:48:03

by kernel test robot

[permalink] [raw]
Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

hi, Thomas Gleixner,

On Tue, Apr 20, 2021 at 08:35:06PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 20 2021 at 11:08, kernel test robot wrote:
> > FYI, we noticed a -3.3% regression of will-it-scale.per_thread_ops due to commit:
> >
> > commit: 4bad58ebc8bc4f20d89cff95417c9b4674769709 ("signal: Allow tasks to cache one sigqueue struct")
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
> >
> > in testcase: will-it-scale
> > on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> > with following parameters:
> >
> > nr_task: 100%
> > mode: thread
> > test: futex3
> > cpufreq_governor: performance
> > ucode: 0x5003006
> >
> > test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > test-url: https://github.com/antonblanchard/will-it-scale
> > commit:
> > 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
> > 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
> >
> > 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
> > 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
> > 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
> > 1638 ? 3% -7.8% 1510 ? 5% sched_debug.cfs_rq:/.runnable_avg.max
> > 297.83 ? 68% +1747.6% 5502 ?152% interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
> > 297.83 ? 68% +1747.6% 5502 ?152% interrupts.CPU12.33:PCI-MSI.524291-edge.eth0-TxRx-2
>
> This change is definitely not causing more network traffic
>
> > 8200 -33.4% 5459 ? 35% interrupts.CPU27.NMI:Non-maskable_interrupts
> > 8200 -33.4% 5459 ? 35% interrupts.CPU27.PMI:Performance_monitoring_interrupts
> > 8199 -33.4% 5459 ? 35% interrupts.CPU28.NMI:Non-maskable_interrupts
> > 8199 -33.4% 5459 ? 35% interrupts.CPU28.PMI:Performance_monitoring_interrupts
> > 6148 ? 33% -11.2% 5459 ? 35% interrupts.CPU29.NMI:Non-maskable_interrupts
> > 6148 ? 33% -11.2% 5459 ? 35% interrupts.CPU29.PMI:Performance_monitoring_interrupts
> > 4287 ? 8% +33.6% 5730 ? 15% interrupts.CPU49.CAL:Function_call_interrupts
> > 6356 ? 19% +49.6% 9509 ? 19% interrupts.CPU97.CAL:Function_call_interrupts
>
> Neither does it increase the number of function calls
>
> > 407730 ? 8% +37.5% 560565 ? 7% perf-stat.i.dTLB-load-misses
> > 415959 ? 8% +40.4% 583928 ? 7% perf-stat.ps.dTLB-load-misses
>
> And this massive increase does not make sense either.
>
> Confused.

FYI.
we re-test this, and confirmed the regression persistent. still:

69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
---------------- ---------------------------
%stddev %change %stddev
\ | \
1.271e+09 -3.3% 1.229e+09 will-it-scale.192.threads
6620228 -3.3% 6401749 will-it-scale.per_thread_ops
1.271e+09 -3.3% 1.229e+09 will-it-scale.workload

both fbc and parent use identical config, as attached in original report.

data for 4bad58ebc8bc4f20d89cff95417:
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json: "will-it-scale.per_thread_ops": [
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6404491,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6421116,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6402763,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6403483,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6412066,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6414511,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6395917, <------ new 14 runs
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6396872,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6400830,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6408883,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6403844,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6405911,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6390766,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6394523,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6394594,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6399547,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6402487,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6394673,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6400717,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6386997

data for parent (69995ebbb9d37173):
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json: "will-it-scale.per_thread_ops": [
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6640509,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6630326,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6633025,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6625355,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6623274,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6628858,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6614380, <----- new 14 runs
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6607324,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6613340,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6610083,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6616290,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6616934,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6618978,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6627108,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6609973,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6618440,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6617191,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6615858,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6615761,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6621558

>
> Thanks,
>
> tglx

2021-04-22 15:39:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Oliver,

On Thu, Apr 22 2021 at 14:02, Oliver Sang wrote:
> On Tue, Apr 20, 2021 at 08:35:06PM +0200, Thomas Gleixner wrote:
>> Confused.
>
> FYI.
> we re-test this, and confirmed the regression persistent. still:
>
> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 1.271e+09 -3.3% 1.229e+09 will-it-scale.192.threads
> 6620228 -3.3% 6401749 will-it-scale.per_thread_ops
> 1.271e+09 -3.3% 1.229e+09 will-it-scale.workload

I'll have a look.

2021-04-30 08:17:33

by Feng Tang

[permalink] [raw]
Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Hi Thomas,

On Tue, Apr 20, 2021 at 11:08:37AM +0800, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed a -3.3% regression of will-it-scale.per_thread_ops due to commit:
>
>
> commit: 4bad58ebc8bc4f20d89cff95417c9b4674769709 ("signal: Allow tasks to cache one sigqueue struct")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
>
>
> in testcase: will-it-scale
> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> with following parameters:
>
> nr_task: 100%
> mode: thread
> test: futex3
> cpufreq_governor: performance
> ucode: 0x5003006
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp split-job --compatible job.yaml
> bin/lkp run compatible-job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
> gcc-9/performance/x86_64-rhel-8.3/thread/100%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2ap2/futex3/will-it-scale/0x5003006
>
> commit:
> 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
> 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
>
> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
> 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
> 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload

We've double checked this, and it seems to be another case of
the code alignment change caused regression change, just like
the other case we debugged " [genirq] cbe16f35be:
will-it-scale.per_thread_ops -5.2% regression"

https://lore.kernel.org/lkml/[email protected]/

With the same debug patch of forcing function address 64 bytes
aligned, then commit 4bad58ebc8 will bring no change on this case.

commit 09c60546f04f "./Makefile: add debug option to enable function
aligned on 32 bytes" only forced 32 bytes align, with thinking 64B
align will occupy more code space, and affect iTLB more. Maybe we
should just extend it to 64B align, as it is for debug only anyway.

Thanks,
Feng









2021-04-30 08:58:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Feng,

On Fri, Apr 30 2021 at 16:13, Feng Tang wrote:
> On Tue, Apr 20, 2021 at 11:08:37AM +0800, kernel test robot wrote:
>> commit:
>> 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
>> 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
>>
>> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
>> ---------------- ---------------------------
>> %stddev %change %stddev
>> \ | \
>> 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
>> 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
>> 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
>
> We've double checked this, and it seems to be another case of
> the code alignment change caused regression change, just like
> the other case we debugged " [genirq] cbe16f35be:
> will-it-scale.per_thread_ops -5.2% regression"
>
> https://lore.kernel.org/lkml/[email protected]/
>
> With the same debug patch of forcing function address 64 bytes
> aligned, then commit 4bad58ebc8 will bring no change on this case.
>
> commit 09c60546f04f "./Makefile: add debug option to enable function
> aligned on 32 bytes" only forced 32 bytes align, with thinking 64B
> align will occupy more code space, and affect iTLB more. Maybe we
> should just extend it to 64B align, as it is for debug only anyway.

thanks for the heads up!

But why is this restricted to debug mode?

The fact that adding a few bytes of text causes regressions in unrelated
code is not restricted to debug or am I missing something here?

Thanks,

tglx

2021-05-01 09:50:23

by Feng Tang

[permalink] [raw]
Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Hi Thomas,

On Fri, Apr 30, 2021 at 10:57:20AM +0200, Thomas Gleixner wrote:
> Feng,
>
> On Fri, Apr 30 2021 at 16:13, Feng Tang wrote:
> > On Tue, Apr 20, 2021 at 11:08:37AM +0800, kernel test robot wrote:
> >> commit:
> >> 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
> >> 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
> >>
> >> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> >> ---------------- ---------------------------
> >> %stddev %change %stddev
> >> \ | \
> >> 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
> >> 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
> >> 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
> >
> > We've double checked this, and it seems to be another case of
> > the code alignment change caused regression change, just like
> > the other case we debugged " [genirq] cbe16f35be:
> > will-it-scale.per_thread_ops -5.2% regression"
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > With the same debug patch of forcing function address 64 bytes
> > aligned, then commit 4bad58ebc8 will bring no change on this case.
> >
> > commit 09c60546f04f "./Makefile: add debug option to enable function
> > aligned on 32 bytes" only forced 32 bytes align, with thinking 64B
> > align will occupy more code space, and affect iTLB more. Maybe we
> > should just extend it to 64B align, as it is for debug only anyway.
>
> thanks for the heads up!
>
> But why is this restricted to debug mode?
>
> The fact that adding a few bytes of text causes regressions in unrelated
> code is not restricted to debug or am I missing something here?

With the default kernel config of 0day, 64B_force_aligned kernel is 11%
bigger than the 32B_force_aligned kernel (both the vmlinux and its text
size), and benchmark also shows there are performance drops with the
64B_forced_aligned kernel (should be related with more i-cache and i-TLB
footprint).

Also we are still looking for other ways with same effect, while not
increasing kernel text so much. So we are still put it under debug
options.

Thanks,
Feng

> Thanks,
>
> tglx