LinuxLists.cc - [signal] 4bad58ebc8: will-it-scale.per_thread

2021-04-20 02:53:20

Subject: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Greeting,

FYI, we noticed a

commit: 4bad58ebc8bc4f20d8

If you fix the issue, Reported-by: kernel

Details are as below:
--------------------------

To reproduce:

git clone bin/lkp install bin/lkp split-job bin/lkp run
========================== compiler/cpufreq_governor/ gcc-9/performance/x86_64
commit:
69995ebbb9 ("signal: 4bad58ebc8 ("signal:
69995ebbb9d37173 ---------------- %stddev %change \ | 1.273e+09 6630224 1.273e+09 1638 ? 3% -7.8% 297.83 ? 68% +1747.6% 297.83 ? 68% +1747.6% 8200 -33.4% 8200 -33.4% 8199 -33.4% 8199 -33.4% 6148 ? 33% -11.2% 6148 ? 33% -11.2% 4287 ? 8% +33.6% 6356 ? 19% +49.6% 9.163e+10 3.211e+08 0.94 +3.2% 407730 ? 8% 1.551e+11 274320 1.169e+11 5.952e+11 1900 -4.9% 1.07 -3.2% 1893 -3.3% 0.93 +3.3% 0.00 ? 8% +0.0 1896 -5.1% 1.07 -3.2% 9.131e+10 3.2e+08 415959 ? 8% 1.545e+11 274020 1.165e+11 5.932e+11 1.793e+14 32.73 -1.0 8.37 -0.2 1.52 -0.1 2.27 -0.1 2.17 -0.1 1.32 -0.1 5.45 +0.3 7.55 +0.4 5.07 +0.5 28.26 +0.9 37.41 +1.1 33.56 +1.2 52.14 +1.3 23.03 +1.4 21.10 -0.7 17.77 -0.5 8.48 -0.2 1.58 -0.1 2.43 -0.1 2.20 -0.1 0.42 ? 6% -0.1 0.42 ? 6% -0.1 1.34 -0.1 0.52 ? 2% -0.0 0.47 ? 2% -0.0 0.23 ? 4% -0.0 0.18 ? 4% -0.0 5.60 +0.3 8.20 +0.4 5.36 +0.5 23.57 +0.8 37.58 +1.1 52.56 +1.2 33.87 +1.2 28.60 +1.3 17.64 -0.4 9.47 -0.3 6.88 -0.3 8.18 -0.2 1.33 -0.1 2.42 -0.1 1.85 -0.1 1.88 -0.1 1.25 -0.0 1.27 -0.0 1.69 +0.0 5.26 +0.2 9.61 +0.4 3.74 +0.6 5.08 +0.8

6.65e+06 +-------------- | 6.6e+06 |-+ | 6.55e+06 |-+ 6.5e+06 |-.++.+ : |+ :+.: 6.45e+06 |-+ + + | 6.4e+06 |-+ 6.35e+06 |-+ | 6.3e+06 |-+ OO OO O |O OO O OO O O 6.25e+06 +-------------- [*] bisect-good sample
[O] bisect-bad sample

Disclaimer:
Results have been for informational design or configuration

---
0DAY/LKP+ Test Infrastructure https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
//github.com/antonblanchard/will-it-scale">https://github.com/antonblanchard/will-it-scale
kindly add following tag
test robot <[email protected]>
------------------------------------------------------------------------>
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
job.yaml # job file is attached in this email
--compatible job.yaml
compatible-job.yaml
===============================================================
kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
-rhel-8.3/thread/100%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2ap2/futex3/will-it-scale/0x5003006
Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
Allow tasks to cache one sigqueue struct")
4bad58ebc8bc4f20d89cff95417
---------------------------
%stddev
\
-3.3% 1.231e+09 will-it-scale.192.threads
-3.3% 6409738 will-it-scale.per_thread_ops
-3.3% 1.231e+09 will-it-scale.workload
1510 ? 5% sched_debug.cfs_rq:/.runnable_avg.max
5502 ?152% interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
5502 ?152% interrupts.CPU12.33:PCI-MSI.524291-edge.eth0-TxRx-2
5459 ? 35% interrupts.CPU27.NMI:Non-maskable_interrupts
5459 ? 35% interrupts.CPU27.PMI:Performance_monitoring_interrupts
5459 ? 35% interrupts.CPU28.NMI:Non-maskable_interrupts
5459 ? 35% interrupts.CPU28.PMI:Performance_monitoring_interrupts
5459 ? 35% interrupts.CPU29.NMI:Non-maskable_interrupts
5459 ? 35% interrupts.CPU29.PMI:Performance_monitoring_interrupts
5730 ? 15% interrupts.CPU49.CAL:Function_call_interrupts
9509 ? 19% interrupts.CPU97.CAL:Function_call_interrupts
-3.3% 8.857e+10 perf-stat.i.branch-instructions
-2.9% 3.118e+08 perf-stat.i.branch-misses
0.97 perf-stat.i.cpi
+37.5% 560565 ? 7% perf-stat.i.dTLB-load-misses
-3.3% 1.499e+11 perf-stat.i.dTLB-loads
-8.4% 251354 ? 18% perf-stat.i.dTLB-store-misses
-3.3% 1.13e+11 perf-stat.i.dTLB-stores
-3.3% 5.754e+11 perf-stat.i.instructions
1807 perf-stat.i.instructions-per-iTLB-miss
1.03 perf-stat.i.ipc
1830 perf-stat.i.metric.M/sec
0.97 perf-stat.overall.cpi
0.00 ? 7% perf-stat.overall.dTLB-load-miss-rate%
1800 perf-stat.overall.instructions-per-iTLB-miss
1.04 perf-stat.overall.ipc
-3.3% 8.827e+10 perf-stat.ps.branch-instructions
-2.9% 3.107e+08 perf-stat.ps.branch-misses
+40.4% 583928 ? 7% perf-stat.ps.dTLB-load-misses
-3.3% 1.494e+11 perf-stat.ps.dTLB-loads
-8.4% 250940 ? 18% perf-stat.ps.dTLB-store-misses
-3.3% 1.126e+11 perf-stat.ps.dTLB-stores
-3.3% 5.734e+11 perf-stat.ps.instructions
-3.3% 1.733e+14 perf-stat.total.instructions
31.71 perf-profile.calltrace.cycles-pp.__entry_text_start.syscall
8.20 perf-profile.calltrace.cycles-pp.hash_futex.futex_wake.do_futex.__x64_sys_futex.do_syscall_64
1.38 perf-profile.calltrace.cycles-pp.rcu_nocb_flush_deferred_wakeup.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall
2.17 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_safe_stack.syscall
2.08 perf-profile.calltrace.cycles-pp.syscall_enter_from_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
1.26 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall
5.71 perf-profile.calltrace.cycles-pp.get_futex_key.futex_wake.do_futex.__x64_sys_futex.do_syscall_64
7.98 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall
5.58 perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall
29.19 perf-profile.calltrace.cycles-pp.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
38.50 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
34.78 perf-profile.calltrace.cycles-pp.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
53.40 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.syscall
24.44 perf-profile.calltrace.cycles-pp.futex_wake.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
20.44 perf-profile.children.cycles-pp.__entry_text_start
17.31 perf-profile.children.cycles-pp.syscall_return_via_sysret
8.28 perf-profile.children.cycles-pp.hash_futex
1.44 perf-profile.children.cycles-pp.rcu_nocb_flush_deferred_wakeup
2.33 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
2.11 perf-profile.children.cycles-pp.syscall_enter_from_user_mode
0.36 ? 2% perf-profile.children.cycles-pp.tick_sched_handle
0.36 ? 2% perf-profile.children.cycles-pp.update_process_times
1.29 perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
0.48 ? 2% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.43 ? 2% perf-profile.children.cycles-pp.tick_sched_timer
0.20 ? 2% perf-profile.children.cycles-pp.update_curr
0.16 ? 3% perf-profile.children.cycles-pp.perf_prepare_sample
5.89 perf-profile.children.cycles-pp.get_futex_key
8.59 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
5.86 perf-profile.children.cycles-pp.exit_to_user_mode_prepare
24.36 perf-profile.children.cycles-pp.futex_wake
38.68 perf-profile.children.cycles-pp.do_syscall_64
53.80 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
35.11 perf-profile.children.cycles-pp.__x64_sys_futex
29.89 perf-profile.children.cycles-pp.do_futex
17.20 perf-profile.self.cycles-pp.syscall_return_via_sysret
9.15 ? 2% perf-profile.self.cycles-pp.__entry_text_start
6.61 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
7.98 perf-profile.self.cycles-pp.hash_futex
1.22 perf-profile.self.cycles-pp.rcu_nocb_flush_deferred_wakeup
2.32 perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
1.77 perf-profile.self.cycles-pp.syscall_exit_to_user_mode
1.81 perf-profile.self.cycles-pp.syscall_enter_from_user_mode
1.21 perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
1.23 perf-profile.self.cycles-pp.do_syscall_64
1.71 perf-profile.self.cycles-pp.testcase
5.48 perf-profile.self.cycles-pp.get_futex_key
10.02 perf-profile.self.cycles-pp.futex_wake
4.37 perf-profile.self.cycles-pp.exit_to_user_mode_prepare
5.90 perf-profile.self.cycles-pp.do_futex

will-it-scale.per_thread_ops

--------------------------------------------------+
+ +.++.+|
+ : |
.++. +.+ + + : : |
+ ++ +.+ +.++. :: :: : |
+ : : : + |
: + +.+ +.+ |
+ +.+ .+ +.+ |
O +O + |
OO OOO OO O O |
|
|
|
|
--------------------------------------------------+

estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
Open Source Technology Center
.org/hyperkitty/list/lkp@lists.01.org">https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Attachments:

(No filename) (12.43 kB)
config-5.12.0-rc2-00046-g4bad58ebc8bc (175.55 kB)
job-script (8.01 kB)
job.yaml (5.29 kB)
reproduce (348.00 B)
Download all attachments

2021-04-20 18:37:17

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

On Tue, Apr 20 2021 at 11:08, kernel test robot wrote:
> FYI, we noticed a -3.3% regression of will-it-scale.per_thread_ops due to commit:
>
> commit: 4bad58ebc8bc4f20d89cff95417c9b4674769709 ("signal: Allow tasks to cache one sigqueue struct")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
>
> in testcase: will-it-scale
> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> with following parameters:
>
> nr_task: 100%
> mode: thread
> test: futex3
> cpufreq_governor: performance
> ucode: 0x5003006
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
> commit:
> 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
> 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
>
> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
> 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
> 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
> 1638 ± 3% -7.8% 1510 ± 5% sched_debug.cfs_rq:/.runnable_avg.max
> 297.83 ± 68% +1747.6% 5502 ±152% interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
> 297.83 ± 68% +1747.6% 5502 ±152% interrupts.CPU12.33:PCI-MSI.524291-edge.eth0-TxRx-2

This change is definitely not causing more network traffic

> 8200 -33.4% 5459 ± 35% interrupts.CPU27.NMI:Non-maskable_interrupts
> 8200 -33.4% 5459 ± 35% interrupts.CPU27.PMI:Performance_monitoring_interrupts
> 8199 -33.4% 5459 ± 35% interrupts.CPU28.NMI:Non-maskable_interrupts
> 8199 -33.4% 5459 ± 35% interrupts.CPU28.PMI:Performance_monitoring_interrupts
> 6148 ± 33% -11.2% 5459 ± 35% interrupts.CPU29.NMI:Non-maskable_interrupts
> 6148 ± 33% -11.2% 5459 ± 35% interrupts.CPU29.PMI:Performance_monitoring_interrupts
> 4287 ± 8% +33.6% 5730 ± 15% interrupts.CPU49.CAL:Function_call_interrupts
> 6356 ± 19% +49.6% 9509 ± 19% interrupts.CPU97.CAL:Function_call_interrupts

Neither does it increase the number of function calls

> 407730 ± 8% +37.5% 560565 ± 7% perf-stat.i.dTLB-load-misses
> 415959 ± 8% +40.4% 583928 ± 7% perf-stat.ps.dTLB-load-misses

And this massive increase does not make sense either.

Confused.

Thanks,

tglx

2021-04-22 05:48:03

by kernel test robot

[permalink] [raw]

Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

hi, Thomas Gleixner,

On Tue, Apr 20, 2021 at 08:35:06PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 20 2021 at 11:08, kernel test robot wrote:
> > FYI, we noticed a -3.3% regression of will-it-scale.per_thread_ops due to commit:
> >
> > commit: 4bad58ebc8bc4f20d89cff95417c9b4674769709 ("signal: Allow tasks to cache one sigqueue struct")
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
> >
> > in testcase: will-it-scale
> > on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> > with following parameters:
> >
> > nr_task: 100%
> > mode: thread
> > test: futex3
> > cpufreq_governor: performance
> > ucode: 0x5003006
> >
> > test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > test-url: https://github.com/antonblanchard/will-it-scale
> > commit:
> > 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
> > 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
> >
> > 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
> > 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
> > 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
> > 1638 ? 3% -7.8% 1510 ? 5% sched_debug.cfs_rq:/.runnable_avg.max
> > 297.83 ? 68% +1747.6% 5502 ?152% interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
> > 297.83 ? 68% +1747.6% 5502 ?152% interrupts.CPU12.33:PCI-MSI.524291-edge.eth0-TxRx-2
>
> This change is definitely not causing more network traffic
>
> > 8200 -33.4% 5459 ? 35% interrupts.CPU27.NMI:Non-maskable_interrupts
> > 8200 -33.4% 5459 ? 35% interrupts.CPU27.PMI:Performance_monitoring_interrupts
> > 8199 -33.4% 5459 ? 35% interrupts.CPU28.NMI:Non-maskable_interrupts
> > 8199 -33.4% 5459 ? 35% interrupts.CPU28.PMI:Performance_monitoring_interrupts
> > 6148 ? 33% -11.2% 5459 ? 35% interrupts.CPU29.NMI:Non-maskable_interrupts
> > 6148 ? 33% -11.2% 5459 ? 35% interrupts.CPU29.PMI:Performance_monitoring_interrupts
> > 4287 ? 8% +33.6% 5730 ? 15% interrupts.CPU49.CAL:Function_call_interrupts
> > 6356 ? 19% +49.6% 9509 ? 19% interrupts.CPU97.CAL:Function_call_interrupts
>
> Neither does it increase the number of function calls
>
> > 407730 ? 8% +37.5% 560565 ? 7% perf-stat.i.dTLB-load-misses
> > 415959 ? 8% +40.4% 583928 ? 7% perf-stat.ps.dTLB-load-misses
>
> And this massive increase does not make sense either.
>
> Confused.

FYI.
we re-test this, and confirmed the regression persistent. still:

69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
---------------- ---------------------------
%stddev %change %stddev
\ | \
1.271e+09 -3.3% 1.229e+09 will-it-scale.192.threads
6620228 -3.3% 6401749 will-it-scale.per_thread_ops
1.271e+09 -3.3% 1.229e+09 will-it-scale.workload

both fbc and parent use identical config, as attached in original report.

data for 4bad58ebc8bc4f20d89cff95417:
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json: "will-it-scale.per_thread_ops": [
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6404491,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6421116,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6402763,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6403483,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6412066,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6414511,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6395917, <------ new 14 runs
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6396872,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6400830,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6408883,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6403844,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6405911,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6390766,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6394523,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6394594,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6399547,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6402487,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6394673,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6400717,
4bad58ebc8bc4f20d89cff95417c9b4674769709/matrix.json- 6386997

data for parent (69995ebbb9d37173):
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json: "will-it-scale.per_thread_ops": [
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6640509,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6630326,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6633025,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6625355,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6623274,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6628858,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6614380, <----- new 14 runs
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6607324,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6613340,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6610083,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6616290,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6616934,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6618978,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6627108,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6609973,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6618440,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6617191,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6615858,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6615761,
69995ebbb9d3717306a165db88a1292b63f77a37/matrix.json- 6621558

>
> Thanks,
>
> tglx

2021-04-22 15:39:53

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Oliver,

On Thu, Apr 22 2021 at 14:02, Oliver Sang wrote:
> On Tue, Apr 20, 2021 at 08:35:06PM +0200, Thomas Gleixner wrote:
>> Confused.
>
> FYI.
> we re-test this, and confirmed the regression persistent. still:
>
> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 1.271e+09 -3.3% 1.229e+09 will-it-scale.192.threads
> 6620228 -3.3% 6401749 will-it-scale.per_thread_ops
> 1.271e+09 -3.3% 1.229e+09 will-it-scale.workload

I'll have a look.

2021-04-30 08:17:33

by Feng Tang

[permalink] [raw]

Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Hi Thomas,

On Tue, Apr 20, 2021 at 11:08:37AM +0800, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed a -3.3% regression of will-it-scale.per_thread_ops due to commit:
>
>
> commit: 4bad58ebc8bc4f20d89cff95417c9b4674769709 ("signal: Allow tasks to cache one sigqueue struct")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
>
>
> in testcase: will-it-scale
> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> with following parameters:
>
> nr_task: 100%
> mode: thread
> test: futex3
> cpufreq_governor: performance
> ucode: 0x5003006
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp split-job --compatible job.yaml
> bin/lkp run compatible-job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
> gcc-9/performance/x86_64-rhel-8.3/thread/100%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2ap2/futex3/will-it-scale/0x5003006
>
> commit:
> 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
> 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
>
> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
> 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
> 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload

We've double checked this, and it seems to be another case of
the code alignment change caused regression change, just like
the other case we debugged " [genirq] cbe16f35be:
will-it-scale.per_thread_ops -5.2% regression"

https://lore.kernel.org/lkml/[email protected]/

With the same debug patch of forcing function address 64 bytes
aligned, then commit 4bad58ebc8 will bring no change on this case.

commit 09c60546f04f "./Makefile: add debug option to enable function
aligned on 32 bytes" only forced 32 bytes align, with thinking 64B
align will occupy more code space, and affect iTLB more. Maybe we
should just extend it to 64B align, as it is for debug only anyway.

Thanks,
Feng

2021-04-30 08:58:55

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Feng,

On Fri, Apr 30 2021 at 16:13, Feng Tang wrote:
> On Tue, Apr 20, 2021 at 11:08:37AM +0800, kernel test robot wrote:
>> commit:
>> 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
>> 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
>>
>> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
>> ---------------- ---------------------------
>> %stddev %change %stddev
>> \ | \
>> 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
>> 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
>> 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
>
> We've double checked this, and it seems to be another case of
> the code alignment change caused regression change, just like
> the other case we debugged " [genirq] cbe16f35be:
> will-it-scale.per_thread_ops -5.2% regression"
>
> https://lore.kernel.org/lkml/[email protected]/
>
> With the same debug patch of forcing function address 64 bytes
> aligned, then commit 4bad58ebc8 will bring no change on this case.
>
> commit 09c60546f04f "./Makefile: add debug option to enable function
> aligned on 32 bytes" only forced 32 bytes align, with thinking 64B
> align will occupy more code space, and affect iTLB more. Maybe we
> should just extend it to 64B align, as it is for debug only anyway.

thanks for the heads up!

But why is this restricted to debug mode?

The fact that adding a few bytes of text causes regressions in unrelated
code is not restricted to debug or am I missing something here?

Thanks,

tglx

2021-05-01 09:50:23

by Feng Tang

[permalink] [raw]

Subject: Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

Hi Thomas,

On Fri, Apr 30, 2021 at 10:57:20AM +0200, Thomas Gleixner wrote:
> Feng,
>
> On Fri, Apr 30 2021 at 16:13, Feng Tang wrote:
> > On Tue, Apr 20, 2021 at 11:08:37AM +0800, kernel test robot wrote:
> >> commit:
> >> 69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
> >> 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
> >>
> >> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417
> >> ---------------- ---------------------------
> >> %stddev %change %stddev
> >> \ | \
> >> 1.273e+09 -3.3% 1.231e+09 will-it-scale.192.threads
> >> 6630224 -3.3% 6409738 will-it-scale.per_thread_ops
> >> 1.273e+09 -3.3% 1.231e+09 will-it-scale.workload
> >
> > We've double checked this, and it seems to be another case of
> > the code alignment change caused regression change, just like
> > the other case we debugged " [genirq] cbe16f35be:
> > will-it-scale.per_thread_ops -5.2% regression"
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > With the same debug patch of forcing function address 64 bytes
> > aligned, then commit 4bad58ebc8 will bring no change on this case.
> >
> > commit 09c60546f04f "./Makefile: add debug option to enable function
> > aligned on 32 bytes" only forced 32 bytes align, with thinking 64B
> > align will occupy more code space, and affect iTLB more. Maybe we
> > should just extend it to 64B align, as it is for debug only anyway.
>
> thanks for the heads up!
>
> But why is this restricted to debug mode?
>
> The fact that adding a few bytes of text causes regressions in unrelated
> code is not restricted to debug or am I missing something here?

With the default kernel config of 0day, 64B_force_aligned kernel is 11%
bigger than the 32B_force_aligned kernel (both the vmlinux and its text
size), and benchmark also shows there are performance drops with the
64B_forced_aligned kernel (should be related with more i-cache and i-TLB
footprint).

Also we are still looking for other ways with same effect, while not
increasing kernel text so much. So we are still put it under debug
options.

Thanks,
Feng

> Thanks,
>
> tglx