2020-02-05 12:34:57

by Chen, Rong A

[permalink] [raw]
Subject: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

Greeting,

FYI, we noticed a -5.5% regression of will-it-scale.per_process_ops due to commit:


commit: 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb ("perf/x86: Add check_period PMU callback")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: will-it-scale
on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
with following parameters:

nr_task: 100%
mode: process
test: signal1
cpufreq_governor: performance
ucode: 0x500002c

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-7/performance/x86_64-rhel-7.6/process/100%/debian-x86_64-20191114.cgz/lkp-csl-2ap4/signal1/will-it-scale/0x500002c

commit:
v5.0-rc6
81ec3f3c4c ("perf/x86: Add check_period PMU callback")

v5.0-rc6 81ec3f3c4c4d78f2d3b6689c981
---------------- ---------------------------
%stddev %change %stddev
\ | \
17987 -5.5% 16997 will-it-scale.per_process_ops
3453717 -5.5% 3263556 will-it-scale.workload
3.032e+08 ± 22% -56.2% 1.329e+08 ± 71% cpuidle.C6.time
435366 ± 25% -51.4% 211628 ± 35% cpuidle.C6.usage
5620 ± 50% -33.6% 3731 ± 4% softirqs.CPU187.RCU
7972 ± 42% -44.7% 4407 ± 3% softirqs.CPU51.RCU
381824 ± 27% -66.5% 128076 ± 91% turbostat.C6
0.48 ± 24% -0.3 0.18 ± 93% turbostat.C6%
640.83 ± 6% +14.9% 736.62 ± 6% sched_debug.cfs_rq:/.util_avg.min
56437 ± 9% -17.3% 46662 ± 7% sched_debug.cpu.nr_switches.max
5224 ± 5% -7.1% 4852 ± 6% sched_debug.cpu.nr_switches.stddev
54643 ± 9% -17.9% 44853 ± 7% sched_debug.cpu.sched_count.max
26160 ± 10% -17.6% 21557 ± 7% sched_debug.cpu.ttwu_count.max
25875 ± 9% -17.3% 21398 ± 7% sched_debug.cpu.ttwu_local.max
745.75 ± 16% -46.8% 396.75 ± 38% interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
952.50 ±108% -93.2% 65.00 ± 50% interrupts.CPU1.RES:Rescheduling_interrupts
745.75 ± 16% -46.8% 396.75 ± 38% interrupts.CPU11.33:PCI-MSI.524291-edge.eth0-TxRx-2
740.50 ±171% -100.0% 0.25 ±173% interrupts.CPU166.RES:Rescheduling_interrupts
3207 ± 6% +27.7% 4095 ± 5% interrupts.CPU185.CAL:Function_call_interrupts
152.75 ±168% -99.2% 1.25 ± 34% interrupts.CPU50.RES:Rescheduling_interrupts
698.25 ±166% -98.7% 9.25 ±135% interrupts.CPU70.RES:Rescheduling_interrupts
3367 ± 2% +13.5% 3821 ± 6% interrupts.CPU89.CAL:Function_call_interrupts
202.75 ±117% -85.1% 30.25 ± 83% interrupts.CPU96.RES:Rescheduling_interrupts
71307 ± 3% -11.0% 63441 numa-vmstat.node2.nr_file_pages
7081 ± 3% -19.6% 5696 numa-vmstat.node2.nr_kernel_stack
1854 ± 7% -40.2% 1109 numa-vmstat.node2.nr_mapped
2524 ± 6% -18.1% 2068 ± 2% numa-vmstat.node2.nr_page_table_pages
5132 ± 15% -38.5% 3159 ± 10% numa-vmstat.node2.nr_slab_reclaimable
15668 ± 9% -22.2% 12192 ± 5% numa-vmstat.node2.nr_slab_unreclaimable
70254 ± 2% -9.9% 63317 numa-vmstat.node2.nr_unevictable
70254 ± 2% -9.9% 63317 numa-vmstat.node2.nr_zone_unevictable
361503 ± 20% -23.9% 274980 ± 6% numa-vmstat.node2.numa_hit
275672 ± 26% -31.3% 189397 ± 10% numa-vmstat.node2.numa_local
285230 ± 3% -11.0% 253766 numa-meminfo.node2.FilePages
1707 ± 11% -73.2% 457.50 ±118% numa-meminfo.node2.Inactive
20532 ± 15% -38.4% 12638 ± 10% numa-meminfo.node2.KReclaimable
7082 ± 3% -19.6% 5696 numa-meminfo.node2.KernelStack
7112 ± 5% -37.6% 4436 numa-meminfo.node2.Mapped
590073 ± 6% -15.9% 496353 ± 9% numa-meminfo.node2.MemUsed
10101 ± 6% -18.0% 8282 ± 2% numa-meminfo.node2.PageTables
20532 ± 15% -38.4% 12638 ± 10% numa-meminfo.node2.SReclaimable
62680 ± 9% -22.2% 48775 ± 5% numa-meminfo.node2.SUnreclaim
83213 ± 8% -26.2% 61414 ± 6% numa-meminfo.node2.Slab
281021 ± 2% -9.9% 253271 numa-meminfo.node2.Unevictable
3.322e+09 -5.1% 3.152e+09 perf-stat.i.branch-instructions
1.17 +0.0 1.18 perf-stat.i.branch-miss-rate%
38492834 -4.2% 36888923 perf-stat.i.branch-misses
42.83 -0.7 42.11 perf-stat.i.cache-miss-rate%
43547238 -6.0% 40916641 ± 2% perf-stat.i.cache-misses
1.014e+08 -4.5% 96860566 perf-stat.i.cache-references
34.91 +5.3% 36.76 perf-stat.i.cpi
13610 +6.2% 14457 ± 2% perf-stat.i.cycles-between-cache-misses
5.049e+09 -5.4% 4.777e+09 perf-stat.i.dTLB-loads
0.00 ± 8% +0.0 0.00 ± 2% perf-stat.i.dTLB-store-miss-rate%
17697 ± 5% +90.1% 33643 ± 3% perf-stat.i.dTLB-store-misses
3.162e+09 -5.3% 2.994e+09 perf-stat.i.dTLB-stores
26478258 -8.6% 24200892 perf-stat.i.iTLB-load-misses
1.682e+10 -5.1% 1.596e+10 perf-stat.i.instructions
640.74 +3.6% 664.01 ± 2% perf-stat.i.instructions-per-iTLB-miss
3611851 -4.9% 3435033 perf-stat.i.node-load-misses
7022617 -3.0% 6811102 perf-stat.i.node-store-misses
1.16 +0.0 1.17 perf-stat.overall.branch-miss-rate%
42.96 -0.7 42.24 perf-stat.overall.cache-miss-rate%
34.96 +5.3% 36.81 perf-stat.overall.cpi
13501 +6.4% 14364 ± 2% perf-stat.overall.cycles-between-cache-misses
0.00 ± 5% +0.0 0.00 ± 2% perf-stat.overall.dTLB-store-miss-rate%
635.29 +3.8% 659.66 perf-stat.overall.instructions-per-iTLB-miss
0.03 -5.0% 0.03 perf-stat.overall.ipc
3.308e+09 -5.1% 3.139e+09 perf-stat.ps.branch-instructions
38324306 -4.1% 36739072 perf-stat.ps.branch-misses
43365638 -6.0% 40745255 ± 2% perf-stat.ps.cache-misses
1.01e+08 -4.5% 96451003 perf-stat.ps.cache-references
5.028e+09 -5.4% 4.757e+09 perf-stat.ps.dTLB-loads
17634 ± 5% +90.1% 33526 ± 3% perf-stat.ps.dTLB-store-misses
3.149e+09 -5.3% 2.982e+09 perf-stat.ps.dTLB-stores
26369184 -8.6% 24103019 perf-stat.ps.iTLB-load-misses
1.675e+10 -5.1% 1.59e+10 perf-stat.ps.instructions
3597149 -4.9% 3420963 perf-stat.ps.node-load-misses
6994250 -3.0% 6784176 perf-stat.ps.node-store-misses
5.199e+12 -5.2% 4.931e+12 perf-stat.total.instructions



will-it-scale.per_process_ops

19500 +-+-----------------------------------------------------------------+
|..+.+..+. |
19000 +-+ +.. |
| |
18500 +-+ +..+. .+.+..+.+..+..+ |
| +..+. + |
18000 +-+ +..+.+..+..+.+..+..+ |
| |
17500 +-+ O O |
| O |
17000 +-+ O O O O O O O O O
| O O |
16500 O-+O O O O O O O O O |
| O O O |
16000 +-+-----------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (9.89 kB)
config-5.0.0-rc6-00001-g81ec3f3c4c4d7 (190.62 kB)
job-script (7.61 kB)
job.yaml (5.23 kB)
reproduce (322.00 B)
Download all attachments

2020-02-05 12:59:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Wed, Feb 05, 2020 at 08:32:16PM +0800, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -5.5% regression of will-it-scale.per_process_ops due to commit:
>
>
> commit: 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb ("perf/x86: Add check_period PMU callback")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>

I'm fairly sure this bisect/result is bogus.

2020-02-06 03:05:33

by Philip Li

[permalink] [raw]
Subject: RE: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

> Subject: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5%
> regression
>
> On Wed, Feb 05, 2020 at 08:32:16PM +0800, kernel test robot wrote:
> > Greeting,
> >
> > FYI, we noticed a -5.5% regression of will-it-scale.per_process_ops due to
> commit:
> >
> >
> > commit: 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb ("perf/x86: Add
> check_period PMU callback")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
>
> I'm fairly sure this bisect/result is bogus.
Hi Peter, thanks for feedback, we will investigate this in earliest time.

> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

2020-02-21 08:04:51

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression


On Wed, Feb 05, 2020 at 01:58:04PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 05, 2020 at 08:32:16PM +0800, kernel test robot wrote:
> > Greeting,
> >
> > FYI, we noticed a -5.5% regression of will-it-scale.per_process_ops due to commit:
> >
> >
> > commit: 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb ("perf/x86: Add check_period PMU callback")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
>
> I'm fairly sure this bisect/result is bogus.


Hi Peter,

Some updates:

We checked more on this. We run 14 times test for it, and the
results are consistent about the 5.5% degradation, and we
run the same test on several other platforms, whose test results
are also consistent, though there are no such -5.5% seen.

We are also curious that the commit seems to be completely not
relative to this scalability test of signal, which starts a task
for each online CPU, and keeps calling raise(), and calculating
the run numbers.

One experiment we did is checking which part of the commit
really affects the test, and it turned out to be the change of
"struct pmu". Effectively, applying this patch upon 5.0-rc6
which triggers the same regression.

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1d5c551..e1a0517 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -447,6 +447,11 @@ struct pmu {
* Filter events for PMU-specific reasons.
*/
int (*filter_match) (struct perf_event *event); /* optional */
+
+ /*
+ * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
+ */
+ int (*check_period) (struct perf_event *event, u64 value); /* optional */
};

So likely, this commit changes the layout of the kernel text
and data, which may trigger some cacheline level change. From
the system map of the 2 kernels, a big trunk of symbol's address
changes which follow the global "pmu",

5.0-rc6-systemap:

ffffffff8221d000 d pmu
ffffffff8221d100 d pmc_reserve_mutex
ffffffff8221d120 d amd_f15_PMC53
ffffffff8221d160 d amd_f15_PMC50

5.0-rc6+pmu-change-systemap:

ffffffff8221d000 d pmu
ffffffff8221d120 d pmc_reserve_mutex
ffffffff8221d140 d amd_f15_PMC53
ffffffff8221d180 d amd_f15_PMC50

But we can hardly identify which exact symbol is responsible
for the change, as too many symbols are offseted.

btw, we've seen similar case that an irrelevant commit changes
the benchmark, like a hugetlb patch improves pagefault test on
a platform that never uses hugetlb https://lkml.org/lkml/2020/1/14/150

Thanks,
Feng

> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

2020-02-21 10:59:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Fri, Feb 21, 2020 at 04:03:25PM +0800, Feng Tang wrote:

> We checked more on this. We run 14 times test for it, and the
> results are consistent about the 5.5% degradation, and we
> run the same test on several other platforms, whose test results
> are also consistent, though there are no such -5.5% seen.

> So likely, this commit changes the layout of the kernel text
> and data, which may trigger some cacheline level change. From
> the system map of the 2 kernels, a big trunk of symbol's address
> changes which follow the global "pmu",
>
> 5.0-rc6-systemap:
>
> ffffffff8221d000 d pmu
> ffffffff8221d100 d pmc_reserve_mutex
> ffffffff8221d120 d amd_f15_PMC53
> ffffffff8221d160 d amd_f15_PMC50
>
> 5.0-rc6+pmu-change-systemap:
>
> ffffffff8221d000 d pmu
> ffffffff8221d120 d pmc_reserve_mutex
> ffffffff8221d140 d amd_f15_PMC53
> ffffffff8221d180 d amd_f15_PMC50
>
> But we can hardly identify which exact symbol is responsible
> for the change, as too many symbols are offseted.

*groan*, that's horrible.

Thanks for digging into this.

2020-02-21 13:22:49

by Jiri Olsa

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Fri, Feb 21, 2020 at 04:03:25PM +0800, Feng Tang wrote:
>
> On Wed, Feb 05, 2020 at 01:58:04PM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 05, 2020 at 08:32:16PM +0800, kernel test robot wrote:
> > > Greeting,
> > >
> > > FYI, we noticed a -5.5% regression of will-it-scale.per_process_ops due to commit:
> > >
> > >
> > > commit: 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb ("perf/x86: Add check_period PMU callback")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> >
> > I'm fairly sure this bisect/result is bogus.
>
>
> Hi Peter,
>
> Some updates:
>
> We checked more on this. We run 14 times test for it, and the
> results are consistent about the 5.5% degradation, and we
> run the same test on several other platforms, whose test results
> are also consistent, though there are no such -5.5% seen.
>
> We are also curious that the commit seems to be completely not
> relative to this scalability test of signal, which starts a task
> for each online CPU, and keeps calling raise(), and calculating
> the run numbers.
>
> One experiment we did is checking which part of the commit
> really affects the test, and it turned out to be the change of
> "struct pmu". Effectively, applying this patch upon 5.0-rc6
> which triggers the same regression.
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 1d5c551..e1a0517 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -447,6 +447,11 @@ struct pmu {
> * Filter events for PMU-specific reasons.
> */
> int (*filter_match) (struct perf_event *event); /* optional */
> +
> + /*
> + * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
> + */
> + int (*check_period) (struct perf_event *event, u64 value); /* optional */
> };
>
> So likely, this commit changes the layout of the kernel text
> and data, which may trigger some cacheline level change. From
> the system map of the 2 kernels, a big trunk of symbol's address
> changes which follow the global "pmu",

nice, I wonder we could see that in perf c2c output ;-)
I'll try to run and check

thanks,
jirka

>
> 5.0-rc6-systemap:
>
> ffffffff8221d000 d pmu
> ffffffff8221d100 d pmc_reserve_mutex
> ffffffff8221d120 d amd_f15_PMC53
> ffffffff8221d160 d amd_f15_PMC50
>
> 5.0-rc6+pmu-change-systemap:
>
> ffffffff8221d000 d pmu
> ffffffff8221d120 d pmc_reserve_mutex
> ffffffff8221d140 d amd_f15_PMC53
> ffffffff8221d180 d amd_f15_PMC50
>
> But we can hardly identify which exact symbol is responsible
> for the change, as too many symbols are offseted.
>
> btw, we've seen similar case that an irrelevant commit changes
> the benchmark, like a hugetlb patch improves pagefault test on
> a platform that never uses hugetlb https://lkml.org/lkml/2020/1/14/150
>
> Thanks,
> Feng
>
> > _______________________________________________
> > LKP mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
>

2020-02-21 18:05:27

by Andi Kleen

[permalink] [raw]
Subject: RE: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression



>So likely, this commit changes the layout of the kernel text
>and data,

It should be only data here. text changes all the time anyways,
but data tends to be more stable.

> which may trigger some cacheline level change. From
>the system map of the 2 kernels, a big trunk of symbol's address
>changes which follow the global "pmu",

I wonder if it's the effect Andrew predicted a long time ago from
using __read_mostly. If all the __read_mostlies are moved somewhere
else the remaining read/write variables will get more sensitive to false sharing.

A simple experiment would be to add a __cacheline_aligned to align it,
and then add

____cacheline_aligned char dummy[0];

at the end to pad it to 64bytes.

Or hopefully Jiri can figure it out from the C2C data.

>btw, we've seen similar case that an irrelevant commit changes
>the benchmark, like a hugetlb patch improves pagefault test on
>a platform that never uses hugetlb https://lkml.org/lkml/2020/1/14/150

Yes we've had similar problems with the data segment before.

-Andi

2020-02-22 12:44:51

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

Hi Andi,

On Sat, Feb 22, 2020 at 02:05:02AM +0800, Kleen, Andi wrote:
>
>
> >So likely, this commit changes the layout of the kernel text
> >and data,
>
> It should be only data here. text changes all the time anyways,
> but data tends to be more stable.

Yes, I also did en experiment by modifying the gcc option to let
all functions address aligned to 32 or 64, and the 5.5% gap still
exist for the 2 commmits.

> > which may trigger some cacheline level change. From
> >the system map of the 2 kernels, a big trunk of symbol's address
> >changes which follow the global "pmu",
>
> I wonder if it's the effect Andrew predicted a long time ago from
> using __read_mostly. If all the __read_mostlies are moved somewhere
> else the remaining read/write variables will get more sensitive to false sharing.
>
> A simple experiment would be to add a __cacheline_aligned to align it,
> and then add
>
> ____cacheline_aligned char dummy[0];
>
> at the end to pad it to 64bytes.

Thanks for the suggestion, I tried this and the 5.5 regrssion is gone!
which also confirms the offset for the bulk of stuff following "pmu"
causes the performance drop.

>
> Or hopefully Jiri can figure it out from the C2C data.

I'm also trying to debug following Jiri's "perf c2c" suggestion.

Thanks,
Feng

2020-02-22 17:09:11

by Andi Kleen

[permalink] [raw]
Subject: RE: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

>Thanks for the suggestion, I tried this and the 5.5 regrssion is gone!
>which also confirms the offset for the bulk of stuff following "pmu"
>causes the performance drop.

Okay that confirms that it is false sharing. Looking at c2c is the right way to
go then.

Peter, could adopt the padding as a workaround for now?

-Andi

2020-02-23 14:12:28

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

Hi Jiri,

On Fri, Feb 21, 2020 at 02:20:48PM +0100, Jiri Olsa wrote:

> > We are also curious that the commit seems to be completely not
> > relative to this scalability test of signal, which starts a task
> > for each online CPU, and keeps calling raise(), and calculating
> > the run numbers.
> >
> > One experiment we did is checking which part of the commit
> > really affects the test, and it turned out to be the change of
> > "struct pmu". Effectively, applying this patch upon 5.0-rc6
> > which triggers the same regression.
> > So likely, this commit changes the layout of the kernel text
> > and data, which may trigger some cacheline level change. From
> > the system map of the 2 kernels, a big trunk of symbol's address
> > changes which follow the global "pmu",
>
> nice, I wonder we could see that in perf c2c output ;-)
> I'll try to run and check

Thanks for the "perf c2c" suggestion.

I tried to use perf-c2c on one platform (not the one that show
the 5.5% regression), and found the main "hitm" points to the
"root_user" global data, as there is a task for each CPU doing
the signal stress test, and both __sigqueue_alloc() and
__sigqueue_free() will call get_user() and free_uid() to inc/dec
this root_user's refcount.

Then I added some alignement inside struct "user_struct" (for
"root_user"), then the -5.5% is gone, with a +2.6% instead.

One c2c report log is attached.

One thing I don't understand is, this -5.5% only happens in
one 2 sockets, 96C/192T Cascadelake platform, as we've run
the same test on several different platforms. In therory,
the false sharing may also take effect?

Thanks,
Feng


Attachments:
(No filename) (1.64 kB)
c2c_wis_sig_32T.log (173.83 kB)
Download all attachments

2020-02-23 17:37:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Sun, Feb 23, 2020 at 6:11 AM Feng Tang <[email protected]> wrote:
>
> I tried to use perf-c2c on one platform (not the one that show
> the 5.5% regression), and found the main "hitm" points to the
> "root_user" global data, as there is a task for each CPU doing
> the signal stress test, and both __sigqueue_alloc() and
> __sigqueue_free() will call get_user() and free_uid() to inc/dec
> this root_user's refcount.

What's around it for you?

There might be that 'uidhash_lock' spinlock right next to it, and
maybe that exacerbates the issue?

> Then I added some alignement inside struct "user_struct" (for
> "root_user"), then the -5.5% is gone, with a +2.6% instead.

Do you actually need to align things inside the struct, or is it
sufficient to just align the structure itself?

IOW, is the cache conflicts _within_ the user_struct itself, or is it
with some nearby data (like that uidhash_lock or whatever?)

> One thing I don't understand is, this -5.5% only happens in
> one 2 sockets, 96C/192T Cascadelake platform, as we've run
> the same test on several different platforms. In therory,
> the false sharing may also take effect?

Is that the biggest machine you have access to?

Maybe it just isn't noticeable with smaller core counts. A lot of
conflict loads tend to have "exponential" behavior - when things get
overloaded, performance plummets because it just makes things worse as
everybody gets slower at that contention point and now it gets even
more contended...

Linus

2020-02-23 19:37:30

by Jiri Olsa

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Sun, Feb 23, 2020 at 10:11:47PM +0800, Feng Tang wrote:
> Hi Jiri,

hi,

>
> On Fri, Feb 21, 2020 at 02:20:48PM +0100, Jiri Olsa wrote:
>
> > > We are also curious that the commit seems to be completely not
> > > relative to this scalability test of signal, which starts a task
> > > for each online CPU, and keeps calling raise(), and calculating
> > > the run numbers.
> > >
> > > One experiment we did is checking which part of the commit
> > > really affects the test, and it turned out to be the change of
> > > "struct pmu". Effectively, applying this patch upon 5.0-rc6
> > > which triggers the same regression.
> > > So likely, this commit changes the layout of the kernel text
> > > and data, which may trigger some cacheline level change. From
> > > the system map of the 2 kernels, a big trunk of symbol's address
> > > changes which follow the global "pmu",
> >
> > nice, I wonder we could see that in perf c2c output ;-)
> > I'll try to run and check
>
> Thanks for the "perf c2c" suggestion.

I'm fighting with lkp tests.. looks like it's not fedora friendly ;-)

which specific test is doing this? perhaps I can dig it out and run
without the script machinery..

>
> I tried to use perf-c2c on one platform (not the one that show
> the 5.5% regression), and found the main "hitm" points to the
> "root_user" global data, as there is a task for each CPU doing
> the signal stress test, and both __sigqueue_alloc() and
> __sigqueue_free() will call get_user() and free_uid() to inc/dec
> this root_user's refcount.
>
> Then I added some alignement inside struct "user_struct" (for
> "root_user"), then the -5.5% is gone, with a +2.6% instead.

could you share the change?

>
> One c2c report log is attached.

could you also post one (for same data) without the callchains?

# perf c2c report --stdio --call-graph none

it should show the read/write/offset for cachelines in more
readable way

I'd be also interested to see the data if you can share (no worries
if not) ... I'd need the perf.data and bz2 file from 'perf archive'
run on the perf.data

>
> One thing I don't understand is, this -5.5% only happens in
> one 2 sockets, 96C/192T Cascadelake platform, as we've run
> the same test on several different platforms. In therory,
> the false sharing may also take effect?

I don't have access to cascade lake, but AFAICT the bigger
machine the bigger issues with false sharing ;-)

jirka

2020-02-24 00:34:01

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

Hi Linus,

On Sun, Feb 23, 2020 at 09:37:06AM -0800, Linus Torvalds wrote:
> On Sun, Feb 23, 2020 at 6:11 AM Feng Tang <[email protected]> wrote:
> >
> > I tried to use perf-c2c on one platform (not the one that show
> > the 5.5% regression), and found the main "hitm" points to the
> > "root_user" global data, as there is a task for each CPU doing
> > the signal stress test, and both __sigqueue_alloc() and
> > __sigqueue_free() will call get_user() and free_uid() to inc/dec
> > this root_user's refcount.
>
> What's around it for you?
>
> There might be that 'uidhash_lock' spinlock right next to it, and
> maybe that exacerbates the issue?

The system map shows:

ffffffff8225b520 d __syscall_meta__ptrace
ffffffff8225b560 d args__ptrace
ffffffff8225b580 d types__ptrace
ffffffff8225b5c0 D root_user
ffffffff8225b680 D init_user_ns
ffffffff8225b880 d ratelimit_state.56624
ffffffff8225b8c0 d event_exit__sigsuspend

I also searched the uidhack_lock,
ffffffff82b04c80 b uidhash_lock

> > Then I added some alignement inside struct "user_struct" (for
> > "root_user"), then the -5.5% is gone, with a +2.6% instead.
>
> Do you actually need to align things inside the struct, or is it
> sufficient to just align the structure itself?

Initially I justed add the ____cacheline_aligned after the definition
of struct 'user_struct', which only makes the following 'init_user_ns'
aligned, and test result doesn't show much change.

Then I added

struct user_struct {
+ char dummy[0] ____cacheline_aligned;

to make 'root_user' itself aligned.


> IOW, is the cache conflicts _within_ the user_struct itself, or is it
> with some nearby data (like that uidhash_lock or whatever?)

From the perf c2c data, and the source code checking, the conflicts
only happens for root_user.__count, and root_user.sigpending, as
all running tasks are accessing this global data for get/put and
other operations.

> > One thing I don't understand is, this -5.5% only happens in
> > one 2 sockets, 96C/192T Cascadelake platform, as we've run
> > the same test on several different platforms. In therory,
> > the false sharing may also take effect?
>
> Is that the biggest machine you have access to?
>
> Maybe it just isn't noticeable with smaller core counts. A lot of
> conflict loads tend to have "exponential" behavior - when things get
> overloaded, performance plummets because it just makes things worse as
> everybody gets slower at that contention point and now it gets even
> more contended...

No, it's not the biggest, I tried another machine 'Xeon Phi(TM) CPU 7295',
which has 72C/288T, and the regression is not seen. This is the part
confusing me :)

Also I've tried to limit the concurrent task number from 192 to 96/48/24/6/1,
and the regression number did get smaller following the task decrease.

Thanks,
Feng

> Linus

2020-02-24 01:07:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Sun, Feb 23, 2020 at 4:33 PM Feng Tang <[email protected]> wrote:
>
> From the perf c2c data, and the source code checking, the conflicts
> only happens for root_user.__count, and root_user.sigpending, as
> all running tasks are accessing this global data for get/put and
> other operations.

That's odd.

Why? Because those two would be guaranteed to be in the same cacheline
_after_ you've aligned that user_struct.

So if it were a false sharing issue between those two, it would
actually get _worse_ with alignment. Those two fields are basically
next to each other.

But maybe it was straddling a cacheline before, and it caused two
cache accesses each time?

I find this as confusing as you do.

If it's sigpending vs the __refcount, then we almost always change
them together. sigpending gets incremented by __sigqueue_alloc() -
which also does a "get_uid()", and then we decrement it in
__sigqueue_free() - which also does a "free_uid().

That said, exactly *because* they get incremented and decremented
together, maybe we could do something clever: make the "sigpending" be
a separate user counter, kind of how we do mm->user vs mm-.count.

And we'd only increment __refcount as the sigpending goes from zero to
non-zero, and decrement it as sigpending goes back to zero. Avoiding
the double atomics for the case of "lots of signals".

> ffffffff8225b580 d types__ptrace
> ffffffff8225b5c0 D root_user
> ffffffff8225b680 D init_user_ns

I'm assuming this is after the alignment patch (since that's 64-byte
aligned there).

What was it without the alignment?

> No, it's not the biggest, I tried another machine 'Xeon Phi(TM) CPU 7295',
> which has 72C/288T, and the regression is not seen. This is the part
> confusing me :)

Hmm.

Humor me - what happens if you turn off SMT on that Cascade Lake
system? Maybe it's about the thread ID bit in the L1? Although again,
I'd have expected things to get _worse_ if it's the two fields that
are now in the same cachline thanks to alignment.

The Xeon Phi is the small-core setup, right? They may be slow enough
to not show the issue as clearly despite having more cores. And it
wouldn't show effects of some out-of-order speculative cache accesses.

Linus

2020-02-24 01:59:57

by Huang, Ying

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

Linus Torvalds <[email protected]> writes:

> On Sun, Feb 23, 2020 at 4:33 PM Feng Tang <[email protected]> wrote:
>>
>> From the perf c2c data, and the source code checking, the conflicts
>> only happens for root_user.__count, and root_user.sigpending, as
>> all running tasks are accessing this global data for get/put and
>> other operations.
>
> That's odd.
>
> Why? Because those two would be guaranteed to be in the same cacheline
> _after_ you've aligned that user_struct.
>
> So if it were a false sharing issue between those two, it would
> actually get _worse_ with alignment. Those two fields are basically
> next to each other.
>
> But maybe it was straddling a cacheline before, and it caused two
> cache accesses each time?
>
> I find this as confusing as you do.
>
> If it's sigpending vs the __refcount, then we almost always change
> them together. sigpending gets incremented by __sigqueue_alloc() -
> which also does a "get_uid()", and then we decrement it in
> __sigqueue_free() - which also does a "free_uid().
>

One way to verify this is to change the layout of user_struct (or
root_user) to make __count and sigpending fields to be in 2 separate
cache lines explicitly.

Best Regards,
Huang, Ying

2020-02-24 02:20:03

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Sun, Feb 23, 2020 at 05:06:33PM -0800, Linus Torvalds wrote:

> > ffffffff8225b580 d types__ptrace
> > ffffffff8225b5c0 D root_user
> > ffffffff8225b680 D init_user_ns
>
> I'm assuming this is after the alignment patch (since that's 64-byte
> aligned there).
>
> What was it without the alignment?

For 5.0-rc6:
ffffffff8225b4c0 d types__ptrace
ffffffff8225b4e0 D root_user
ffffffff8225b580 D init_user_ns

For 5.0-rc6 + 81ec3f3c4c4
ffffffff8225b580 d types__ptrace
ffffffff8225b5a0 D root_user
ffffffff8225b640 D init_user_ns

The sigpending and __count are in the same cachline.

>
> > No, it's not the biggest, I tried another machine 'Xeon Phi(TM) CPU 7295',
> > which has 72C/288T, and the regression is not seen. This is the part
> > confusing me :)
>
> Hmm.
>
> Humor me - what happens if you turn off SMT on that Cascade Lake
> system? Maybe it's about the thread ID bit in the L1? Although again,
> I'd have expected things to get _worse_ if it's the two fields that
> are now in the same cachline thanks to alignment.

I'll try it and report back.

> The Xeon Phi is the small-core setup, right? They may be slow enough
> to not show the issue as clearly despite having more cores. And it
> wouldn't show effects of some out-of-order speculative cache accesses.

Yes, seems the Xeon Phi is using 72 Silvermont cores. And the less bigger
platform I tested was a 2 sockets 48C/96T Cascadelake platform which
doesn't reproduce the regression.

Thanks,
Feng

2020-02-24 13:21:57

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Mon, Feb 24, 2020 at 10:19:15AM +0800, Feng Tang wrote:
> >
> > > No, it's not the biggest, I tried another machine 'Xeon Phi(TM) CPU 7295',
> > > which has 72C/288T, and the regression is not seen. This is the part
> > > confusing me :)
> >
> > Hmm.
> >
> > Humor me - what happens if you turn off SMT on that Cascade Lake
> > system? Maybe it's about the thread ID bit in the L1? Although again,
> > I'd have expected things to get _worse_ if it's the two fields that
> > are now in the same cachline thanks to alignment.
>
> I'll try it and report back.

I added "nosmt=force" on the 2S 4 nodes 96C/192T machine, and tested
both 96 and 192 processes, and the regression still exists.

Also for Ying's suggestion about separate 'sigpending' to another cache
line than '__refcount', it can not heal the regression either.

Thanks,
Feng

2020-02-24 19:25:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Sun, Feb 23, 2020 at 6:19 PM Feng Tang <[email protected]> wrote:
>
> > What was it without the alignment?
>
> For 5.0-rc6:
> ffffffff8225b4c0 d types__ptrace
> ffffffff8225b4e0 D root_user
> ffffffff8225b580 D init_user_ns
>
> For 5.0-rc6 + 81ec3f3c4c4
> ffffffff8225b580 d types__ptrace
> ffffffff8225b5a0 D root_user
> ffffffff8225b640 D init_user_ns
>
> The sigpending and __count are in the same cachline.

Ok, so they used to be 32-byte aligned, and making it 64-byte aligned
changed something.

None of it makes any sense, though, since as you say, the two fields
you see having cache movement are still in the same cacheline.

The only difference ends up being whether they are in the first or
second half of the cacheline.

I thought that Cascade Lake ends up having full-cacheline transfers at
all caching levels, though, so even that shouldn't matter.

That said, it's a 2-socket system, so maybe there's something in the
cache transfer between sockets that cares which half of the cacheline
goes first.

Or maybe some critical-word-first logic that is simply buggy (or just
has unfortunate interactions).

I did try your little micro-benchmark on my desktop (small
8-core/16-thread 9900K CPU, just to verify the hotspots.

It does show that the fact that we have *two* atomics is a big deal:
the profiles show __sigqueue_alloc as having about half the cost being
that initial "lock xadd" for the refcount update, and a quarter being
the "lock inc" for the sigpending update.

The sigpending update is cheaper, because clearly the cacheline is
generally on the same core (since we just got it for the refcount).

The dequeuing isn't quite as clear in the profiles, because the "lock
decl" is in __dequeue_signal(), and then we have that free_uid ->
refcount_dec_and_lock_irqsave() chain to the 'lock cmpxchg' which is
the combined lock and decrement (it's basically
refcount_dec_not_one()).

The rest is xsave/restore and the userspace return (which is very
expensive due to the Intel CPU bugs - 30% of all CPU cycles are on
that stupid 'verw').

I'm guessing this might be another thing that makes Cascade Lake show
things: maybe Intel fixed the CPU bug, and thus the contention is much
more visible because it's not being hidden by the overhead?

ANYWAY.

Considering that none of the numbers make any sense at all, I think
that what's going in is (WARNING: wild handwaving commences) that this
is just extremely timing-sensitive for just _when_ the cacheline
transfer happens, and depending on pure bad luck you can get into a
situation where the likelihood that there's a transfer between the two
locked accesses (particularly maybe on the dequeuing path where they
aren't right next to each other), so instead of doing both accesses
with the same cacheline ownership, you get a bounce in between them.

And maybe there is some data transfer path where the cacheline is
transferred as two 32-byte transfers, and if the two words are in the
"high" 32 bytes, it takes longer to get them initially, and then it's
also likelier that you end up losing it again between accesses.

Yeah, if we could harness the energy from that handwaving, we could
probably power a small village.

I don't know. This does not seem to be a particularly serious load.
But it does feel like it should be possible to combine the two atomic
accesses into one, where you don't need to do the refcount thing
except for the case where sigcount goes from zero to non-zero (and
back to zero again).

But is it worth spending resources on it?

It might be a good idea to ask a hardware person why that 32-byte
cacheline placement might matter on that platform.

Does anybody else have any ideas?

Linus

2020-02-24 19:43:08

by Andi Kleen

[permalink] [raw]
Subject: RE: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

>Does anybody else have any ideas?

We've had problems with atomics having strange ordering constraints before.

But it might be c2c missing the right cache line. There are some known cases where that can happen.

Feng, can you double check with

perf record -a -d -e mem_load_l3_hit_retired.xsnp_hitm:pp,mem_load_l3_miss_retired.remote_hitm:pp ...

Does it report the same hot IPs as c2c?
What's the break down between the first and the second event for the hot IPs? The first is inside socket, the second between sockets.

-Andi

2020-02-24 20:18:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Mon, Feb 24, 2020 at 11:24 AM Linus Torvalds
<[email protected]> wrote:
>
> I don't know. This does not seem to be a particularly serious load.
> But it does feel like it should be possible to combine the two atomic
> accesses into one, where you don't need to do the refcount thing
> except for the case where sigcount goes from zero to non-zero (and
> back to zero again).

Ok, that looks just as simple as I thought it would be.

TOTALLY UNTESTED patch attached. It may be completely buggy garbage,
but it _looks_ trivial enough. Just make the rule be that "if we have
any user->sigpending cases, we'll get a ref to the user for the first
one, and drop it only when getting rid of the last one".

So it might be worth testing this. But again: I have NOT done so.

There might be some silly reason why this doesn't work because I just
did the tests wrong or missed some case.

Or there might be some subtle reason why it doesn't work because I
didn't think this through properly.

But it _looks_ obvious and simple enough. And it compiles for me. So
maybe it works.

Linus


Attachments:
patch.diff (1.40 kB)

2020-02-24 20:57:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

[ Adding a few more people that tend to be involved in signal
handling. Just in case - even if they probably don't care ]

On Mon, Feb 24, 2020 at 12:09 PM Linus Torvalds
<[email protected]> wrote:
>
> TOTALLY UNTESTED patch attached. It may be completely buggy garbage,
> but it _looks_ trivial enough.

I've tested it, and the profiles on the silly microbenchmark look much
nicer. Now it's just the sigpending update shows up, the refcount case
clearly still occasionally happens, but it's now in the noise.

I made slight changes to the __sigqueue_alloc() case to generate
better code: since we now use that atomic_inc_return() anyway, we
might as well then use the value that is returned for the
RLIMIT_SIGPENDING check too, instead of reading it again.

That might avoid another potential cacheline bounce, plus the
generated code just looks better.

Updated (and now slightly tested!) patch attached.

It would be interesting if this is noticeable on your benchmark
numbers. I didn't actually _time_ anything, I just looked at profiles.

But my setup clearly isn't going to see the horrible contention case
anyway, so my timing numbers wouldn't be all that interesting.

Linus


Attachments:
patch.diff (1.66 kB)

2020-02-24 21:22:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

Linus Torvalds <[email protected]> writes:

> [ Adding a few more people that tend to be involved in signal
> handling. Just in case - even if they probably don't care ]
>
> On Mon, Feb 24, 2020 at 12:09 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> TOTALLY UNTESTED patch attached. It may be completely buggy garbage,
>> but it _looks_ trivial enough.
>
> I've tested it, and the profiles on the silly microbenchmark look much
> nicer. Now it's just the sigpending update shows up, the refcount case
> clearly still occasionally happens, but it's now in the noise.
>
> I made slight changes to the __sigqueue_alloc() case to generate
> better code: since we now use that atomic_inc_return() anyway, we
> might as well then use the value that is returned for the
> RLIMIT_SIGPENDING check too, instead of reading it again.
>
> That might avoid another potential cacheline bounce, plus the
> generated code just looks better.
>
> Updated (and now slightly tested!) patch attached.
>
> It would be interesting if this is noticeable on your benchmark
> numbers. I didn't actually _time_ anything, I just looked at profiles.
>
> But my setup clearly isn't going to see the horrible contention case
> anyway, so my timing numbers wouldn't be all that interesting.
>
> Linus

I keep looking at your patch and wondering if there isn't a way
to remove the uid refcount entirely on this path.

Linus I might be wrong but I have this sense that your change will only
help when signal delivery is backed up. I expect in the common case
there won't be any pending signals outstanding for the user.

Not that I see anything bad jumping out at me from your patch.

Eric



> kernel/signal.c | 23 ++++++++++++++---------
> 1 file changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ad8dea93dbb..5b2396350dd1 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -413,27 +413,32 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
> {
> struct sigqueue *q = NULL;
> struct user_struct *user;
> + int sigpending;
>
> /*
> * Protect access to @t credentials. This can go away when all
> * callers hold rcu read lock.
> + *
> + * NOTE! A pending signal will hold on to the user refcount,
> + * and we get/put the refcount only when the sigpending count
> + * changes from/to zero.
> */
> rcu_read_lock();
> - user = get_uid(__task_cred(t)->user);
> - atomic_inc(&user->sigpending);
> + user = __task_cred(t)->user;
> + sigpending = atomic_inc_return(&user->sigpending);
> + if (sigpending == 1)
> + get_uid(user);
> rcu_read_unlock();
>
> - if (override_rlimit ||
> - atomic_read(&user->sigpending) <=
> - task_rlimit(t, RLIMIT_SIGPENDING)) {
> + if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
> q = kmem_cache_alloc(sigqueue_cachep, flags);
> } else {
> print_dropped_signal(sig);
> }
>
> if (unlikely(q == NULL)) {
> - atomic_dec(&user->sigpending);
> - free_uid(user);
> + if (atomic_dec_and_test(&user->sigpending))
> + free_uid(user);
> } else {
> INIT_LIST_HEAD(&q->list);
> q->flags = 0;
> @@ -447,8 +452,8 @@ static void __sigqueue_free(struct sigqueue *q)
> {
> if (q->flags & SIGQUEUE_PREALLOC)
> return;
> - atomic_dec(&q->user->sigpending);
> - free_uid(q->user);
> + if (atomic_dec_and_test(&q->user->sigpending))
> + free_uid(q->user);
> kmem_cache_free(sigqueue_cachep, q);
> }
>

2020-02-24 21:52:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Mon, Feb 24, 2020 at 1:22 PM Eric W. Biederman <[email protected]> wrote:
>
> I keep looking at your patch and wondering if there isn't a way
> to remove the uid refcount entirely on this path.

I agree. I tried to come up with something, but couldn't.

> Linus I might be wrong but I have this sense that your change will only
> help when signal delivery is backed up. I expect in the common case
> there won't be any pending signals outstanding for the user.

Again, 100% agreed.

HOWEVER.

The normal case where there's one only signal pending is also the case
where we don't care about the extra atomic RMW access. By definition
that's not going to ever going to show up as a performance issue or
for cacheline contention.

So the only case that matters from a performance standpoint is the
"lots of signals" case, in which case you'll see that sigqueue become
backed up.

But as I said in the original thread (before you got added to the list):

"I don't know. This does not seem to be a particularly serious load."

I'm not convinced this will show up outside of this kind of
signal-sending microbenchmark.

That said, I don't really see any downside to the patch either, so...

Linus

2020-02-24 22:03:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

Linus Torvalds <[email protected]> writes:

> On Mon, Feb 24, 2020 at 1:22 PM Eric W. Biederman <[email protected]> wrote:
>>
>> I keep looking at your patch and wondering if there isn't a way
>> to remove the uid refcount entirely on this path.
>
> I agree. I tried to come up with something, but couldn't.
>
>> Linus I might be wrong but I have this sense that your change will only
>> help when signal delivery is backed up. I expect in the common case
>> there won't be any pending signals outstanding for the user.
>
> Again, 100% agreed.
>
> HOWEVER.
>
> The normal case where there's one only signal pending is also the case
> where we don't care about the extra atomic RMW access. By definition
> that's not going to ever going to show up as a performance issue or
> for cacheline contention.
>
> So the only case that matters from a performance standpoint is the
> "lots of signals" case, in which case you'll see that sigqueue become
> backed up.
>
> But as I said in the original thread (before you got added to the list):
>
> "I don't know. This does not seem to be a particularly serious load."
>
> I'm not convinced this will show up outside of this kind of
> signal-sending microbenchmark.
>
> That said, I don't really see any downside to the patch either, so...

Other than scratching my head about why are we optimizing neither do I.

It would help to have a comment somewhere in the code or the commit
message that says the issue is contetion under load. So the next time
someone goes through the code and goes why aren't we doing the stupid
and simple thing it will be clear.

Eric

2020-02-24 22:20:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Mon, Feb 24, 2020 at 2:02 PM Eric W. Biederman <[email protected]> wrote:
>
> Other than scratching my head about why are we optimizing neither do I.

You can see the background on lore

https://lore.kernel.org/lkml/20200205123216.GO12867@shao2-debian/

and the thread about the largely unexplained regression there. I had a
wild handwaving theory on what's going on in

https://lore.kernel.org/lkml/CAHk-=wjkSb1OkiCSn_fzf2v7A=K0bNsUEeQa+06XMhTO+oQUaA@mail.gmail.com/

but yes, the contention only happens once you have a lot of cores.

That said, I suspect it actually improves performance on that
microbenchmark even without the contention - just not as noticeably.
I'm running a kernel with the patch right now, but I wasn't going to
boot back into an old kernel just to test that. I was hoping that the
kernel test robot people would just check it out.

> It would help to have a comment somewhere in the code or the commit
> message that says the issue is contetion under load.

Note that even without the contention, on that "send a lot of signals"
case it does avoid the second atomic op, and the profile really does
look better.

That profile improvement I can see even on my own machine, and I see
how the nasty CPU bug avoidance (the "verw" on the system call exit
path) goes from 30% to 31% cost.

And that increase in the relative cost of the "verw" on the profile
must mean that the actual real code just improved in performance (even
if I didn't actually time it).

With the contention, you get that added odd extra regression that
seems to depend on exact cacheline placement.

So I think the patch improves performance (for this "lots of queued
signals" case) in general, and I hope it will also then get rid of
that contention regression.

Linus

2020-02-25 02:59:19

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Mon, Feb 24, 2020 at 12:47:14PM -0800, Linus Torvalds wrote:
> [ Adding a few more people that tend to be involved in signal
> handling. Just in case - even if they probably don't care ]
>
> On Mon, Feb 24, 2020 at 12:09 PM Linus Torvalds
>
> I've tested it, and the profiles on the silly microbenchmark look much
> nicer. Now it's just the sigpending update shows up, the refcount case
> clearly still occasionally happens, but it's now in the noise.
>
> I made slight changes to the __sigqueue_alloc() case to generate
> better code: since we now use that atomic_inc_return() anyway, we
> might as well then use the value that is returned for the
> RLIMIT_SIGPENDING check too, instead of reading it again.
>
> That might avoid another potential cacheline bounce, plus the
> generated code just looks better.
>
> Updated (and now slightly tested!) patch attached.
> It would be interesting if this is noticeable on your benchmark
> numbers. I didn't actually _time_ anything, I just looked at profiles.
>
> But my setup clearly isn't going to see the horrible contention case
> anyway, so my timing numbers wouldn't be all that interesting.

Thanks for the optimization patch for signal!

It makes a big difference, that the performance score is tripled!
bump from original 17000 to 54000. Also the gap between 5.0-rc6 and
5.0-rc6+Jiri's patch is reduced to around 2%.

The test I run is inserting your patch right before 5.0-rc6, then
run the test for 5.0-rc6 and 5.0-rc6+Jiri's patch. Sorry it took
quite some time, as the test platform is not local but inside
0day's framework, which takes some time for scheduling, kbuilding
and testing.

Thanks,
Feng


> Linus

> kernel/signal.c | 23 ++++++++++++++---------
> 1 file changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ad8dea93dbb..5b2396350dd1 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -413,27 +413,32 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
> {
> struct sigqueue *q = NULL;
> struct user_struct *user;
> + int sigpending;
>
> /*
> * Protect access to @t credentials. This can go away when all
> * callers hold rcu read lock.
> + *
> + * NOTE! A pending signal will hold on to the user refcount,
> + * and we get/put the refcount only when the sigpending count
> + * changes from/to zero.
> */
> rcu_read_lock();
> - user = get_uid(__task_cred(t)->user);
> - atomic_inc(&user->sigpending);
> + user = __task_cred(t)->user;
> + sigpending = atomic_inc_return(&user->sigpending);
> + if (sigpending == 1)
> + get_uid(user);
> rcu_read_unlock();
>
> - if (override_rlimit ||
> - atomic_read(&user->sigpending) <=
> - task_rlimit(t, RLIMIT_SIGPENDING)) {
> + if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
> q = kmem_cache_alloc(sigqueue_cachep, flags);
> } else {
> print_dropped_signal(sig);
> }
>
> if (unlikely(q == NULL)) {
> - atomic_dec(&user->sigpending);
> - free_uid(user);
> + if (atomic_dec_and_test(&user->sigpending))
> + free_uid(user);
> } else {
> INIT_LIST_HEAD(&q->list);
> q->flags = 0;
> @@ -447,8 +452,8 @@ static void __sigqueue_free(struct sigqueue *q)
> {
> if (q->flags & SIGQUEUE_PREALLOC)
> return;
> - atomic_dec(&q->user->sigpending);
> - free_uid(q->user);
> + if (atomic_dec_and_test(&q->user->sigpending))
> + free_uid(q->user);
> kmem_cache_free(sigqueue_cachep, q);
> }
>

2020-02-25 03:15:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Mon, Feb 24, 2020 at 6:57 PM Feng Tang <[email protected]> wrote:
>
> Thanks for the optimization patch for signal!
>
> It makes a big difference, that the performance score is tripled!
> bump from original 17000 to 54000. Also the gap between 5.0-rc6 and
> 5.0-rc6+Jiri's patch is reduced to around 2%.

Ok, so what I think is happening is that the exact same issue still
exists, but now with less contention it's not quite as noticeable.

Can you find some Intel CPU hardware person who could spend a moment
on that odd 32-byte sub-block issue?

Considering that this effect apparently doesn't happen on any other
platform you've tested, and this Cascade Lake platform is the newly
released current Intel server platform, I think it's worth looking at.

That microbenchmark is not important on its own, but the odd timing
behaviour it has would be good to have explained.

And while the signal sending microbenchmark is not likely to be very
relevant to much anything else, I guess I'll apply the patch. Even if
it's just a microbenchmark, it's not like we haven't used those before
to pinpoint some very specific behavior. We used lmbench (and whatever
that odd page cache benchmark was) to do some fairly fundamental
optimizations back in the days.

If you fix the details on all the microbenchmarks you find, eventually
you probably do well on real loads too..

Linus

2020-02-25 04:54:45

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

Hi Linus,

On Mon, Feb 24, 2020 at 07:15:15PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2020 at 6:57 PM Feng Tang <[email protected]> wrote:
> >
> > Thanks for the optimization patch for signal!
> >
> > It makes a big difference, that the performance score is tripled!
> > bump from original 17000 to 54000. Also the gap between 5.0-rc6 and
> > 5.0-rc6+Jiri's patch is reduced to around 2%.
>
> Ok, so what I think is happening is that the exact same issue still
> exists, but now with less contention it's not quite as noticeable.

I thought that too.

Since we have the reproducable platform, we will keep an eye on it,
and report back if anything found.

You've mentioned the patch's effect on small system in another mail,
I ran the benchmark on a 4 core Skylake desktop, and it only brought
2% performance gain, as expected.

>
> Can you find some Intel CPU hardware person who could spend a moment
> on that odd 32-byte sub-block issue?
>
> Considering that this effect apparently doesn't happen on any other
> platform you've tested, and this Cascade Lake platform is the newly
> released current Intel server platform, I think it's worth looking at.

I'll try to reach some silicon people, and get back if found anything.


> That microbenchmark is not important on its own, but the odd timing
> behaviour it has would be good to have explained.
>
> And while the signal sending microbenchmark is not likely to be very
> relevant to much anything else, I guess I'll apply the patch. Even if
> it's just a microbenchmark, it's not like we haven't used those before
> to pinpoint some very specific behavior. We used lmbench (and whatever
> that odd page cache benchmark was) to do some fairly fundamental
> optimizations back in the days.

Thanks again for the patch.

- Feng
>
> If you fix the details on all the microbenchmarks you find, eventually
> you probably do well on real loads too..
>
> Linus