2015-06-22 12:25:02

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 00/13] percpu rwsem -v2

This is a derived work of the cpu hotplug lock rework I did in 2013 which never
really went anywhere because Linus didn't like it.

This applies those same optimizations to the percpu-rwsem. Seeing how we did
all the work it seemed a waste to not use it at all. Linus still didn't like it
because there was only a single user, there are two now:

- uprobes
- cgroups

This series converts the cpu hotplug lock into a percpu-rwsem to provide a 3rd
user.

Also, since Linus thinks lglocks is a failed locking primitive (which I whole
heartedly agree with, its preempt-disable latencies are an abomination), it
also converts the global part of fs/locks's usage of lglock over to a
percpu-rwsem and uses a per-cpu spinlock for the local part. This both provides
another (4th) percpu-rwsem users and removes an lglock user.

It further removes the stop_machine lglock usage, and with it kills lglocks.

Changes since -v1:

- Added the missing smp_load_acquire()/smp_store_release() as spotted by Oleg
- Added percpu_down_read_trylock()
- Convert cpu hotplug lock
- Convert fs/locks
- Removes lglock from stop_machine
- Removes lglock

---
Documentation/locking/lglock.txt | 166 -------------------------
fs/Kconfig | 1 +
fs/file_table.c | 1 -
fs/locks.c | 65 +++++++---
include/linux/cpu.h | 6 +
include/linux/lglock.h | 81 -------------
include/linux/percpu-rwsem.h | 96 +++++++++++++--
include/linux/sched.h | 9 +-
init/main.c | 1 +
kernel/cpu.c | 130 ++++++--------------
kernel/fork.c | 2 +
kernel/locking/Makefile | 1 -
kernel/locking/lglock.c | 111 -----------------
kernel/locking/percpu-rwsem.c | 255 +++++++++++++++++++++------------------
kernel/rcu/Makefile | 2 +-
kernel/stop_machine.c | 52 ++++----
lib/Kconfig | 10 ++
17 files changed, 371 insertions(+), 618 deletions(-)


2015-06-22 12:36:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2


I forgot to Re-instate "From: Oleg Nesterov" On the first 4 patches.

Sorry about that. I'll take more care with a next posting.

2015-06-22 18:11:28

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On 06/22/2015 02:16 PM, Peter Zijlstra wrote:
> Also, since Linus thinks lglocks is a failed locking primitive (which I whole
> heartedly agree with, its preempt-disable latencies are an abomination), it
> also converts the global part of fs/locks's usage of lglock over to a
> percpu-rwsem and uses a per-cpu spinlock for the local part. This both provides
> another (4th) percpu-rwsem users and removes an lglock user.

I did a quick lockperf run with these patches on a 4 socket E5-4610 machine.
These microbenches execercise the fs' locks a bit.

I suspect I got the wrong tree. The patches did not apply cleanly. The resulting
kernel boots fine and doesn't explode... so far...

The results aren't looking too bad. Though building a kernel with 'make -j200'
was extreme slow. I'll look into it tomorrow.

https://git.samba.org/jlayton/linux.git/?p=jlayton/lockperf.git;a=summary

flock01
mean variance sigma max min
4.1.0 11.7075 816.3341 28.5716 125.6552 0.0021
percpu-rwsem 11.4614 760.1345 27.5705 132.5030 0.0026


flock02
mean variance sigma max min
4.1.0 7.0197 1.1812 1.0868 10.6188 5.1706
percpu-rwsem 9.3194 1.3443 1.1594 11.5902 6.6138


lease01
mean variance sigma max min
4.1.0 41.8361 23.8462 4.8833 51.3493 28.5859
percpu-rwsem 40.2738 20.8323 4.5642 49.6037 28.0704


lease02
mean variance sigma max min
4.1.0 71.2159 12.7763 3.5744 77.8432 58.0390
percpu-rwsem 71.4312 14.7688 3.8430 76.5036 57.8615


posix01
mean variance sigma max min
4.1.0 121.9020 27882.5260 166.9806 603.5509 0.0063
percpu-rwsem 185.3981 38474.3836 196.1489 580.6532 0.0073


posix02
mean variance sigma max min
4.1.0 12.7461 3.1802 1.7833 15.5411 8.1018
percpu-rwsem 16.2341 4.3038 2.0746 19.3271 11.1751


posix03
mean variance sigma max min
4.1.0 0.9121 0.0000 0.0000 0.9121 0.9121
percpu-rwsem 0.9379 0.0000 0.0000 0.9379 0.9379


posix04
mean variance sigma max min
4.1.0 0.0703 0.0044 0.0664 0.6764 0.0437
percpu-rwsem 0.0675 0.0007 0.0267 0.3236 0.0491


cheers,
daniel

2015-06-22 19:06:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Mon, Jun 22, 2015 at 08:11:14PM +0200, Daniel Wagner wrote:
> On 06/22/2015 02:16 PM, Peter Zijlstra wrote:
> > Also, since Linus thinks lglocks is a failed locking primitive (which I whole
> > heartedly agree with, its preempt-disable latencies are an abomination), it
> > also converts the global part of fs/locks's usage of lglock over to a
> > percpu-rwsem and uses a per-cpu spinlock for the local part. This both provides
> > another (4th) percpu-rwsem users and removes an lglock user.
>
> I did a quick lockperf run with these patches on a 4 socket E5-4610 machine.
> These microbenches execercise the fs' locks a bit.
>
> I suspect I got the wrong tree. The patches did not apply cleanly. The resulting
> kernel boots fine and doesn't explode... so far...

Its against tip/master, although I expect the locking/core bits that
were sent to Linus earlier today to be the biggest missing piece.

All I really did was build a kernel with lockdep enabled and boot +
build a kernel to see it didn't go belly up.

> The results aren't looking too bad. Though building a kernel with 'make -j200'
> was extreme slow. I'll look into it tomorrow.
>
> https://git.samba.org/jlayton/linux.git/?p=jlayton/lockperf.git;a=summary

Sweet, I wasn't aware these existed. I'll go have a play.

> posix01
> mean variance sigma max min
> 4.1.0 121.9020 27882.5260 166.9806 603.5509 0.0063
> percpu-rwsem 185.3981 38474.3836 196.1489 580.6532 0.0073
>
>
> posix02
> mean variance sigma max min
> 4.1.0 12.7461 3.1802 1.7833 15.5411 8.1018
> percpu-rwsem 16.2341 4.3038 2.0746 19.3271 11.1751
>

These two seem to hurt, lemme go look at what they do.

2015-06-22 20:06:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Mon, Jun 22, 2015 at 5:16 AM, Peter Zijlstra <[email protected]> wrote:
>
> It further removes the stop_machine lglock usage, and with it kills lglocks.

Ok. With all the conversions, and removal of lglock, my dislike of
this goes away.

I'm somewhat worried about Daniel's report about "building a kernel
with 'make -j200' was extreme slow", but that may be due to something
else (does the machine have enough memory for "make -j200"? The kernel
compile parallelizes so well, and gcc uses so much memory, that you
need a *lot* of memory to use things like "-j200").

But assuming that gets sorted out, and somebody looks at the few file
locking performance issues, I have no objections to this series any
more.

Linus

2015-06-23 09:35:42

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On 06/22/2015 09:05 PM, Peter Zijlstra wrote:
> On Mon, Jun 22, 2015 at 08:11:14PM +0200, Daniel Wagner wrote:
>> On 06/22/2015 02:16 PM, Peter Zijlstra wrote:
>>> Also, since Linus thinks lglocks is a failed locking primitive (which I whole
>>> heartedly agree with, its preempt-disable latencies are an abomination), it
>>> also converts the global part of fs/locks's usage of lglock over to a
>>> percpu-rwsem and uses a per-cpu spinlock for the local part. This both provides
>>> another (4th) percpu-rwsem users and removes an lglock user.
>>
>> I did a quick lockperf run with these patches on a 4 socket E5-4610 machine.
>> These microbenches execercise the fs' locks a bit.
>>
>> I suspect I got the wrong tree. The patches did not apply cleanly. The resulting
>> kernel boots fine and doesn't explode... so far...
>
> Its against tip/master, although I expect the locking/core bits that
> were sent to Linus earlier today to be the biggest missing piece.
>
> All I really did was build a kernel with lockdep enabled and boot +
> build a kernel to see it didn't go belly up.
>
>> The results aren't looking too bad. Though building a kernel with 'make -j200'
>> was extreme slow. I'll look into it tomorrow.

So this turns out to be false alarm. I had icecream installed/actived
and that interfered with gcc. Stupid me.

The machine has 0.5TB memory and doesn't seem to be really concerned about
'make -j200'

make clean && time make -j200

mainline 4.1.0
2nd run
real 1m7.595s
user 28m43.125s
sys 3m48.189s


tip v4.1-2756-ge3d06bd
2nd run
real 1m6.871s
user 28m50.803s
sys 3m50.223s
3rd run
real 1m6.974s
user 28m52.093s
sys 3m50.259s


tip v4.1-2769-g6ce2591 (percpu-rwsem)
2nd run
real 1m7.847s
user 29m0.439s
sys 3m51.181s
3rd run
real 1m7.113s
user 29m3.127s
sys 3m51.516s



Compared to 'make -j64' on tip v4.1-2756-ge3d06bd
2nd run
real 1m7.605s
user 28m3.121s
sys 3m52.541s

>> https://git.samba.org/jlayton/linux.git/?p=jlayton/lockperf.git;a=summary
>
> Sweet, I wasn't aware these existed. I'll go have a play.
>
>> posix01
>> mean variance sigma max min
>> 4.1.0 121.9020 27882.5260 166.9806 603.5509 0.0063
>> percpu-rwsem 185.3981 38474.3836 196.1489 580.6532 0.0073
>>
>>
>> posix02
>> mean variance sigma max min
>> 4.1.0 12.7461 3.1802 1.7833 15.5411 8.1018
>> percpu-rwsem 16.2341 4.3038 2.0746 19.3271 11.1751
>>
>
> These two seem to hurt, lemme go look at what they do.

Now here the same tests with tip and tip+percpu-rwsem. The patches
applied cleanly :)

I put all the raw data here[1] in case someone is interested. Some of the
test behave a bit strange, running extremely fast compared to the other runs.
That is probably the result of me trying to reduce the run time to the min.


flock01
mean variance sigma max min
4.1.0 11.7075 816.3341 28.5716 125.6552 0.0021
4.1.0+percpu-rwsem 11.4614 760.1345 27.5705 132.5030 0.0026
tip 6.8390 329.3037 18.1467 81.0373 0.0021
tip+percpu-rwsem 10.0870 546.7435 23.3825 106.2396 0.0026


flock02
mean variance sigma max min
4.1.0 7.0197 1.1812 1.0868 10.6188 5.1706
4.1.0+percpu-rwsem 9.3194 1.3443 1.1594 11.5902 6.6138
tip 7.1057 1.6719 1.2930 11.2362 5.1434
tip+percpu-rwsem 9.0357 1.9874 1.4097 14.0254 6.4380


lease01
mean variance sigma max min
4.1.0 41.8361 23.8462 4.8833 51.3493 28.5859
4.1.0+percpu-rwsem 40.2738 20.8323 4.5642 49.6037 28.0704
tip 30.2617 13.0900 3.6180 36.6398 20.2085
tip+percpu-rwsem 31.2730 17.9787 4.2401 37.8981 19.2944


lease02
mean variance sigma max min
4.1.0 71.2159 12.7763 3.5744 77.8432 58.0390
4.1.0+percpu-rwsem 71.4312 14.7688 3.8430 76.5036 57.8615
tip 20.2019 5.2042 2.2813 23.1071 13.4647
tip+percpu-rwsem 20.8305 6.6631 2.5813 23.8034 11.2815


posix01
mean variance sigma max min
4.1.0 121.9020 27882.5260 166.9806 603.5509 0.0063
4.1.0+percpu-rwsem 185.3981 38474.3836 196.1489 580.6532 0.0073
tip 129.2736 23752.7122 154.1191 474.0604 0.0063
tip+percpu-rwsem 142.6474 24732.1571 157.2646 468.7478 0.0072


posix02
mean variance sigma max min
4.1.0 12.7461 3.1802 1.7833 15.5411 8.1018
4.1.0+percpu-rwsem 16.2341 4.3038 2.0746 19.3271 11.1751
tip 13.2810 5.3958 2.3229 20.1243 8.9361
tip+percpu-rwsem 15.6802 4.7514 2.1798 21.5704 9.4074


posix03
mean variance sigma max min
4.1.0 0.9121 0.0000 0.0000 0.9121 0.9121
4.1.0+percpu-rwsem 0.9379 0.0000 0.0000 0.9379 0.9379
tip 0.8647 0.0009 0.0297 0.9274 0.7995
tip+percpu-rwsem 0.8147 0.0003 0.0161 0.8530 0.7824


posix04
mean variance sigma max min
4.1.0 0.0703 0.0044 0.0664 0.6764 0.0437
4.1.0+percpu-rwsem 0.0675 0.0007 0.0267 0.3236 0.0491
tip 0.0618 0.0027 0.0521 0.5642 0.0453
tip+percpu-rwsem 0.0658 0.0003 0.0175 0.1793 0.0493


cheers,
daniel

[1] http://monom.org/percpu-rwsem/

2015-06-23 10:01:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2


* Daniel Wagner <[email protected]> wrote:

> The machine has 0.5TB memory and doesn't seem to be really concerned about
> 'make -j200'
>
> make clean && time make -j200
>
> mainline 4.1.0
> 2nd run
> real 1m7.595s
> user 28m43.125s
> sys 3m48.189s
>
>
> tip v4.1-2756-ge3d06bd
> 2nd run
> real 1m6.871s
> user 28m50.803s
> sys 3m50.223s
> 3rd run
> real 1m6.974s
> user 28m52.093s
> sys 3m50.259s
>
>
> tip v4.1-2769-g6ce2591 (percpu-rwsem)
> 2nd run
> real 1m7.847s
> user 29m0.439s
> sys 3m51.181s
> 3rd run
> real 1m7.113s
> user 29m3.127s
> sys 3m51.516s
>
>
>
> Compared to 'make -j64' on tip v4.1-2756-ge3d06bd
> 2nd run
> real 1m7.605s
> user 28m3.121s
> sys 3m52.541s

Btw., instead of just listing the raw runs, you can get an automatic average and
stddev numbers with this:

$ perf stat --null --repeat 5 --pre 'make clean' --post 'sync' make -j200

Performance counter stats for 'make -j200' (3 runs):

29.068162979 seconds time elapsed ( +- 0.27% )

Thanks,

Ingo

2015-06-23 14:34:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Tue, Jun 23, 2015 at 11:35:24AM +0200, Daniel Wagner wrote:
> flock01
> mean variance sigma max min
> 4.1.0 11.7075 816.3341 28.5716 125.6552 0.0021
> 4.1.0+percpu-rwsem 11.4614 760.1345 27.5705 132.5030 0.0026
> tip 6.8390 329.3037 18.1467 81.0373 0.0021
> tip+percpu-rwsem 10.0870 546.7435 23.3825 106.2396 0.0026

> posix01
> mean variance sigma max min
> 4.1.0 121.9020 27882.5260 166.9806 603.5509 0.0063
> 4.1.0+percpu-rwsem 185.3981 38474.3836 196.1489 580.6532 0.0073
> tip 129.2736 23752.7122 154.1191 474.0604 0.0063
> tip+percpu-rwsem 142.6474 24732.1571 157.2646 468.7478 0.0072

Both these tests are incredibly unstable for me (as well as for you it
appears). Variance is through the roof on them.

I get runtimes like:

root@ivb-ex:/usr/local/src/lockperf# ./flock01 -n 240 -l 32 /tmp/a
0.266157011
root@ivb-ex:/usr/local/src/lockperf# ./flock01 -n 240 -l 32 /tmp/a
139.303399960

That's not really inspiring, if I use bigger loop counts it more or less
settles, but then the EX is unusable because it ends up running 3000
seconds per test.

In any case, on a smaller box (ivb-ep) I got the below results:

posix01
mean variance sigma max min
data-4.1.0-02756-ge3d06bd 250.7032 40.4864 6.3629 263.7736 238.5192
data-4.1.0-02756-ge3d06bd-dirty 252.6847 35.8953 5.9913 270.1679 233.0215

Which looks better, but the difference is still well within the variance
and thus not significant.

Lemme continue playing with this for a bit more.

2015-06-23 14:56:50

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On 06/23/2015 04:34 PM, Peter Zijlstra wrote:
> On Tue, Jun 23, 2015 at 11:35:24AM +0200, Daniel Wagner wrote:
>> flock01
>> mean variance sigma max min
>> 4.1.0 11.7075 816.3341 28.5716 125.6552 0.0021
>> 4.1.0+percpu-rwsem 11.4614 760.1345 27.5705 132.5030 0.0026
>> tip 6.8390 329.3037 18.1467 81.0373 0.0021
>> tip+percpu-rwsem 10.0870 546.7435 23.3825 106.2396 0.0026
>
>> posix01
>> mean variance sigma max min
>> 4.1.0 121.9020 27882.5260 166.9806 603.5509 0.0063
>> 4.1.0+percpu-rwsem 185.3981 38474.3836 196.1489 580.6532 0.0073
>> tip 129.2736 23752.7122 154.1191 474.0604 0.0063
>> tip+percpu-rwsem 142.6474 24732.1571 157.2646 468.7478 0.0072
>
> Both these tests are incredibly unstable for me (as well as for you it
> appears). Variance is through the roof on them.

Since on my test machine not all 4 socket have inter connection, I pinned the
tests down to one socket to see if that reduces the variance.

Expect flock01 and posix01 show now really low variances (3 runs):

[...]
flock02
mean variance sigma max min
tip-1 11.8994 0.5874 0.7664 13.2022 8.6324
tip-2 11.7394 0.5252 0.7247 13.2540 9.7513
tip-3 11.8155 0.5288 0.7272 13.2700 9.9480
tip+percpu-rswem-1 15.3601 0.8981 0.9477 16.8116 12.6910
tip+percpu-rswem-2 15.2558 0.8442 0.9188 17.0199 12.9586
tip+percpu-rswem-3 15.5297 0.6386 0.7991 17.4392 12.7992


lease01
mean variance sigma max min
tip-1 0.3424 0.0001 0.0110 0.3644 0.3088
tip-2 0.3627 0.0003 0.0185 0.4140 0.3312
tip-3 0.3446 0.0002 0.0125 0.3851 0.3155
tip+percpu-rswem-1 0.3464 0.0001 0.0116 0.3781 0.3113
tip+percpu-rswem-2 0.3597 0.0003 0.0162 0.3978 0.3250
tip+percpu-rswem-3 0.3513 0.0002 0.0151 0.3933 0.3122
[...]

So with this setup we can start to compare the numbers.

> I get runtimes like:
>
> root@ivb-ex:/usr/local/src/lockperf# ./flock01 -n 240 -l 32 /tmp/a
> 0.266157011
> root@ivb-ex:/usr/local/src/lockperf# ./flock01 -n 240 -l 32 /tmp/a
> 139.303399960

Same here:

flock01
mean variance sigma max min
tip-1 242.6147 3632.6201 60.2712 313.3081 86.3743
tip-2 233.1934 3850.1995 62.0500 318.2716 101.2738
tip-3 223.0392 3944.5220 62.8054 318.1932 110.8155
tip+percpu-rswem-1 276.5913 2145.0510 46.3147 317.5385 156.1318
tip+percpu-rswem-2 270.7089 2735.7635 52.3045 318.9418 154.5902
tip+percpu-rswem-3 267.8207 3028.3557 55.0305 320.2987 150.9659

posix01
mean variance sigma max min
tip-1 18.8729 151.2810 12.2996 37.3563 0.0060
tip-2 17.6894 140.9982 11.8743 37.2080 0.0060
tip-3 18.7785 145.1217 12.0466 35.5001 0.0060
tip+percpu-rswem-1 18.9970 163.8856 12.8018 35.8795 0.0069
tip+percpu-rswem-2 18.9594 147.3197 12.1375 35.4404 0.0069
tip+percpu-rswem-3 18.8366 126.5831 11.2509 35.9014 0.0069

2015-06-23 16:10:20

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Mon, 2015-06-22 at 14:16 +0200, Peter Zijlstra wrote:
> This series converts the cpu hotplug lock into a percpu-rwsem to provide a 3rd
> user.

Curious, why not also mem hotplug? It seems to use the exact same
locking mayhem than cpu.

Thanks,
Davidlohr

2015-06-23 16:22:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Tue, Jun 23, 2015 at 09:10:03AM -0700, Davidlohr Bueso wrote:
> On Mon, 2015-06-22 at 14:16 +0200, Peter Zijlstra wrote:
> > This series converts the cpu hotplug lock into a percpu-rwsem to provide a 3rd
> > user.
>
> Curious, why not also mem hotplug? It seems to use the exact same
> locking mayhem than cpu.

Because it looks like they 'forgot' to copy the notifiers and therefore
I suspect we could simplify things. We might not need the recursive
nonsense.

But I've not yet actually looked at it much.

I was indeed greatly saddened that these people copied cpu hotplug;
clearly they had not gotten the memo that cpu hotplug is a trainwreck.

2015-06-23 17:50:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Tue, Jun 23, 2015 at 04:56:39PM +0200, Daniel Wagner wrote:
> flock02
> mean variance sigma max min
> tip-1 11.8994 0.5874 0.7664 13.2022 8.6324
> tip-2 11.7394 0.5252 0.7247 13.2540 9.7513
> tip-3 11.8155 0.5288 0.7272 13.2700 9.9480
> tip+percpu-rswem-1 15.3601 0.8981 0.9477 16.8116 12.6910
> tip+percpu-rswem-2 15.2558 0.8442 0.9188 17.0199 12.9586
> tip+percpu-rswem-3 15.5297 0.6386 0.7991 17.4392 12.7992

I did indeed manage to get flock02 down to a usable level and found:

3.20 : ffffffff811ecbdf: incl %gs:0x7ee1de72(%rip) # aa58 <__preempt_count>
0.27 : ffffffff811ecbe6: mov 0xa98553(%rip),%rax # ffffffff81c85140 <file_rwsem>
10.78 : ffffffff811ecbed: incl %gs:(%rax)
0.19 : ffffffff811ecbf0: mov 0xa9855a(%rip),%edx # ffffffff81c85150 <file_rwsem+0x10>
0.00 : ffffffff811ecbf6: test %edx,%edx
0.00 : ffffffff811ecbf8: jne ffffffff811ecdd1 <flock_lock_file+0x261>
3.47 : ffffffff811ecbfe: decl %gs:0x7ee1de53(%rip) # aa58 <__preempt_count>
0.00 : ffffffff811ecc05: je ffffffff811eccec <flock_lock_file+0x17c>

Which is percpu_down_read(). Now aside from the fact that I run a
PREEMPT=y kernel, it looks like that sem->refcount increment stalls
because of the dependent load.

Manually hoisting the load very slightly improves things:

0.24 : ffffffff811ecbdf: mov 0xa9855a(%rip),%rax # ffffffff81c85140 <file_rwsem>
5.88 : ffffffff811ecbe6: incl %gs:0x7ee1de6b(%rip) # aa58 <__preempt_count>
7.94 : ffffffff811ecbed: incl %gs:(%rax)
0.30 : ffffffff811ecbf0: mov 0xa9855a(%rip),%edx # ffffffff81c85150 <file_rwsem+0x10>
0.00 : ffffffff811ecbf6: test %edx,%edx
0.00 : ffffffff811ecbf8: jne ffffffff811ecdd1 <flock_lock_file+0x261>
3.70 : ffffffff811ecbfe: decl %gs:0x7ee1de53(%rip) # aa58 <__preempt_count>
0.00 : ffffffff811ecc05: je ffffffff811eccec <flock_lock_file+0x17c>

But its not much :/

Using DEFINE_STATIC_PERCPU_RWSEM(file_rwsem) would allow GCC to omit the
sem->refcount load entirely, but its not smart enough to see that it can
(tested 4.9 and 5.1).

---
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -35,6 +35,8 @@ extern void __percpu_up_read(struct perc

static inline void _percpu_down_read(struct percpu_rw_semaphore *sem)
{
+ unsigned int __percpu *refcount = sem->refcount;
+
might_sleep();

preempt_disable();
@@ -47,7 +49,7 @@ static inline void _percpu_down_read(str
* writer will see anything we did within this RCU-sched read-side
* critical section.
*/
- __this_cpu_inc(*sem->refcount);
+ __this_cpu_inc(*refcount);
if (unlikely(!rcu_sync_is_idle(&sem->rss)))
__percpu_down_read(sem); /* Unconditional memory barrier. */
preempt_enable();
@@ -81,6 +83,8 @@ static inline bool percpu_down_read_tryl

static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
{
+ unsigned int __percpu *refcount = sem->refcount;
+
/*
* The barrier() in preempt_disable() prevents the compiler from
* bleeding the critical section out.
@@ -90,7 +94,7 @@ static inline void percpu_up_read(struct
* Same as in percpu_down_read().
*/
if (likely(rcu_sync_is_idle(&sem->rss)))
- __this_cpu_dec(*sem->refcount);
+ __this_cpu_dec(*refcount);
else
__percpu_up_read(sem); /* Unconditional memory barrier. */
preempt_enable();

2015-06-23 19:36:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Tue, Jun 23, 2015 at 07:50:12PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2015 at 04:56:39PM +0200, Daniel Wagner wrote:
> > flock02
> > mean variance sigma max min
> > tip-1 11.8994 0.5874 0.7664 13.2022 8.6324
> > tip-2 11.7394 0.5252 0.7247 13.2540 9.7513
> > tip-3 11.8155 0.5288 0.7272 13.2700 9.9480
> > tip+percpu-rswem-1 15.3601 0.8981 0.9477 16.8116 12.6910
> > tip+percpu-rswem-2 15.2558 0.8442 0.9188 17.0199 12.9586
> > tip+percpu-rswem-3 15.5297 0.6386 0.7991 17.4392 12.7992
>
> I did indeed manage to get flock02 down to a usable level and found:

Aside from the flock_lock_file function moving up, we also get an
increase in _raw_spin_lock.

Before:

5.17% 5.17% flock02 [kernel.vmlinux] [k] _raw_spin_lock
|
---_raw_spin_lock
|
|--99.75%-- flock_lock_file_wait
| sys_flock
| entry_SYSCALL_64_fastpath
| flock
--0.25%-- [...]


After:

7.20% 7.20% flock02 [kernel.vmlinux] [k] _raw_spin_lock
|
---_raw_spin_lock
|
|--52.23%-- flock_lock_file_wait
| sys_flock
| entry_SYSCALL_64_fastpath
| flock
|
|--25.92%-- flock_lock_file
| flock_lock_file_wait
| sys_flock
| entry_SYSCALL_64_fastpath
| flock
|
|--21.42%-- locks_delete_lock_ctx
| flock_lock_file
| flock_lock_file_wait
| sys_flock
| entry_SYSCALL_64_fastpath
| flock
--0.43%-- [...]


And its not at all clear to me why this would be. It looks like
FILE_LOCK_DEFERRED is happening, but I've not yet figured out why that
would be.

2015-06-24 08:46:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2


* Peter Zijlstra <[email protected]> wrote:

> On Tue, Jun 23, 2015 at 07:50:12PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2015 at 04:56:39PM +0200, Daniel Wagner wrote:
> > > flock02
> > > mean variance sigma max min
> > > tip-1 11.8994 0.5874 0.7664 13.2022 8.6324
> > > tip-2 11.7394 0.5252 0.7247 13.2540 9.7513
> > > tip-3 11.8155 0.5288 0.7272 13.2700 9.9480
> > > tip+percpu-rswem-1 15.3601 0.8981 0.9477 16.8116 12.6910
> > > tip+percpu-rswem-2 15.2558 0.8442 0.9188 17.0199 12.9586
> > > tip+percpu-rswem-3 15.5297 0.6386 0.7991 17.4392 12.7992
> >
> > I did indeed manage to get flock02 down to a usable level and found:
>
> Aside from the flock_lock_file function moving up, we also get an
> increase in _raw_spin_lock.
>
> Before:
>
> 5.17% 5.17% flock02 [kernel.vmlinux] [k] _raw_spin_lock
> |
> ---_raw_spin_lock
> |
> |--99.75%-- flock_lock_file_wait
> | sys_flock
> | entry_SYSCALL_64_fastpath
> | flock
> --0.25%-- [...]
>
>
> After:
>
> 7.20% 7.20% flock02 [kernel.vmlinux] [k] _raw_spin_lock
> |
> ---_raw_spin_lock
> |
> |--52.23%-- flock_lock_file_wait
> | sys_flock
> | entry_SYSCALL_64_fastpath
> | flock
> |
> |--25.92%-- flock_lock_file
> | flock_lock_file_wait
> | sys_flock
> | entry_SYSCALL_64_fastpath
> | flock
> |
> |--21.42%-- locks_delete_lock_ctx
> | flock_lock_file
> | flock_lock_file_wait
> | sys_flock
> | entry_SYSCALL_64_fastpath
> | flock
> --0.43%-- [...]
>
>
> And its not at all clear to me why this would be. It looks like
> FILE_LOCK_DEFERRED is happening, but I've not yet figured out why that
> would be.

So I'd suggest to first compare preemption behavior: does the workload
context-switch heavily, and is it the exact same context switching rate and are
the points of preemption the same as well between the two kernels?

[ Such high variance is often caused by (dynamically) unstable load balancing and
the workload never finding a good equilibrium. Any observable locking overhead
is usually just a second order concern or a symptom. Assuming the workload
context switches heavily. ]

Thanks,

Ingo

2015-06-24 09:02:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Wed, Jun 24, 2015 at 10:46:48AM +0200, Ingo Molnar wrote:
> > > > flock02
> > > > mean variance sigma max min
> > > > tip-1 11.8994 0.5874 0.7664 13.2022 8.6324
> > > > tip-2 11.7394 0.5252 0.7247 13.2540 9.7513
> > > > tip-3 11.8155 0.5288 0.7272 13.2700 9.9480
> > > > tip+percpu-rswem-1 15.3601 0.8981 0.9477 16.8116 12.6910
> > > > tip+percpu-rswem-2 15.2558 0.8442 0.9188 17.0199 12.9586
> > > > tip+percpu-rswem-3 15.5297 0.6386 0.7991 17.4392 12.7992

> [ Such high variance is often caused by (dynamically) unstable load balancing and
> the workload never finding a good equilibrium. Any observable locking overhead
> is usually just a second order concern or a symptom. Assuming the workload
> context switches heavily. ]

flock02 is a relatively stable benchmark -- unlike some of the others
where the variance is orders of magnitude higher than the avg.

But yes, I'll go poke at it more. I just need to hunt down unrelated
fail before continuing with this.

2015-06-24 09:18:50

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On 06/24/2015 10:46 AM, Ingo Molnar wrote:
> So I'd suggest to first compare preemption behavior: does the workload
> context-switch heavily, and is it the exact same context switching rate and are
> the points of preemption the same as well between the two kernels?

If I read this correctly, the answer is yes.

First the 'stable' flock02 test:

perf stat --repeat 5 --pre 'rm -rf /tmp/a' ~/src/lockperf/flock02 -n 128 -l 64 /tmp/a
0.008793148
0.008784990
0.008587804
0.008693641
0.008776946

Performance counter stats for '/home/wagi/src/lockperf/flock02 -n 128 -l 64 /tmp/a' (5 runs):

76.509634 task-clock (msec) # 3.312 CPUs utilized ( +- 0.67% )
2 context-switches # 0.029 K/sec ( +- 26.50% )
128 cpu-migrations # 0.002 M/sec ( +- 0.31% )
5,295 page-faults # 0.069 M/sec ( +- 0.49% )
89,944,154 cycles # 1.176 GHz ( +- 0.66% )
58,670,259 stalled-cycles-frontend # 65.23% frontend cycles idle ( +- 0.88% )
0 stalled-cycles-backend # 0.00% backend cycles idle
76,991,414 instructions # 0.86 insns per cycle
# 0.76 stalled cycles per insn ( +- 0.19% )
15,239,720 branches # 199.187 M/sec ( +- 0.20% )
103,418 branch-misses # 0.68% of all branches ( +- 6.68% )

0.023102895 seconds time elapsed ( +- 1.09% )


And here posix01 which shows high variance:

perf stat --repeat 5 --pre 'rm -rf /tmp/a' ~/src/lockperf/posix01 -n 128 -l 64 /tmp/a
0.006020402
32.510838421
55.516466069
46.794470223
5.097701438

Performance counter stats for '/home/wagi/src/lockperf/posix01 -n 128 -l 64 /tmp/a' (5 runs):

4177.932106 task-clock (msec) # 14.162 CPUs utilized ( +- 34.59% )
70,646 context-switches # 0.017 M/sec ( +- 31.56% )
28,009 cpu-migrations # 0.007 M/sec ( +- 33.55% )
4,834 page-faults # 0.001 M/sec ( +- 0.98% )
7,291,160,968 cycles # 1.745 GHz ( +- 32.17% )
5,216,204,262 stalled-cycles-frontend # 71.54% frontend cycles idle ( +- 32.13% )
0 stalled-cycles-backend # 0.00% backend cycles idle
1,901,289,780 instructions # 0.26 insns per cycle
# 2.74 stalled cycles per insn ( +- 30.80% )
440,415,914 branches # 105.415 M/sec ( +- 31.06% )
1,347,021 branch-misses # 0.31% of all branches ( +- 29.17% )

0.295016987 seconds time elapsed ( +- 32.01% )


BTW, thanks for the perf stat tip. Really handy!

cheers,
daniel

2015-07-01 06:03:47

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

Hi,

I did a sweep over the parameters for posix01. The parameters are number
of processes and number of locks taken per process. In contrast to the
other test, it looks like there is no set which ends a nice stable
result (read low variance). I have tried several things including
pinning down all processes to CPUs to avoid migration. The results
improved slightly but there was still a high variance.

Anyway I have collected some data and I like to share it. Maybe it is
still useful. All numbers here are without the above mentioned pinning.
There are some runs missing (don't know the reason yet) and I didn't let
it run till the end. So add some salt to these numbers.

The test script and raw data can be found here:

http://monom.org/posix01/

The tables reads:
nproc: number of process started
columns: number of locks taken per process

Hardware
4x E5-4610, for this test all process are scheduled on one socket

First the numbers for tip 4.1.0-02756-ge3d06bd.

nproc 8
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.075449 0.210547 0.340658 0.464083 0.590400
std 0.015550 0.024989 0.032080 0.043803 0.055003
min 0.021643 0.067456 0.211779 0.279643 0.327628
25% 0.065337 0.195664 0.318114 0.430040 0.546488
50% 0.075345 0.209411 0.338512 0.461397 0.591433
75% 0.084725 0.226517 0.364190 0.494638 0.626532
max 0.127050 0.281836 0.454558 0.607559 0.762149


nproc 16
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 1.023660 2.463384 3.891954 5.312716 6.752857
std 0.105065 0.124916 0.136476 0.172906 0.207449
min 0.351199 1.527379 3.106403 4.157478 5.519601
25% 0.961098 2.397597 3.807098 5.201875 6.633034
50% 1.031460 2.467317 3.895824 5.321227 6.757502
75% 1.093412 2.539284 3.985122 5.432336 6.889859
max 1.278603 2.785901 4.369434 5.798982 7.324263


nproc 24
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 3.460166 7.942193 11.898540 11.150066 11.060036
std 0.191564 0.232989 0.612868 0.680323 0.465967
min 2.748545 6.575510 9.977165 9.209685 8.937682
25% 3.325521 7.806847 11.440580 10.774070 10.912302
50% 3.493138 7.951859 11.852556 11.163595 11.074910
75% 3.596927 8.088036 12.443429 11.365197 11.243125
max 3.974884 8.589840 13.079780 16.341043 14.244954


nproc 32
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 6.797286 13.943421 14.373278 15.857103 20.047039
std 0.366013 0.417859 0.625967 0.377463 0.302939
min 3.323312 12.266006 12.492706 14.451931 17.496059
25% 6.649401 13.719397 14.186790 15.738348 19.958001
50% 6.868362 13.862458 14.312992 15.870438 20.083564
75% 6.995801 14.027167 14.429383 15.984881 20.215722
max 7.369007 15.631300 21.587450 19.364991 20.755793


nproc 40
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 11.156514 16.936808 18.930412 25.605206 32.334239
std 0.613158 0.614545 0.485336 0.344226 0.398747
min 5.609261 13.147398 16.930261 23.448985 28.992899
25% 10.999876 16.740775 18.788180 25.481274 32.188020
50% 11.251502 16.883100 18.946506 25.648879 32.369347
75% 11.439205 17.032133 19.105678 25.806715 32.565019
max 12.155905 24.116348 26.152117 26.502637 33.263763


nproc 48
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 16.523705 18.214558 27.877811 37.703763 47.655792
std 0.974732 1.118383 0.357481 0.435081 0.472945
min 7.909358 16.279568 25.989797 35.308061 45.279940
25% 16.385582 17.960832 27.729399 37.555420 47.458123
50% 16.692900 18.137635 27.920459 37.767064 47.679325
75% 16.927355 18.311502 28.092018 37.950782 47.926311
max 17.720374 35.810409 28.721941 38.746273 49.333097


nproc 56
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 11.567668 25.100333 38.603884 52.135564 65.716669
std 0.320771 0.369833 0.554834 0.534120 0.612844
min 10.123811 22.598875 35.668780 49.182148 62.504962
25% 11.394438 24.925338 38.389200 51.885988 65.441492
50% 11.593920 25.135043 38.641839 52.206010 65.771692
75% 11.789101 25.328558 38.895343 52.451819 66.068270
max 12.319346 25.948404 46.458428 53.605888 67.270679


nproc 64
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 15.421295 33.254418 51.073912 68.936111 86.919074
std 0.398493 0.411222 0.551629 0.690891 0.694183
min 13.269859 30.900978 48.174802 65.549282 83.099271
25% 15.203732 33.037478 50.821702 68.619365 86.579749
50% 15.467885 33.279869 51.130972 69.001664 86.953804
75% 15.694466 33.514712 51.380860 69.361632 87.341084
max 16.347321 34.475095 52.507292 70.884752 88.807083


nproc 72
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 19.762286 42.488827 65.167763 87.903430 110.666679
std 0.483660 0.480269 0.689872 0.828354 0.892759
min 15.506067 39.937453 61.196633 84.227403 107.014850
25% 19.519194 42.261548 64.834133 87.515837 110.225142
50% 19.809986 42.541263 65.265768 87.974049 110.747980
75% 20.083315 42.792858 65.603762 88.392599 111.223192
max 20.913434 43.830009 66.791452 90.184550 113.062344


nproc 80
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 24.782285 52.853068 80.902314 109.112294 137.441640
std 0.523731 0.639160 0.799033 0.952619 1.091478
min 20.126615 47.813274 77.357915 104.033857 131.978443
25% 24.498501 52.547855 80.509926 108.606293 136.877050
50% 24.835766 52.918841 80.950773 109.197236 137.498470
75% 25.137887 53.244013 81.376380 109.723791 138.101133
max 26.161997 54.372957 83.266046 111.709888 140.419400


nproc 88
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 30.196867 64.467080 98.710365 133.024282 167.330900
std 0.749476 0.691460 0.863908 1.033780 1.240237
min 16.647491 60.034797 94.053510 128.281171 161.778166
25% 29.896764 64.121607 98.290368 132.484092 166.711172
50% 30.271808 64.514222 98.742714 133.089852 167.429483
75% 30.627200 64.903154 99.262584 133.706735 168.086624
max 31.806051 66.343856 101.077264 136.143873 170.449596


nproc 96
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 36.304100 77.194851 117.958001 158.820159 199.868940
std 0.712442 0.718565 1.009163 1.220813 1.462219
min 31.128111 73.850226 112.075970 152.910227 192.977453
25% 35.928427 76.811233 117.466922 158.151278 199.058411
50% 36.378220 77.209148 117.998878 158.879704 199.861157
75% 36.761744 77.636286 118.615380 159.583272 200.701769
max 38.069263 79.445286 120.878239 162.826438 206.826424


nproc 104
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 42.731401 90.887253 138.815476 186.824953 235.055458
std 1.045572 0.742232 0.999065 1.298818 1.554890
min 23.734733 87.384048 133.462821 180.971966 227.475939
25% 42.353032 90.441055 138.213962 186.109237 234.169575
50% 42.861112 90.900274 138.836083 186.835884 235.084204
75% 43.236527 91.382487 139.460129 187.694247 236.011148
max 44.600281 93.394394 141.959512 190.171221 239.491909


nproc 112
100 200 300 400
count 460.000000 460.000000 460.000000 460.000000
mean 49.782729 105.468739 161.416099 217.385757
std 0.904312 1.011980 1.222772 1.475225
min 45.334285 100.711113 156.087707 210.639527
25% 49.394518 104.971028 160.743875 216.590612
50% 49.906665 105.604756 161.528712 217.437408
75% 50.363428 106.088852 162.187166 218.286111
max 51.800116 108.372299 164.614385 221.788613


And now the same tests for tip+percpu_rwsem:

nproc 8
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.285784 0.639623 0.935062 1.165287 1.457565
std 0.040458 0.089317 0.112704 0.094596 0.110337
min 0.118961 0.253775 0.351943 0.869095 1.026194
25% 0.263250 0.600806 0.858630 1.100281 1.376566
50% 0.287019 0.649395 0.930437 1.167166 1.461235
75% 0.312601 0.692013 1.013786 1.228887 1.533511
max 0.407264 0.860837 1.298671 1.460842 1.927867


nproc 16
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 2.338683 5.219408 8.117279 11.050641 14.035433
std 0.146102 0.270400 0.392875 0.510692 0.576044
min 1.836110 4.179970 6.491748 8.998336 11.442838
25% 2.239374 5.042915 7.860587 10.728740 13.667630
50% 2.335801 5.217732 8.125243 11.052183 14.010561
75% 2.443152 5.404223 8.396037 11.404375 14.417740
max 2.798029 5.927344 9.172875 12.203548 15.444552


nproc 24
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 6.399927 13.673487 20.729554 27.316864 34.125202
std 0.558388 1.157996 1.647191 2.066864 2.487975
min 4.961608 10.767524 17.145018 22.441426 28.566438
25% 5.987118 12.849801 19.555979 25.943463 32.399122
50% 6.388215 13.583983 20.533054 27.122120 33.959403
75% 6.915310 14.786835 22.252796 29.187176 36.308254
max 7.405319 15.823960 23.858206 31.754922 38.997955


nproc 32
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 11.973832 24.885823 36.705614 48.036525 57.418669
std 1.270516 2.604583 3.963139 5.283237 6.441122
min 9.395066 19.958662 27.768684 38.247046 46.265231
25% 10.955417 22.708953 33.510437 43.613011 51.901209
50% 11.801515 24.556642 35.805816 47.315635 55.933447
75% 13.294692 27.520679 40.689642 53.139912 63.860584
max 14.217272 29.968337 44.409489 58.246754 71.045867


nproc 40
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 19.307414 39.204462 55.768040 70.808627 83.830246
std 2.189803 3.982241 5.467692 6.737372 8.124025
min 14.450258 30.606836 44.342114 55.520218 64.704178
25% 17.418113 35.968251 51.341042 65.352697 77.744806
50% 19.067713 39.023460 55.548934 70.282785 83.374667
75% 21.479466 42.666118 60.379906 76.604241 91.158904
max 23.687483 47.019928 67.143361 85.084045 100.957011


nproc 48
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 28.386773 55.462523 77.886706 92.579064 104.319703
std 3.231688 6.142373 8.633285 10.950222 12.510504
min 21.703659 42.486864 56.904221 66.605689 76.529646
25% 25.635256 50.575642 71.306694 82.931995 94.222776
50% 28.136694 55.235674 77.298409 91.993559 104.909015
75% 31.484979 60.645302 85.693462 102.195018 114.141212
max 35.713537 68.342796 96.065304 115.926497 130.916876


nproc 56
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 39.037206 74.470404 97.900979 111.320283 135.943281
std 4.594741 8.940246 11.715321 13.823450 16.032080
min 29.532559 55.193557 65.590273 79.580482 98.565733
25% 35.212004 66.990273 88.066459 100.643871 122.864654
50% 38.796902 73.928176 96.771490 110.669216 136.199617
75% 43.154846 82.041731 108.937264 120.727216 147.769269
max 49.215714 92.181542 125.188702 141.113117 170.961264


nproc 64
100 200 300 400 500
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 51.099012 93.028015 114.649700 145.944300 178.043572
std 6.310777 12.719401 14.675830 18.019135 21.084448
min 36.770938 54.620852 80.837116 98.765936 126.207980
25% 45.955694 84.078285 103.452854 132.127548 160.746493
50% 50.275929 93.031565 114.333533 144.951788 177.105994
75% 56.955477 104.656181 128.418118 163.865640 197.275452
max 63.369715 120.360706 146.542148 182.482159 218.814651


nproc 72
100 200 300 400 500
count 506.000000 506.000000 506.000000 506.000000 506.000000
mean 64.905270 108.760098 138.811285 179.277895 222.584001
std 8.784532 16.293281 18.160401 21.203767 25.904456
min 43.035451 64.762288 96.401934 127.995159 162.341026
25% 58.658290 98.438247 126.035692 162.944645 202.228444
50% 64.756854 109.608197 139.190635 181.413255 223.359111
75% 72.488483 123.608470 152.745541 195.549278 245.454358
max 83.424516 139.214509 172.538610 218.677815 270.799895


nproc 80
100 200 300 400 500
count 61.000000 61.000000 61.000000 61.000000 61.000000
mean 76.727789 124.438489 174.095378 225.855798 272.416390
std 9.757928 18.034325 20.216132 24.868596 29.384832
min 55.988043 83.842137 130.842940 173.596051 208.508169
25% 69.218268 116.679810 162.149179 207.015727 252.194955
50% 75.392969 125.378519 173.117425 225.071270 276.188038
75% 83.748328 136.689138 192.392097 245.019530 296.407232
max 97.004966 165.172805 206.391629 266.751069 318.089290


nproc 88
100
count 157.000000
mean 90.337638
std 15.239911
min 53.393662
25% 79.648088
50% 91.075065
75% 103.530939
max 120.680507


And an attempt at visualization:

http://monom.org/posix01/sweep-4.1.0-02756-ge3d06bd.png
http://monom.org/posix01/sweep-4.1.0-02769-g6ce2591.png


Let me know if these numbers help or not. I start to get better in
running those tests tough they take quite some time to finish. So if
they are useless I sleep well without doing this :)

cheers,
daniel

2015-07-01 21:55:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Tue, Jun 30, 2015 at 10:57 PM, Daniel Wagner <[email protected]> wrote:
>
> And an attempt at visualization:
>
> http://monom.org/posix01/sweep-4.1.0-02756-ge3d06bd.png
> http://monom.org/posix01/sweep-4.1.0-02769-g6ce2591.png

Ugh. The old numbers look (mostly) fairly tight, and then the new ones
are all over the map, and usually much worse.

We've seen this behavior before when switching from a non-sleeping
lock to a sleeping one. The sleeping locks have absolutely horrible
behavior when they get contended, and spend tons of CPU time on the
sleep/wakeup management, based on almost random timing noise. And it
can get orders of magnitude worse if there are any nested locks that
basically trigger trains of that kind of behavior.

In general, sleeping locks are just horribly horribly bad for things
that do small simple operations. Which is what fs/locks.c does.

I'm not convinced it's fixable. Maybe the new rwsem just isn't a good idea.

Linus

2015-07-02 09:42:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Wed, Jul 01, 2015 at 02:54:59PM -0700, Linus Torvalds wrote:
> On Tue, Jun 30, 2015 at 10:57 PM, Daniel Wagner <[email protected]> wrote:
> >
> > And an attempt at visualization:
> >
> > http://monom.org/posix01/sweep-4.1.0-02756-ge3d06bd.png
> > http://monom.org/posix01/sweep-4.1.0-02769-g6ce2591.png
>
> Ugh. The old numbers look (mostly) fairly tight, and then the new ones
> are all over the map, and usually much worse.
>
> We've seen this behavior before when switching from a non-sleeping
> lock to a sleeping one. The sleeping locks have absolutely horrible
> behavior when they get contended, and spend tons of CPU time on the
> sleep/wakeup management,

Right, I'm just not seeing how any of that would happen here :/ The read
side would only ever block on reading /proc/$something and I'm fairly
sure that benchmark doesn't actually touch that file.

In any case, I will look into this, I've just not had time yet..

2015-07-20 05:53:17

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On 07/02/2015 11:41 AM, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 02:54:59PM -0700, Linus Torvalds wrote:
>> On Tue, Jun 30, 2015 at 10:57 PM, Daniel Wagner <[email protected]> wrote:
>>>
>>> And an attempt at visualization:
>>>
>>> http://monom.org/posix01/sweep-4.1.0-02756-ge3d06bd.png
>>> http://monom.org/posix01/sweep-4.1.0-02769-g6ce2591.png
>>
>> Ugh. The old numbers look (mostly) fairly tight, and then the new ones
>> are all over the map, and usually much worse.
>>
>> We've seen this behavior before when switching from a non-sleeping
>> lock to a sleeping one. The sleeping locks have absolutely horrible
>> behavior when they get contended, and spend tons of CPU time on the
>> sleep/wakeup management,
>
> Right, I'm just not seeing how any of that would happen here :/ The read
> side would only ever block on reading /proc/$something and I'm fairly
> sure that benchmark doesn't actually touch that file.
>
> In any case, I will look into this, I've just not had time yet..

I did some more testing and found out that the slow path of percpu_down_read()
is never taken (as expected). The only change left is the exchange from a
percpu arch_spinlock_t spinlocks to percpu spinlock_t spinlocks.

Turning them back into arch_spinlock_t gives almost the same numbers as
with spinlock_t.

Then Peter suggested to change the code to

preempt_disable();
spin_unlock();
preempt_enable_no_resched();

to verify if arch_spin_lock() is buggy and does not disable preemption
and we see a lock holder preemption on non virt setups.

Here all the numbers and plots:

- base line
http://monom.org/posix01-4/tip-4.1.0-02756-ge3d06bd.png
http://monom.org/posix01-4/tip-4.1.0-02756-ge3d06bd.txt

- arch_spinlock_t
http://monom.org/posix01-4/arch_spintlock_t-4.1.0-02769-g6ce2591-dirty.png
http://monom.org/posix01-4/arch_spintlock_t-4.1.0-02769-g6ce2591-dirty.txt
http://monom.org/posix01-4/arch_spintlock_t-4.1.0-02769-g6ce2591-dirty.patch

- no resched
http://monom.org/posix01-4/no_resched-4.1.0-02770-g4d518cf.png
http://monom.org/posix01-4/no_resched-4.1.0-02770-g4d518cf.txt
http://monom.org/posix01-4/no_resched-4.1.0-02770-g4d518cf.patch

cheers,
daniel

2015-07-20 18:44:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/13] percpu rwsem -v2

On Sun, Jul 19, 2015 at 10:53 PM, Daniel Wagner <[email protected]> wrote:
>
> Turning them back into arch_spinlock_t gives almost the same numbers as
> with spinlock_t.
>
> Then Peter suggested to change the code to
>
> preempt_disable();
> spin_unlock();
> preempt_enable_no_resched();
>
> to verify if arch_spin_lock() is buggy and does not disable preemption
> and we see a lock holder preemption on non virt setups.

Hmm. "arch_spin_lock()" isn't _supposed_ to disable preemption. The
caller should do that (possibly by disabling interrupts). See
include/linux/spinlock_api_smp.h for details.

But yes, that's a *very* subtle difference between "arch_spin_lock()"
and "spin_lock()". The former doesn't do lockdep or other debugging
and it doesn't disable preemption. So they are not interchangeable.

The current lglocks uses arch_spin_lock exactly because it does not
*want* lockdep tracking (it does its own) and because it does its own
preemption handling.

So saying "verify if arch_spin_lock() is buggy and does not disable
preemption" is complete BS. If arch_spin_lock() were to disable
preemption, _that_ would be a bug.

Linus