LinuxLists.cc - sem_otime trashing

2013-06-01 19:03:05

Subject: sem_otime trashing

Hi Rik,

I finally managed to get EFI boot, i.e. I'm now able to test on my i3
(2core+HT).

With semscale (i.e.: just overhead, perform semop=0 operations), the
scalability from 1 to 2 cores is good, but not linear:
# semscale 10 | grep "interleave 2"
> Cpus 1, interleave 2 delay 0: 35502103 in 10 secs
> Cpus 2, interleave 2 delay 0: 53990954 in 10 secs
---
+53% when adding the 2nd core
(interleave 2 to force to use different cores)

Did you consider moving sem_otime into the individual semaphores?
I did that (gross patch attached), and the performance is significantly
better:

# semscale 10 | grep "interleave 2"
Cpus 1, interleave 2 delay 0: 35585634 in 10 secs
Cpus 2, interleave 2 delay 0: 70410230 in 10 secs
---
+99% scalability when adding the 2nd core

Unfortunately I won't be able to read my mails next week, but the effect
was too significant not to share it immediately.

--
Manfred

Attachments:

patch-sem_otime_WIP (3.30 kB)

2013-06-02 05:53:27

by Mike Galbraith

[permalink] [raw]

Subject: Re: sem_otime trashing

On Sat, 2013-06-01 at 21:02 +0200, Manfred Spraul wrote:
> Hi Rik,
>
> I finally managed to get EFI boot, i.e. I'm now able to test on my i3
> (2core+HT).
>
> With semscale (i.e.: just overhead, perform semop=0 operations), the
> scalability from 1 to 2 cores is good, but not linear:
> # semscale 10 | grep "interleave 2"
> > Cpus 1, interleave 2 delay 0: 35502103 in 10 secs
> > Cpus 2, interleave 2 delay 0: 53990954 in 10 secs
> ---
> +53% when adding the 2nd core
> (interleave 2 to force to use different cores)
>
> Did you consider moving sem_otime into the individual semaphores?
> I did that (gross patch attached), and the performance is significantly
> better:
>
> # semscale 10 | grep "interleave 2"
> Cpus 1, interleave 2 delay 0: 35585634 in 10 secs
> Cpus 2, interleave 2 delay 0: 70410230 in 10 secs
> ---
> +99% scalability when adding the 2nd core
>
> Unfortunately I won't be able to read my mails next week, but the effect
> was too significant not to share it immediately.

64 core box.

Previous numbers:
vogelweide:/abuild/mike/:[0]# uname -r
3.8.13-rt9-rtm
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 33553800, ops/sec 1118460

New numbers:
vogelweide:/abuild/mike/:[0]# !./semop-multi
./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 129474934, ops/sec 4315831

But, box rcu stalled on me. It's looking like the scalability patches
are a bit racy rcu wise in an -rt kernel (oh dear). So, build as plain
old PREEMPT again, eliminate -rt funnies.

Previous numbers:
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 22053968, ops/sec 735132

vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.858765 seconds for 1000192 loops
per loop execution time: 1.858 usec

New numbers:
vogelweide:/abuild/mike/:[0]# !./semop
./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 45521478, ops/sec 1517382
vogelweide:/abuild/mike/:[0]# !./osim
./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.350682 seconds for 1000192 loops
per loop execution time: 0.350 usec

(1.8->0.3?.. box, you ain't a race horse, you're a plow horse)

vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.276405 seconds for 1000192 loops
per loop execution time: 0.276 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.370041 seconds for 1000192 loops
per loop execution time: 0.369 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.502396 seconds for 1000192 loops
per loop execution time: 0.502 usec

(runtime)

vogelweide:/abuild/mike/:[0]# ./osim 64 256 10000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 39063 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 3.354423 seconds for 10000128 loops
per loop execution time: 0.335 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 100000000 0 0
osim <sems> <tasks> <loops> <busy-in> <busy-out>
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 390625 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 41.180479 seconds for 100000000 loops
per loop execution time: 0.411 usec

Box likes your idea.