2001-04-22 00:29:00

by D.W.Howells

[permalink] [raw]
Subject: [PATCH] rw_semaphores, optimisations

This patch (made against linux-2.4.4-pre6) makes a number of changes to the
rwsem implementation:

(1) Fixes a subtle contention bug between up_write and the down_* functions.

(2) Optimises the i386 fastpath implementation and changed the slowpath
implementation to aid it.
- The arch/i386/lib/rwsem.c is now gone.
- Better inline asm constraints have been applied.

(3) Changed the sparc64 fastpath implementation to use revised slowpath
interface.
[Dave Miller: can you check this please]

(4) Makes the generic spinlock implementation non-inline.
- lib/rwsem.c has been duplicated to lib/rwsem-spinlock.c and a
slightly different algorithm has been created. This one is simpler
since it does not have to use atomic operations on the counters as
all accesses to them are governed by a blanket spinlock.

David


Attachments:
rwsem-opt.diff (29.25 kB)
rw-semaphore optimisations patch

2001-04-22 19:07:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] rw_semaphores, optimisations

On Sun, Apr 22, 2001 at 01:27:20AM +0100, D . W . Howells wrote:
> This patch (made against linux-2.4.4-pre6) makes a number of changes to the
> rwsem implementation:
>
> (1) Fixes a subtle contention bug between up_write and the down_* functions.
>
> (2) Optimises the i386 fastpath implementation and changed the slowpath
> implementation to aid it.
> - The arch/i386/lib/rwsem.c is now gone.
> - Better inline asm constraints have been applied.
>
> (3) Changed the sparc64 fastpath implementation to use revised slowpath
> interface.
> [Dave Miller: can you check this please]
>
> (4) Makes the generic spinlock implementation non-inline.
> - lib/rwsem.c has been duplicated to lib/rwsem-spinlock.c and a
> slightly different algorithm has been created. This one is simpler
> since it does not have to use atomic operations on the counters as
> all accesses to them are governed by a blanket spinlock.

I finally finished dropping the list_empty() check from my generic rwsemaphores.

The new behaviour of my rwsemaphores are:

- max sleepers and max simultanous readers both downgraded to 2^(BITS_PER_LONG>>1) for
higher performance in the fast path
- up_* cannot be recalled from irq/softirq context anymore

I'm using this your latest patch on top of vanilla 2.4.4-pre6 to benchmark my
new generic rwsemaphores against yours.

To benchmark I'm using your tar.gz but I left my 50 seconds run of the rwtest
with many more readers and writers to stress it a bit more (I posted the
changes to the rwsem-rw.c test in the message where I reported the race in your
semaphores before pre6).

Here the results (I did two tries for every bench and btw the numbers are
quite stable as you can see).

rwsem-2.4.4-pre6 + my new generic rwsem (fast path in C inlined)

rw

reads taken: 6499121
writes taken: 3355701
reads taken: 6413447
writes taken: 3311328

r1

reads taken: 15218540
reads taken: 15203915

r2

reads taken: 5087253
reads taken: 5099084

ro

reads taken: 4274607
reads taken: 4280280

w1

writes taken: 14723159
writes taken: 14708296

wo

writes taken: 1778713
writes taken: 1776248

----------------------------------------------
rwsem-2.4.4-pre6 + my new generic rwsem with fast path out of line (fast path in C out of line)

rw

reads taken: 6116063
writes taken: 3157816
reads taken: 6248542
writes taken: 3226122

r1

reads taken: 14092045
reads taken: 14112771

r2

reads taken: 4002635
reads taken: 4006940

ro

reads taken: 4150747
reads taken: 4150279

w1

writes taken: 13655019
writes taken: 13639011

wo

writes taken: 1757065
writes taken: 1758623

----------------------------------------------
RWSEM_GENERIC_SPINLOCK y in rwsem-2.4.4-pre6 + your latest #try2 (fast path in C out of line)

rw

reads taken: 5872682
writes taken: 3032179
reads taken: 5892582
writes taken: 3042346

r1

reads taken: 13079190
reads taken: 13104405

r2

reads taken: 3907702
reads taken: 3906729

ro

reads taken: 3005924
reads taken: 3006690

w1

writes taken: 12581209
writes taken: 12570627

wo

writes taken: 1376782
writes taken: 1328468

----------------------------------------------
RWSEM_XCHGADD y in rwsem-2.4.4-pre6 + your latest #try2 (fast path in asm in line)

rw

reads taken: 5789650
writes taken: 2989604
reads taken: 5776594
writes taken: 2982812
r1

reads taken: 16935488
reads taken: 16930738

r2

reads taken: 5646446
reads taken: 5651600

ro

reads taken: 4952654
reads taken: 4956992

w1

writes taken: 15432874
writes taken: 15408684

wo

writes taken: 814131
writes taken: 815551

To make it more readable I plotted the numbers in a graph (attached).

So my new C based rw semaphores are faster than your C based semaphores in
all tests and they're even faster than your x86 asm inlined semaphores when
there's contention (however it's probably normal that with contention the asm
semaphores are slower so I'm not sure I can make the asm inlined semaphore
faster than yours). I will try to boost my new rwsem with asm in the fast path
now (it won't be easy but possibly by only changing a few lines of the slow
path inside an #ifdef CONFIG_...) to see where can I go. Once I will have an
asm inlined fast path for >=486 integrated I will upload a patch but if you
want a patch now with only the new generic C implementation to verify the above
numbers let me know of course (ah and btw the machine is a 2-way PII 450mhz).

I think it's interesting also the comparison of my generic rwsem when inlined
and not inlined (first four columns). BTW, I also tried to move my only lock in
a separated cacheline (growing the size of the rwsem over 32bytes) and that
hurted a lot.

Note also the your wo results with your asm version (the one in pre6 + your
latest try#2) is scoring more than two times worse than my generic semaphores.

Andrea


Attachments:
(No filename) (4.71 kB)
rwsem.png (5.99 kB)
Download all attachments

2001-04-22 19:17:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] rw_semaphores, optimisations

On Sun, Apr 22, 2001 at 09:07:03PM +0200, Andrea Arcangeli wrote:
> On Sun, Apr 22, 2001 at 01:27:20AM +0100, D . W . Howells wrote:

btw, I noticed I answered your previous email but for my benchmarks I really
used your latest #try2 posted today at 13 (not last night a 1am), just to avoid
mistakes this is the the md5sum of the patch I applied on top of pre6:

510c05d168c2b60e0d9c804381579b51 rwsem.diff

Andrea

2001-04-22 22:54:03

by D.W.Howells

[permalink] [raw]
Subject: Re: [PATCH] rw_semaphores, optimisations

Hello Andrea,

Interesting benchmarks... did you compile the test programs with "make
SCHED=yes" by any chance? Also what other software are you running?

The reason I ask is that running a full blown KDE setup running in the
background, I get the following numbers on the rwsem-ro test (XADD optimised
kernel):

SCHED: 4615646, 4530769, 4534453 and 4628365
no SCHED: 6311620, 6312776, 6327772 and 6325508

Also quite stable as you can see.

> (ah and btw the machine is a 2-way PII 450mhz).

Your numbers were "4274607" and "4280280" for this kernel and test This I
find a little suprising. I'd expect them to be about 10% higher than I get on
my machine given your faster CPUs.

What compiler are you using? I'm using the following:

Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-80)

Something else that I noticed: Playing a music CD appears to improve the
benchmarks all round:-) Must be some interrupt effect of some sort, or maybe
they just like the music...

> rwsem-2.4.4-pre6 + my new generic rwsem (fast path in C inlined)

Linus wants out of line generic code only, I believe. Hence why I made my
generic code out of line.

I have noticed one glaring potential slowdown in my generic code's down
functions. I've got the following in _both_ fastpaths!:

struct task_struct *tsk = current;

It shouldn't hurt _too_ much (its only reg->reg anyway), but it will have an
effect. I'll have to move it and post another patch tomorrow.

I've also been comparing the assembly from the two generic spinlock
implementations (having out-of-lined yours in what I think is the you'd have
done it). I've noticed a number of things:

(1) My fastpaths have slightly fewer instructions in them

(2) gcc-2.96-20000731 produces what looks like much less efficient code
than gcc-snapshot-20010409 (to be expected, I suppose).

(3) Both compilers do insane things to registers (like in one instruction
moving %eax to %edx and then moving it back again in the next).

(4) If _any_ inline assembly is used, the compiler grabs extra chunks of
stack which it does not then use. It will then pop these into registers
under some circumstances. It'll also save random registers it doesn't
clobber under others.

(Basically, I had a lot of frustrating fun playing around with the spinlock
asm constraints trying to avoid the compiler clobbering registers
unnecessarily because of them).

I've attached the source file I've been playing with and an example
disassembly dump for your amusement. I used the snapshot gcc to do this (it
emits conditional chunks of code out of line more intelligently than the
older one.

It's also interesting that your generic out-of-line semaphores are faster
given the fact that you muck around with EFLAGS and CLI/STI, and I don't.
Maybe I'm getting hit by an interrupt. I'll have to play around with it and
benchmark it again.

David


Attachments:
slowpath.c (1.94 kB)
rwsem implementations slowpath comparison C source code
slowpath.dis (5.35 kB)
disassembly of compiled slowpath.c
Download all attachments

2001-04-23 01:12:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] rw_semaphores, optimisations

On Mon, Apr 23, 2001 at 03:04:41AM +0200, Andrea Arcangeli wrote:
> that is supposed to be a performance optimization, I do the same too in my code.

ah no I see what you mean, yes you are hurted by that. I'm waiting your #try 3
against pre6, by that time I hope to be able to make a run of the benchmark of
my asm version too (I will grow the graph).

Andrea

2001-04-23 01:05:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] rw_semaphores, optimisations

On Sun, Apr 22, 2001 at 11:52:29PM +0100, D . W . Howells wrote:
> Hello Andrea,
>
> Interesting benchmarks... did you compile the test programs with "make
> SCHED=yes" by any chance? Also what other software are you running?

No I never tried the SCHED=yes. However in my modification of the rwsem-rw bench
I dropped the #ifdef SCHED completly and I schedule the right way (first
checking need_resched) in a more interesting place (in the middle of the
critical section).

> The reason I ask is that running a full blown KDE setup running in the
> background, I get the following numbers on the rwsem-ro test (XADD optimised
> kernel):
>
> SCHED: 4615646, 4530769, 4534453 and 4628365
> no SCHED: 6311620, 6312776, 6327772 and 6325508

No absolutely not, that machine has nearly only the kernel daemons running
in background (even cron is disabled to make sure it doesn't screwup
the benchmarks). This is how the machine looks like before running the
bench.

andrea@laser:~ > ps xa
PID TTY STAT TIME COMMAND
1 ? S 0:03 init [2]
2 ? SW 0:00 [keventd]
3 ? SW 0:00 [kswapd]
4 ? SW 0:00 [kreclaimd]
5 ? SW 0:00 [bdflush]
6 ? SW 0:00 [kupdated]
7 ? SW< 0:00 [mdrecoveryd]
123 ? S 0:00 /sbin/dhcpcd -d eth0
150 ? S 0:00 /sbin/portmap
168 ? S 0:00 /usr/sbin/syslogd -m 1000
172 ? S 0:00 /usr/sbin/klogd -c 5
220 ? S 0:00 /usr/sbin/sshd
254 ? S 0:00 /usr/sbin/automount /misc file /etc/auto.misc
256 ? S 0:00 /usr/sbin/automount /net program /etc/auto.net
271 ? S 0:00 /usr/sbin/rpc.kstatd
276 ? S 0:00 /usr/sbin/rpc.kmountd
278 ? SW 0:00 [nfsd]
279 ? SW 0:00 [nfsd]
280 ? SW 0:00 [nfsd]
281 ? SW 0:00 [nfsd]
282 ? SW 0:00 [lockd]
283 ? SW 0:00 [rpciod]
459 ? S 0:00 /usr/sbin/inetd
461 tty1 S 0:00 /sbin/mingetty --noclear tty1
462 tty2 S 0:00 /sbin/mingetty tty2
463 tty3 S 0:00 /sbin/mingetty tty3
464 tty4 S 0:00 /sbin/mingetty tty4
465 tty5 S 0:00 /sbin/mingetty tty5
466 tty6 S 0:00 /sbin/mingetty tty6
1177 ? S 0:00 in.rlogind
1178 pts/0 S 0:00 login -- andrea
1179 pts/0 S 0:00 -bash
1186 pts/0 R 0:00 ps xa
andrea@laser:~ >

> > (ah and btw the machine is a 2-way PII 450mhz).
>
> Your numbers were "4274607" and "4280280" for this kernel and test This I
> find a little suprising. I'd expect them to be about 10% higher than I get on
> my machine given your faster CPUs.
>
> What compiler are you using? I'm using the following:
>
> Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
> gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-80)

andrea@athlon:~ > gcc -v
Reading specs from /home/andrea/bin/i686/gcc-2_95-branch-20010325/lib/gcc-lib/i686-pc-linux-gnu/2.95.4/specs
gcc version 2.95.4 20010319 (prerelease)
andrea@athlon:~ >

ah and btw, I also used the builtin expect in all the fast path but they were
compiled out by the preprocessor because I'm compiling with <96.

> Something else that I noticed: Playing a music CD appears to improve the
> benchmarks all round:-) Must be some interrupt effect of some sort, or maybe
> they just like the music...

The machine is a test box without soundcard, disk was idle.

> > rwsem-2.4.4-pre6 + my new generic rwsem (fast path in C inlined)
>
> Linus wants out of line generic code only, I believe. Hence why I made my
> generic code out of line.

I also did a run with my code out of line of course and as you can see
it's not a relevant penality.

> I have noticed one glaring potential slowdown in my generic code's down
> functions. I've got the following in _both_ fastpaths!:
>
> struct task_struct *tsk = current;

that is supposed to be a performance optimization, I do the same too in my code.

> It's also interesting that your generic out-of-line semaphores are faster
> given the fact that you muck around with EFLAGS and CLI/STI, and I don't.

as said in my last email I changed the semantics and you cannot call up_* from
irq context anymore, so in short I'm not mucking with cli/sti/eflags anymore.

Note that I didn't released anything but the bench yet, I am finishing to
plugin an asm fast path on top of my slow path and then I will run new
benchmark and post some code.

But my generic semaphore is also smaller, it's 16 byte in size even in SMP both
the asm optimized rwsem and the C generic one (of course on 32bit archs, for
64bit archs is slightly bigger than 16 bytes).

Andrea