2001-04-25 20:08:07

by D.W.Howells

[permalink] [raw]
Subject: [PATCH] rw_semaphores, optimisations try #4

This patch (made against linux-2.4.4-pre6 + rwsem-opt3) somewhat improves
performance on the i386 XADD optimised implementation:

A patch against -pre6 can be obtained too:

ftp://infradead.org/pub/people/dwh/rwsem-pre6-opt4.diff

Here's some benchmarks (take with a pinch of salt of course):

TEST NUM READERS NUM WRITERS CONTENTION DURATION
=============== =============== =============== =============== ========
rwsem-r1 1 0 no 10s
rwsem-r2 2 0 no 10s
rwsem-ro 4 0 no 10s
rwsem-w1 0 1 no 10s
rwsem-wo 0 4 w-w only 10s
rwsem-rw 4 2 r-w & w-w 10s
rwsem-worm 30 1 r-w & w-w 10s
rwsem-xx 30 15 r-w & w-w 50s

rwsem-opt4 (mine) 00_rwsem-9 (Andrea's)
------------------------------- -------------------------------
TEST READERS WRITERS READERS WRITERS
=============== =============== =============== ===============
===============
rwsem-r1 30347096 n/a 30130004 n/a
30362972 n/a 30127882 n/a
rwsem-r2 11031268 n/a 11035072 n/a
11038232 n/a 11030787 n/a
rwsem-ro 12641408 n/a 12722192 n/a
12636058 n/a 12729404 n/a
rwsem-w1 n/a 28607326 n/a 28505470
n/a 28609208 n/a 28508206
rwsem-wo n/a 1607789 n/a 1783876
n/a 1608603 n/a 1800982
rwsem-rw 1111545 557071 1106763 554698
1109773 555901 1103090 552567
rwsem-worm 5229696 54807 1585755 52438
5219531 54528 1588428 52222
rwsem-xx 5396096 2786619 5361894 2768893
5398443 2787613 5400716 2788801

I've compared my patch to Andrea's 00_rwsem-9, both built on top of
linux-2.4.4-pre6.

David


Attachments:
rwsem-opt4.diff (18.28 kB)
rw-semaphore, further optimisations #4

2001-04-25 20:56:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] rw_semaphores, optimisations try #4

On Wed, Apr 25, 2001 at 09:06:38PM +0100, D . W . Howells wrote:
> This patch (made against linux-2.4.4-pre6 + rwsem-opt3) somewhat improves
> performance on the i386 XADD optimised implementation:

It seems more similar to my code btw (you finally killed the useless
chmxchg ;).

I only had a short low at your attached patch, but the results are quite
suspect to my eyes beacuse we should still be equally fast in the fast
path and I should still beat you on the write fast path because I do a
much faster subl; js while you do movl -1; xadd ; js, while according to
your results you beat me on both. Do you have an explanation or you
don't know the reason either? I will re-benchmark the whole thing
shortly. But before re-benchmark if you have time could you fix the
benchmark to use the variable pointer and send me a new tarball? For
your code it probably doesn't matter because you dereference the pointer
by hand anyways, but it matters for mine and we want to benchmark real
world fast path of course.

Andrea

2001-04-26 07:39:32

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] rw_semaphores, optimisations try #4

Andrea Arcangeli <[email protected]> wrote:
> It seems more similar to my code btw (you finally killed the useless
> chmxchg ;).

CMPXCHG ought to make things better by avoiding the XADD(+1)/XADD(-1) loop,
however, I tried various combinations and XADD beats CMPXCHG significantly.

Here's a quote from Borland assembler manual I managed to dig out, giving i486
timings on memory access:

ADDL/SUBL 3 cycles
XADDL 4 cycles
CMPXCHG 8 cycles (success) / 10 cycles (failure)
LOCK +1 cycle minimum on this CPU

In reality, however, XADDL gives at least as good a result as ADDL/SUBL, maybe
just a little bit better, but its hard to say. However, the penalty imposed on
the other CPU (when it has to flush it's cache) probably more than makes up
for the difference.

> I only had a short low at your attached patch, but the results are quite
> suspect to my eyes beacuse we should still be equally fast in the fast
> path and I should still beat you on the write fast path because I do a
> much faster subl; js while you do movl -1; xadd ; js, while according to
> your results you beat me on both. Do you have an explanation or you
> don't know the reason either?

MOVL $1,EDX
SUBL EDX,(EAX)

Works out faster than:

SUBL $1,(EAX)

as well... probably due to an avoided stall when the instruction before the
snippet loads EAX from memory. Oh yes... "STC, SUBL" may also be faster too.

> I will re-benchmark the whole thing shortly. But before re-benchmark if you
> have time could you fix the benchmark to use the variable pointer and send
> me a new tarball? For your code it probably doesn't matter because you
> dereference the pointer by hand anyways, but it matters for mine and we want
> to benchmark real world fast path of course.

No, not till this evening now, I'm afraid.

As for real-world benchmarks, I suspect the fastpath is going to be
sufficiently few cycles that it's drowned out by whatever bit of code is
actually using it, like my Wine server module, which is where all this started
for me.

David