2006-01-04 14:42:13

by Ingo Molnar

[permalink] [raw]
Subject: [patch 00/21] mutex subsystem, -V14


this is version 14 of the generic mutex subsystem, against v2.6.15.

The patch-queue consists of 21 patches, which can also be downloaded
from:

http://redhat.com/~mingo/generic-mutex-subsystem/

Changes since -V13:

13 files changed, 113 insertions(+), 85 deletions(-)

- VFS: converted sb->s_lock to a mutex too. This improves the
create+unlink benchmark on ext3.

- further simplification of __mutex_lock_common(): no more gotos, and
only one atomic_xchg() is done. Code size is now extremely small on
both UP and SMP:

text data bss dec hex filename
398 0 0 398 18e mutex.o.UP

text data bss dec hex filename
463 0 0 463 1cf mutex.o.SMP

- synchro-test updates: max # of threads of 64, fairness stats,
better defaults if =y, cleanups. (David Howells, me)

- FASTCALL -> fastcall in mutex.h (suggested by David Howells)

- documentation updates

Ingo


2006-01-04 23:45:47

by Joel Schopp

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

> this is version 14 of the generic mutex subsystem, against v2.6.15.
>
> The patch-queue consists of 21 patches, which can also be downloaded
> from:
>
> http://redhat.com/~mingo/generic-mutex-subsystem/
>

Took a glance at this on ppc64. Would it be useful if I contributed an arch
specific version like arm has? We'll either need an arch specific version or
have the generic changed.

Anyway, here is some disassembly of some of the code generated with my comments:

c00000000049bf9c <.mutex_lock>:
c00000000049bf9c: 7c 00 06 ac eieio
c00000000049bfa0: 7d 20 18 28 lwarx r9,r0,r3
c00000000049bfa4: 31 29 ff ff addic r9,r9,-1
c00000000049bfa8: 7d 20 19 2d stwcx. r9,r0,r3
c00000000049bfac: 40 c2 ff f4 bne+ c00000000049bfa0 <.mutex_lock+0x4>
c00000000049bfb0: 4c 00 01 2c isync
c00000000049bfb4: 7d 20 4b 78 mr r0,r9
c00000000049bfb8: 78 09 0f e3 rldicl. r9,r0,33,63
c00000000049bfbc: 4d 82 00 20 beqlr
c00000000049bfc0: 4b ff ff 58 b c00000000049bf18
<.__mutex_lock_noinline>

The eieio is completly unnecessary, it got picked up from atomic_dec_return
(Anton, why is there an eieio at the start of atomic_dec_return in the first
place?).

Ignore the + on the bne, the disassembler is wrong, it is really a -

c00000000049bfc4 <.mutex_unlock>:
c00000000049bfc4: 7c 00 06 ac eieio
c00000000049bfc8: 7d 20 18 28 lwarx r9,r0,r3
c00000000049bfcc: 31 29 00 01 addic r9,r9,1
c00000000049bfd0: 7d 20 19 2d stwcx. r9,r0,r3
c00000000049bfd4: 40 c2 ff f4 bne+ c00000000049bfc8 <.mutex_unlock+0x4>
c00000000049bfd8: 4c 00 01 2c isync
c00000000049bfdc: 7d 20 07 b4 extsw r0,r9
c00000000049bfe0: 2c 00 00 00 cmpwi r0,0
c00000000049bfe4: 4d 81 00 20 bgtlr
c00000000049bfe8: 4b ff ff 40 b c00000000049bf28
<.__mutex_unlock_noinline>

That eieio should be an lwsync to avoid data corruption. And I think the isync
is superfluous.

Ditto the disassembler being wrong about the + vs -.

2006-01-05 02:38:32

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

On Wed, 4 Jan 2006, Joel Schopp wrote:

> > this is version 14 of the generic mutex subsystem, against v2.6.15.
> >
> > The patch-queue consists of 21 patches, which can also be downloaded from:
> >
> > http://redhat.com/~mingo/generic-mutex-subsystem/
> >
>
> Took a glance at this on ppc64. Would it be useful if I contributed an arch
> specific version like arm has? We'll either need an arch specific version or
> have the generic changed.

Don't change the generic version. You should provide a ppc specific
version if the generic ones don't look so good.


Nicolas

2006-01-05 02:52:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14



On Wed, 4 Jan 2006, Nicolas Pitre wrote:

> On Wed, 4 Jan 2006, Joel Schopp wrote:
>
> > > this is version 14 of the generic mutex subsystem, against v2.6.15.
> > >
> > > The patch-queue consists of 21 patches, which can also be downloaded from:
> > >
> > > http://redhat.com/~mingo/generic-mutex-subsystem/
> > >
> >
> > Took a glance at this on ppc64. Would it be useful if I contributed an arch
> > specific version like arm has? We'll either need an arch specific version or
> > have the generic changed.
>
> Don't change the generic version. You should provide a ppc specific
> version if the generic ones don't look so good.

Well, if the generic one generates _buggy_ code on ppc64, that means that
either the generic version is buggy, or one of the atomics that it uses is
buggily implemented on ppc64.

And I think it's the generic mutex stuff that is buggy. It seems to assume
memory barriers that aren't valid to assume.

A mutex is more than just updating the mutex count properly. You also have
to have the proper memory barriers there to make sure that the things that
the mutex _protects_ actually stay inside the mutex.

So while a ppc64-optimized mutex is probably a good idea per se, I think
the generic mutex code had better be fixed first and regardless of any
optimized version.

On x86/x86-64, the locked instructions automatically imply the proper
memory barriers, but that was just lucky, I think.

Linus

2006-01-05 03:21:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

Linus Torvalds wrote:
>
> On Wed, 4 Jan 2006, Nicolas Pitre wrote:
>
>
>>On Wed, 4 Jan 2006, Joel Schopp wrote:
>>
>>
>>>>this is version 14 of the generic mutex subsystem, against v2.6.15.
>>>>
>>>>The patch-queue consists of 21 patches, which can also be downloaded from:
>>>>
>>>> http://redhat.com/~mingo/generic-mutex-subsystem/
>>>>
>>>
>>>Took a glance at this on ppc64. Would it be useful if I contributed an arch
>>>specific version like arm has? We'll either need an arch specific version or
>>>have the generic changed.
>>
>>Don't change the generic version. You should provide a ppc specific
>>version if the generic ones don't look so good.
>
>
> Well, if the generic one generates _buggy_ code on ppc64, that means that
> either the generic version is buggy, or one of the atomics that it uses is
> buggily implemented on ppc64.
>
> And I think it's the generic mutex stuff that is buggy. It seems to assume
> memory barriers that aren't valid to assume.
>
> A mutex is more than just updating the mutex count properly. You also have
> to have the proper memory barriers there to make sure that the things that
> the mutex _protects_ actually stay inside the mutex.
>
> So while a ppc64-optimized mutex is probably a good idea per se, I think
> the generic mutex code had better be fixed first and regardless of any
> optimized version.
>
> On x86/x86-64, the locked instructions automatically imply the proper
> memory barriers, but that was just lucky, I think.
>

I think the generic code is correct according to Documentation/atomic_ops.txt
which basically defines any atomic_xxx operation which both modifies its
operand and returns something to have a full memory barrier before and after
its load/store operations.

Side note, why can't powerpc use lwsync for smp_wmb? The only problem seems to
be that it allows loads to be reordered before stores, but that's OK with
smp_wmb, right?

And why is smp_wmb() (ie. the non I/O barrier) doing eieio, while wmb() does
not? And rmb() does lwsync, which apparently does not order IO at all...

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-01-05 03:48:55

by Anton Blanchard

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14


Hi,

> Side note, why can't powerpc use lwsync for smp_wmb? The only problem seems
> to be that it allows loads to be reordered before stores, but that's
> OK with smp_wmb, right?

lwsync implies more ordering than eieio and so may take longer. lwsync
orders everything except store - load, and eieio just orders store - store.

On power3 an lwsync is a full sync which takes forever, although in
newer chips both lwsync and eieio tend to take the same number of
cycles.

> And why is smp_wmb() (ie. the non I/O barrier) doing eieio, while wmb() does
> not? And rmb() does lwsync, which apparently does not order IO at all...

Because people love to abuse the barrier macros :) grep for wmb in
drivers/net and look for the number of places wmb() is being used to
order memory and IO. Architecturally eieio is a store - store ordering
for IO and memory but not between the two. sync is slow but does
guarantee this.

SGIs mmiowb() might be useful for some of these cases but every time its
brought up everyone ends up confused as to its real use.

Really we should have io_*mb() and smp_*mb(). At that stage we may even
be able to kill the base *mb() macros.

Anton

2006-01-05 14:35:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14


* Joel Schopp <[email protected]> wrote:

> Took a glance at this on ppc64. Would it be useful if I contributed
> an arch specific version like arm has? We'll either need an arch
> specific version or have the generic changed.

feel free to implement an assembly mutex fastpath, and it would
certainly be welcome and useful - but i think you are wrong about SMP
synchronization:

> Anyway, here is some disassembly of some of the code generated with my
> comments:
>
> c00000000049bf9c <.mutex_lock>:
> c00000000049bf9c: 7c 00 06 ac eieio
> c00000000049bfa0: 7d 20 18 28 lwarx r9,r0,r3
> c00000000049bfa4: 31 29 ff ff addic r9,r9,-1

> The eieio is completly unnecessary, it got picked up from
> atomic_dec_return (Anton, why is there an eieio at the start of
> atomic_dec_return in the first place?).

a mutex is like a spinlock, it must prevent loads and stores within the
critical section from 'leaking outside the critical section' [they must
not be reordered to before the mutex_lock(), nor to after the
mutex_unlock()] - hence the barriers added by atomic_dec_return() are
very much needed.

Ingo

2006-01-05 14:41:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14


* Linus Torvalds <[email protected]> wrote:

> Well, if the generic one generates _buggy_ code on ppc64, that means
> that either the generic version is buggy, or one of the atomics that
> it uses is buggily implemented on ppc64.
>
> And I think it's the generic mutex stuff that is buggy. It seems to
> assume memory barriers that aren't valid to assume.

in those headers i'm only using atomic_dec_return(), atomic_cmpxchg()
and atomic_xchg() - all of which imply a barrier. It is
atomic_inc/add/sub/dec() that doesnt default to an implied SMP barrier.

but it's certainly not documented too well. If atomic_dec_return() is
not supposed to imply a barrier, then all the affected architectures
(sparc64, ppc64, mips, alpha, etc.) are overdoing synchronization
currently: they all have barriers for these primitives. [They also have
an implicit barrier for atomic_dec_test() - which is being relied on for
correctness - no kernel code adds an smp_mb__before_atomic_dec() barrier
to around atomic_dec_test().]

the patch below adds the barriers to the asm-generic mutex routines, so
it's not like i'm lazy ;), but i really think this is unnecessary.
Adding this patch would add a second, unnecessary barrier for all the
arches that have barrier-less atomic ops.

it also makes sense: the moment you are interested in the 'previous
value' of the atomic counter in an atomic fashion, you very likely want
to use it for a critical section. (e.g. all the put-the-resource ops
that use atomic_dec_test() rely on this implicit barrier.)

Ingo

Index: linux/include/asm-generic/mutex-dec.h
===================================================================
--- linux.orig/include/asm-generic/mutex-dec.h
+++ linux/include/asm-generic/mutex-dec.h
@@ -19,6 +19,7 @@
*/
#define __mutex_fastpath_lock(count, fail_fn) \
do { \
+ smp_mb__before_atomic_dec(); \
if (unlikely(atomic_dec_return(count) < 0)) \
fail_fn(count); \
} while (0)
@@ -36,6 +37,7 @@ do { \
static inline int
__mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *))
{
+ smp_mb__before_atomic_dec();
if (unlikely(atomic_dec_return(count) < 0))
return fail_fn(count);
else
@@ -59,6 +61,8 @@ __mutex_fastpath_lock_retval(atomic_t *c
do { \
if (unlikely(atomic_inc_return(count) <= 0)) \
fail_fn(count); \
+ else \
+ smp_mb__after_atomic_inc(); \
} while (0)

#define __mutex_slowpath_needs_to_unlock() 1
@@ -92,6 +96,7 @@ __mutex_fastpath_trylock(atomic_t *count
* the mutex state would be.
*/
#ifdef __HAVE_ARCH_CMPXCHG
+ smp_mb__before_atomic_dec();
if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
return 1;
return 0;
Index: linux/include/asm-generic/mutex-xchg.h
===================================================================
--- linux.orig/include/asm-generic/mutex-xchg.h
+++ linux/include/asm-generic/mutex-xchg.h
@@ -24,6 +24,7 @@
*/
#define __mutex_fastpath_lock(count, fail_fn) \
do { \
+ smp_mb__before_atomic_dec(); \
if (unlikely(atomic_xchg(count, 0) != 1)) \
fail_fn(count); \
} while (0)
@@ -42,6 +43,7 @@ do { \
static inline int
__mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *))
{
+ smp_mb__before_atomic_dec();
if (unlikely(atomic_xchg(count, 0) != 1))
return fail_fn(count);
else
@@ -64,6 +66,8 @@ __mutex_fastpath_lock_retval(atomic_t *c
do { \
if (unlikely(atomic_xchg(count, 1) != 0)) \
fail_fn(count); \
+ else \
+ smp_mb__after_atomic_inc(); \
} while (0)

#define __mutex_slowpath_needs_to_unlock() 0
@@ -86,7 +90,10 @@ do { \
static inline int
__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
{
- int prev = atomic_xchg(count, 0);
+ int prev;
+
+ smp_mb__before_atomic_dec();
+ prev = atomic_xchg(count, 0);

if (unlikely(prev < 0)) {
/*

2006-01-05 16:22:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14



On Thu, 5 Jan 2006, Ingo Molnar wrote:
>
> the patch below adds the barriers to the asm-generic mutex routines, so
> it's not like i'm lazy ;), but i really think this is unnecessary.
> Adding this patch would add a second, unnecessary barrier for all the
> arches that have barrier-less atomic ops.
>
> it also makes sense: the moment you are interested in the 'previous
> value' of the atomic counter in an atomic fashion, you very likely want
> to use it for a critical section. (e.g. all the put-the-resource ops
> that use atomic_dec_test() rely on this implicit barrier.)

Ok, fair enough. However, that still leaves the question of which way the
barrier works. Traditionally, we have only cared about one thing: that all
preceding writes have finished, because the "atomic_dec_return" thing is
used as a _reference_counter_, and we're going to release the thing.

However, that's not the case in a mutex. A mutex locking operation works
exactly the other way around: it doesn't really care about the previous
writes at all, since those operations were unlocked. It cares about the
_subsequent_ writes, since those have to be seen by others as being in the
critical region, and never be seen as happening before the lock.

So I _think_ your argument is bogus, and your patch is bogus. The use of
"atomic_dec_return()" in a mutex is _not_ the same barrier as using it for
reference counting. Not at all. Memory barriers aren't just one thing:
they are semi-permeable things in two different directions and with two
different operations: there are several different kinds of them.

> #define __mutex_fastpath_lock(count, fail_fn) \
> do { \
> + smp_mb__before_atomic_dec(); \
> if (unlikely(atomic_dec_return(count) < 0)) \
> fail_fn(count); \
> } while (0)

So I think the barrier has to come _after_ the atomic decrement (or
exchange).

Because as it is written now, any writes in the locked region could
percolate up to just before the atomic dec - ie _outside_ the region.
Which is against the whole point of a lock - it would allow another CPU to
see the write even before it sees that the lock was successful, as far as
I can tell.

But memory ordering is subtle, so maybe I'm missing something..

Linus

2006-01-05 16:42:29

by Joel Schopp

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14


>>Anyway, here is some disassembly of some of the code generated with my
>>comments:
>>
>>c00000000049bf9c <.mutex_lock>:
>>c00000000049bf9c: 7c 00 06 ac eieio
>>c00000000049bfa0: 7d 20 18 28 lwarx r9,r0,r3
>>c00000000049bfa4: 31 29 ff ff addic r9,r9,-1
>
>
>>The eieio is completly unnecessary, it got picked up from
>>atomic_dec_return (Anton, why is there an eieio at the start of
>>atomic_dec_return in the first place?).
>
>
> a mutex is like a spinlock, it must prevent loads and stores within the
> critical section from 'leaking outside the critical section' [they must
> not be reordered to before the mutex_lock(), nor to after the
> mutex_unlock()] - hence the barriers added by atomic_dec_return() are
> very much needed.

The bne- and isync together form a sufficient import barrier. See PowerPC Book2
Appendix B.2.1.1

And if the eieio was necessary it should come after not before twidling the lock
bits.

2006-01-05 18:04:40

by Jesse Barnes

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

On Wednesday, January 4, 2006 7:39 pm, Anton Blanchard wrote:
> SGIs mmiowb() might be useful for some of these cases but every time
> its brought up everyone ends up confused as to its real use.

It's documented in Documentation/DocBook/deviceiobook.tmpl. If the
documentation isn't clear, we should fix it, rather than avoid using the
primitive altogether. If drivers/net really means mmiowb() in some
places, we should change it, and like you said maybe get rid of some of
these primitives so that their usage is clearer.

Jesse

2006-01-05 22:03:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14


* Linus Torvalds <[email protected]> wrote:

> On Thu, 5 Jan 2006, Ingo Molnar wrote:
> >
> > the patch below adds the barriers to the asm-generic mutex routines, so
> > it's not like i'm lazy ;), but i really think this is unnecessary.
> > Adding this patch would add a second, unnecessary barrier for all the
> > arches that have barrier-less atomic ops.
> >
> > it also makes sense: the moment you are interested in the 'previous
> > value' of the atomic counter in an atomic fashion, you very likely want
> > to use it for a critical section. (e.g. all the put-the-resource ops
> > that use atomic_dec_test() rely on this implicit barrier.)
>
> Ok, fair enough. However, that still leaves the question of which way
> the barrier works. Traditionally, we have only cared about one thing:
> that all preceding writes have finished, because the
> "atomic_dec_return" thing is used as a _reference_counter_, and we're
> going to release the thing.

yeah, i think you are right. Here's a detailed analysis of why you are
right about atomic_dec_return():

there are 8 types of instruction reordering that can happen at the
beginning and at the end of a critical section. Firstly, here's the
programmed order of instructions:

pre-read
pre-write
[critical-START]
critial-read
critical-write
[critical-END]
post-read
post-write

the following reorderings are possible:

a pre-read crossing forwards into (and over) the critical section: OK
a pre-write crossing forwards into (and over) the critical section: OK
a critical-read crossing backwards across critical-START: BAD
a critical-write crossing backwards across critical-START: BAD
a critical-read crossing forwards across critical-END: BAD
a critical-write crossing forwards across critical-END: BAD
a post-read crossing backwards into (and over) the critical section: OK
a post-write crossing backwards into (and over) the critical section: OK

so critical-START needs to be a read and a write barrier for reads and
writes happening after it. I.e. it's a memory barrier that only lets
instruction reordering in a forward direction, not in a backwards
direction.

critical-END needs to be a read and write barrier that only lets
instructions into the critical section in a backwards direction - and
it's a full memory barrier otherwise.

AFAICS, currently we dont have such a 'half-conductor / diode' memory
barrier primitive: smp_mb() is a full barrier for both directions. But
lets assume they existed, and lets call them smp_mb_forwards() and
smp_mb_backwards().

furthermore, the locking and unlocking instruction must not cross into
the critical section, so the lock sequence must be at least:

lock
smp_rmb_for_lock_forwards()
smp_mb_backwards()
... critical section ...
smp_mb_forwards()
smp_wmb_for_lock_backwards()
unlock

and yet another requirement is that two subsequent critical sections for
the same lock must not reorder the 'lock' and 'unlock' instructions:

smp_mb_forwards()
unlock
... unlocked code ...
lock
smp_mb_backwards()

i.e. 'lock' and 'unlock' must not switch places. So the most relaxed
list of requirements would be:

smp_wmb_for_lock_backwards()
lock
smp_wmb_for_lock_forwards()
smp_mb_backwards()
... critical section ...
smp_mb_forwards()
smp_wmb_for_lock_backwards()
unlock

i also think this is the 'absolute minimum memory ordering requirement'
for critical sections: relaxing this any further is not possible without
breaking critical sections.

i doubt many (in fact, any) CPUs are capable of expressing it in such a
finegrained way. With our existing primitives, probably the closest one
would be:

lock
smp_mb();
...
smp_mb();
unlock

as long as the CPU always executes the lock and unlock stores (which go
to the same address) in program order. (is there any CPU doesnt do
that?)

in that sense, both atomic_dec_return() and atomic_inc_return() are in
indeed incorrect (for the use of mutexes) e.g. on ppc64. They are both
done via:

eioio
... atomic-dec ...
isync

eioio is a stronger than smp_wmb() - it is a barrier for system memory
and IO space memory writes. isync is a read barrier - it throws away all
speculative register contents. So it is roughly equivalent to:

smp_wmb();
... atomic-dec ...
smp_rmb();

this fulfills the requirement of the critical section not leaking out of
the lock sequence itself, but (if i got the ppc64 semantics right) it
doesnt protect a write within the critical section to cross out over the
smp_rmb(), and to get reordered with the atomic-dec - violating the
critical section rules.

some other architectures are safe by accident, e.g. Alpha's
atomic_dec_return() does:

smp_mb();
... atomic-dec ...
smp_mb();

which is overkill, full read and write barrier on both sides.

Sparc64's atomic_dec_return() does yet another thing:

membar StoreLoad | LoadLoad
... atomic-load ...
... atomic-conditional-store ...
membar StoreLoad | StoreStore

AFAICS this violates the requirements: a load from within the critical
section may go before the atomic-conditional-store, and may thus be
executed before the critical section acquires the lock.

on MIPS, atomic_dec_return() does what is equivalent to:

... atomic-dec ...
smp_mb();

which is fine for a lock sequence, but atomic_inc_return() is not
adequate for an unlock sequence:

... atomic-inc ...
smp_mb();

because this allows reads and writes within the critical section to
reorder with the atomic-inc instructions.

to sum it up: atomic_dec/inc_return() alone is not enough to implement
critical sections, on a number of architectures. atomic_xchg() seems to
have similar problems too.

the patch below adds the smp_mb() barriers to the generic headers, which
should now fulfill all the ordering requirements, on every architecture.
It only relies on one property of the atomic primitives: that they wont
get reordered with respect to themselves, so an atomic_inc_ret() and an
atomic_dec_ret() cannot switch place.

Can you see any hole in this reasoning?

Ingo

Index: linux/include/asm-generic/mutex-dec.h
===================================================================
--- linux.orig/include/asm-generic/mutex-dec.h
+++ linux/include/asm-generic/mutex-dec.h
@@ -21,6 +21,8 @@
do { \
if (unlikely(atomic_dec_return(count) < 0)) \
fail_fn(count); \
+ else \
+ smp_mb(); \
} while (0)

/**
@@ -38,8 +40,10 @@ __mutex_fastpath_lock_retval(atomic_t *c
{
if (unlikely(atomic_dec_return(count) < 0))
return fail_fn(count);
- else
+ else {
+ smp_mb();
return 0;
+ }
}

/**
@@ -57,6 +61,7 @@ __mutex_fastpath_lock_retval(atomic_t *c
*/
#define __mutex_fastpath_unlock(count, fail_fn) \
do { \
+ smp_mb(); \
if (unlikely(atomic_inc_return(count) <= 0)) \
fail_fn(count); \
} while (0)
@@ -92,8 +97,10 @@ __mutex_fastpath_trylock(atomic_t *count
* the mutex state would be.
*/
#ifdef __HAVE_ARCH_CMPXCHG
- if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+ if (likely(atomic_cmpxchg(count, 1, 0)) == 1) {
+ smp_mb();
return 1;
+ }
return 0;
#else
return fail_fn(count);
Index: linux/include/asm-generic/mutex-xchg.h
===================================================================
--- linux.orig/include/asm-generic/mutex-xchg.h
+++ linux/include/asm-generic/mutex-xchg.h
@@ -26,6 +26,8 @@
do { \
if (unlikely(atomic_xchg(count, 0) != 1)) \
fail_fn(count); \
+ else \
+ smp_mb(); \
} while (0)


@@ -44,8 +46,10 @@ __mutex_fastpath_lock_retval(atomic_t *c
{
if (unlikely(atomic_xchg(count, 0) != 1))
return fail_fn(count);
- else
+ else {
+ smp_mb();
return 0;
+ }
}

/**
@@ -62,6 +66,7 @@ __mutex_fastpath_lock_retval(atomic_t *c
*/
#define __mutex_fastpath_unlock(count, fail_fn) \
do { \
+ smp_mb(); \
if (unlikely(atomic_xchg(count, 1) != 0)) \
fail_fn(count); \
} while (0)
@@ -104,6 +109,7 @@ __mutex_fastpath_trylock(atomic_t *count
if (prev < 0)
prev = 0;
}
+ smp_mb();

return prev;
}

2006-01-05 22:19:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14



On Thu, 5 Jan 2006, Ingo Molnar wrote:
>
> [ long details removed ]
>
> to sum it up: atomic_dec/inc_return() alone is not enough to implement
> critical sections, on a number of architectures. atomic_xchg() seems to
> have similar problems too.

Yes.

> the patch below adds the smp_mb() barriers to the generic headers, which
> should now fulfill all the ordering requirements, on every architecture.
> It only relies on one property of the atomic primitives: that they wont
> get reordered with respect to themselves, so an atomic_inc_ret() and an
> atomic_dec_ret() cannot switch place.
>
> Can you see any hole in this reasoning?

No. The alternative is to just make the ordering requirements
for "atomic_dec_return()" and "atomic_xchg()" be absolute. Say that they
have to be full memory barriers, and push the problem into the low-level
architecture.

I _think_ your patch is the right approach, because most architectures are
likely to do their own fast-paths for mutexes, and as such the generic
ones are more of a template for how to do it, and hopefilly aren't that
performance critical.

Linus

2006-01-05 22:21:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14


* Joel Schopp <[email protected]> wrote:

> The bne- and isync together form a sufficient import barrier. See
> PowerPC Book2 Appendix B.2.1.1

ok. Please correct me if i'm wrong: the question is, could we on ppc64
use atomic_dec_return() for mutex_lock(), and atomic_inc_return() for
mutex_unlock().

atomic_dec_return() does:

EIEIO_ON_SMP
"1: lwarx %0,0,%1 # atomic_dec_return\n\
addic %0,%0,-1\n"
PPC405_ERR77(0,%1)
" stwcx. %0,0,%1\n\
bne- 1b"
ISYNC_ON_SMP

the EIEIO_ON_SMP is in essence smp_wmb(), correct? (it's a bit stronger
because it also orders IO-space writes, but it doesnt impose any
restrictions on reads)

ISYNC_ON_SMP flushes all speculative reads currently in the queue - and
is hence a smp_rmb_backwards() primitive [per my previous mail] - but
does not affect writes - correct?

if that's the case, what prevents a store from within the critical
section going up to right after the EIEIO_ON_SMP, but before the
atomic-dec instructions? Does any of those instructions imply some
barrier perhaps? Are writes always ordered perhaps (like on x86 CPUs),
and hence the store before the bne is an effective write-barrier?

Ingo

2006-01-05 22:44:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14


* Linus Torvalds <[email protected]> wrote:

> I _think_ your patch is the right approach, because most architectures
> are likely to do their own fast-paths for mutexes, and as such the
> generic ones are more of a template for how to do it, and hopefilly
> aren't that performance critical.

yeah, i think so too. We've got 3 architectures done in assembly so far,
and it seems people like optimizing such code. Also, since the generic
code does all the boring slowpath stuff, the architecture can
concentrate on the fun part alone: to make the fastpath really fast.
The generic code is still in full control of all the mutex semantics,
and can ignore asm/mutex.h when it wants/needs to. So i'm quite happy
with the current design and i'm not against more per-arch assembly fun,
at all.

there's one exception i think: atomic-xchg.h was pretty optimal on ARM,
and i'd expect it to be pretty optimal on the other atomic-swap
platforms too. So maybe we should specify atomic_xchg() to _not_ imply a
full barrier - it's a new API anyway. We cannot embedd the barrier
within atomic_xchg(), because the barrier is needed at different ends
for lock and for unlock, and adding two barriers would be unnecessary.

asm-generic/mutex-dec.h is less optimal (and thus less critical), and i
can see no easy way to modify it, because i think it would be quite
confusing to enforce 'lock' ordering for atomic_dec_return(), and
'unlock' ordering for atomic_inc_return(). We cannot remove the existing
barriers either (and add them explicitly), because there are existing
users of these primitives. (although we could add explicit barriers to
those too - but probably not worth the trouble)

Ingo

2006-01-05 23:06:46

by Joel Schopp

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h
===================================================================
--- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h 2006-01-04 14:46:31.%N -0600
+++ 2.6.15-mutex14/include/asm-powerpc/mutex.h 2006-01-05 16:25:41.%N -0600
@@ -1,9 +1,83 @@
/*
- * Pull in the generic implementation for the mutex fastpath.
+ * include/asm-powerpc/mutex.h
*
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
+ * PowerPC optimized mutex locking primitives
+ *
+ * Please look into asm-generic/mutex-xchg.h for a formal definition.
+ * Copyright (C) 2006 Joel Schopp <[email protected]>, IBM
*/
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H
+#define __mutex_fastpath_lock(count, fail_fn)\
+do{ \
+ long tmp; \
+ __asm__ __volatile__( \
+"1: lwarx %0,0,%1\n" \
+" addic %0,%0,-1\n" \
+" stwcx. %0,0,%1\n" \
+" bne- 1b\n" \
+" isync \n" \
+ : "=&r" (tmp) \
+ : "r" (&(count)->counter) \
+ : "cr0", "memory"); \
+ if (unlikely(tmp < 0)) \
+ fail_fn(count); \
+} while (0)
+
+#define __mutex_fastpath_unlock(count, fail_fn)\
+do{ \
+ long tmp; \
+ __asm__ __volatile__(SYNC_ON_SMP \
+"1: lwarx %0,0,%1\n" \
+" addic %0,%0,1\n" \
+" stwcx. %0,0,%1\n" \
+" bne- 1b\n" \
+ : "=&r" (tmp) \
+ : "r" (&(count)->counter) \
+ : "cr0", "memory"); \
+ if (unlikely(tmp <= 0)) \
+ fail_fn(count); \
+} while (0)
+
+
+static inline int
+__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*))
+{
+ long tmp;
+ __asm__ __volatile__(
+"1: lwarx %0,0,%1\n"
+" cmpwi 0,%0,1\n"
+" bne- 2f\n"
+" stwcx. %0,0,%1\n"
+" bne- 1b\n"
+" isync\n"
+"2:"
+ : "=&r" (tmp)
+ : "r" (&(count)->counter)
+ : "cr0", "memory");
+
+ return (int)tmp;
+
+}
+
+#define __mutex_slowpath_needs_to_unlock() 1

-#include <asm-generic/mutex-dec.h>
+static inline int
+__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *))
+{
+ long tmp;
+ __asm__ __volatile__(
+"1: lwarx %0,0,%1\n"
+" addic %0,%0,-1\n"
+" stwcx. %0,0,%1\n"
+" bne- 1b\n"
+" isync \n"
+ : "=&r" (tmp)
+ : "r" (&(count)->counter)
+ : "cr0", "memory");
+ if (unlikely(tmp < 0))
+ return fail_fn(count);
+ else
+ return 0;
+}
+#endif


Attachments:
powerpcmutex.patch (2.75 kB)

2006-01-05 23:28:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14



On Thu, 5 Jan 2006, Joel Schopp wrote:
>
> Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC.
> It is completely untested, but compiles.

Shouldn't you make that "isync" dependent on SMP too? UP doesn't need it,
since DMA will never matter, and interrupts are precise.

Linus

2006-01-05 23:37:24

by Joel Schopp

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

>>Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC.
>>It is completely untested, but compiles.
>
>
> Shouldn't you make that "isync" dependent on SMP too? UP doesn't need it,
> since DMA will never matter, and interrupts are precise.
>
> Linus
>

I think the isync is necessary to keep heavily out of order processors from
getting ahead of themselves even on UP. Scanning back through the powerpc
spinlock code they seem to take the same view there as well.

2006-01-05 23:42:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14


* Joel Schopp <[email protected]> wrote:

> > Shouldn't you make that "isync" dependent on SMP too? UP doesn't
> > need it, since DMA will never matter, and interrupts are precise.
>
> I think the isync is necessary to keep heavily out of order processors
> from getting ahead of themselves even on UP. Scanning back through
> the powerpc spinlock code they seem to take the same view there as
> well.

the asm/spinlock.h ops are only built on SMP kernels. mutex.h is for
both UP and SMP. On UP you should need no synchronization, because the
only way another context could interfere with your critical section is
by getting interrupted, and interrupts are fully synchronizing, right?
On UP the only synchronization needed is when a device reads/writes
memory in parallel to the CPU.

Ingo

2006-01-06 00:30:20

by Olof Johansson

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

On Thu, Jan 05, 2006 at 05:06:26PM -0600, Joel Schopp wrote:

> Here is a first pass at a powerpc file for the fast paths just as an
> FYI/RFC. It is completely untested, but compiles.

You really should test it, it saves reviewers time. It's not that hard
to at least try booting it.

Besides the isync comments earlier, there's a bunch of whitespace issues
going on. Did you copy and paste the code from somewhere? If so, you
should move the original copyright over too.

All your macros use spaces instead of tabs up to the \, should be
changed.

All tmp variables should be ints, since the atomic_t counter is a 32-bit
variable. If you use longs, and lwarx (loads 32-bit without sign
extend), the comparison with < 0 will never be true.

> Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h
> ===================================================================
> --- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h 2006-01-04 14:46:31.%N -0600
> +++ 2.6.15-mutex14/include/asm-powerpc/mutex.h 2006-01-05 16:25:41.%N -0600
> @@ -1,9 +1,83 @@
> /*
> - * Pull in the generic implementation for the mutex fastpath.
> + * include/asm-powerpc/mutex.h

No need to keep filenames in files.

> *
> - * TODO: implement optimized primitives instead, or leave the generic
> - * implementation in place, or pick the atomic_xchg() based generic
> - * implementation. (see asm-generic/mutex-xchg.h for details)
> + * PowerPC optimized mutex locking primitives
> + *
> + * Please look into asm-generic/mutex-xchg.h for a formal definition.
> + * Copyright (C) 2006 Joel Schopp <[email protected]>, IBM
> */
> +#ifndef _ASM_MUTEX_H
> +#define _ASM_MUTEX_H
> +#define __mutex_fastpath_lock(count, fail_fn)\
> +do{ \
> + long tmp; \
> + __asm__ __volatile__( \
> +"1: lwarx %0,0,%1\n" \
> +" addic %0,%0,-1\n" \
> +" stwcx. %0,0,%1\n" \
> +" bne- 1b\n" \
> +" isync \n" \
> + : "=&r" (tmp) \
> + : "r" (&(count)->counter) \
> + : "cr0", "memory"); \
> + if (unlikely(tmp < 0)) \
> + fail_fn(count); \
> +} while (0)

trailing whitespace

> +
> +#define __mutex_fastpath_unlock(count, fail_fn)\
> +do{ \
> + long tmp; \
> + __asm__ __volatile__(SYNC_ON_SMP \
> +"1: lwarx %0,0,%1\n" \
> +" addic %0,%0,1\n" \
> +" stwcx. %0,0,%1\n" \
> +" bne- 1b\n" \

space vs tab

> + : "=&r" (tmp) \
> + : "r" (&(count)->counter) \
> + : "cr0", "memory"); \
> + if (unlikely(tmp <= 0)) \
> + fail_fn(count); \
> +} while (0)
> +
> +
> +static inline int

trailing whitespace

> +__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*))

atomic_t *count

> +{
> + long tmp;
> + __asm__ __volatile__(
> +"1: lwarx %0,0,%1\n"
> +" cmpwi 0,%0,1\n"
> +" bne- 2f\n"
> +" stwcx. %0,0,%1\n"

space vs tab on the above 4 lines

Shouldn't you decrement the counter before the store?

> +" bne- 1b\n"
> +" isync\n"
> +"2:"
> + : "=&r" (tmp)
> + : "r" (&(count)->counter)
> + : "cr0", "memory");
> +
> + return (int)tmp;
> +
> +}
> +
> +#define __mutex_slowpath_needs_to_unlock() 1
>
> -#include <asm-generic/mutex-dec.h>
> +static inline int

trailing whitespace

> +__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *))

atomic_t *count

> +{
> + long tmp;

counter is a 32-bit variable, so should tmp be otherwise the < 0
comparison can never be true.

> + __asm__ __volatile__(
> +"1: lwarx %0,0,%1\n"
> +" addic %0,%0,-1\n"
> +" stwcx. %0,0,%1\n"
> +" bne- 1b\n"
> +" isync \n"
> + : "=&r" (tmp)
> + : "r" (&(count)->counter)
> + : "cr0", "memory");
> + if (unlikely(tmp < 0))
> + return fail_fn(count);
> + else
> + return 0;
> +}
> +#endif

2006-01-06 03:51:52

by Keith Owens

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

Ingo Molnar (on Thu, 5 Jan 2006 23:43:57 +0100) wrote:
>there's one exception i think: atomic-xchg.h was pretty optimal on ARM,
>and i'd expect it to be pretty optimal on the other atomic-swap
>platforms too. So maybe we should specify atomic_xchg() to _not_ imply a
>full barrier - it's a new API anyway. We cannot embedd the barrier
>within atomic_xchg(), because the barrier is needed at different ends
>for lock and for unlock, and adding two barriers would be unnecessary.

IA64 defines two qualifiers for cmpxchg, specifically for
distinguishing between acquiring and releasing the lock.

cmpxchg<sz>.<sem>

<sz> is the data size, 1, 2, 4 or 8. <sem> is one of 'acq' or 'rel'.

sem Ordering Semaphore Operation
Completer Semantics
acq Acquire The memory read/write is made visible prior to
all subsequent data memory accesses.
rel Release The memory read/write is made visible after all
previous data memory accesses.

cmpxchg4.acq prevents following data accesses from migrating before
taking the lock (critical R/W cannot precede critical-START).
cmpxchg4.rel prevents preceding data accesses from migrating after
releasing the lock (critical R/W cannot follow critical-END). I
suggest adding acq and rel hints to atomic_xchg, and let architectures
that implement suitable operations use those hints.

2006-01-06 07:35:10

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [patch 00/21] mutex subsystem, -V14

On Thursday 05 January 2006 18:21, Linus Torvalds wrote:
> On Thu, 5 Jan 2006, Ingo Molnar wrote:
> >
> > the patch below adds the barriers to the asm-generic mutex routines, so
> > it's not like i'm lazy ;), but i really think this is unnecessary.
> > Adding this patch would add a second, unnecessary barrier for all the
> > arches that have barrier-less atomic ops.
> >
> > it also makes sense: the moment you are interested in the 'previous
> > value' of the atomic counter in an atomic fashion, you very likely want
> > to use it for a critical section. (e.g. all the put-the-resource ops
> > that use atomic_dec_test() rely on this implicit barrier.)
>
> So I _think_ your argument is bogus, and your patch is bogus. The use of
> "atomic_dec_return()" in a mutex is _not_ the same barrier as using it for
> reference counting. Not at all. Memory barriers aren't just one thing:
> they are semi-permeable things in two different directions and with two
> different operations: there are several different kinds of them.
>
> > #define __mutex_fastpath_lock(count, fail_fn) \
> > do { \
> > + smp_mb__before_atomic_dec(); \
> > if (unlikely(atomic_dec_return(count) < 0)) \
> > fail_fn(count); \
> > } while (0)
>
> So I think the barrier has to come _after_ the atomic decrement (or
> exchange).
>
> Because as it is written now, any writes in the locked region could
> percolate up to just before the atomic dec - ie _outside_ the region.
> Which is against the whole point of a lock - it would allow another CPU to
> see the write even before it sees that the lock was successful, as far as
> I can tell.
>
> But memory ordering is subtle, so maybe I'm missing something..

We mere humans^W device driver people get more confused with barriers
every day, as CPUs get more subtle in their out-of-order-ness.

I think adding longer-named-but-self-explanatory aliases for memory and io
barrier functions can help.

mmiowb => barrier_memw_iow
.... => barrier_memw_memw (a store-store barrier to mem)
....

General template for the name may be something like

[smp]barrier_{mem,io,memio}{r,w,rw}_{mem,io,memio}{r,w,rw}

Are there even more subtle cases?
--
vda

2006-01-07 17:49:34

by Joel Schopp

[permalink] [raw]
Subject: PowerPC fastpaths for mutex subsystem

Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h
===================================================================
--- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h 2006-01-04 14:46:31.%N -0600
+++ 2.6.15-mutex14/include/asm-powerpc/mutex.h 2006-01-06 17:36:09.%N -0600
@@ -1,9 +1,84 @@
/*
- * Pull in the generic implementation for the mutex fastpath.
+ * include/asm-powerpc/mutex.h
*
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
+ * PowerPC optimized mutex locking primitives
+ *
+ * Please look into asm-generic/mutex-xchg.h for a formal definition.
+ * Copyright (C) 2006 Joel Schopp <[email protected]>, IBM
*/
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H
+#define __mutex_fastpath_lock(count, fail_fn)\
+do{ \
+ int tmp; \
+ __asm__ __volatile__( \
+"1: lwarx %0,0,%1\n" \
+" addic %0,%0,-1\n" \
+" stwcx. %0,0,%1\n" \
+" bne- 1b\n" \
+ ISYNC_ON_SMP \
+ : "=&r" (tmp) \
+ : "r" (&(count)->counter) \
+ : "cr0", "memory"); \
+ if (unlikely(tmp < 0)) \
+ fail_fn(count); \
+} while (0)
+
+#define __mutex_fastpath_unlock(count, fail_fn)\
+do{ \
+ int tmp; \
+ __asm__ __volatile__(SYNC_ON_SMP\
+"1: lwarx %0,0,%1\n" \
+" addic %0,%0,1\n" \
+" stwcx. %0,0,%1\n" \
+" bne- 1b\n" \
+ : "=&r" (tmp) \
+ : "r" (&(count)->counter) \
+ : "cr0", "memory"); \
+ if (unlikely(tmp <= 0)) \
+ fail_fn(count); \
+} while (0)
+
+
+static inline int
+__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*))
+{
+ int tmp;
+ __asm__ __volatile__(
+"1: lwarx %0,0,%1\n"
+" cmpwi 0,%0,1\n"
+" bne- 2f\n"
+" addic %0,%0,-1\n"
+" stwcx. %0,0,%1\n"
+" bne- 1b\n"
+" isync\n"
+"2:"
+ : "=&r" (tmp)
+ : "r" (&(count)->counter)
+ : "cr0", "memory");
+
+ return (int)tmp;
+
+}
+
+#define __mutex_slowpath_needs_to_unlock() 1

-#include <asm-generic/mutex-dec.h>
+static inline int
+__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *))
+{
+ int tmp;
+ __asm__ __volatile__(
+"1: lwarx %0,0,%1\n"
+" addic %0,%0,-1\n"
+" stwcx. %0,0,%1\n"
+" bne- 1b\n"
+" isync \n"
+ : "=&r" (tmp)
+ : "r" (&(count)->counter)
+ : "cr0", "memory");
+ if (unlikely(tmp < 0))
+ return fail_fn(count);
+ else
+ return 0;
+}
+#endif


Attachments:
powerpcmutex.patch (2.29 kB)

2006-01-07 22:38:28

by Andrew Morton

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem

Joel Schopp <[email protected]> wrote:
>
> This is the second pass at optimizing the fastpath for the new mutex subsystem
> on PowerPC. I think it is ready to be included in the series with the other
> mutex patches now. Tested on a 4 core (2 SMT threads/core) Power5 machine with
> gcc 3.3.2.
>
> Test results from synchro-test.ko:
>
> All tests run for default 5 seconds
> Threads semaphores mutexes mutexes+attached
> 1 63,465,364 58,404,630 62,109,571
> 4 58,424,282 35,541,297 37,820,794
> 8 40,731,668 35,541,297 40,281,768
> 16 38,372,769 37,256,298 41,751,764
> 32 38,406,895 36,933,675 38,731,571
> 64 37,232,017 36,222,480 40,766,379
>

Doens't this mean that the sped-up mutexes are still slower than semaphores?

2006-01-08 07:54:21

by Anton Blanchard

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem


> Doens't this mean that the sped-up mutexes are still slower than semaphores?

Wasnt most of the x86 mutex gain a result of going from fair to unfair
operation? The current ppc64 semaphores are unfair.

Anton

2006-01-08 08:01:14

by Andrew Morton

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem

Anton Blanchard <[email protected]> wrote:
>
>
> > Doens't this mean that the sped-up mutexes are still slower than semaphores?
>
> Wasnt most of the x86 mutex gain a result of going from fair to unfair
> operation? The current ppc64 semaphores are unfair.
>

What's "unfair"? Mutexes are FIFO, as are x86 semaphores.

2006-01-08 08:30:40

by Anton Blanchard

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem


> What's "unfair"? Mutexes are FIFO, as are x86 semaphores.

The ppc64 semaphores dont force everyone into the slow path under
contention. So you could drop and pick up the semaphore even with
someone waiting. I thought thats how the new mutex code worked.

Anton

2006-01-08 09:49:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem


* Joel Schopp <[email protected]> wrote:

> Tested on a 4 core (2 SMT threads/core) Power5 machine with gcc 3.3.2.

> Test results from synchro-test.ko:
>
> All tests run for default 5 seconds
> Threads semaphores mutexes mutexes+attached
> 1 63,465,364 58,404,630 62,109,571
> 4 58,424,282 35,541,297 37,820,794
> 8 40,731,668 35,541,297 40,281,768
> 16 38,372,769 37,256,298 41,751,764
> 32 38,406,895 36,933,675 38,731,571
> 64 37,232,017 36,222,480 40,766,379

interesting. Could you try two things? Firstly, could you add some
minimal delays to the lock/unlock path, of at least 1 usec? E.g.
"synchro-test.ko load=1 interval=1". [but you could try longer delays
too, 10 usecs is still realistic.]

secondly, could you try the VFS creat+unlink test via the test-mutex.c
code below, with something like:

./test-mutex V 16 10

(this tests with 16 tasks, for 10 seconds.) You'll get a useful ops/sec
number out of this test, but the other stats will only be calculated if
you implement the rdtsc() macro to read cycles - right now it defaults
to 'always 0' on ppc, i386 and ia64 has it implemented. Also, beware
that the default atomic_inc()/dec() is unsafe (only i386 and ia64 has
the real thing implemented), you might want to add a safe PPC
implementation.

thirdly, could you run 'vmstat 1' during the tests, and post those lines
too? Here i'm curious about two things: the average runqueue length
(whether we have overscheduling), and CPU utilization and idle time left
(how efficiently cycles are preserved in contention). [btw., does ppc
have an idle=poll equivalent mode of idling?]

also, there seems to be some fluctuation in the numbers - could you try
to run a few more to see how stable the numbers are?

Ingo

------------
/*
* Copyright (C) 2005, Ingo Molnar <[email protected]>
*/

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>
#include <unistd.h>
#include <string.h>
#include <pwd.h>
#include <grp.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <regex.h>
#include <fcntl.h>
#include <time.h>
#include <sys/mman.h>
#include <dlfcn.h>
#include <popt.h>
#include <sys/socket.h>
#include <ctype.h>
#include <assert.h>
#include <sched.h>

#ifdef __ia64__
#include <sys/ioctl.h>
#include "mmtimer.h"
int mmtimer_fd;

unsigned long __mm_timer_clock_res;
unsigned long *__mm_clock_dev;
unsigned long __mm_clock_offset;
#endif

unsigned long *shared;

#define mutex_lock() gettimeofday((void *)0, (void *)10)
#define mutex_unlock() gettimeofday((void *)0, (void *)20)
#define down() gettimeofday((void *)0, (void *)100)
#define up() gettimeofday((void *)0, (void *)200)
#define down_write() gettimeofday((void *)0, (void *)1000)
#define up_write() gettimeofday((void *)0, (void *)2000)
#define down_read() gettimeofday((void *)0, (void *)10000)
#define up_read() gettimeofday((void *)0, (void *)20000)

/*
* Shared locks and variables between the test tasks:
*/
#define CACHELINE_SIZE (128/sizeof(long))

enum {
SHARED_DELTA_SUM = 0*CACHELINE_SIZE,
SHARED_DELTA_MAX = 1*CACHELINE_SIZE,
SHARED_DELTA2_SUM = 2*CACHELINE_SIZE,
SHARED_DELTA2_MAX = 3*CACHELINE_SIZE,
SHARED_DELTA3_SUM = 4*CACHELINE_SIZE,
SHARED_DELTA3_MAX = 5*CACHELINE_SIZE,
SHARED_DELTA_DELTA_SUM = 6*CACHELINE_SIZE,
SHARED_COUNT = 7*CACHELINE_SIZE,
SHARED_SUM = 8*CACHELINE_SIZE,
SHARED_LOCK = 9*CACHELINE_SIZE,
SHARED_END = 10*CACHELINE_SIZE,
};

#define SHARED(x) (*(shared + SHARED_##x))
#define SHARED_LL(x) (*(unsigned long long *)(shared + SHARED_##x))

#define BUG_ON(c) assert(!(c))

static unsigned long *setup_shared_var(void)
{
char zerobuff [4096] = { 0, };
int ret, fd;
unsigned long *buf;
char tmpfile[100];

sprintf(tmpfile, ".tmp_mmap-%d", getpid());

fd = creat(tmpfile, 0700);

BUG_ON(fd == -1);
close(fd);

fd = open(tmpfile, O_RDWR|O_CREAT|O_TRUNC);
unlink(tmpfile);
BUG_ON(fd == -1);
ret = write(fd, zerobuff, 4096);
BUG_ON(ret != 4096);

buf = (void *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
BUG_ON(buf == (void *)-1);

close(fd);

return buf;
}

#define LOOPS 10000

#ifdef __ia64__
static int setup_mmtimer(void)
{
unsigned long regoff;
int fd, _t;
size_t pagesize;

if ((fd = open ("/dev/mmtimer", O_RDONLY)) == -1)
perror("missing /dev/mmtimer");
else {
pagesize = getpagesize();
__mm_clock_dev = mmap(0, pagesize, PROT_READ,
MAP_SHARED, fd, 0);
if (__mm_clock_dev != MAP_FAILED) {
regoff = ioctl(fd, MMTIMER_GETOFFSET, 0);
if (regoff >= 0) {
__mm_clock_dev += regoff;
__mm_clock_offset = *__mm_clock_dev;
} else
perror("reg offset ioctl failed");
_t = ioctl(fd, MMTIMER_GETFREQ, &__mm_timer_clock_res);
if (_t)
perror("get freq ioctl fail");
}
}
}


#define ia64_fetchadd8_rel(p, inc) \
({ \
__u64 ia64_intri_res; \
asm volatile ("fetchadd8.rel %0=[%1],%2" \
: "=r"(ia64_intri_res) : "r"(p), "i" (inc) \
: "memory"); \
\
ia64_intri_res; \
})

static inline void atomic_inc(unsigned long *flag)
{
ia64_fetchadd8_rel(flag, 1);
}

static inline void atomic_dec(unsigned long *flag)
{
ia64_fetchadd8_rel(flag, -1);
}
#elif defined(__i386__)
static inline void atomic_inc(unsigned long *flag)
{
__asm__ __volatile__(
"lock; incl %0\n"
: "=g"(*flag) : : "memory");
}

static inline void atomic_dec(unsigned long *flag)
{
__asm__ __volatile__(
"lock; decl %0\n"
: "=g"(*flag) : : "memory");
}
#else
static inline void atomic_inc(unsigned long *flag)
{
++*flag;
}

static inline void atomic_dec(unsigned long *flag)
{
--*flag;
}
#endif

static void LOCK(unsigned long *shared)
{
for (;;) {
atomic_inc(&SHARED(LOCK));
if (SHARED(LOCK) == 1)
break;
atomic_dec(&SHARED(LOCK));
usleep(1);
}
}

static void UNLOCK(unsigned long *shared)
{
atomic_dec(&SHARED(LOCK));
}


static void sigint(int sig)
{
atomic_inc(&SHARED(END));
}

static void print_status(unsigned long *shared)
{
unsigned long count;

count = SHARED(COUNT);
SHARED(COUNT) = 0;
SHARED_LL(SUM) += count;

printf("\r| loops/sec: %ld \r", count);
fflush(stdout);
}

enum {
TYPE_MUTEX,
TYPE_SEM,
TYPE_RSEM,
TYPE_WSEM,
TYPE_VFS,
NR_TYPES
};

const char * type_names[NR_TYPES] =
{ "Mutex",
"Semaphore",
"RW-semaphore Read",
"RW-semaphore Write",
"VFS"
};


typedef unsigned long long cycles_t;
typedef unsigned long long usecs_t;

#ifdef __ia64__
# define rdtscll(val) \
do { \
val = *__mm_clock_dev; \
} while (0)
#elif defined(__i386__)
# define rdtscll(val) \
do { \
__asm__ __volatile__("rdtsc" : "=A" (val)); \
} while (0)
#else
# define rdtscll(val) \
do { (val) = 0LL; } while (0)
#endif

#define rdtod(val) \
do { \
struct timeval tv; \
\
gettimeofday(&tv, NULL); \
(val) = tv.tv_sec * 1000000ULL + tv.tv_usec; \
} while (0)

#define max(x,y) ({ \
typeof(x) _x = (x); \
typeof(y) _y = (y); \
(void) (&_x == &_y); \
_x > _y ? _x : _y; })

#define unlikely(x) __builtin_expect(!!(x), 0)

int main(int argc, char **argv)
{
int i, parent, me, first = 1;
unsigned long cpus, tasks, seconds = 0;
cycles_t t0, t01, t1, delta, delta2, delta3, delta_sum = 0,
delta2_sum = 0, delta3_sum = 0, delta_delta,
delta_delta_sum = 0, prev_delta,
delta_max = 0, delta2_max = 0, delta3_max = 0;
char str[100];
double freq;
int type;

if (argc <= 1 || argc > 4) {
usage:
fprintf(stderr,
"usage: test-mutex [Mutex|Sem|Rsem|Wsem|Vfs creat+unlink] <threads> <seconds>\n");
exit(-1);
usage2:
fprintf(stderr, "the Mutex/Sem/Rsem/Wsem tests are not available.\n");
goto usage;
}

switch (argv[1][0]) {
case 'M': type = TYPE_MUTEX; goto usage2; break;
case 'S': type = TYPE_SEM; goto usage2; break;
case 'R': type = TYPE_RSEM; goto usage2; break;
case 'W': type = TYPE_WSEM; goto usage2; break;
case 'V': type = TYPE_VFS; break;
default: goto usage;
}

system("rm -f /tmp/* 2>/dev/null >/dev/null");
cpus = system("exit `grep processor /proc/cpuinfo | wc -l`");
cpus = WEXITSTATUS(cpus);

tasks = cpus;
if (argc >= 3) {
tasks = atol(argv[2]);
if (!tasks)
goto usage;
}

if (argc >= 4)
seconds = atol(argv[3]);
else
seconds = -1;

#ifdef __ia64__
setup_mmtimer();
#endif

printf("%ld CPUs, running %ld parallel test-tasks.\n", cpus, tasks);
printf("checking %s performance.\n", type_names[type]);

shared = setup_shared_var();

signal(SIGINT, sigint);
signal(SIGHUP, sigint);

parent = getpid();

for (i = 0; i < tasks; i++)
if (!fork())
break;
sleep(1);
me = getpid();
sprintf(str, "/tmp/tmp-%d", me);

if (me == parent) {
unsigned long long total_count;
int i = 0, j;

for (;;) {
sleep(1);
if (i == seconds || SHARED(END))
break;
i++;
print_status(shared);
}
atomic_inc(&SHARED(END));
total_count = SHARED(SUM);
for (j = 0; j < tasks; j++)
wait(NULL);

if (i)
printf("\navg ops/sec: %Ld\n", total_count / i);
LOCK(shared);
// printf("delta_sum: %Ld\n", SHARED_LL(DELTA_SUM));
// printf("delta_delta_sum: %Ld\n", SHARED_LL(DELTA_DELTA_SUM));
#ifdef __ia64__
freq = 25.0;
#else
freq = 700.0;
#endif

printf("average cost per op: %.2f usecs\n",
(double)SHARED_LL(DELTA_SUM)/total_count/freq);
printf("average cost per lock: %.2f usecs\n",
(double)SHARED_LL(DELTA2_SUM)/total_count/freq);
printf("average cost per unlock: %.2f usecs\n",
(double)SHARED_LL(DELTA3_SUM)/total_count/freq);
printf("max cost per op: %.2f usecs\n",
(double)SHARED_LL(DELTA_MAX)/freq);
printf("max cost per lock: %.2f usecs\n",
(double)SHARED_LL(DELTA2_MAX)/freq);
printf("max cost per unlock: %.2f usecs\n",
(double)SHARED_LL(DELTA3_MAX)/freq);
printf("average deviance per op: %.2f usecs\n",
(double)SHARED_LL(DELTA_DELTA_SUM)/total_count/freq/2.0);
UNLOCK(shared);
exit(0);
}

for (;;) {
rdtscll(t0);

switch (type) {
case TYPE_MUTEX:
mutex_lock();
rdtscll(t01);
mutex_unlock();
break;
case TYPE_SEM:
down();
rdtscll(t01);
up();
break;
case TYPE_RSEM:
down_read();
rdtscll(t01);
up_read();
break;
case TYPE_WSEM:
down_write();
rdtscll(t01);
up_write();
break;
case TYPE_VFS:
{
int fd;

fd = creat(str, S_IRWXU);
rdtscll(t01);
close(fd);

break;
}
}
rdtscll(t1);

delta = t1-t0;
if (unlikely(delta > delta_max))
delta_max = delta;
delta_sum += delta;

delta2 = t01-t0;
if (unlikely(delta2 > delta2_max))
delta2_max = delta2;
delta2_sum += delta2;

delta3 = t1-t01;
if (unlikely(delta3 > delta3_max))
delta3_max = delta3;
delta3_sum += delta3;

if (!first) {
if (prev_delta < delta)
delta_delta = delta - prev_delta;
else
delta_delta = prev_delta - delta;
delta_delta_sum += delta_delta;
#if 0
printf("%Ld-%Ld {%Ld} prev: {%Ld} / [%Ld]\n",
t0, t1, delta, prev_delta, delta_delta);
printf(" {%Ld} - {%Ld}\n",
delta_sum, delta_delta_sum);
#endif
} else
first = 0;

prev_delta = delta;

atomic_inc(&SHARED(COUNT));

if (unlikely(SHARED(END))) {
LOCK(shared);
SHARED_LL(DELTA_SUM) += delta_sum;
SHARED_LL(DELTA_MAX) = max(SHARED_LL(DELTA_MAX),
delta_max);
SHARED_LL(DELTA2_SUM) += delta2_sum;
SHARED_LL(DELTA2_MAX) = max(SHARED_LL(DELTA2_MAX),
delta2_max);
SHARED_LL(DELTA3_SUM) += delta3_sum;
SHARED_LL(DELTA3_MAX) = max(SHARED_LL(DELTA3_MAX),
delta3_max);
SHARED_LL(DELTA_DELTA_SUM) += delta_delta_sum;
#if 0
printf("delta_sum: %Ld\n", delta_sum);
printf("delta_delta_sum: %Ld\n", delta_delta_sum);
printf("DELTA_SUM: %Ld\n", SHARED_LL(DELTA_SUM));
printf("DELTA_DELTA_SUM: %Ld\n", SHARED_LL(DELTA_DELTA_SUM));
#endif
UNLOCK(shared);

exit(0);
}
}

return 0;
}

2006-01-08 10:44:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem


looks good to me. Minor nit:

> +" isync\n"

> +" isync \n"

shouldnt these two be ISYNC_ON_SMP?

Ingo

2006-01-09 11:15:10

by David Howells

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem

Andrew Morton <[email protected]> wrote:

> > Wasnt most of the x86 mutex gain a result of going from fair to unfair
> > operation? The current ppc64 semaphores are unfair.
> >
>
> What's "unfair"? Mutexes are FIFO, as are x86 semaphores.

No, strictly Ingo's mutexes are neither completely fair nor completely FIFO.
It's possible for a process to jump the queue because unlock() always sets the
counter back to 1 before waking up the process at the front of the queue. This
means that the lock() fastpath in another process may steal the mutex out of
sequence before the wakee has a chance to grab it.

I'm not 100% convinced that x86 counting semaphores are completely fair or
completely FIFO. It's possible that they are because up() never arbitrarily
sets the count back to >0.

R/W semaphores are completely fair, but only as completely FIFO as the unfair
spinlocks permit. This is because it's much easier to guarantee their behaviour
(writer-starvation is a real problem with unfair rwsems). I have a simple
implementation of totally fair spinlocks for x86 which would also work on
anything that can emulate XADD, but I don't think it's worth the trouble.

However, for Ingo's mutexes, I suspect this queue-jumping feature is
sufficiently low probability that we can ignore it. It is theoretically
possible to induce livelock, but in reality I think it extremely unlikely to
happen for any significant length of time.

David

2006-01-10 22:31:54

by Joel Schopp

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem

> interesting. Could you try two things? Firstly, could you add some
> minimal delays to the lock/unlock path, of at least 1 usec? E.g.
> "synchro-test.ko load=1 interval=1". [but you could try longer delays
> too, 10 usecs is still realistic.]

Graphs attached. The summary for those who don't like to look at attachments is
that the mutex fastpath (threads 1) that I sent the optimized patch for is
comparable within the margin of error to semaphores. The mutex common path
(threads > 1) gets embarrassed by semaphores. So mutexes common paths are not
yet ready as far as ppc64 is concerned.

>
> secondly, could you try the VFS creat+unlink test via the test-mutex.c
> code below, with something like:
>
> ./test-mutex V 16 10

Queued into my todo list.

>
> thirdly, could you run 'vmstat 1' during the tests, and post those lines
> too? Here i'm curious about two things: the average runqueue length
> (whether we have overscheduling), and CPU utilization and idle time left
> (how efficiently cycles are preserved in contention). [btw., does ppc
> have an idle=poll equivalent mode of idling?]

Also queued in my todo list.

>
> also, there seems to be some fluctuation in the numbers - could you try
> to run a few more to see how stable the numbers are?

For the graphs the line is the average of 5 runs, and the 5 runs are scatter
plotted as well.


Attachments:
semvsmux2.png (4.43 kB)
semvsmux3.png (4.37 kB)
semvsmux.png (4.69 kB)
Download all attachments

2006-01-10 23:09:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem


* Joel Schopp <[email protected]> wrote:

> >interesting. Could you try two things? Firstly, could you add some
> >minimal delays to the lock/unlock path, of at least 1 usec? E.g.
> >"synchro-test.ko load=1 interval=1". [but you could try longer delays
> >too, 10 usecs is still realistic.]
>
> Graphs attached. The summary for those who don't like to look at
> attachments is that the mutex fastpath (threads 1) that I sent the
> optimized patch for is comparable within the margin of error to
> semaphores. The mutex common path (threads > 1) gets embarrassed by
> semaphores. So mutexes common paths are not yet ready as far as ppc64
> is concerned.

ok. I'll really need to look at "vmstat" output from these. We could
easily make the mutex slowpath behave like ppc64 semaphores, via the
attached (untested) patch, but i really think it's the wrong thing to
do, because it overloads the system with runnable tasks in an
essentially unlimited fashion [== overscheduling] - they'll all contend
for the same single mutex.

in synthetic workloads on idle systems it such overscheduling can help,
because the 'luck factor' of the 'thundering herd' of tasks can generate
a higher total throughput - at the expense of system efficiency. At 8
CPUs i already measured a net performance loss at 3 tasks! So i think
the current 'at most 2 tasks runnable' approach of mutexes is the right
one on a broad range of hardware.

still, i'll try a different patch tomorrow, to keep the number of 'in
flight' tasks within a certain limit (say at 2) - i suspect that would
close the performance gap too, on this test.

but i really think the current 'at most one task in flight' logic is the
correct approach. I'm also curious about the VFS-test numbers (already
on your todo).

> >thirdly, could you run 'vmstat 1' during the tests, and post those lines
> >too? Here i'm curious about two things: the average runqueue length
> >(whether we have overscheduling), and CPU utilization and idle time left
> >(how efficiently cycles are preserved in contention). [btw., does ppc
> >have an idle=poll equivalent mode of idling?]
>
> Also queued in my todo list.

thanks!

> >also, there seems to be some fluctuation in the numbers - could you try
> >to run a few more to see how stable the numbers are?
>
> For the graphs the line is the average of 5 runs, and the 5 runs are
> scatter plotted as well.

ok, that should be more than enough.

Ingo

--- kernel/mutex.c.orig
+++ kernel/mutex.c
@@ -226,6 +226,9 @@ __mutex_unlock_slowpath(atomic_t *lock_c

debug_mutex_wake_waiter(lock, waiter);

+ /* be (much) more agressive about wakeups: */
+ list_move_tail(&waiter.list, &lock->wait_list);
+
wake_up_process(waiter->task);
}

2006-01-11 10:52:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem


* Ingo Molnar <[email protected]> wrote:

> ok. I'll really need to look at "vmstat" output from these. We could
> easily make the mutex slowpath behave like ppc64 semaphores, via the
> attached (untested) patch, but i really think it's the wrong thing to
> do, because it overloads the system with runnable tasks in an
> essentially unlimited fashion [== overscheduling] - they'll all
> contend for the same single mutex.

find the working patch below. (the previous one had a syntax error)

Ingo

Index: linux/kernel/mutex.c
===================================================================
--- linux.orig/kernel/mutex.c
+++ linux/kernel/mutex.c
@@ -227,6 +227,9 @@ __mutex_unlock_slowpath(atomic_t *lock_c
debug_mutex_wake_waiter(lock, waiter);

wake_up_process(waiter->task);
+
+ /* be (much) more agressive about wakeups: */
+ list_move_tail(&waiter->list, &lock->wait_list);
}

debug_mutex_clear_owner(lock);

2006-01-11 17:45:14

by Joel Schopp

[permalink] [raw]
Subject: Re: PowerPC fastpaths for mutex subsystem

> ok. I'll really need to look at "vmstat" output from these. We could
> easily make the mutex slowpath behave like ppc64 semaphores, via the
> attached (untested) patch, but i really think it's the wrong thing to
> do, because it overloads the system with runnable tasks in an
> essentially unlimited fashion [== overscheduling] - they'll all contend
> for the same single mutex.
>
> in synthetic workloads on idle systems it such overscheduling can help,
> because the 'luck factor' of the 'thundering herd' of tasks can generate
> a higher total throughput - at the expense of system efficiency. At 8
> CPUs i already measured a net performance loss at 3 tasks! So i think
> the current 'at most 2 tasks runnable' approach of mutexes is the right
> one on a broad range of hardware.
>
> still, i'll try a different patch tomorrow, to keep the number of 'in
> flight' tasks within a certain limit (say at 2) - i suspect that would
> close the performance gap too, on this test.

The fundamental problem is that there is a relatively major latency to wake
somebody up, and for them to actually run so they can acquire a lock. In an
ideal world there would always be a single waiter running trying to acquire the
lock at the moment it was unlocked and not running until then.

There are better solutions than just madly throwing more waiters in flight on an
unlock. Here's three possibilities off the top of my head:

1) It is possible to have a hybrid lock that spins a single waiting thread and
sleeps waiters 2..n, so there is always a waiter running trying to acquire the
lock. It solves the latency problem if the lock is held a length of time at
least as long as it takes to wake up the next waiter. But the spinning waiter
burns some cpu to buy the decreased latency.

2) You could also do the classic spin for awhile and then sleep method. This
essentially turns low latency locks into spinlocks but still sleeps locks which
are held longer and/or are much more contested.

3) There is the option to look at cpu idleness of the current cpu and spin or
sleep based on that.

4) Accept that we have a cpu efficient high latency lock and use it appropriately.

I'm not saying any of these 4 is what we should do. I'm just trying to say
there are options out there that don't involve thundering hurds and luck to
address the problem.