2005-12-19 01:36:27

by Ingo Molnar

[permalink] [raw]
Subject: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


add two new atomic ops to x86_64: atomic_dec_call_if_negative() and
atomic_inc_call_if_nonpositive(), which are conditional-call-if atomic
operations. Needed by the new mutex code.

Signed-off-by: Ingo Molnar <[email protected]>

----

include/asm-x86_64/atomic.h | 56 ++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 56 insertions(+)

Index: linux/include/asm-x86_64/atomic.h
===================================================================
--- linux.orig/include/asm-x86_64/atomic.h
+++ linux/include/asm-x86_64/atomic.h
@@ -203,6 +203,62 @@ static __inline__ int atomic_sub_return(
#define atomic_inc_return(v) (atomic_add_return(1,v))
#define atomic_dec_return(v) (atomic_sub_return(1,v))

+/**
+ * atomic_dec_call_if_negative - decrement and call function if negative
+ * @v: pointer of type atomic_t
+ * @fn: function to call if the result is negative
+ *
+ * Atomically decrements @v and calls a function if the result is negative.
+ * NOTE: the function is type-checked and must be a fastcall.
+ */
+#define atomic_dec_call_if_negative(v, fn_name) \
+do { \
+ fastcall void (*__tmp)(atomic_t *) = fn_name; \
+ \
+ (void)__tmp; \
+ typecheck(atomic_t *, v); \
+ \
+ __asm__ __volatile__( \
+ LOCK "decl (%%rdi)\n" \
+ "js 2f\n" \
+ "1:\n" \
+ LOCK_SECTION_START("") \
+ "2: call "#fn_name"\n\t" \
+ "jmp 1b\n" \
+ LOCK_SECTION_END \
+ : \
+ :"D" (v) \
+ :"memory"); \
+} while (0)
+
+/**
+ * atomic_inc_call_if_nonpositive - increment and call function if nonpositive
+ * @v: pointer of type atomic_t
+ * @fn: function to call if the result is nonpositive
+ *
+ * Atomically increments @v and calls a function if the result is nonpositive.
+ * NOTE: the function is type-checked and must be a fastcall.
+ */
+#define atomic_inc_call_if_nonpositive(v, fn_name) \
+do { \
+ fastcall void (*__tmp)(atomic_t *) = fn_name; \
+ \
+ (void)__tmp; \
+ typecheck(atomic_t *, v); \
+ \
+ __asm__ __volatile__( \
+ LOCK "incl (%%rdi)\n" \
+ "jle 2f\n" \
+ "1:\n" \
+ LOCK_SECTION_START("") \
+ "2: call "#fn_name"\n\t" \
+ "jmp 1b\n" \
+ LOCK_SECTION_END \
+ : \
+ :"D" (v) \
+ :"memory"); \
+} while (0)
+
/* An 64bit atomic type */

typedef struct { volatile long counter; } atomic64_t;


2005-12-19 17:44:15

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Mon, 19 Dec 2005, Ingo Molnar wrote:

> +#define atomic_dec_call_if_negative(v, fn_name) \
> +do { \
> + fastcall void (*__tmp)(atomic_t *) = fn_name; \
> + \
> + (void)__tmp; \
> + typecheck(atomic_t *, v); \
> + \
> + __asm__ __volatile__( \
> + LOCK "decl (%%rdi)\n" \
> + "js 2f\n" \
> + "1:\n" \
> + LOCK_SECTION_START("") \
> + "2: call "#fn_name"\n\t" \
> + "jmp 1b\n" \
> + LOCK_SECTION_END \
> + : \
> + :"D" (v) \
> + :"memory"); \
> +} while (0)

Hi Ingo,
Doesn't this corrupt caller saved registers?

2005-12-19 20:59:16

by David Woodhouse

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> Hi Ingo,
> Doesn't this corrupt caller saved registers?

Looks like it. I _really_ don't like calling functions from inline asm.
It's not nice. Can't we use atomic_dec_return() for this?

--
dwmw2

2005-12-20 04:30:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


* Zwane Mwaikambo <[email protected]> wrote:

> On Mon, 19 Dec 2005, Ingo Molnar wrote:
>
> > +#define atomic_dec_call_if_negative(v, fn_name) \
> > +do { \
> > + fastcall void (*__tmp)(atomic_t *) = fn_name; \
> > + \
> > + (void)__tmp; \
> > + typecheck(atomic_t *, v); \
> > + \
> > + __asm__ __volatile__( \
> > + LOCK "decl (%%rdi)\n" \
> > + "js 2f\n" \
> > + "1:\n" \
> > + LOCK_SECTION_START("") \
> > + "2: call "#fn_name"\n\t" \
> > + "jmp 1b\n" \
> > + LOCK_SECTION_END \
> > + : \
> > + :"D" (v) \
> > + :"memory"); \
> > +} while (0)
>
> Hi Ingo,
> Doesn't this corrupt caller saved registers?

good catch - i correctly marked them clobbered on i386, but not on
x86_64. The function using this code uses nothing else, so this bug
caused no real corruption - but that was just pure luck.

Ingo

2005-12-20 04:31:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


* David Woodhouse <[email protected]> wrote:

> On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> > Hi Ingo,
> > Doesn't this corrupt caller saved registers?
>
> Looks like it. I _really_ don't like calling functions from inline
> asm. It's not nice. Can't we use atomic_dec_return() for this?

we can use atomic_dec_return(), but that will add one more instruction
to the fastpath. OTOH, atomic_dec_return() is available on every
architecture, so it's a really tempting thing. I'll experiment with it.

Ingo

2005-12-20 06:30:17

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, 20 Dec 2005, Ingo Molnar wrote:

>
> * David Woodhouse <[email protected]> wrote:
>
> > On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> > > Hi Ingo,
> > > Doesn't this corrupt caller saved registers?
> >
> > Looks like it. I _really_ don't like calling functions from inline
> > asm. It's not nice. Can't we use atomic_dec_return() for this?
>
> we can use atomic_dec_return(), but that will add one more instruction
> to the fastpath. OTOH, atomic_dec_return() is available on every
> architecture, so it's a really tempting thing. I'll experiment with it.

Please consider using (a variant of) xchg() instead. Although
atomic_dec() is available on all architectures, its implementation is
far from being the most efficient thing to do for them all. For
example, see my discussion about swp on ARM:

http://www.ussg.iu.edu/hypermail/linux/kernel/0512.2/0727.html

What would be the most efficient implementation on ARM might look like:

static inline void mutex_lock(struct mutex *m)
{
if (unlikely(atomic_xchg(&m->count, 0) != 1))
__mutex_lock_nonatomic(m);
}

static inline void mutex_unlock(struct mutex *m)
{
if (unlikely(atomic_xchg(&m->count, 1) == -1))
__mutex_unlock_nonatomic(m);
}

Yet we might want to use special wrappers with non-standard calling
convention for getting to the contention handlers
(__mutex_lock_nonatomic() and __mutex_unlock_nonatomic() in this case).

Furthermore trying to make atomic_inc_call_if_nonpositive() looks like a
generic thing is rather counter productive. It is likely to be useful
to the mutex code only anyway, so why not make it implicit what the
contention handlers are? This will allow for each architectures to
implement the mutex interface with the best fast path they can come
with.

I'd propose this:

First, rename some functions to make it clearer what they are used for:

s/__mutex_lock_nonatomic/__mutex_lock_contended/
s/__mutex_unlock_nonatomic/__mutex_unlock_contended/

Next, please make it possible for architecture specific implementation
of mutex_lock(), mutex_unlock(), mutex_trylock(), and so on to exist.
A default implementation in include/asm-generic/mutex.h should probably
exist and the two examples above is certainly a good start.

mutex_lock should have the following definition: it should try to lock
the mutex (set the count to 0, or any value < 1). If the count was 1
prior setting it to 0 then the lock is successful and we're done.
Otherwise it should call __mutex_lock_contended. Knowing the previous
value and setting the new value must be done atomically. Whether it uses
atomic_xchg() or atomic_dec_return(), or even open code some clever
assembly trick to achieve that like your i386
atomic_dec_call_if_negative() implementation is the architecture's own
business. I can imagine the ARM implementation which inlined fast path
would be between 3 and 4 instructions only. But that's possible only if
the implementation allows any flexibility in achieving the above.

Similarly, mutex_unlock should set the mutex count to 1. It also should
call __mutex_unlock_contended if the count wasn't 0 before. And again
the atomic nature of the count (which might be renamed to "state" since
"count" is misleading for a mutex) must be preserved of course. Again
the architecture is free to implement it in whatever way as long as it
has this behavior.

The main header file for mutex users would be include/linux/mutex.h.
If mutex debugging is enabled, then architecture fast path is ignored so
mutex(lock() would simply become an alias for __mutex_lock_contended()
or whatever wrapper you have for that purpose. If debugging is disabled
only then is asm/mutex.h included from linux/mutex.h to get the
architecture fast path code (which is possibly including
asm-generic/mutex.h with reasonnable reference/default implementations).

What do you think?


Nicolas

2005-12-20 08:12:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

Nicolas Pitre wrote:
> On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
>
>>* David Woodhouse <[email protected]> wrote:
>>
>>
>>>On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
>>>
>>>>Hi Ingo,
>>>> Doesn't this corrupt caller saved registers?
>>>
>>>Looks like it. I _really_ don't like calling functions from inline
>>>asm. It's not nice. Can't we use atomic_dec_return() for this?
>>
>>we can use atomic_dec_return(), but that will add one more instruction
>>to the fastpath. OTOH, atomic_dec_return() is available on every
>>architecture, so it's a really tempting thing. I'll experiment with it.
>
>
> Please consider using (a variant of) xchg() instead. Although
> atomic_dec() is available on all architectures, its implementation is
> far from being the most efficient thing to do for them all. For
> example, see my discussion about swp on ARM:
>

Considering that on UP, the arm should not need to disable interrupts
for this function (or has someone refuted Linus?), how about:

#ifndef CONFIG_SMP
typedef struct { volatile int counter; } mutex_counter_t;

static inline int mutex_counter_dec_return(mutex_counter *v)
{
return --v->counter;
}

...
#else
#define mutex_counter_t atomic_t
...
#endif

Or does that get too bulky or have other problems?

MP ARMs should have an adequate atomic_dec_return.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-20 14:10:24

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, 20 Dec 2005, Nick Piggin wrote:

> Nicolas Pitre wrote:
> > On Tue, 20 Dec 2005, Ingo Molnar wrote:
> >
> >
> > > * David Woodhouse <[email protected]> wrote:
> > >
> > >
> > > > On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> > > >
> > > > > Hi Ingo,
> > > > > Doesn't this corrupt caller saved registers?
> > > >
> > > > Looks like it. I _really_ don't like calling functions from inline asm.
> > > > It's not nice. Can't we use atomic_dec_return() for this?
> > >
> > > we can use atomic_dec_return(), but that will add one more instruction to
> > > the fastpath. OTOH, atomic_dec_return() is available on every
> > > architecture, so it's a really tempting thing. I'll experiment with it.
> >
> >
> > Please consider using (a variant of) xchg() instead. Although atomic_dec()
> > is available on all architectures, its implementation is far from being the
> > most efficient thing to do for them all. For example, see my discussion
> > about swp on ARM:
> >
>
> Considering that on UP, the arm should not need to disable interrupts
> for this function (or has someone refuted Linus?), how about:

Kernel preemption.


Nicolas

2005-12-20 14:12:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

Nicolas Pitre wrote:
> On Tue, 20 Dec 2005, Nick Piggin wrote:

>>Considering that on UP, the arm should not need to disable interrupts
>>for this function (or has someone refuted Linus?), how about:
>
>
> Kernel preemption.
>

preempt_disable() ?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-20 14:35:10

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Wed, 21 Dec 2005, Nick Piggin wrote:

> Nicolas Pitre wrote:
> > On Tue, 20 Dec 2005, Nick Piggin wrote:
>
> > > Considering that on UP, the arm should not need to disable interrupts
> > > for this function (or has someone refuted Linus?), how about:
> >
> >
> > Kernel preemption.
> >
>
> preempt_disable() ?

Sure, and we're now more costly than the current implementation with irq
disabling.

If we go with simple mutexes that's because there is a gain, even a huge
one on ARM, especially for the fast uncontended case. If you guys
insist on making things so generic and rigid then there is no gain
anymore worth the bother.


Nicolas

2005-12-20 15:06:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

Nicolas Pitre wrote:
> On Wed, 21 Dec 2005, Nick Piggin wrote:
>
>
>>Nicolas Pitre wrote:
>>
>>>On Tue, 20 Dec 2005, Nick Piggin wrote:
>>
>>>>Considering that on UP, the arm should not need to disable interrupts
>>>>for this function (or has someone refuted Linus?), how about:
>>>
>>>
>>>Kernel preemption.
>>>
>>
>>preempt_disable() ?
>
>
> Sure, and we're now more costly than the current implementation with irq
> disabling.
>

Why? It is just a plain increment of a location that will almost certainly
be in cache. I can't see how it would be more than half the cost of the
irq implementation (based on looking at your measurements). How do you
figure?

Also, preempt_disable is a very frequent operation on preempt kernels so
if you have CONFIG_PREEMPT then you don't care about preempt_disable in
down() (and if you do then you are calling down too often).

> If we go with simple mutexes that's because there is a gain, even a huge
> one on ARM, especially for the fast uncontended case. If you guys
> insist on making things so generic and rigid then there is no gain
> anymore worth the bother.
>

I guess there is no bother for you, but maintaining code for 1 generic
platform versus two dozen architectures is a huge win for many.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-20 16:36:22

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Wed, 21 Dec 2005, Nick Piggin wrote:

> Nicolas Pitre wrote:
> > On Wed, 21 Dec 2005, Nick Piggin wrote:
> >
> >
> > > Nicolas Pitre wrote:
> > >
> > > > On Tue, 20 Dec 2005, Nick Piggin wrote:
> > >
> > > > > Considering that on UP, the arm should not need to disable interrupts
> > > > > for this function (or has someone refuted Linus?), how about:
> > > >
> > > >
> > > > Kernel preemption.
> > > >
> > >
> > > preempt_disable() ?
> >
> >
> > Sure, and we're now more costly than the current implementation with irq
> > disabling.
> >
>
> Why? It is just a plain increment of a location that will almost certainly
> be in cache. I can't see how it would be more than half the cost of the
> irq implementation (based on looking at your measurements). How do you
> figure?

ARM is a load/store architecture. So the preempt_count has to be
loaded, incremented, and stored back (3 instructions just there). Then
the preempt_count itself isn't straight there, it is reached with the
thread_info pointer which requires 2 additional instructions to
generate. So we're already up to 5 instructions while the disabling of
interrupts takes 3.

OK now we want to decrement the semaphore count. One load, one
decrement, one store, i.e. 3 more instructions. We're up to 8.

Oh and we're still supposed to be in a fast path. And we even aren't
ready to decide if we acquired the semaphore just yet.

Now we have to call preempt_enable(). Given that preempt_count() will
use 2 instructions again to get to the thread_info pointer, let's say
for the demonstration that we instead open coded our own
preempt_enable() and preempt_disable() so to be able to cache the
thread_info pointer amongst those two. So let's forget about those 2
additional instructions.

Let's also suppose that we preserved the original preempt count so that
we don't have to re-load and decrement it which would have used 2 other
additional instructions. Good! But register pressure is increasing,
and btw we're using completely non generic kernel infrastructure at this
point.

We nevertheless still have to store the original preempt count back.
One instruction i.e. we're up to 9.

And with 9 instructions we're already worse than the implementation with
interrupt disabling. Maybe not in terms of cycles but hey we're not
done yet!

Any preempt_enable() equivalent look at the TIF_NEED_RESCHED flag
(luckily we still have the thread_info pointer handy since we're not
using test_thread_flag() but loading the flag directly i.e. one
instruction), testing it (one instruction), conditionally branching to
preempt_schedule (one instruction). Now up to 12 instructions. Amd to
make things even worse: since the flag has to be tested right after it
has been loaded you must account for result delays (more cycles) for the
load.

OK, did we acquire that damn semaphore or not? Oh since we clobbered
the processor's condition flag while testing the TIF_NEED_RESCHED flag
we are now forced to test that semaphore count again to see if it went
negative: one instruction. And finally branch to the contention routine
if the semaphore was already locked: one instruction.

So 14 instructions total with preemption disabling, and that's with the
best implementation possible by open coding everything instead of
relying on generic functions/macros.

Compare that to my 4 (or 3 when gcc can cse a constant)
instructions needed for the mutex equivalent.


Nicolas

2005-12-20 18:27:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch



On Wed, 21 Dec 2005, Nick Piggin wrote:

> Nicolas Pitre wrote:
> > On Tue, 20 Dec 2005, Nick Piggin wrote:
>
> > > Considering that on UP, the arm should not need to disable interrupts
> > > for this function (or has someone refuted Linus?), how about:
> >
> >
> > Kernel preemption.
> >
>
> preempt_disable() ?

Yes. It should almost always be faster than physically disabling
interrupts anyway. Even if it's more instructions.

Linus

2005-12-20 18:29:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch



On Tue, 20 Dec 2005, Nicolas Pitre wrote:
>
> Sure, and we're now more costly than the current implementation with irq
> disabling.

Do the timing. It may be more instructions, but I think it was you
yourself that timed the current thing at 23 cycles, no?

Special registers are almost always slower than nicely cached accesses
(and they all basically will be).

Linus

2005-12-20 19:20:45

by Russell King

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, Dec 20, 2005 at 11:35:22AM -0500, Nicolas Pitre wrote:
> So 14 instructions total with preemption disabling, and that's with the
> best implementation possible by open coding everything instead of
> relying on generic functions/macros.

I agree with your analysis Nicolas.

However, don't forget to compare this with our existing semaphore
implementation which is 9 instructions, or 8 for the SMP version.

In total, this means that mutexes will be _FAR MORE EXPENSIVE_ on ARM
than our semaphores.

Forcing architectures down the "lets make everything generic" path
does not always hack it. It can't do by its very nature. Generic
implementations are *always* sub-optimal and it is pretty clear
that any gain that mutexes _may_ give is completely wasted on ARM
by the overhead of having a generic framework imposed upon us.

So, to sum up, if this path is persued, mutexes will be a bloody
disaster on ARM. We'd be far better off sticking to semaphores
and ignoring this mutex stuff.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-20 19:32:46

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


> So, to sum up, if this path is persued, mutexes will be a bloody
> disaster on ARM. We'd be far better off sticking to semaphores
> and ignoring this mutex stuff.

on x86 the fastpath is the same for both basically.. is there a
fundamental reason it can't be for ARM?

2005-12-20 19:34:39

by Russell King

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, Dec 20, 2005 at 10:27:12AM -0800, Linus Torvalds wrote:
> On Tue, 20 Dec 2005, Nicolas Pitre wrote:
> >
> > Sure, and we're now more costly than the current implementation with irq
> > disabling.
>
> Do the timing. It may be more instructions, but I think it was you
> yourself that timed the current thing at 23 cycles, no?

That's PXA, which is Intel. Most other ARM CPUs are far faster
than that at IRQ disable. Typically you're looking at 6 cycles
to disable + 3 cycles to re-enable.

However, Nico's analysis of 14 instructions vs 9 instructions
pretty much paints the picture - those 14 instructions for the
preempt_disable _will_ be more heavy weight than Nico's idea.

Also, Nico has an alternative idea for mutexes which does not
involve decrementing or incrementing - it's an atomic swap.
That works out at about the same cycle count on non-Intel ARM
CPUs as the present semaphore path. I'm willing to bet that
it will be faster than the present semaphore path on Intel ARM
CPUs.

Here's the cycle counts for ARM926, assuming hot caches and the
failure path not taken for the existing semaphore code:

mrs ip, cpsr 2
orr lr, ip, #128 1
msr cpsr_c, lr 3
ldr lr, [%0] 2
subs lr, lr, %1 1
str lr, [%0] 1
msr cpsr_c, ip 3
movmi ip, %0 1
blmi failure 1
total: 15 cycles

Here's nico's version (with a couple of fixes to ensure we don't
schedule if preempt count is non-zero):

mov r0, sp, lsr #13 1
mov r0, r0, lsl #13 1
ldr r1, [r0, #PREEMPT_COUNT] 2
add r2, r1, #1 1
str r2, [r0, #PREEMPT_COUNT] 1
ldr r4, [r3] 2
sub r4, r4, #1 1
str r4, [r3] 1
str r1, [r0, #PREEMPT_COUNT] 1
teq r1, #0 1
bne no_preempt_check 1
ldr r1, [r0, #FLAGS] 2
tst r1, #TIF_NEED_RESCHED 1
blne schedule 1
no_preempt_check:
cmp r4, #0 1
blmi failed 1
total: 19 cycles

That's a 26% increase in the cost of a mutex implemented this way
over a plain semaphore.

Hence, mutexes implemented this way will be _more_ costly.
Significantly so. Enough to make them worthless.

I think the approach needs completely rethinking.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-20 19:37:19

by Russell King

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, Dec 20, 2005 at 08:32:35PM +0100, Arjan van de Ven wrote:
>
> > So, to sum up, if this path is persued, mutexes will be a bloody
> > disaster on ARM. We'd be far better off sticking to semaphores
> > and ignoring this mutex stuff.
>
> on x86 the fastpath is the same for both basically.. is there a
> fundamental reason it can't be for ARM?

If we're talking about implementing mutexes as a semaphore, then they
will be identical. But what's the point of that? It's a semaphore
which is just called a mutex.

So you might as well just save the patch noise and do nothing instead.
It seems to me that you'll get _exactly_ the same result with a lot
less effort.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-20 19:46:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


On Tue, 20 Dec 2005, Russell King wrote:

> On Tue, Dec 20, 2005 at 11:35:22AM -0500, Nicolas Pitre wrote:
> > So 14 instructions total with preemption disabling, and that's with the
> > best implementation possible by open coding everything instead of
> > relying on generic functions/macros.
>
> I agree with your analysis Nicolas.
>
> However, don't forget to compare this with our existing semaphore
> implementation which is 9 instructions, or 8 for the SMP version.
>
> In total, this means that mutexes will be _FAR MORE EXPENSIVE_ on ARM
> than our semaphores.
>
> Forcing architectures down the "lets make everything generic" path
> does not always hack it. It can't do by its very nature. Generic
> implementations are *always* sub-optimal and it is pretty clear
> that any gain that mutexes _may_ give is completely wasted on ARM
> by the overhead of having a generic framework imposed upon us.
>
> So, to sum up, if this path is persued, mutexes will be a bloody
> disaster on ARM. We'd be far better off sticking to semaphores
> and ignoring this mutex stuff.
>

So what's wrong with having the generic code, and for those with a fast
semapore add an arch specific?

#define mutex_lock down
#define mutex_unlock up
#define mutex_trylock(x) (!down_trylock(x))

Until the mutex code is updated to a fast arch specific implementation.

Let me restate, that the generic code should not be this, but each arch
can have this if they already went through great lengths in making a fast
semaphore.

Heck put the above defines in the generic code, with a define

linux/mutex.h:

#ifdef HAVE_ARCH_MUTEX
#include <asm/mutex.h>
#else

#ifdef HAVE_FAST_SEMAPHORE

#define <defines here>

#else

generic code here

#endif /* HAVE_FAST_SEMAPHORE */

#endif /* HAVE_ARCH_MUTEX */

-- Steve

2005-12-20 19:49:35

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, 20 Dec 2005, Linus Torvalds wrote:

>
>
> On Tue, 20 Dec 2005, Nicolas Pitre wrote:
> >
> > Sure, and we're now more costly than the current implementation with irq
> > disabling.
>
> Do the timing. It may be more instructions, but I think it was you
> yourself that timed the current thing at 23 cycles, no?

Yes. And with my hand-optimized assembly version using
(the equivalent of) preempt_disable/preempt_enable I get 20 cycles over
14 instructions. If I let gcc do it using the standard macros it gets
worse.

The irq disabling implementation spends 23 cycles over 8 instructions.

The mutex swp-based implementation spends 8 cycles over 4 instructions.

If we make the preempt_disable/enable implementation out of line that'll
reduce the .text size bloat, but it'LL add more cycles due to the
additional two branches plus preserve of the return address making it
above 23 cycles for sure.

> Special registers are almost always slower than nicely cached accesses
> (and they all basically will be).

Sure, but do the numbers above tell you anything else? It seems pretty
obvious to me that simple mutexes are a real gain. Suffice for the
implementation to let architecture do their best to lock the mutex and
fall back to generic code when there is contention (just like we already
do for semaphores).


Nicolas

2005-12-20 19:55:17

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, 20 Dec 2005, Russell King wrote:

> On Tue, Dec 20, 2005 at 11:35:22AM -0500, Nicolas Pitre wrote:
> > So 14 instructions total with preemption disabling, and that's with the
> > best implementation possible by open coding everything instead of
> > relying on generic functions/macros.
>
> I agree with your analysis Nicolas.
>
> However, don't forget to compare this with our existing semaphore
> implementation which is 9 instructions, or 8 for the SMP version.
>
> In total, this means that mutexes will be _FAR MORE EXPENSIVE_ on ARM
> than our semaphores.

I don't follow you at all. I'm arguing that mutexes are much less
expensive than any semaphore implementation (except on SMP where
semaphores and mutexes will probably look the same).

> Forcing architectures down the "lets make everything generic" path
> does not always hack it. It can't do by its very nature. Generic
> implementations are *always* sub-optimal and it is pretty clear
> that any gain that mutexes _may_ give is completely wasted on ARM
> by the overhead of having a generic framework imposed upon us.

Agreed. And that's why I'm suggesting that the mutex locking/unlocking
fast path should be architecture specific. And to that effect I'm
working on a patch against Ingo's mutex code to illustrate my point.

What should still remain generic is the contention fallback. That's
where the actual complexity lives and that part doesn't have to be
optimized for best inline performances.


Nicolas

2005-12-20 19:58:07

by Russell King

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, Dec 20, 2005 at 02:43:30PM -0500, Steven Rostedt wrote:
> So what's wrong with having the generic code, and for those with a fast
> semapore add an arch specific?
>
> #define mutex_lock down
> #define mutex_unlock up
> #define mutex_trylock(x) (!down_trylock(x))
>
> Until the mutex code is updated to a fast arch specific implementation.
>
> Let me restate, that the generic code should not be this, but each arch
> can have this if they already went through great lengths in making a fast
> semaphore.

I have no problem with this since we can then use Nico's swp-based
implementation. Great! What seems to be happening though is that
there's a move to make these operations be generic across all
architectures.

What both Nico and myself have demonstrated is that if architectures
are placed into the generic strait-jacket, any alleged performance
benefit of mutexes is completely swamped, which in turn makes the
whole mutex idea entirely pointless.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-20 19:59:28

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, 20 Dec 2005, Arjan van de Ven wrote:

> on x86 the fastpath is the same for both basically.. is there a
> fundamental reason it can't be for ARM?

ARM prior ARM architecture level 6 (which means about 99% of all ARM
processors in the field currently) cannot do an atomic
decrement/increment. It must be done manually with interrupt disabled.

The only truly atomic instruction it has is a swap which lays itself
pretty well for mutex support.


Nicolas

2005-12-20 20:12:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch



On Tue, 20 Dec 2005, Russell King wrote:
>
> Also, Nico has an alternative idea for mutexes which does not
> involve decrementing or incrementing - it's an atomic swap.
> That works out at about the same cycle count on non-Intel ARM
> CPUs as the present semaphore path. I'm willing to bet that
> it will be faster than the present semaphore path on Intel ARM
> CPUs.

Don't be so sure, especially not in the future.

An atomic "swap" operation is, from a CPU design standpoint, fundamentally
more expensive that a "load + store".

Now, most ARM architectures don't notice this, because they are all
in-order, and not SMP-aware anyway. No suble memory ordering, no nothing.
Which is the only case when "swap" basically becomes a cheap "load+store".

What I'm trying to say is that a plain "load + store" is almost always
going to be the best option in the long run.

It's also almost certainly always the best option for UP + non-preempt,
for both present and future CPU's. The reason is simply that a
microarchitecture will _always_ be optimized for that case, since it's
pretty much by definition the common situation.

Is preemption even the common case on ARM? I'd assume not. Why are people
so interested in the preemption case? IOW, why don't you just do

ldr lr,[%0]
subs lr, lr, %1
str lr,[%0]
blmi failure

as the _base_ timings, since that should be the common case. That's the
drop-dead fastest UP case.

There's an additional advantage of the regular load/store case: if some
CPU has scheduling issues, you can actually write this out as C code (with
an extra empty ASM to make sure that the compiler doesn't move anything
out of the critical region). Again, that probably doesn't matter on most
ARM chips, but in the general case it sure does matter.

(Btw, inlining _any_ of these except perhaps the above trivial case, is
probably wrong. None of the ARM chips tend to have caches all that big, I
bet).

Linus

2005-12-20 21:18:30

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, 20 Dec 2005, Linus Torvalds wrote:

>
>
> On Tue, 20 Dec 2005, Russell King wrote:
> >
> > Also, Nico has an alternative idea for mutexes which does not
> > involve decrementing or incrementing - it's an atomic swap.
> > That works out at about the same cycle count on non-Intel ARM
> > CPUs as the present semaphore path. I'm willing to bet that
> > it will be faster than the present semaphore path on Intel ARM
> > CPUs.
>
> Don't be so sure, especially not in the future.

I think we agree. But we still live in the present.

> An atomic "swap" operation is, from a CPU design standpoint, fundamentally
> more expensive that a "load + store".
>
> Now, most ARM architectures don't notice this, because they are all
> in-order, and not SMP-aware anyway. No suble memory ordering, no nothing.
> Which is the only case when "swap" basically becomes a cheap "load+store".

Indeed.

> What I'm trying to say is that a plain "load + store" is almost always
> going to be the best option in the long run.
>
> It's also almost certainly always the best option for UP + non-preempt,
> for both present and future CPU's. The reason is simply that a
> microarchitecture will _always_ be optimized for that case, since it's
> pretty much by definition the common situation.

Which is why each architecture must always have the option of providing
its own fast path implementation according to a number of factors that
cannot be made into a generic layer. But this is the same issue whether
we talk about semaphores or mutexes.

> Is preemption even the common case on ARM? I'd assume not. Why are people
> so interested in the preemption case?

Because embedded people want it. ARM is also about the only
architecture besides x86 that currently sees most of the RT work for the
same reason. And yes preemption does make a difference.

> IOW, why don't you just do
>
> ldr lr,[%0]
> subs lr, lr, %1
> str lr,[%0]
> blmi failure
>
> as the _base_ timings, since that should be the common case. That's the
> drop-dead fastest UP case.

The above is 5 cycles. About the same as the preemption-safe swp-based
mutex implementation on non-Intel ARM. It is broken wrt interrupts when
the swp is not. It doesn't work with preemption while the swp
implementation is potentially smaller with cse and it works in all
cases.

I mean...... what is it with mutexes that you dislike to the point of
bending backward that far, and even after seeing the numbers, with such
a semaphore implementation that _I_ even wouldn't trust people to use
correctly? Yes we should use complete() from interrupt handlers but I
can bet you that a lot of people are still using up() there, and with
the current up() implementation it just _works_ at least on ARM.

> (Btw, inlining _any_ of these except perhaps the above trivial case, is
> probably wrong. None of the ARM chips tend to have caches all that big, I
> bet).

On XScale you should add 2 cycles to branch to the out of line code and
2 other cycles to branch back.

The ARM mutex implementation can save that and have an extremely small
inlined footprint already.

Again what do you dislike so much about mutexes? It must not have much
to do with technical issues at this point. ;-)


Nicolas

2005-12-20 22:06:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch



On Tue, 20 Dec 2005, Nicolas Pitre wrote:
>
> I mean...... what is it with mutexes that you dislike to the point of
> bending backward that far, and even after seeing the numbers, with such
> a semaphore implementation that _I_ even wouldn't trust people to use
> correctly?

Quite frankly, what has disgusted me about this mutex discussion is the
totally specious arguments for the new mutexes that just rubs me entirely
the wrong way.

If it had _started_ with a mutex implementation that was faster, simpler,
and didn't rename the old and working semaphores, I'd have been perfectly
fine with it.

As it is, the discussion has been pretty much everything but that.

And then people who argue about single cycles, end up dismissing the
single cycles when I argue that "ld+st" is faster - like you just did.

Be consistent, dammit. If single cycles matter, they matter. If they
don't, then the existing code is better, since it's existing and works.
You can't have it both ways.

In other words: if people didn't mix up issues that had nothing to do with
this into it, I'd be happier. I've already said that a mutex that does
_not_ replace semaphore (and doesn't mess with naming) is acceptable.

We've done that before. But do it RIGHT, dammit. And don't mix existing
semaphores into it (for example, completions didn't change any old users).

Linus

2005-12-20 22:21:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


On Tue, 20 Dec 2005, Linus Torvalds wrote:
>
> In other words: if people didn't mix up issues that had nothing to do with
> this into it, I'd be happier. I've already said that a mutex that does
> _not_ replace semaphore (and doesn't mess with naming) is acceptable.
>

At least I'm very happy with Linus' decision. Can we now end this thread,
and move forward. You may bring up your issues in what comes next.

Thank you,

-- Steve

2005-12-20 22:43:13

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, 20 Dec 2005, Linus Torvalds wrote:

> Quite frankly, what has disgusted me about this mutex discussion is the
> totally specious arguments for the new mutexes that just rubs me entirely
> the wrong way.
>
[...]
>
> In other words: if people didn't mix up issues that had nothing to do with
> this into it, I'd be happier. I've already said that a mutex that does
> _not_ replace semaphore (and doesn't mess with naming) is acceptable.

Oh if it is so we are in _violent_ agreement then. I don't dispute that
at all and I pretty agree with a separate namespace for mutexes.
Actually I think no one contested that either.

Current semaphores can be migrated to mutexes individually when that
makes sense to do so, separately.

With regards to my _implementation_ concerns about the proposed mutex
patches I guess I can discuss them with Ingo (and an actual patch is
coming to fix them).


Nicolas

2005-12-21 02:12:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

Nicolas Pitre wrote:
> On Tue, 20 Dec 2005, Linus Torvalds wrote:

>>IOW, why don't you just do
>>
>> ldr lr,[%0]
>> subs lr, lr, %1
>> str lr,[%0]
>> blmi failure
>>
>>as the _base_ timings, since that should be the common case. That's the
>>drop-dead fastest UP case.
>
>
> The above is 5 cycles. About the same as the preemption-safe swp-based
> mutex implementation on non-Intel ARM. It is broken wrt interrupts when
> the swp is not.

How it is broken WRT interrupts? (sorry, I haven't had it explained to me
in words of two syllables or less)

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-21 04:16:23

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

On Tue, 20 Dec 2005, Nicolas Pitre wrote:

> On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> >
> > * David Woodhouse <[email protected]> wrote:
> >
> > > On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> > > > Hi Ingo,
> > > > Doesn't this corrupt caller saved registers?
> > >
> > > Looks like it. I _really_ don't like calling functions from inline
> > > asm. It's not nice. Can't we use atomic_dec_return() for this?
> >
> > we can use atomic_dec_return(), but that will add one more instruction
> > to the fastpath. OTOH, atomic_dec_return() is available on every
> > architecture, so it's a really tempting thing. I'll experiment with it.
>
> Please consider using (a variant of) xchg() instead. Although
> atomic_dec() is available on all architectures, its implementation is
> far from being the most efficient thing to do for them all.

Actually, the best thing to do is to let the architecture do what is the
most efficient. Please consider this patch that adds the flexibility
for any architecture to use the optimal mechanism it has for atomic
locking. It also let it decide whether inlining the fast path is a good
thing or not. This patch should leave i386 unchanged. It applies on
top of your last posted mutex patch serie.

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -14,6 +14,8 @@
#include <asm/atomic.h>
#include <linux/list.h>
#include <linux/spinlock_types.h>
+#include <linux/linkage.h>
+#include <asm/mutex.h>

/*
* Simple, straightforward mutexes with strict semantics:
@@ -93,12 +95,49 @@ extern void FASTCALL(__mutex_init(struct
#define DEFINE_MUTEX(mutexname) \
struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)

+/*
+ * We can speed up the lock-acquire, but only if the architecture
+ * supports it and if there's no debugging state
+ * to be set up (!DEBUG_MUTEXESS).
+ */
+#ifdef CONFIG_DEBUG_MUTEXESS
+#undef MUTEX_LOCKLESS_FASTPATH
+#endif
+
+/*
+ * We can inline the lock-acquire, but only if the architecture
+ * wants it, and if MUTEX_LOCKLESS_FASTPATH is active.
+ */
+#ifndef MUTEX_LOCKLESS_FASTPATH
+#undef MUTEX_INLINE_FASTPATH
+#endif
+
+/* out of line contention handlers */
+extern void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count));
+extern void FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count));
+
+#ifdef MUTEX_INLINE_FASTPATH
+
+#define mutex_lock(lock) \
+ __arch_fast_mutex_lock(&lock->count, __mutex_lock_noinline)
+#define mutex_unlock(lock) \
+ __arch_fast_mutex_unlock(&lock->count, __mutex_unlock_noinline)
+#define mutex_trylock(lock) \
+ __arch_fast_mutex_trylock(&lock->count)
+#define mutex_is_locked(lock) \
+ ({ mb(); atomic_read(&lock->count) != 1; })
+
+#else
+
extern void FASTCALL(mutex_lock(struct mutex *lock));
-extern int FASTCALL(mutex_lock_interruptible(struct mutex *lock));
-extern int FASTCALL(mutex_trylock(struct mutex *lock));
extern void FASTCALL(mutex_unlock(struct mutex *lock));
+extern int FASTCALL(mutex_trylock(struct mutex *lock));
extern int FASTCALL(mutex_is_locked(struct mutex *lock));

+#endif
+
+extern int FASTCALL(mutex_lock_interruptible(struct mutex *lock));
+
/*
* Debugging variant of mutexes. The only difference is that they accept
* the semaphore APIs too:
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -21,20 +21,6 @@
#include <linux/interrupt.h>

/*
- * We can speed up the lock-acquire, if the architecture
- * supports cmpxchg and if there's no debugging state
- * to be set up (!DEBUG_MUTEXESS).
- *
- * trick: we can use cmpxchg on the release side too, if bit
- * 0 of lock->owner is set if there is at least a single pending
- * task in the wait_list. This way the release atomic-fastpath
- * can be a mirror image of the acquire path:
- */
-#if defined(__HAVE_ARCH_CMPXCHG) && !defined(CONFIG_DEBUG_MUTEXESS)
-# define MUTEX_LOCKLESS_FASTPATH
-#endif
-
-/*
* In the debug case we carry the caller's instruction pointer into
* other functions, but we dont want the function argument overhead
* in the nondebug case - hence these macros:
@@ -373,9 +359,7 @@ repeat:
static int __mutex_trylock(struct mutex *lock __IP_DECL__)
{
#ifdef MUTEX_LOCKLESS_FASTPATH
- if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
- return 1;
- return 0;
+ return __arch_fast_mutex_trylock(&lock->count);
#else
struct thread_info *ti = current_thread_info();
unsigned long flags;
@@ -397,19 +381,6 @@ static int __mutex_trylock(struct mutex
#endif
}

-int fastcall mutex_is_locked(struct mutex *lock)
-{
- mb();
- return atomic_read(&lock->count) != 1;
-}
-
-EXPORT_SYMBOL_GPL(mutex_is_locked);
-
-int __sched fastcall mutex_trylock(struct mutex *lock)
-{
- return __mutex_trylock(lock __CALLER_IP__);
-}
-
/*
* Release the lock:
*/
@@ -457,7 +428,6 @@ static inline void __mutex_unlock_nonato
* We want the atomic op come first, to make sure the
* branch is predicted as default-untaken:
*/
-static __sched void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count));

/*
* The locking fastpath is the 1->0 transition from
@@ -465,39 +435,41 @@ static __sched void FASTCALL(__mutex_loc
*/
static inline void __mutex_lock_atomic(struct mutex *lock)
{
- atomic_dec_call_if_negative(&lock->count, __mutex_lock_noinline);
+ __arch_fast_mutex_lock(&lock->count, __mutex_lock_noinline);
}

-static fastcall __sched void __mutex_lock_noinline(atomic_t *lock_count)
+fastcall __sched void __mutex_lock_noinline(atomic_t *lock_count)
{
struct mutex *lock = container_of(lock_count, struct mutex, count);

__mutex_lock_nonatomic(lock);
}

+EXPORT_SYMBOL_GPL(__mutex_lock_noinline);
+
static inline void __mutex_lock(struct mutex *lock)
{
__mutex_lock_atomic(lock);
}

-static void __sched FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count));
-
/*
* The unlocking fastpath is the 0->1 transition from
* 'locked' into 'unlocked' state:
*/
static inline void __mutex_unlock_atomic(struct mutex *lock)
{
- atomic_inc_call_if_nonpositive(&lock->count, __mutex_unlock_noinline);
+ __arch_fast_mutex_unlock(&lock->count, __mutex_unlock_noinline);
}

-static fastcall void __sched __mutex_unlock_noinline(atomic_t *lock_count)
+fastcall void __sched __mutex_unlock_noinline(atomic_t *lock_count)
{
struct mutex *lock = container_of(lock_count, struct mutex, count);

__mutex_unlock_nonatomic(lock);
}

+EXPORT_SYMBOL_GPL(__mutex_unlock_noinline);
+
static inline void __mutex_unlock(struct mutex *lock)
{
__mutex_unlock_atomic(lock);
@@ -517,6 +489,8 @@ static inline void __mutex_unlock(struct

#endif

+#ifndef MUTEX_INLINE_FASTPATH
+
void __sched fastcall mutex_lock(struct mutex *lock)
{
__mutex_lock(lock __CALLER_IP__);
@@ -532,6 +506,23 @@ void __sched fastcall mutex_unlock(struc

EXPORT_SYMBOL_GPL(mutex_unlock);

+int __sched fastcall mutex_trylock(struct mutex *lock)
+{
+ return __mutex_trylock(lock __CALLER_IP__);
+}
+
+EXPORT_SYMBOL_GPL(mutex_trylock);
+
+int fastcall mutex_is_locked(struct mutex *lock)
+{
+ mb();
+ return atomic_read(&lock->count) != 1;
+}
+
+EXPORT_SYMBOL_GPL(mutex_is_locked);
+
+#endif
+
int __sched fastcall mutex_lock_interruptible(struct mutex *lock)
{
return __mutex_lock_interruptible(lock, 0 __CALLER_IP__);
Index: linux-2.6/include/asm-generic/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/mutex.h
@@ -0,0 +1,22 @@
+/*
+ * linux/include/asm-generic/mutex.h
+ *
+ * Reference implementation for lockless fast path mutex operations
+ */
+
+#define MUTEX_LOCKLESS_FASTPATH
+
+#define __arch_fast_mutex_lock(count, contention_fn) \
+do { \
+ if (atomic_xchg(count, 0) != 1) \
+ contention_fn(count); \
+} while (0)
+
+#define __arch_fast_mutex_unlock(count, contention_fn) \
+do { \
+ if (atomic_xchg(count, 1) != 0) \
+ contention_fn(count); \
+} while (0)
+
+#define __arch_fast_mutex_trylock(count) \
+ (atomic_cmpxchg(count, 1, 0) == 1)
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-i386/mutex.h
@@ -0,0 +1,44 @@
+#define MUTEX_LOCKLESS_FASTPATH
+
+#define __arch_fast_mutex_lock(v, contention) \
+do { \
+ fastcall void (*__tmp)(atomic_t *) = contention; \
+ \
+ (void)__tmp; \
+ typecheck(atomic_t *, v); \
+ \
+ __asm__ __volatile__( \
+ LOCK "decl (%%eax)\n" \
+ "js 2f\n" \
+ "1:\n" \
+ LOCK_SECTION_START("") \
+ "2: call "#contention"\n\t" \
+ "jmp 1b\n" \
+ LOCK_SECTION_END \
+ : \
+ :"a" (v) \
+ :"memory","cx","dx"); \
+} while (0)
+
+#define __arch_fast_mutex_unlock(v, contention) \
+do { \
+ fastcall void (*__tmp)(atomic_t *) = contention; \
+ \
+ (void)__tmp; \
+ typecheck(atomic_t *, v); \
+ \
+ __asm__ __volatile__( \
+ LOCK "incl (%%eax)\n" \
+ "jle 2f\n" \
+ "1:\n" \
+ LOCK_SECTION_START("") \
+ "2: call "#contention"\n\t" \
+ "jmp 1b\n" \
+ LOCK_SECTION_END \
+ : \
+ :"a" (v) \
+ :"memory","cx","dx"); \
+} while (0)
+
+#define __arch_fast_mutex_trylock(v) \
+ (atomic_cmpxchg(v, 1, 0) == 1)
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-arm/mutex.h
@@ -0,0 +1,9 @@
+#include <asm/system.h>
+
+#define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+
+#include <asm-generic/mutex.h>
+
+#ifndef swp_is_buggy
+#define MUTEX_INLINE_FASTPATH
+#endif


Nicolas

2005-12-21 06:01:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


* Linus Torvalds <[email protected]> wrote:

> If it had _started_ with a mutex implementation that was faster,
> simpler, and didn't rename the old and working semaphores, I'd have
> been perfectly fine with it.

oh, i'm totally OK with not doing the renames and leaving semaphores
alone!

Just in case it wasnt clear: i very much expected that the migration
helper patches would be controverial, and that they would probably not
go upstream. [Christoph said last week that they were fit for an
obfuscated C contest, not the kernel, and i didnt expect this sentiment
to change overnight.] Look at my patch order:

xfs-mutex-namespace-collision-fix.patch

add-atomic-xchg-i386.patch
add-atomic-xchg-x86_64.patch
add-atomic-xchg-ia64.patch
add-atomic-call-func-i386.patch
add-atomic-call-func-x86_64.patch
add-atomic-call-func-ia64.patch
add-atomic-call-wrappers-all-other-arches.patch

mutex-core.patch

mutex-debug.patch
mutex-debug-more.patch

mutex-migration-helper-i386.patch # not upstream from here
mutex-migration-helper-x86_64.patch
mutex-migration-helper-ia64.patch
mutex-migration-helper-core.patch

sx8-sem2completions.patch
cpu5wdt-sem2completions.patch
ide-gendev-sem-to-completion.patch
loop-to-completions.patch

arch-semaphores.patch

The first 11 patches are finegrained and contain _zero_ of the migration
and rename stuff. I specifically created the patch-series in such a way,
so that we could simply chop off the last few patches.

in the future i'll only send patches up to mutex-debug-more.patch, as
suggested by Christoph and you as well. So there's really no
controversy. Btw., that was true for my first queue already, as noticed
by Christoph [*].

Basically all of the activity in the last 2 days was in the first 11
patches. I'll send an updated queue later today.

Ingo

[*] the migration helpers were incredibly useful for pulling this off.
Without the wide exposure of mutexes to existing semaphore users i'd
not have been able to profile the system, to measure the impact the
effects of mutexes on performance. I'd also not have been able to
say what percentage of semaphores could move over to mutexes. We
could also not have carried the mutex implementation in the -rt tree
for more than a year, in which year millions of lines were changed
in the upstream kernel! It would have been simply impossible to even
attempt this, without the type-sensitive APIs and the minimal
renames to categorize semaphores.

2005-12-21 06:21:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


* Nick Piggin <[email protected]> wrote:

> >>Considering that on UP, the arm should not need to disable interrupts
> >>for this function (or has someone refuted Linus?), how about:
> >
> >Kernel preemption.
>
> preempt_disable() ?

please take a look at kernel/mutex.c, there's a define at the top of the
file:

// #define MUTEX_IRQ_SAFE

which, if off, makes the mutex code use preempt_disable() and
preempt_enable() to make it preemption-safe. If it's on, the mutex
implementation uses IRQ flags.

in my current tree i've already eliminated this define, and have
switched the code to use preempt_disable()/preempt_enable() exclusively,
because preempt_*() is equally fast on all platforms, while IRQ disable
costs vary largely. (and they are rarely faster than preempt_disable()).

my current tree also provides a mechanism for architectures to hand-code
the mutex lock and unlock fastpath, if they choose to do so. So i think
we can really stop the cycle counting.

Ingo

2005-12-21 06:27:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch


* Nicolas Pitre <[email protected]> wrote:

> > Please consider using (a variant of) xchg() instead. Although
> > atomic_dec() is available on all architectures, its implementation is
> > far from being the most efficient thing to do for them all.
>
> Actually, the best thing to do is to let the architecture do what is
> the most efficient. [...]

i have already added something similar. Furthermore, i have also
eliminated the CMPXCHG requirement from MUTEX_FASTPATH, which should
open ARM up to the generic lockless fastpath. (but you can still choose
to reimplement it in the architecture) I'll send a new queue later
today.

Ingo

2005-12-21 11:59:50

by Horst H. von Brand

[permalink] [raw]
Subject: Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch

Steven Rostedt <[email protected]> wrote:

[...]

> Let me restate, that the generic code should not be this, but each arch
> can have this if they already went through great lengths in making a fast
> semaphore.
>
> Heck put the above defines in the generic code, with a define
>
> linux/mutex.h:
>
> #ifdef HAVE_ARCH_MUTEX
> #include <asm/mutex.h>
> #else
>
> #ifdef HAVE_FAST_SEMAPHORE
>
> #define <defines here>
>
> #else
>
> generic code here

Anything to go here could/should very well be in the above arch-specific
file. Saves you a #define ;-)

> #endif /* HAVE_FAST_SEMAPHORE */
> #endif /* HAVE_ARCH_MUTEX */
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513