2005-12-22 11:42:34

by Ingo Molnar

[permalink] [raw]
Subject: [patch 0/9] mutex subsystem, -V4

this is the -V4 of the mutex subsystem patch-queue. It consists of the
following patches:

add-atomic-xchg.patch
add-atomic-call-func-i386.patch
add-atomic-call-func-x86_64.patch
add-atomic-call-wrappers-rest.patch
mutex-core.patch
mutex-switch-arm-to-xchg.patch
mutex-debug.patch
mutex-debug-more.patch
xfs-mutex-namespace-collision-fix.patch

the patches are against Linus' latest GIT tree, and they should work
fine on every Linux architecture.

Changes since -V3:

- imlemented an atomic_xchg() based mutex implementation. It integrated
pretty nicely into the generic code, and most of the code is still
shared.

- added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines
this then the generic mutex code will switch to the atomic_xchg()
implementation.

This should be conceptually equivalent to the variant Nicolas Pitre
posted - Nicolas, could you check out this one? It's much easier to
provide this in the generic implementation, and the code ends up
looking cleaner.

- eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for
architectures to override the generic code anymore, with the
introduction of __ARCH_WANT_XCHG_BASED_ATOMICS.

- ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS.

- ARM buildfix: move the new atomic primitives to the correct place.
(Nicolas Pitre)

- optimization: unlock the mutex outside of the spinlock (suggested by
Nicolas Pitre)

- removed leftover arch_semaphore reference from the XFS patch. (noticed
by Arjan van de Ven)

- better document the fact that mutex_trylock() follows spin_trylock()
semantics, not sem_trylock() semantics.

- cleanup: got rid of the MUTEX_LOCKLESS_FASTPATH define. (in -V3 this
became equivalent to DEBUG_MUTEXES, via the elimination of the
__HAVE_ARCH_CMPXCHG dependency on the fastpath.)

- further simplified and unified mutex_trylock().

- DocBook entries, and more documentation of internals.

- dropped the spinlock-debug fix, Linus merged it.

Ingo


2005-12-22 11:53:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, Dec 22, 2005 at 12:41:47PM +0100, Ingo Molnar wrote:
> this is the -V4 of the mutex subsystem patch-queue. It consists of the
> following patches:
>
> add-atomic-xchg.patch
> add-atomic-call-func-i386.patch
> add-atomic-call-func-x86_64.patch
> add-atomic-call-wrappers-rest.patch
> mutex-core.patch
> mutex-switch-arm-to-xchg.patch
> mutex-debug.patch
> mutex-debug-more.patch
> xfs-mutex-namespace-collision-fix.patch
>
> the patches are against Linus' latest GIT tree, and they should work
> fine on every Linux architecture.
>
> Changes since -V3:
>
> - imlemented an atomic_xchg() based mutex implementation. It integrated
> pretty nicely into the generic code, and most of the code is still
> shared.
>
> - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines
> this then the generic mutex code will switch to the atomic_xchg()
> implementation.
>
> This should be conceptually equivalent to the variant Nicolas Pitre
> posted - Nicolas, could you check out this one? It's much easier to
> provide this in the generic implementation, and the code ends up
> looking cleaner.
>
> - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for
> architectures to override the generic code anymore, with the
> introduction of __ARCH_WANT_XCHG_BASED_ATOMICS.
>
> - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS.

I must admit I really really hat __ARCH_ stuff if we can avoid it.
An <asm/mutex.h> that usually includes two asm-generic variants is probably
a much better choice.

Actually just havign asm/mutex.h implement the faspath per-arch and get
rid of all the oddball atomic.h additions would be even better. While
this means we need per-arch code it also means the code is a lot easier
understandable, and we don't add odd public APIs.

2005-12-22 11:56:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Ingo Molnar <[email protected]> wrote:
>
> this is the -V4 of the mutex subsystem patch-queue.
>

I've only been following this with half an eye, with the apparently
erroneous expectation that future versions of the patchset would come with
some explanation of why on earth we'd want to merge all this stuff into the
kernel.

We've seen some benchmarks which indicate that there are performance gains
to be had if there's a lot of lock contention, but isn't it the case that
this is because of the semi-extra wake_up() in down()? Has anyone tried to
redo existing semaphores so they no longer have that disadvantage?

2005-12-22 12:21:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Andrew Morton <[email protected]> wrote:

> I've only been following this with half an eye, with the apparently
> erroneous expectation that future versions of the patchset would come
> with some explanation of why on earth we'd want to merge all this
> stuff into the kernel.

in my initial announcement i listed 10 good reasons to do so, and they
are still true:

http://people.redhat.com/mingo/generic-mutex-subsystem/mutex-announce.txt
[...]

But firstly, i'd like to answer the most important question:

"Why the hell do we need a new mutex subsystem, and what's wrong
with semaphores??"

This question is clearly nagging most of the doubters, so i'm listing my
answers first, before fully explaining the patchset. (For more
implementational details, see the subseqent sections.)

here are the top 10 reasons of why i think the generic mutex code should
be considered for upstream integration:

- 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes,
'struct mutex' is 16 bytes. A smaller structure size means less RAM
footprint, and better CPU-cache utilization.

- tighter code. On x86 i get the following .text sizes when
switching all mutex-alike semaphores in the kernel to the mutex
subsystem:

text data bss dec hex filename
3280380 868188 396860 4545428 455b94 vmlinux-semaphore
3255329 865296 396732 4517357 44eded vmlinux-mutex

that's 25051 bytes of code saved, or a 0.76% win - off the hottest
codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%)
Smaller code means better icache footprint, which is one of the
major optimization goals in the Linux kernel currently.

- the mutex subsystem is faster and has superior scalability for
contented workloads. On an 8-way x86 system, running a mutex-based
kernel and testing creat+unlink+close (of separate, per-task files)
in /tmp with 16 parallel tasks, the average number of ops/sec is:

Semaphores: Mutexes:

$ ./test-mutex V 16 10 $ ./test-mutex V 16 10
8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
checking VFS performance. checking VFS performance.
avg loops/sec: 34713 avg loops/sec: 84153
CPU utilization: 63% CPU utilization: 22%

i.e. in this workload, the mutex based kernel was 2.4 times faster
than the semaphore based kernel, _and_ it also had 2.8 times less CPU
utilization. (In terms of 'ops per CPU cycle', the semaphore kernel
performed 551 ops/sec per 1% of CPU time used, while the mutex kernel
performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times
more efficient.)

the scalability difference is visible even on a 2-way P4 HT box:

Semaphores: Mutexes:

$ ./test-mutex V 16 10 $ ./test-mutex V 16 10
4 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
checking VFS performance. checking VFS performance.
avg loops/sec: 127659 avg loops/sec: 181082
CPU utilization: 100% CPU utilization: 34%

(the straight performance advantage of mutexes is 41%, the per-cycle
efficiency of mutexes is 4.1 times better.)

- there are no fastpath tradeoffs, the mutex fastpath is just as tight
as the semaphore fastpath. On x86, the locking fastpath is 2
instructions:

c0377ccb <mutex_lock>:
c0377ccb: f0 ff 08 lock decl (%eax)
c0377cce: 78 0e js c0377cde <.text.lock.mutex>
c0377cd0: c3 ret

the unlocking fastpath is equally tight:

c0377cd1 <mutex_unlock>:
c0377cd1: f0 ff 00 lock incl (%eax)
c0377cd4: 7e 0f jle c0377ce5 <.text.lock.mutex+0x7>
c0377cd6: c3 ret

- the per-call-site inlining cost of the slowpath is cheaper and
smaller than that of semaphores, by one instruction, because the
mutex trampoline code does not do a "lea %0,%%eax" that the semaphore
code does before calling __down_failed. The mutex subsystem uses out
of line code currently so this makes only a small difference in .text
size, but in case we want to inline mutexes, they will be cheaper
than semaphores.

- No wholesale or dangerous migration path. The migration to mutexes is
fundamentally opt-in, safe and easy: multiple type-based and .config
based migration helpers are provided to make the migration to mutexes
easy. Migration is as finegrained as it gets, so robustness of the
kernel or out-of-tree code should not be jeopardized at any stage.
The migration helpers can be eliminated once migration is completed,
once the kernel has been separated into 'mutex users' and 'semaphore
users'. Out-of-tree code automatically defaults to semaphore
semantics, mutexes are not forced upon anyone, at any stage of the
migration.

- 'struct mutex' semantics are well-defined and are enforced if
CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have
virtually no debugging code or instrumentation. The mutex subsystem
checks and enforces the following rules:

* - only one task can hold the mutex at a time
* - only the owner can unlock the mutex
* - multiple unlocks are not permitted
* - recursive locking is not permitted
* - a mutex object must be initialized via the API
* - a mutex object must not be initialized via memset or copying
* - task may not exit with mutex held
* - memory areas where held locks reside must not be freed

furthermore, there are also convenience features in the debugging
code:

* - uses symbolic names of mutexes, whenever they are printed in debug output
* - point-of-acquire tracking, symbolic lookup of function names
* - list of all locks held in the system, printout of them
* - owner tracking
* - detects self-recursing locks and prints out all relevant info
* - detects multi-task circular deadlocks and prints out all affected
* locks and tasks (and only those tasks)

we have extensive experience with the mutex debugging code in the -rt
kernel, and it eases the debugging of mutex related bugs
considerably. A handful of upstream bugs were found as well this
way, and were contributed back to the vanilla kernel. We do believe
that improved debugging code is an important tool in improving the
fast-paced upstream kernel's quality.

a side-effect of the strict semantics is that mutexes are much easier
to analyze on a static level. E.g. Sparse could check the correctness
of mutex users, further improving the kernel's quality. Also, the
highest-level security and reliability validation techniques (and
certification processes) involve static code analysis.

- kernel/mutex.c is generic, and has minimal per-arch needs. No new
primitives have to be implemented to support spinlock-based generic
mutexes. Only 2 new atomic primitives have to be implemented for an
architecture to support optimized, lockless generic mutexes. In
contrast, to implement semaphores on a new architecture, hundreds of
lines of nontrivial (often assembly) code has to be written and
debugged.

- kernel/mutex.c is highly hackable. New locking features can be
implemented in C, and they carry over to every architecture.
Extensive self-consistency debugging checks of the mutex
implementation are done if CONFIG_DEBUG_MUTEXES is turned on. I do
think that hackability is the most important property of
kernel code.

- the generic mutex subsystem is also one more step towards enabling
the fully preemptable -rt kernel. Ok, this shouldnt bother the
upstream kernel too much at the moment, but it's a personal driving
force for me nevertheless ;-)

(NOTE: i consciously did not list 'Priority Inheritance' amongst the
reasons, because priority inheritance for blocking kernel locks
would be a misguided reason at best, and flat out wrong at worst.)

2005-12-22 12:45:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Christoph Hellwig <[email protected]> wrote:

> > Changes since -V3:
> >
> > - imlemented an atomic_xchg() based mutex implementation. It integrated
> > pretty nicely into the generic code, and most of the code is still
> > shared.
> >
> > - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines
> > this then the generic mutex code will switch to the atomic_xchg()
> > implementation.
> >
> > This should be conceptually equivalent to the variant Nicolas Pitre
> > posted - Nicolas, could you check out this one? It's much easier to
> > provide this in the generic implementation, and the code ends up
> > looking cleaner.
> >
> > - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for
> > architectures to override the generic code anymore, with the
> > introduction of __ARCH_WANT_XCHG_BASED_ATOMICS.
> >
> > - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS.
>
> I must admit I really really hat __ARCH_ stuff if we can avoid it. An
> <asm/mutex.h> that usually includes two asm-generic variants is
> probably a much better choice.

agreed. In my tree i've changed it to CONFIG_MUTEX_XCHG_ALGORITHM, which
is selected by ARM in its Kconfig.

Ingo

2005-12-22 13:09:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Ingo Molnar <[email protected]> wrote:
>
> ...
>
> here are the top 10 reasons of why i think the generic mutex code should
> be considered for upstream integration:

Appropriate for the changelog, please.

> - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes,
> 'struct mutex' is 16 bytes. A smaller structure size means less RAM
> footprint, and better CPU-cache utilization.

Because of the .sleepers thing. Perhaps a revamped semaphore wouldn't need
thsat field?

> - tighter code. On x86 i get the following .text sizes when
> switching all mutex-alike semaphores in the kernel to the mutex
> subsystem:
>
> text data bss dec hex filename
> 3280380 868188 396860 4545428 455b94 vmlinux-semaphore
> 3255329 865296 396732 4517357 44eded vmlinux-mutex
>
> that's 25051 bytes of code saved, or a 0.76% win - off the hottest
> codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%)
> Smaller code means better icache footprint, which is one of the
> major optimization goals in the Linux kernel currently.

Why is the mutex-using kernel any smaller? IOW: from where do these
savings come?

> - the mutex subsystem is faster and has superior scalability for
> contented workloads. On an 8-way x86 system, running a mutex-based
> kernel and testing creat+unlink+close (of separate, per-task files)
> in /tmp with 16 parallel tasks, the average number of ops/sec is:
>
> Semaphores: Mutexes:
>
> $ ./test-mutex V 16 10 $ ./test-mutex V 16 10
> 8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
> checking VFS performance. checking VFS performance.
> avg loops/sec: 34713 avg loops/sec: 84153
> CPU utilization: 63% CPU utilization: 22%
>
> i.e. in this workload, the mutex based kernel was 2.4 times faster
> than the semaphore based kernel, _and_ it also had 2.8 times less CPU
> utilization. (In terms of 'ops per CPU cycle', the semaphore kernel
> performed 551 ops/sec per 1% of CPU time used, while the mutex kernel
> performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times
> more efficient.)
>
> the scalability difference is visible even on a 2-way P4 HT box:
>
> Semaphores: Mutexes:
>
> $ ./test-mutex V 16 10 $ ./test-mutex V 16 10
> 4 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
> checking VFS performance. checking VFS performance.
> avg loops/sec: 127659 avg loops/sec: 181082
> CPU utilization: 100% CPU utilization: 34%
>
> (the straight performance advantage of mutexes is 41%, the per-cycle
> efficiency of mutexes is 4.1 times better.)

Why is the mutex-using kernel more scalable?

Can semaphores be tuned to offer the same scalability improvements?

> ...
>
> - the per-call-site inlining cost of the slowpath is cheaper and
> smaller than that of semaphores, by one instruction, because the
> mutex trampoline code does not do a "lea %0,%%eax" that the semaphore
> code does before calling __down_failed. The mutex subsystem uses out
> of line code currently so this makes only a small difference in .text
> size, but in case we want to inline mutexes, they will be cheaper
> than semaphores.

Cannot the semaphore code be improved in the same manner?

> - No wholesale or dangerous migration path. The migration to mutexes is
> fundamentally opt-in, safe and easy: multiple type-based and .config
> based migration helpers are provided to make the migration to mutexes
> easy. Migration is as finegrained as it gets, so robustness of the
> kernel or out-of-tree code should not be jeopardized at any stage.
> The migration helpers can be eliminated once migration is completed,
> once the kernel has been separated into 'mutex users' and 'semaphore
> users'. Out-of-tree code automatically defaults to semaphore
> semantics, mutexes are not forced upon anyone, at any stage of the
> migration.

IOW: a complete PITA, sorry.

> - 'struct mutex' semantics are well-defined and are enforced if
> CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have
> virtually no debugging code or instrumentation. The mutex subsystem
> checks and enforces the following rules:
>
> * - only one task can hold the mutex at a time
> * - only the owner can unlock the mutex
> * - multiple unlocks are not permitted
> * - recursive locking is not permitted
> * - a mutex object must be initialized via the API
> * - a mutex object must not be initialized via memset or copying
> * - task may not exit with mutex held
> * - memory areas where held locks reside must not be freed

Pretty much all of that could be added to semaphores-when-used-as-mutexes.
Without introducing a whole new locking mechanism.

> furthermore, there are also convenience features in the debugging
> code:
>
> * - uses symbolic names of mutexes, whenever they are printed in debug output
> * - point-of-acquire tracking, symbolic lookup of function names
> * - list of all locks held in the system, printout of them
> * - owner tracking
> * - detects self-recursing locks and prints out all relevant info
> * - detects multi-task circular deadlocks and prints out all affected
> * locks and tasks (and only those tasks)

Ditto, I expect.

> - the generic mutex subsystem is also one more step towards enabling
> the fully preemptable -rt kernel. Ok, this shouldnt bother the
> upstream kernel too much at the moment, but it's a personal driving
> force for me nevertheless ;-)

Actually that's the only compelling reason I've yet seen, sorry.

What is it about these mutexes which is useful to RT and why cannot that
feature be provided by semaphores?


Ingo, there appear to be quite a few straw-man arguments here. You're
comparing a subsystem (semaphores) which obviously could do with a lot of
fixing and enhancing with something new which has had a lot of recent
feature work out into it.

I'd prefer to see mutexes compared with semaphores after you've put as much
work into improving semaphores as you have into developing mutexes.

2005-12-22 13:23:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


> > * - only one task can hold the mutex at a time
> > * - only the owner can unlock the mutex
> > * - multiple unlocks are not permitted
> > * - recursive locking is not permitted
> > * - a mutex object must be initialized via the API
> > * - a mutex object must not be initialized via memset or copying
> > * - task may not exit with mutex held
> > * - memory areas where held locks reside must not be freed
>
> Pretty much all of that could be added to semaphores-when-used-as-mutexes.
> Without introducing a whole new locking mechanism.

this is basically in direct conflict with what Linus has been asking
for.. "leave semaphore semantics alone".

I agree with Linus that if it has very different rules, it should have a
different name.

> Ingo, there appear to be quite a few straw-man arguments here. You're
> comparing a subsystem (semaphores) which obviously could do with a lot of
> fixing and enhancing with something new which has had a lot of recent
> feature work out into it.
>
> I'd prefer to see mutexes compared with semaphores after you've put as much
> work into improving semaphores as you have into developing mutexes.

semaphores have had a lot of work for the last... 10 years. To me that
is a sign that maybe they're close to their limit already.

You keep saying 10 times "so please enhance semaphores to do this".
Semaphores have far more complex rules for the slowpath (mutex semantics
are simple because the number of wakers is always at most one, and if
you hold the lock, you KNOW nobody else can do another release under
you). Adding dual rules or other complexity to it doesn't sound too
compelling to me actually; they're highly tricky things already (just
look at the i386 ones.. that extra wakeup was there to plug a hole (at
least that is my empirical conclusion based on "we remove it it hangs"
behavior).

Having 2 sets of behaviors for the same primitive also sounds not good
to me to be honest, that's bound to explode stuff left and right all the
time.

I realize you're not looking forward to a gradual conversion; yet Linus
says he doesn't want a wholesale change either but wants it gradual.


2005-12-22 13:45:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Arjan van de Ven <[email protected]> wrote:
>
>
> > > * - only one task can hold the mutex at a time
> > > * - only the owner can unlock the mutex
> > > * - multiple unlocks are not permitted
> > > * - recursive locking is not permitted
> > > * - a mutex object must be initialized via the API
> > > * - a mutex object must not be initialized via memset or copying
> > > * - task may not exit with mutex held
> > > * - memory areas where held locks reside must not be freed
> >
> > Pretty much all of that could be added to semaphores-when-used-as-mutexes.
> > Without introducing a whole new locking mechanism.
>
> this is basically in direct conflict with what Linus has been asking
> for.. "leave semaphore semantics alone".

No, it's mostly just adding new debug stuff. And we'd need to add
additional anotations to those few semaphores which are used as non-mutexes
to support that new debug stuff.

That's if the new debugging stuff is actually useful. I'm not sure it is,
really - I can't think of any kernel bugs which these features would have
exposed. Although I think Ingo found some, but I don't recall what they
were.

>
> > Ingo, there appear to be quite a few straw-man arguments here. You're
> > comparing a subsystem (semaphores) which obviously could do with a lot of
> > fixing and enhancing with something new which has had a lot of recent
> > feature work out into it.
> >
> > I'd prefer to see mutexes compared with semaphores after you've put as much
> > work into improving semaphores as you have into developing mutexes.
>
> semaphores have had a lot of work for the last... 10 years. To me that
> is a sign that maybe they're close to their limit already.

No they haven't - they're basically unchanged from four years ago (at
least).

> You keep saying 10 times "so please enhance semaphores to do this".

Well I think diligence requires that we be able to demonstrate why it's not
possible.

It's plain dumb for us to justify a fancy-pants new system based on
features which we could have added to the old one, no?

> Semaphores have far more complex rules for the slowpath (mutex semantics
> are simple because the number of wakers is always at most one, and if
> you hold the lock, you KNOW nobody else can do another release under
> you). Adding dual rules or other complexity to it doesn't sound too
> compelling to me actually; they're highly tricky things already (just
> look at the i386 ones.. that extra wakeup was there to plug a hole (at
> least that is my empirical conclusion based on "we remove it it hangs"
> behavior).
>
> Having 2 sets of behaviors for the same primitive also sounds not good
> to me to be honest, that's bound to explode stuff left and right all the
> time.

There's no need for two sets of behaviour. What I'm saying is that we
could add similar debugging features to semaphores (if useful). Yes, we
would have to tell the kernel at the source level to defeat that debugging
if a particular lock isn't being used as a mutex. That's rather less
intrusive than adding a whole new type of lock. But I'd question the value
even of doing that, given the general historical non-bugginess of existing
semaphore users.

2005-12-22 14:13:49

by Alan

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Iau, 2005-12-22 at 05:44 -0800, Andrew Morton wrote:
> > semaphores have had a lot of work for the last... 10 years. To me that
> > is a sign that maybe they're close to their limit already.
>
> No they haven't - they're basically unchanged from four years ago (at
> least).

Point still holds

> It's plain dumb for us to justify a fancy-pants new system based on
> features which we could have added to the old one, no?

The old one does one job well. Mutex wakeups are very different in
behaviour because of the single waker rule and the fact you can do stuff
like affine wakeups.

You could make it one function but it would inevitably end up full of

if(mutex) {
}

so you'd make it slower, destabilize both and not get some of the
winnings like code size.

Oh and of course you no doubt break the semaphores while doing it, while
at least this seperate way as Linus suggested you don't break the
existing functionality.

> There's no need for two sets of behaviour.

The fundamental reason for mutexes in terms of performance and
scalability requires the different wake up behaviours. The performance
numbers are pretty compelling. I was against the original proposal but
now the gradual changeover is proposed I'm for.


Alan

2005-12-22 15:19:49

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 22 Dec 2005, Andrew Morton wrote:

> I'd prefer to see mutexes compared with semaphores after you've put as much
> work into improving semaphores as you have into developing mutexes.

There is a fundamental difference between semaphores and mutexes. The
semaphore semantics _require_ atomic increments/decrements where mutexes
do not. This makes a huge difference on ARM where 99% of all ARM
processors out there can only perform atomic swaps which is sufficient
for mutexes but insufficient for semaphores. Therefore on ARM
performing an atomic increment/decrement (the semaphore fast
path) requires extra costly locking
and .text space (23 cycles over 8 instructions) while the mutex fast
path is about 2-3 instructions and
needs 7-8 cycles. I bet many other architectures are in the same camp.


Nicolas

2005-12-22 15:34:20

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 22 Dec 2005, Christoph Hellwig wrote:

> On Thu, Dec 22, 2005 at 12:41:47PM +0100, Ingo Molnar wrote:
> > this is the -V4 of the mutex subsystem patch-queue. It consists of the
> > following patches:
> >
> > add-atomic-xchg.patch
> > add-atomic-call-func-i386.patch
> > add-atomic-call-func-x86_64.patch
> > add-atomic-call-wrappers-rest.patch
> > mutex-core.patch
> > mutex-switch-arm-to-xchg.patch
> > mutex-debug.patch
> > mutex-debug-more.patch
> > xfs-mutex-namespace-collision-fix.patch
> >
> > the patches are against Linus' latest GIT tree, and they should work
> > fine on every Linux architecture.
> >
> > Changes since -V3:
> >
> > - imlemented an atomic_xchg() based mutex implementation. It integrated
> > pretty nicely into the generic code, and most of the code is still
> > shared.
> >
> > - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines
> > this then the generic mutex code will switch to the atomic_xchg()
> > implementation.
> >
> > This should be conceptually equivalent to the variant Nicolas Pitre
> > posted - Nicolas, could you check out this one? It's much easier to
> > provide this in the generic implementation, and the code ends up
> > looking cleaner.
> >
> > - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for
> > architectures to override the generic code anymore, with the
> > introduction of __ARCH_WANT_XCHG_BASED_ATOMICS.
> >
> > - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS.
>
> I must admit I really really hat __ARCH_ stuff if we can avoid it.
> An <asm/mutex.h> that usually includes two asm-generic variants is probably
> a much better choice.
>
> Actually just havign asm/mutex.h implement the faspath per-arch and get
> rid of all the oddball atomic.h additions would be even better. While
> this means we need per-arch code it also means the code is a lot easier
> understandable, and we don't add odd public APIs.

I'm with Christoph here. Please preserve my
arch_mutex_fast_lock/arch_mutex_fast_unlock helpers. I did it that way
because the most important thing they bring is flexibility where it is
needed i.e. in architecture specific implementations. And done that way
the architecture specific part is well abstracted with the minimum
semantics allowing flexibility in the implementation.

I insist on that because, even if ARM currently relies on the atomic
swap behavior, on ARMv6 at least this can be improved even further, but
a special implementation which is neither a fully qualified atomic
decrement nor an atomic swap is needed. That's why I insist that you
should keep my arch_mutex_fast_lock and friends (rename them if you
wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely.


Nicolas

2005-12-22 15:39:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 2005-12-22 at 05:44 -0800, Andrew Morton wrote:

> There's no need for two sets of behaviour. What I'm saying is that we
> could add similar debugging features to semaphores (if useful). Yes, we
> would have to tell the kernel at the source level to defeat that debugging
> if a particular lock isn't being used as a mutex. That's rather less
> intrusive than adding a whole new type of lock. But I'd question the value
> even of doing that, given the general historical non-bugginess of existing
> semaphore users.

Semaphores need a counter, mutexes only a binary representation of the
locked/unlocked state, which makes it possible to use instructions like
xchg, swap, test_and_set_bit depending on what atomic operations are
available on the architecture. On many architectures this is more
efficient than the counter based implementation.

Also the wakeup rules allow smaller and faster implementations for
mutexes.

tglx


2005-12-22 15:41:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Nicolas Pitre <[email protected]> wrote:

> [...] on ARMv6 at least this can be improved even further, but a
> special implementation which is neither a fully qualified atomic
> decrement nor an atomic swap is needed. [...]

i'm curious, how would this ARMv6 solution look like, and what would be
the advantages over the atomic swap based variant?

Ingo

2005-12-22 16:32:16

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 22 Dec 2005, Ingo Molnar wrote:

>
> * Nicolas Pitre <[email protected]> wrote:
>
> > [...] on ARMv6 at least this can be improved even further, but a
> > special implementation which is neither a fully qualified atomic
> > decrement nor an atomic swap is needed. [...]
>
> i'm curious, how would this ARMv6 solution look like, and what would be
> the advantages over the atomic swap based variant?

On ARMv6 (which can be SMP) the atomic swap instruction is much more
costly than on former ARM versions. It however has ll/sc instructions
which allows it to implement a true atomic decrement, and the lock fast
path would look like:

__mutex_lock:
1: ldrex r1, [r0]
sub r1, r1, #1
strex r2, r1, [r0]
cmp r2, #0
bne 1b
cmp r1, #0
moveq pc, lr
b __mutex_lock_failed

With my ARMv6 implementation of arch_mutex_fast_lock() then it would
become:

__mutex_lock:
ldrex r1, [r0]
sub r1, r1, #1
strex r2, r1, [r0]
orrs r0, r2, r1
moveq pc, lr
b __mutex_lock_failed

This code sequence is not possible with any of the standard atomic
primitives. And the above would work even for the
arch_mutex_fast_lock_retval() used for mutex_lock_interruptible.

Giving complete freedom to the architecture in implementing those could
benefit architectures where disabling preemption while performing the
lock instead of attempting any true atomic operation would be cheaper
(Linus argued about that previously). And if we are UP with preemption
disabled then the lock could possibly be simplified even further if
someone dare to.

But those implementation issues don't belong in the common core code at
all IMHO.


Nicolas

2005-12-22 16:44:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Nicolas Pitre <[email protected]> wrote:

> > i'm curious, how would this ARMv6 solution look like, and what would be
> > the advantages over the atomic swap based variant?
>
> On ARMv6 (which can be SMP) the atomic swap instruction is much more
> costly than on former ARM versions. It however has ll/sc instructions
> which allows it to implement a true atomic decrement, and the lock
> fast path would look like: [...]

but couldnt you implement atomic_dec_return() with the ll/sc
instructions? Something like:

repeat:
ldrex r1, [r0]
sub r1, r1, #1
strex r2, r1, [r0]
orrs r0, r2, r1
jneq repeat

(shot-in-the-dark guess at ARMv6 assembly)

hm?

Ingo

2005-12-22 16:58:52

by Russell King

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, Dec 22, 2005 at 05:44:15PM +0100, Ingo Molnar wrote:
> * Nicolas Pitre <[email protected]> wrote:
> > > i'm curious, how would this ARMv6 solution look like, and what would be
> > > the advantages over the atomic swap based variant?
> >
> > On ARMv6 (which can be SMP) the atomic swap instruction is much more
> > costly than on former ARM versions. It however has ll/sc instructions
> > which allows it to implement a true atomic decrement, and the lock
> > fast path would look like: [...]
>
> but couldnt you implement atomic_dec_return() with the ll/sc
> instructions? Something like:
>
> repeat:
> ldrex r1, [r0]
> sub r1, r1, #1
> strex r2, r1, [r0]
> orrs r0, r2, r1
> jneq repeat
>
> (shot-in-the-dark guess at ARMv6 assembly)

atomic_dec_return() would be:

1: ldrex r1, [r0]
sub r1, r1, #1
strex r2, r1, [r0]
teq r2, #0
bne 1b
@ result in r1

But that's not really the main point Nico's making. Yes, on ARMv6
there is little difference. However, ARMv6 is _not_ mainstream yet.
The previous generation which do not have this is currently mainstream.

When it comes down to it, unlike x86 land where new CPU features are
taken up very quickly, the take up of new features on ARM CPUs is
a lot slower - it's a matter of years not a matter of months.
Therefore, we can expect ARMv5 architecture CPUs to be mainstream
at least for the next year or two, and these are the ones which
we should optimise for.

Nico's point still stands though - and I'd like to ask a more direct
question. There is an efficient implementation for ARMv5 CPUs which
it appears we're being denied the ability to use.

Given that this has been argued for using clear technical arguments
over the last two days, I have yet to see any explaination why you're
choosing to ignore it. Could you please explain why?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-22 16:58:35

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 22 Dec 2005, Ingo Molnar wrote:

>
> * Nicolas Pitre <[email protected]> wrote:
>
> > > i'm curious, how would this ARMv6 solution look like, and what would be
> > > the advantages over the atomic swap based variant?
> >
> > On ARMv6 (which can be SMP) the atomic swap instruction is much more
> > costly than on former ARM versions. It however has ll/sc instructions
> > which allows it to implement a true atomic decrement, and the lock
> > fast path would look like: [...]
>
> but couldnt you implement atomic_dec_return() with the ll/sc
> instructions? Something like:

NO. My first example was atomic_dec_return based. The second is
lighter and fulfill the semantics of arch_mutex_fast_lock() but is not a
common atomic primitive.


Nicolas

2005-12-22 17:17:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, Dec 22, 2005 at 05:44:13AM -0800, Andrew Morton wrote:
> > You keep saying 10 times "so please enhance semaphores to do this".
>
> Well I think diligence requires that we be able to demonstrate why it's not
> possible.
>
> It's plain dumb for us to justify a fancy-pants new system based on
> features which we could have added to the old one, no?

But why should we add features to the semaphores. There's very little
users of those real semaphore semantics, and they could do with a generic,
all-C implementation because they are not important fast-pathes. OTOH
we have lots of places needing plain mutex semantics, that are important
fastpathes. Let's optimize for the common case instead of the corner
case.

2005-12-22 17:20:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, Dec 22, 2005 at 10:34:18AM -0500, Nicolas Pitre wrote:
> I'm with Christoph here. Please preserve my
> arch_mutex_fast_lock/arch_mutex_fast_unlock helpers. I did it that way
> because the most important thing they bring is flexibility where it is
> needed i.e. in architecture specific implementations. And done that way
> the architecture specific part is well abstracted with the minimum
> semantics allowing flexibility in the implementation.
>
> I insist on that because, even if ARM currently relies on the atomic
> swap behavior, on ARMv6 at least this can be improved even further, but
> a special implementation which is neither a fully qualified atomic
> decrement nor an atomic swap is needed. That's why I insist that you
> should keep my arch_mutex_fast_lock and friends (rename them if you
> wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely.

I think one of us should so a new version based on that scheme and without
all the new atomic helpers, then we can compare it against the current
version. I'll try to once I'll get some time.

2005-12-22 17:33:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 2005-12-22 at 10:34 -0500, Nicolas Pitre wrote:

> > Actually just havign asm/mutex.h implement the faspath per-arch and get
> > rid of all the oddball atomic.h additions would be even better. While
> > this means we need per-arch code it also means the code is a lot easier
> > understandable, and we don't add odd public APIs.
>
> I'm with Christoph here. Please preserve my
> arch_mutex_fast_lock/arch_mutex_fast_unlock helpers. I did it that way
> because the most important thing they bring is flexibility where it is
> needed i.e. in architecture specific implementations. And done that way
> the architecture specific part is well abstracted with the minimum
> semantics allowing flexibility in the implementation.
>
> I insist on that because, even if ARM currently relies on the atomic
> swap behavior, on ARMv6 at least this can be improved even further, but
> a special implementation which is neither a fully qualified atomic
> decrement nor an atomic swap is needed. That's why I insist that you
> should keep my arch_mutex_fast_lock and friends (rename them if you
> wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely.
>

Not sure how well this is accepted, but would it be acceptable to have
the mutex_lock and friends covered with the (weak) attribute?

ie.

void fastcall __sched __attribute__((weak)) mutex_lock(struct mutex *lock)

Then let the archs override them if they wish?

You would just need to make an extra slow path mutex visible to the archs.

-- Steve


2005-12-22 17:41:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4



On Thu, 22 Dec 2005, Thomas Gleixner wrote:
>
> Semaphores need a counter, mutexes only a binary representation of the
> locked/unlocked state

Actually, that's not true.

A _spinlock_ only needs a binary representation of the locked/unlocked
state.

A mutex needs a _ternary_ representation. It needs an additional
"contention" state to tell the wakeup that extra action is needed.

If you don't handle contention (and do extra action all the time), you're
screwed from a performance standpoint.

Linus

2005-12-22 18:24:43

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 22 Dec 2005, Steven Rostedt wrote:

> On Thu, 2005-12-22 at 10:34 -0500, Nicolas Pitre wrote:
>
> > > Actually just havign asm/mutex.h implement the faspath per-arch and get
> > > rid of all the oddball atomic.h additions would be even better. While
> > > this means we need per-arch code it also means the code is a lot easier
> > > understandable, and we don't add odd public APIs.
> >
> > I'm with Christoph here. Please preserve my
> > arch_mutex_fast_lock/arch_mutex_fast_unlock helpers. I did it that way
> > because the most important thing they bring is flexibility where it is
> > needed i.e. in architecture specific implementations. And done that way
> > the architecture specific part is well abstracted with the minimum
> > semantics allowing flexibility in the implementation.
> >
> > I insist on that because, even if ARM currently relies on the atomic
> > swap behavior, on ARMv6 at least this can be improved even further, but
> > a special implementation which is neither a fully qualified atomic
> > decrement nor an atomic swap is needed. That's why I insist that you
> > should keep my arch_mutex_fast_lock and friends (rename them if you
> > wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely.
> >
>
> Not sure how well this is accepted, but would it be acceptable to have
> the mutex_lock and friends covered with the (weak) attribute?
>
> ie.
>
> void fastcall __sched __attribute__((weak)) mutex_lock(struct mutex *lock)
>
> Then let the archs override them if they wish?

Actually, my next request would be for architectures to actually inline
the fast path if they see benefit. On ARM, given the tight fast path
allowed by mutexes, it is indeed beneficial to inline them (the last
patch of my latest serie does just that). So weak symbols are not
really useful in that case.


Nicolas

2005-12-22 20:09:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


On Thu, 22 Dec 2005, Linus Torvalds wrote:
>
>
> On Thu, 22 Dec 2005, Thomas Gleixner wrote:
> >
> > Semaphores need a counter, mutexes only a binary representation of the
> > locked/unlocked state
>
> Actually, that's not true.
>
> A _spinlock_ only needs a binary representation of the locked/unlocked
> state.
>
> A mutex needs a _ternary_ representation. It needs an additional
> "contention" state to tell the wakeup that extra action is needed.

True, and that's exactly what Ingo has.

1 unlocked, 0 locked, -1 locked with waiters. But it still works well
with xchg.

-- Steve

>
> If you don't handle contention (and do extra action all the time), you're
> screwed from a performance standpoint.
>
> Linus
>

2005-12-22 21:05:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Russell King <[email protected]> wrote:

> > but couldnt you implement atomic_dec_return() with the ll/sc
> > instructions? Something like:
> >
> > repeat:
> > ldrex r1, [r0]
> > sub r1, r1, #1
> > strex r2, r1, [r0]
> > orrs r0, r2, r1
> > jneq repeat
> >
> > (shot-in-the-dark guess at ARMv6 assembly)
>
> atomic_dec_return() would be:
>
> 1: ldrex r1, [r0]
> sub r1, r1, #1
> strex r2, r1, [r0]
> teq r2, #0
> bne 1b
> @ result in r1
>
> But that's not really the main point Nico's making. Yes, on ARMv6
> there is little difference. However, ARMv6 is _not_ mainstream yet.
> The previous generation which do not have this is currently
> mainstream.

i think there is some miscommunication here. I am actually on your side,
and i spent all my day enabling the best mutex variant on both the v5
and v6 version of your CPU. What we were doing was to discuss the
details of getting there. So dont shoot the messenger, ok?

This is Nico's point i replied to:

> > [...] on ARMv6 at least this can be improved even further, but a
> > special implementation which is neither a fully qualified atomic
> > decrement nor an atomic swap is needed. [...]

and this was my reply:

> i'm curious, how would this ARMv6 solution look like, and what would
> be the advantages over the atomic swap based variant?

to which Nico replied with an ll/sc implementation, which very much
looked like atomic_dec_return(). That's all i said. Nobody is trying to
shut out anything from anywhere, and certainly not me. I am _enabling_
your stuff. I'm just trying to find the most maintainable and still most
flexible solution.

> Nico's point still stands though - and I'd like to ask a more direct
> question. There is an efficient implementation for ARMv5 CPUs which
> it appears we're being denied the ability to use.

do you realize that i've enabled this efficient implementation via
CONFIG_MUTEX_XCHG_ALGORITHM? Do you realize that you _dont_ have to use
extremely heavy IRQ-disabling ops on ARM to enable mutexes? Do you
realize that what we are down to now are 1-2 instructions of
differences?

The discussion was not about ARMv5 at all. The discussion was about the
claim that 'ARMv6 is somehow magical that it needs its own stuff'. No it
isnt magical, it's a sane CPU that can implement a sane
atomic_dec_return(), or even better, a sane atomic_*_call_if_*()
primitive. Or whatever primitive we end up having - i dont mind if it's
called __mutex_whatever, as long as it has a _well defined_ meaning.

what i _DONT_ want is some over-opaque per-arch thing that will again
escallate into the same situation as semaphores: 23 different
implementations nobody is able to change at once, nobody is able to add
features or debugging to, and by today there's probably is no single
person on this planet who knows all 23 of them to begin with.

if you sense me trying to avoid something then that's it: ambiguities in
the arch-level implementation, because they end up stiffling the generic
code.

Ingo

2005-12-22 21:26:49

by Russell King

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, Dec 22, 2005 at 10:04:46PM +0100, Ingo Molnar wrote:
> i think there is some miscommunication here.

Quite possibly. TBH there's soo much waffle generated on the mutex
stuff over the last few days that I've probably lost track of where
everyone is on it. Sorry.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-22 21:27:48

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 22 Dec 2005, Ingo Molnar wrote:

> The discussion was not about ARMv5 at all. The discussion was about the
> claim that 'ARMv6 is somehow magical that it needs its own stuff'. No it
> isnt magical, it's a sane CPU that can implement a sane
> atomic_dec_return(), or even better, a sane atomic_*_call_if_*()
> primitive. Or whatever primitive we end up having - i dont mind if it's
> called __mutex_whatever, as long as it has a _well defined_ meaning.

I'd like to point out that, while atomic_dec_call_if_* is really nice on
i386, it is probably only good for i386 since no other architecture
will be able to provide a better implementation than what can be done
with atomic_dec_return() anyway. Yet that IMHO overloaded
atomic_dec_call_if_* stuff appears in core code.

But like for i386, those other architectures might be able to do some
other tricks to achieve the same needed semantics. My ARMv6 is one of
them (and no it doesn't strictly follows the semantics of
atomic_dec_call_if_*). Are you willing to add more #if
defined(CONFIG_MUTEX_FOO_ALGO) in the core code as time goes by? I hope
not.

> what i _DONT_ want is some over-opaque per-arch thing that will again
> escallate into the same situation as semaphores: 23 different
> implementations nobody is able to change at once, nobody is able to add
> features or debugging to, and by today there's probably is no single
> person on this planet who knows all 23 of them to begin with.

I agree completely with you. But what I don't want is core code
needlessly restricting architecture specific implementation for short
and well defined fast paths. Maybe that's where we must agree upon: a
well defined fast path helper for mutexes?


Nicolas

2005-12-22 21:37:48

by Nicolas Pitre

[permalink] [raw]
Subject: [patch 1/2] mutex subsystem: basic per arch fast path primitives

On Thu, 22 Dec 2005, Nicolas Pitre wrote:

> On Thu, 22 Dec 2005, Ingo Molnar wrote:
>
> > what i _DONT_ want is some over-opaque per-arch thing that will again
> > escallate into the same situation as semaphores: 23 different
> > implementations nobody is able to change at once, nobody is able to add
> > features or debugging to, and by today there's probably is no single
> > person on this planet who knows all 23 of them to begin with.
>
> I agree completely with you. But what I don't want is core code
> needlessly restricting architecture specific implementation for short
> and well defined fast paths. Maybe that's where we must agree upon: a
> well defined fast path helper for mutexes?

OK, let's look at actual code please. Do you have anything against this
and the following patches?

This patch adds fast path mutex primitives for all architectures. It
replaces your atomic_*_call_if_* patches that didn't seem to please
everybody, mutex usage notwithstanding.

Comments?

Index: linux-2.6/include/asm-alpha/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-alpha/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-arm/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-arm26/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-arm26/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-cris/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-cris/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-frv/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-frv/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-generic/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/mutex.h
@@ -0,0 +1,77 @@
+/*
+ * asm-generic/mutex.h
+ *
+ * Generic wrappers for architecture specific low-level mutex
+ * fast path locking and unlocking. Each architecture is welcome
+ * to provide optimized versions for those.
+ */
+
+#ifndef _ASM_GENERIC_MUTEX_H
+#define _ASM_GENERIC_MUTEX_H
+
+#include <asm/atomic.h>
+
+/**
+ * __mutex_fastpath_lock - lock mutex and call function if already locked
+ * @v: pointer of type atomic_t
+ * @contention_fn: function to call if v was already locked
+ *
+ * Atomically locks @v and calls a function if @v was already locked.
+ * When @v == 1 it is unlocked, <= 0 means locked.
+ */
+#define __mutex_fastpath_lock(v, contention_fn) \
+do { \
+ if (atomic_xchg(v, 0) != 1) \
+ contention_fn(v); \
+} while (0)
+
+/**
+ * __mutex_fastpath_lock_retval - lock mutex and call function if already locked
+ * @v: pointer of type atomic_t
+ * @contention_fn: function to call if v was already locked
+ *
+ * Atomically locks @v and calls a function if @v was already locked.
+ * When @v == 1 it is unlocked, <= 0 means locked.
+ *
+ * It also returns 0 if the lock was successful (@v wasn't locked), otherwise
+ * it returns what @contention_fn returned.
+ */
+#define __mutex_fastpath_lock_retval(v, contention_fn) \
+({ \
+ int __retval = 0; \
+ if (atomic_xchg(v, 0) != 1) \
+ __retval = contention_fn(v); \
+ __retval; \
+})
+
+/**
+ * __mutex_fastpath_unlock - unlock and call function if contended
+ * @v: pointer of type atomic_t
+ * @contention_fn: function to call if v was contended
+ *
+ * Atomically unlocks @v and calls a function if @v was contended.
+ * When @v == 1 it is unlocked, 0 it is locked, any negative value means
+ * locked with contention.
+ *
+ * If @v was contended, its new value is implementation defined (it can be
+ * either locked or unlocked). The contention function (slow path) should
+ * use __mutex_slowpath_needs_unlock() to determine if the mutex needs
+ * manual unlocking.
+ */
+#define __mutex_fastpath_unlock(v, contention_fn) \
+do { \
+ if (atomic_xchg(v, 1) != 0) \
+ contention_fn(v); \
+} while (0)
+
+/**
+ * __mutex_slowpath_needs_unlock - tell the state of a contended unlock
+ *
+ * This is meant to be called from any contention function passed to
+ * __mutex_fastpath_unlock(). It tells the contention (slow path) function
+ * whether or not it has to unlock the mutex manually after unlocking a
+ * contended lock was attempted.
+ */
+#define __mutex_slowpath_needs_unlock() (0) /* xchg always unlocks */
+
+#endif
Index: linux-2.6/include/asm-h8300/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-h8300/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-i386/mutex.h
@@ -0,0 +1,70 @@
+/*
+ * asm-i386/mutex.h
+ *
+ * Optimized i386 low-level mutex fast path locking and unlocking.
+ */
+
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H
+
+#include <asm/atomic.h>
+
+/*
+ * Please look into asm-generic/mutex.h for a description of semantics
+ * expected for those.
+ */
+
+#define __mutex_fastpath_lock(v, contention_fn) \
+do { \
+ fastcall void (*__tmp)(atomic_t *) = contention_fn; \
+ unsigned int dummy; \
+ \
+ (void)__tmp; \
+ typecheck(atomic_t *, v); \
+ \
+ __asm__ __volatile__( \
+ LOCK "decl (%%eax)\n" \
+ "js 2f\n" \
+ "1:\n" \
+ LOCK_SECTION_START("") \
+ "2: call "#contention_fn"\n\t" \
+ "jmp 1b\n" \
+ LOCK_SECTION_END \
+ :"=a"(dummy) \
+ :"a" (v) \
+ :"memory", "ecx", "edx"); \
+} while (0)
+
+#define __mutex_fastpath_lock_retval(v, contention_fn) \
+({ \
+ int __retval = 0; \
+ if (unlikely(atomic_dec_return(v) < 0)) \
+ __retval = contention_fn(v); \
+ __retval; \
+})
+
+#define __mutex_fastpath_unlock(v, contention_fn) \
+do { \
+ fastcall void (*__tmp)(atomic_t *) = contention_fn; \
+ unsigned int dummy; \
+ \
+ (void)__tmp; \
+ typecheck(atomic_t *, v); \
+ \
+ __asm__ __volatile__( \
+ LOCK "incl (%%eax)\n" \
+ "jle 2f\n" \
+ "1:\n" \
+ LOCK_SECTION_START("") \
+ "2: call "#contention_fn"\n\t" \
+ "jmp 1b\n" \
+ LOCK_SECTION_END \
+ :"=a" (dummy) \
+ :"a" (v) \
+ :"memory", "ecx", "edx"); \
+} while (0)
+
+/* In the contended case, the unlock doesn't actually unlock */
+#define __mutex_slowpath_needs_unlock() (1)
+
+#endif
Index: linux-2.6/include/asm-ia64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-ia64/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-m32r/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-m32r/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-m68k/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-m68k/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-m68knommu/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-m68knommu/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-mips/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-mips/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-parisc/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-parisc/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-powerpc/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-powerpc/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-ppc/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-ppc/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-ppc64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-ppc64/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-s390/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-s390/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-sh/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-sh/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-sh64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-sh64/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-sparc/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-sparc/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-sparc64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-sparc64/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-um/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-um/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-v850/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-v850/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-x86_64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-x86_64/mutex.h
@@ -0,0 +1,72 @@
+/*
+ * asm-x86_64/mutex.h
+ *
+ * Optimized x86_64 low-level mutex fast path locking and unlocking.
+ */
+
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H
+
+#include <asm/atomic.h>
+
+/*
+ * Please look into asm-generic/mutex.h for a description of semantics
+ * expected for those.
+ */
+
+#define __mutex_fastpath_lock(v, contention_fn) \
+do { \
+ fastcall void (*__tmp)(atomic_t *) = contention_fn; \
+ unsigned long dummy; \
+ \
+ (void)__tmp; \
+ typecheck(atomic_t *, v); \
+ \
+ __asm__ __volatile__( \
+ LOCK "decl (%%rdi)\n" \
+ "js 2f\n" \
+ "1:\n" \
+ LOCK_SECTION_START("") \
+ "2: call "#contention_fn"\n\t" \
+ "jmp 1b\n" \
+ LOCK_SECTION_END \
+ :"=D" (dummy) \
+ :"D" (v) \
+ :"rax", "rsi", "rdx", "rcx", \
+ "r8", "r9", "r10", "r11", "memory"); \
+} while (0)
+
+#define __mutex_fastpath_lock_retval(v, contention_fn) \
+({ \
+ int __retval = 0; \
+ if (unlikely(atomic_dec_return(v) < 0)) \
+ __retval = contention_fn(v); \
+ __retval; \
+})
+
+#define __mutex_fastpath_unlock(v, contention_fn) \
+do { \
+ fastcall void (*__tmp)(atomic_t *) = contention_fn; \
+ unsigned long dummy; \
+ \
+ (void)__tmp; \
+ typecheck(atomic_t *, v); \
+ \
+ __asm__ __volatile__( \
+ LOCK "incl (%%rdi)\n" \
+ "jle 2f\n" \
+ "1:\n" \
+ LOCK_SECTION_START("") \
+ "2: call "#contention_fn"\n\t" \
+ "jmp 1b\n" \
+ LOCK_SECTION_END \
+ :"=D" (dummy) \
+ :"D" (v) \
+ :"rax", "rsi", "rdx", "rcx", \
+ "r8", "r9", "r10", "r11", "memory"); \
+} while (0)
+
+/* In the contended case, the unlock doesn't actually unlock */
+#define __mutex_slowpath_needs_unlock() (1)
+
+#endif
Index: linux-2.6/include/asm-xtensa/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-xtensa/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>

2005-12-22 21:41:03

by Nicolas Pitre

[permalink] [raw]
Subject: [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines


This switch the core over to the architecture defined fast path locking
primitives. It also adds __mutex_lock_interruptible_noinline and
friends.

Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -266,13 +266,8 @@ __mutex_lock_interruptible_nonatomic(str
*/
int fastcall mutex_trylock(struct mutex *lock)
{
-#ifdef __HAVE_ARCH_CMPXCHG
if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
return 1;
-#else
- if (atomic_dec_return(&lock->count) == 0)
- return 1;
-#endif
return 0;
}

@@ -316,13 +311,14 @@ static inline void __mutex_unlock_nonato
* Waiters take care of themselves and stay in flight until
* necessary.
*
- * (in the xchg based implementation the fastpath has set the
+ * (depending on the implementation the fastpath has set the
* count to 1 already, so we must not set it here, because we
* dont own the lock anymore. In the debug case we must set
* the lock inside the spinlock.)
*/
-#if !defined(CONFIG_MUTEX_XCHG_ALGORITHM) && !defined(CONFIG_DEBUG_MUTEXES)
- atomic_set(&lock->count, 1);
+#if !defined(CONFIG_DEBUG_MUTEXES)
+ if (__mutex_slowpath_needs_unlock())
+ atomic_set(&lock->count, 1);
#endif
spin_lock_mutex(&lock->wait_lock);
#ifdef CONFIG_DEBUG_MUTEXES
@@ -349,21 +345,12 @@ static inline void __mutex_unlock_nonato
static __sched void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count));

/*
- * Some architectures do not have fast dec_and_test atomic primitives,
- * for them we are providing an atomic_xchg() based mutex implementation,
- * if they enable CONFIG_MUTEX_XCHG_ALGORITHM.
- *
* The locking fastpath is the 1->0 transition from 'unlocked' into
* 'locked' state:
*/
static inline void __mutex_lock_atomic(struct mutex *lock)
{
-#ifdef CONFIG_MUTEX_XCHG_ALGORITHM
- if (unlikely(atomic_xchg(&lock->count, 0) != 1))
- __mutex_lock_noinline(&lock->count);
-#else
- atomic_dec_call_if_negative(&lock->count, __mutex_lock_noinline);
-#endif
+ __mutex_fastpath_lock(&lock->count, __mutex_lock_noinline);
}

/*
@@ -383,16 +370,23 @@ static inline void __mutex_lock(struct m
__mutex_lock_atomic(lock);
}

+static __sched int FASTCALL(__mutex_lock_interruptible_noinline(atomic_t *lock_count));
+
+static int __mutex_lock_interruptible_atomic(struct mutex *lock)
+{
+ __mutex_fastpath_lock_retval(&lock->count, __mutex_lock_interruptible_noinline);
+}
+
+static int fastcall __sched __mutex_lock_interruptible_noinline(atomic_t *lock_count)
+{
+ struct mutex *lock = container_of(lock_count, struct mutex, count);
+
+ return __mutex_lock_interruptible_nonatomic(lock);
+}
+
static inline int __mutex_lock_interruptible(struct mutex *lock)
{
-#ifdef CONFIG_MUTEX_XCHG_ALGORITHM
- if (unlikely(atomic_xchg(&lock->count, 0) != 1))
- return __mutex_lock_interruptible_nonatomic(lock);
-#else
- if (unlikely(atomic_dec_return(&lock->count) < 0))
- return __mutex_lock_interruptible_nonatomic(lock);
-#endif
- return 0;
+ __mutex_lock_interruptible_atomic(lock);
}

static void __sched FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count));
@@ -403,12 +397,7 @@ static void __sched FASTCALL(__mutex_unl
*/
static inline void __mutex_unlock_atomic(struct mutex *lock)
{
-#ifdef CONFIG_MUTEX_XCHG_ALGORITHM
- if (unlikely(atomic_xchg(&lock->count, 1) != 0))
- __mutex_unlock_noinline(&lock->count);
-#else
- atomic_inc_call_if_nonpositive(&lock->count, __mutex_unlock_noinline);
-#endif
+ __mutex_fastpath_unlock(&lock->count, __mutex_unlock_noinline);
}

static void fastcall __sched __mutex_unlock_noinline(atomic_t *lock_count)
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -12,6 +12,7 @@
*/
#include <linux/list.h>
#include <linux/spinlock_types.h>
+#include <asm/mutex.h>

#include <asm/atomic.h>

2005-12-22 21:43:14

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Andrew Morton writes:

> Ingo Molnar <[email protected]> wrote:
> > - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes,
> > 'struct mutex' is 16 bytes. A smaller structure size means less RAM
> > footprint, and better CPU-cache utilization.
>
> Because of the .sleepers thing. Perhaps a revamped semaphore wouldn't need
> thsat field?

Note that semaphores on 32-bit PPC are only 16 bytes already, since I
got rid of the sleepers field ages ago. The fast path is just
atomic_dec/inc, and the slow path needs an atomic op that does

x = max(x, 0) + inc

atomically and returns the old value of x. That's easy to do with
lwarx/stwcx (just two more instructions than an atomic inc), and can
also be done quite straightforwardly with cmpxchg. Alpha, mips, s390
and sparc64 also use this scheme.

In fact I would go so far as to say that I cannot see how it would be
possible to do a mutex any smaller or faster than a counting semaphore
on these architectures.

Regards,
Paul.

2005-12-22 21:53:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/2] mutex subsystem: basic per arch fast path primitives

> OK, let's look at actual code please. Do you have anything against this
> and the following patches?
>
> This patch adds fast path mutex primitives for all architectures. It
> replaces your atomic_*_call_if_* patches that didn't seem to please
> everybody, mutex usage notwithstanding.

Thanks Nico, this is exactly what I wanted to see.

One question about the naming of the arch helpers, do we really need the
fastpath in there? Or just __mutex_* ?

2005-12-22 21:55:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Nicolas Pitre <[email protected]> wrote:

> I'd like to point out that, while atomic_dec_call_if_* is really nice
> on i386, it is probably only good for i386 since no other architecture
> will be able to provide a better implementation than what can be done
> with atomic_dec_return() anyway. Yet that IMHO overloaded
> atomic_dec_call_if_* stuff appears in core code.

i'd have no problem with going to atomic_dec_return() on i386 too.
atomic_dec_call_if_*() is just working around a gcc limitation: there's
no way to pass a condition from inline assembly into C code, it has to
go over a register. But the difference is small, just 1 extra 'sete'
instruction. So x86 would be just fine with atomic_dec_return() too.

anyway, it seems like everyone would like to hack a few instructions
from the fastpath, so we might as well go with your approach. As long as
the state is well-defined (and it has to be well-defined because it all
hits the common slowpath), and the function names are descriptive of
what happens, it's fine with me.

Ingo

2005-12-22 23:31:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Alan Cox <[email protected]> wrote:
>
> On Iau, 2005-12-22 at 05:44 -0800, Andrew Morton wrote:
> > > semaphores have had a lot of work for the last... 10 years. To me that
> > > is a sign that maybe they're close to their limit already.
> >
> > No they haven't - they're basically unchanged from four years ago (at
> > least).
>
> Point still holds

No it does not.

Ingo's work has shown us two things:

a) semaphores use more kernel text than they should and

b) semaphores are less efficient than they could be.

Fine. Let's update the semaphore implementation to fix those things.
Nobody has addressed this code in several years. If we conclusively cannot
fix these things then that's the time to start looking at implementing new
locking mechanisms.

> > It's plain dumb for us to justify a fancy-pants new system based on
> > features which we could have added to the old one, no?
>
> The old one does one job well. Mutex wakeups are very different in
> behaviour because of the single waker rule and the fact you can do stuff
> like affine wakeups.
>
> You could make it one function but it would inevitably end up full of
>
> if(mutex) {
> }

We'd only need such constructs if we were trying to add all the mutex
runtime debugging features to semaphores. I don't think that's likely to
be very useful so fine, let's not do that.

> Oh and of course you no doubt break the semaphores while doing it, while
> at least this seperate way as Linus suggested you don't break the
> existing functionality.

Linus would not be averse to a patch which optimises semaphores.

> > There's no need for two sets of behaviour.
>
> The fundamental reason for mutexes in terms of performance and
> scalability requires the different wake up behaviours. The performance
> numbers are pretty compelling.

This where I have a bad feeling that I'm missing some vital clue.

An efficient semaphore implementation will wake a single process per up().
Just like mutexes. So why should mutexes have any performance advantage?

2005-12-22 23:34:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote:
> No it does not.
>
> Ingo's work has shown us two things:
>
> a) semaphores use more kernel text than they should and
>
> b) semaphores are less efficient than they could be.
>
> Fine. Let's update the semaphore implementation to fix those things.
> Nobody has addressed this code in several years. If we conclusively cannot
> fix these things then that's the time to start looking at implementing new
> locking mechanisms.

c) semaphores are total overkill for 99% percent of the users. Remember
this thing about optimizing for the common case?

Pretty much everywhere we do want mutex semantic. So let's have a proper
primitive exactly for that, and we can keep the current semaphore
implementation (with a much simpler implementation) for that handfull of
users in the kernel that really want a counting semaphore.

I really don't get why you hate mutex primitives so much.

2005-12-22 23:49:20

by Sean

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, December 22, 2005 6:34 pm, Christoph Hellwig said:
> On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote:
>> No it does not.
>>
>> Ingo's work has shown us two things:
>>
>> a) semaphores use more kernel text than they should and
>>
>> b) semaphores are less efficient than they could be.
>>
>> Fine. Let's update the semaphore implementation to fix those things.
>> Nobody has addressed this code in several years. If we conclusively
>> cannot
>> fix these things then that's the time to start looking at implementing
>> new
>> locking mechanisms.
>
> c) semaphores are total overkill for 99% percent of the users. Remember
> this thing about optimizing for the common case?
>
> Pretty much everywhere we do want mutex semantic. So let's have a proper
> primitive exactly for that, and we can keep the current semaphore
> implementation (with a much simpler implementation) for that handfull of
> users in the kernel that really want a counting semaphore.
>
> I really don't get why you hate mutex primitives so much.
>

Yes it's hard to figure. It seems to be deeper than just hating mutex
primitives, he hates the timer core updates that come from Ingo too; this
may be a general dislike for all things -rt.

Sean

2005-12-22 23:53:54

by Randy Dunlap

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 22 Dec 2005, Sean wrote:

> On Thu, December 22, 2005 6:34 pm, Christoph Hellwig said:
> > On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote:
> >> No it does not.
> >>
> >> Ingo's work has shown us two things:
> >>
> >> a) semaphores use more kernel text than they should and
> >>
> >> b) semaphores are less efficient than they could be.
> >>
> >> Fine. Let's update the semaphore implementation to fix those things.
> >> Nobody has addressed this code in several years. If we conclusively
> >> cannot
> >> fix these things then that's the time to start looking at implementing
> >> new
> >> locking mechanisms.
> >
> > c) semaphores are total overkill for 99% percent of the users. Remember
> > this thing about optimizing for the common case?
> >
> > Pretty much everywhere we do want mutex semantic. So let's have a proper
> > primitive exactly for that, and we can keep the current semaphore
> > implementation (with a much simpler implementation) for that handfull of
> > users in the kernel that really want a counting semaphore.
> >
> > I really don't get why you hate mutex primitives so much.
> >
>
> Yes it's hard to figure. It seems to be deeper than just hating mutex
> primitives, he hates the timer core updates that come from Ingo too; this
> may be a general dislike for all things -rt.

Andrew can surely answer that, but it could be something as
simple as wanting to build a more stable kernel (one without
so much churn), so that things have time to mature and
improve without breaking so many other things...

This (current) is a hectic development cycle style.
--
~Randy

2005-12-23 00:00:26

by Sean

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, December 22, 2005 6:53 pm, Randy.Dunlap said:

> Andrew can surely answer that, but it could be something as
> simple as wanting to build a more stable kernel (one without
> so much churn), so that things have time to mature and
> improve without breaking so many other things...
>
> This (current) is a hectic development cycle style.

Sure, its probably that simple. Just seems like the techincal arguments
clearly support adding a mutex.

Sean

2005-12-23 00:00:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


On Thu, 22 Dec 2005, Randy.Dunlap wrote:
>
> Andrew can surely answer that, but it could be something as
> simple as wanting to build a more stable kernel (one without
> so much churn), so that things have time to mature and
> improve without breaking so many other things...

Not to mention that when this gets into the kernel, Andrew's the one that
will have to deal with a ton of "switch semaphores to mutexes" patches
that will slowly update the entire kernel. These patches will probably
continue to come in for a few years. ;)

IIRC, that was one of his rational reasons why he didn't want the timer
name change.

-- Steve

2005-12-23 14:24:32

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Thu, 22 Dec 2005, Andrew Morton wrote:

> Christoph Hellwig <[email protected]> wrote:
> > I really don't get why you hate mutex primitives so much.
>
> I've just spelled out in considerable detail why this work is premature.
> How can you not "get" it? Why do I have to keep saying the same thing in
> different ways? It's really quite simple.
>
>
> So here is permutation #4:
>
> If we can optimise semaphores for speed and space, the mutexes are
> *unneeded*.

How can't you get the fact that semaphores could _never_ be as simple as
mutexes? This is a theoritical impossibility, which maybe turns out not
to be so true on x86, but which is damn true on ARM where the fast path
(the common case of a mutex) is significantly more efficient.

Semaphores _require_ an atomic decrement, mutexes do not. On some
architectures that makes a huge difference.

> And I think we _should_ optimise semaphores for speed and space. Don't you?

No one disagrees with that.

> If we can do that, there is no point at all in adding a new lock type which
> has no speed advantage and no space advantage and which has less
> functionality than semaphores.

The very point is that mutexes will always have a speed advantage by
nature.

> And, repeating myself yet again: if we can demonstrate that it is not
> feasible to optimise semaphores to the same performance and space efficiency
> of mutexes then (and only then) we have a reason for adding mutexes.

I spent the whole week making that demonstration repeatedly. Why are
you ignoring me?


Nicolas

2005-12-23 14:52:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Nicolas Pitre <[email protected]> wrote:
>
> How can't you get the fact that semaphores could _never_ be as simple as
> mutexes? This is a theoritical impossibility, which maybe turns out not
> to be so true on x86, but which is damn true on ARM where the fast path
> (the common case of a mutex) is significantly more efficient.
>

I did notice your comments. I'll grant that mutexes will save some tens of
fastpath cycles on one minor architecture. Sorry, but that doesn't seem
very important.

2005-12-23 14:58:04

by Russell King

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote:
> Nicolas Pitre <[email protected]> wrote:
> > How can't you get the fact that semaphores could _never_ be as simple as
> > mutexes? This is a theoritical impossibility, which maybe turns out not
> > to be so true on x86, but which is damn true on ARM where the fast path
> > (the common case of a mutex) is significantly more efficient.
>
> I did notice your comments. I'll grant that mutexes will save some tens of
> fastpath cycles on one minor architecture. Sorry, but that doesn't seem
> very important.

Wow.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-23 15:00:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


On Fri, 23 Dec 2005, Andrew Morton wrote:

> Nicolas Pitre <[email protected]> wrote:
> >
> > How can't you get the fact that semaphores could _never_ be as simple as
> > mutexes? This is a theoritical impossibility, which maybe turns out not
> > to be so true on x86, but which is damn true on ARM where the fast path
> > (the common case of a mutex) is significantly more efficient.
> >
>
> I did notice your comments. I'll grant that mutexes will save some tens of
> fastpath cycles on one minor architecture. Sorry, but that doesn't seem
> very important.
>

"minor architecture"? Granted, I don't know of any ARM desktops or
servers, but there's a large number of ARM devices out in the real world.
Or are we giving up on Linux being an embedded OS?

-- Steve

2005-12-23 15:05:17

by Xavier Bestel

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Fri, 2005-12-23 at 15:57, Russell King wrote:
> On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote:
> > Nicolas Pitre <[email protected]> wrote:
> > > How can't you get the fact that semaphores could _never_ be as simple as
> > > mutexes? This is a theoritical impossibility, which maybe turns out not
> > > to be so true on x86, but which is damn true on ARM where the fast path
> > > (the common case of a mutex) is significantly more efficient.
> >
> > I did notice your comments. I'll grant that mutexes will save some tens of
> > fastpath cycles on one minor architecture. Sorry, but that doesn't seem
> > very important.
>
> Wow.

Yes, wow. Andrew doesn't seem aware of embedded linux people, for whom
cycles are important and ARM is king.

Xav


2005-12-23 15:29:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Xavier Bestel <[email protected]> wrote:
>
> On Fri, 2005-12-23 at 15:57, Russell King wrote:
> > On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote:
> > > Nicolas Pitre <[email protected]> wrote:
> > > > How can't you get the fact that semaphores could _never_ be as simple as
> > > > mutexes? This is a theoritical impossibility, which maybe turns out not
> > > > to be so true on x86, but which is damn true on ARM where the fast path
> > > > (the common case of a mutex) is significantly more efficient.
> > >
> > > I did notice your comments. I'll grant that mutexes will save some tens of
> > > fastpath cycles on one minor architecture. Sorry, but that doesn't seem
> > > very important.
> >
> > Wow.
>
> Yes, wow. Andrew doesn't seem aware of embedded linux people, for whom
> cycles are important and ARM is king.
>

Please, spare me the rhetoric.

Adding a new locking primitive is a big deal. We need good reasons for
doing it. And no, I don't think a few cycles on ARM is good enough. I'd
be very surprised if it was measurable - the busiest semaphore is probably
i_sem and when it's taken we're also doing heavy filesystem operations
which would swamp any benefit. And if we're going to churn i_sem then we
perhaps should turn it into an rwsem so we can perform concurrent lookups
(at least). We do disk I/O with that thing held.

Look, I'm not wildly against mutexes - it's not exactly a large amount of
code. I just think they're fairly pointless and I regret that we seem to
be diving into adding yet another locking primitive without having fully
investigated improving the existing one.

2005-12-25 22:11:32

by Roman Zippel

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Hi,

On Friday 23 December 2005 00:34, Christoph Hellwig wrote:

> c) semaphores are total overkill for 99% percent of the users. Remember
> this thing about optimizing for the common case?

Semaphores are not that different from mutexes.
What makes me suspicious is the large difference in the test results, that
either means something is wrong with the test or something is wrong with the
semaphores. From reading the discussion I still don't really know, why the
improvements to mutexes can't be applied to semaphores. I also haven't hardly
seen any discussion about why semaphores the way they are. Linus did suspect
there is a wakeup bug in the semaphore, but there was no conclusive followup
to that.
IMO there are still too many questions open, so I can understand Andrew. We
may only cover up the problem instead of fixing it. I understand that mutexes
have advantages, but if we compare them to semaphores it should be a fair
comparison, otherwise people start to think semaphores are something bad. The
majority of the discussion has been about microoptimisation, but on some
archs non-debug mutexes and semaphores may very well be the same thing.

bye, Roman

2005-12-25 22:55:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Roman Zippel <[email protected]> wrote:

> > c) semaphores are total overkill for 99% percent of the users. Remember
> > this thing about optimizing for the common case?
>
> [...] I also haven't hardly seen any discussion about why semaphores
> the way they are. Linus did suspect there is a wakeup bug in the
> semaphore, but there was no conclusive followup to that.

no conclusive follow-up because ... they are too complex for people to
answer such questions off the cuff? Something so frequently used in
trivial ways should have the complexity of that typical use, not the
complexity of the theoretical use. There is no problem with semaphores,
other than that they are not being used as semaphores all that often.

for which i think there is a rather simple practical reason: if i want
to control a counted resource within the kernel, it is rarely the
simplest solution to use a semaphore for it, because a semaphore cannot
be used to protect data structures in the 'resource is available' stage
[i.e. when the semaphore count is above zero]. It does decrement the
counter atomically, but that is just half of the job i have to do.

to control (allocate/free) the resource i _have to_ add some other
locking mechanism anyway in most cases (a spinlock most likely, to
protect the internal list and other internal state) - at which point
it's simpler and faster to simply add a counter and a waitqueue to those
existing internal variables, than to add a separate locking object to
around (or within) the whole construct.

semaphores would be nice in theory, if there was a way to attach the
'decrement counter atomically' operation to another set of atomic ops,
like list_del() or list_add(), making the whole thing transactional.
[this would also be a wholly new API, so it only applies to semaphores
as a concept, not our actual semaphore incarnation] So i see the
theoretical beauty of semaphores, but in practice, CPUs force us to work
with much simpler constructs.

there are some exceptions: e.g. when the resource is _nothing else_ but
a count.

Ingo

2005-12-25 23:05:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Roman Zippel <[email protected]> wrote:
>
> IMO there are still too many questions open, so I can understand Andrew. We
> may only cover up the problem instead of fixing it. I understand that mutexes
> have advantages, but if we compare them to semaphores it should be a fair
> comparison, otherwise people start to think semaphores are something bad. The
> majority of the discussion has been about microoptimisation, but on some
> archs non-debug mutexes and semaphores may very well be the same thing.

Ingo has told me offline that he thinks that we can indeed remove the
current semaphore implementation.

90% of existing semaphore users get migrated to mutexes.

9% of current semaphore users get migrated to completions.

The remaining 1% of semaphore users are using the counting feature. We
reimplement that in a mostly-arch-independent fashion and then remove the
current semaphore implementation. Note that there are no sequencing
dependencies in all the above.

It's a lot of churn, but we'll end up with a better end result and a
somewhat-net-simpler kernel, so I'm happy.


One side point on semaphores and mutexes: the so-called "fast path" is
generally not performance-critical, because we just don't take them at high
frequencies. Any workload which involves taking a semaphore at more than
50,000-100,000 times/second tends to have ghastly overscheduling failure
scenarios on SMP. So people hit those scenarios and the code gets
converted to a lockless algorithm or to use spinlocking.

For example, for a while ext3/JBD was doing 200,000 context-switches per
second due to taking lock_super() at high frequencies. When I converted
the whole fs to use spin locking throughout the performance in some
workloads went up by 1000%.

Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only
does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which
is _designed_ to hit that lock hard. So if/when i_sem is converted to a
mutex, I figure that the benefits to ARM in that workload will be about a
0.01% performance increase. ie: about two hours' worth of Moore's law in a
dopey microbenchmark.

For these reasons, I think that with sleeping locks, the fastpath is
realtively unimportant. What _is_ important is not screwing up the
slowpath and not taking the lock at too high a frequency. Because either
one will cause overscheduling which is a grossly worse problem.

Also, there's very little point in adding lots of tricky arch-dependent
code and generally mucking up the kernel source to squeeze the last drop of
performance out of the sleeping lock fastpath. Because if those changes
actually make a difference, we've already lost - the code needs to be
changed to use spinlocking.

2005-12-25 23:23:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Andrew Morton <[email protected]> wrote:

> One side point on semaphores and mutexes: the so-called "fast path" is
> generally not performance-critical, because we just don't take them at
> high frequencies. Any workload which involves taking a semaphore at
> more than 50,000-100,000 times/second tends to have ghastly
> overscheduling failure scenarios on SMP. So people hit those
> scenarios and the code gets converted to a lockless algorithm or to
> use spinlocking.
>
> For example, for a while ext3/JBD was doing 200,000 context-switches
> per second due to taking lock_super() at high frequencies. When I
> converted the whole fs to use spin locking throughout the performance
> in some workloads went up by 1000%.

actually, i'm 99.9% certain [ ;-) ] that all that ext3 spinlock
conversion pain could have been avoided by converting ext3 to the mutex
code. Mutexes definitely do not overschedule, even in very high
frequency lock/unlock scenarios. They behave and perform quite close to
spinlocks. (which property is obviously a must for the -rt kernel, where
all spinlocks, rwlocks, seqlocks, rwsems and semaphores are mutexes -
providing a big playground for locking constructs)

hm, can you see any easy way for me to test my bold assertion on ext3,
by somehow moving/hacking it back to semaphores? I could then apply the
MUTEX_DEBUG_FULL mechanism to it and redo those ext3 semaphore vs.
spinlock benchmarks on a couple of boxes, and add a mutex column to the
table of numbers.

also, often it's simpler to use a sleeping lock than to use a spinlock
for something, so we want to have that option open.

Ingo

2005-12-26 10:37:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Ingo Molnar <[email protected]> wrote:
>
>
> * Andrew Morton <[email protected]> wrote:
>
> > One side point on semaphores and mutexes: the so-called "fast path" is
> > generally not performance-critical, because we just don't take them at
> > high frequencies. Any workload which involves taking a semaphore at
> > more than 50,000-100,000 times/second tends to have ghastly
> > overscheduling failure scenarios on SMP. So people hit those
> > scenarios and the code gets converted to a lockless algorithm or to
> > use spinlocking.
> >
> > For example, for a while ext3/JBD was doing 200,000 context-switches
> > per second due to taking lock_super() at high frequencies. When I
> > converted the whole fs to use spin locking throughout the performance
> > in some workloads went up by 1000%.
>
> actually, i'm 99.9% certain [ ;-) ] that all that ext3 spinlock
> conversion pain could have been avoided by converting ext3 to the mutex
> code. Mutexes definitely do not overschedule, even in very high
> frequency lock/unlock scenarios. They behave and perform quite close to
> spinlocks. (which property is obviously a must for the -rt kernel, where
> all spinlocks, rwlocks, seqlocks, rwsems and semaphores are mutexes -
> providing a big playground for locking constructs)

hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost
of a short spin is much less than the cost of a sleep/wakeup. The machine
was doing 100,000 - 200,000 context switches per second.

> hm, can you see any easy way for me to test my bold assertion on ext3,
> by somehow moving/hacking it back to semaphores?

Not really. The problem was most apparent after the lock_kernel() removal
patches. The first thing a CPU hit when it entered the fs was previously
lock_kernel(). That became lock_super() and performance went down the
tubes. From memory, the bad kernel was tip-of-tree as of Memorial Weekend
2003 ;)

I guess you could re-add all the lock_super()s as per 2.5.x's ext3/jbd,
check that it sucks running SDET on 8-way then implement the lock_super()s
via a mutex.

2005-12-26 10:43:08

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


> hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost
> of a short spin is much less than the cost of a sleep/wakeup. The machine
> was doing 100,000 - 200,000 context switches per second.

interesting.. this might be a good indication that a "spin a bit first"
mutex slowpath for some locks might be worth implementing...


2005-12-26 11:12:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Arjan van de Ven <[email protected]> wrote:
>
>
> > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost
> > of a short spin is much less than the cost of a sleep/wakeup. The machine
> > was doing 100,000 - 200,000 context switches per second.
>
> interesting.. this might be a good indication that a "spin a bit first"
> mutex slowpath for some locks might be worth implementing...

If we see a workload which is triggering such high context switch rates,
maybe. But I don't think we've seen any such for a long time.

2005-12-26 15:29:52

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Sun, 25 Dec 2005, Andrew Morton wrote:

> Ingo has told me offline that he thinks that we can indeed remove the
> current semaphore implementation.
>
> 90% of existing semaphore users get migrated to mutexes.
>
> 9% of current semaphore users get migrated to completions.
>
> The remaining 1% of semaphore users are using the counting feature. We
> reimplement that in a mostly-arch-independent fashion and then remove the
> current semaphore implementation. Note that there are no sequencing
> dependencies in all the above.

Great! I'm glad you're amenable to such changes. Especially the
"remove the current semaphore implementation" part many people are
complaining about for all sorts of reasons.

> Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only
> does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which
> is _designed_ to hit that lock hard. So if/when i_sem is converted to a
> mutex, I figure that the benefits to ARM in that workload will be about a
> 0.01% performance increase. ie: about two hours' worth of Moore's law in a
> dopey microbenchmark.

There is not only the question of cycles. There is also kernel text
size. And _that_ is significant. You could argue that only adding a
function call for every mutex lock/unlock cannot be made smaller.
However on ARM a function call is quite a costly operation with frame
pointers, (and frame pointers are on by default on ARM otherwise the
kernel is undebuggable). So given that current semaphore fast path is
inlined to save on the function call overhead, and given that on ARM the
transition from semaphores to mutexes will shrink that from 8
instruction down to 3 instructions, or even 2 instructions in some
cases, will not only save cycles, but a bunch of kernel .text bytes as
well without compromize on the cycle count. This is therefore an all
win situation for ARM.

> Also, there's very little point in adding lots of tricky arch-dependent
> code and generally mucking up the kernel source to squeeze the last drop of
> performance out of the sleeping lock fastpath. Because if those changes
> actually make a difference, we've already lost - the code needs to be
> changed to use spinlocking.

Consider my text size argument above which is something I believe you
value more.


Nicolas

2005-12-26 17:15:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

At 03:11 AM 12/26/2005 -0800, Andrew Morton wrote:
>Arjan van de Ven <[email protected]> wrote:
> >
> >
> > > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost
> > > of a short spin is much less than the cost of a sleep/wakeup. The
> machine
> > > was doing 100,000 - 200,000 context switches per second.
> >
> > interesting.. this might be a good indication that a "spin a bit first"
> > mutex slowpath for some locks might be worth implementing...
>
>If we see a workload which is triggering such high context switch rates,
>maybe. But I don't think we've seen any such for a long time.

Hmm. Is there a real workload where such a high context switch rate is
necessary? Every time I've seen a high (100,000 - 200,000 is beyond absurd
on my little box, but...) context switch rate, it's been because something
sucked.

-Mike

2005-12-26 17:39:47

by Lee Revell

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Mon, 2005-12-26 at 18:15 +0100, Mike Galbraith wrote:
> At 03:11 AM 12/26/2005 -0800, Andrew Morton wrote:
> >Arjan van de Ven <[email protected]> wrote:
> > >
> > >
> > > > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost
> > > > of a short spin is much less than the cost of a sleep/wakeup. The
> > machine
> > > > was doing 100,000 - 200,000 context switches per second.
> > >
> > > interesting.. this might be a good indication that a "spin a bit first"
> > > mutex slowpath for some locks might be worth implementing...
> >
> >If we see a workload which is triggering such high context switch rates,
> >maybe. But I don't think we've seen any such for a long time.
>
> Hmm. Is there a real workload where such a high context switch rate is
> necessary? Every time I've seen a high (100,000 - 200,000 is beyond absurd
> on my little box, but...) context switch rate, it's been because something
> sucked.

I can trivially produce 20K per second on my little sub Ghz box so 100K
on a busy server is certainly plausible. Especially if for the purposes
of this discussion we are also worried about -rt + IRQ threading where
each IRQ costs two context switches (more if it raises a softirq).

Lee

2005-12-26 18:17:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4



On Mon, 26 Dec 2005, Arjan van de Ven wrote:
>
> > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost
> > of a short spin is much less than the cost of a sleep/wakeup. The machine
> > was doing 100,000 - 200,000 context switches per second.
>
> interesting.. this might be a good indication that a "spin a bit first"
> mutex slowpath for some locks might be worth implementing...

No, please don't.

Almost always it's a huge sign that the locking is totally broken.

And yes, the fix was to _fix_ the ext3 locking. Not to make semaphores
spin. The ext3 locking was using a semaphore for totally the wrong
reasons, it made zero sense at all.

I think it's fixed now.

Linus

2005-12-26 22:27:51

by Roman Zippel

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Hi,

On Sunday 25 December 2005 23:54, Ingo Molnar wrote:

> > [...] I also haven't hardly seen any discussion about why semaphores
> > the way they are. Linus did suspect there is a wakeup bug in the
> > semaphore, but there was no conclusive followup to that.
>
> no conclusive follow-up because ... they are too complex for people to
> answer such questions off the cuff? Something so frequently used in
> trivial ways should have the complexity of that typical use, not the
> complexity of the theoretical use. There is no problem with semaphores,
> other than that they are not being used as semaphores all that often.

It shouldn't be that out of the blue for you. I don't mind the whole concept
of mutexes and I agree that that's what most semaphores are used for. Please
stop for a moment trying to sell mutexes, the basic question I'd like to get
answered is, what is the worst-case scenerio if we convert everything to
mutexes?
To make it very clear: I'm not arguing against mutexes, I only want a look at
the complete picture, I don't only want to see the undoubted advantages, but
also what are the risks? Is the semaphore wakeup behaviour really only a bug
or does it fix some problem that just nobody remembers (and maybe even
doesn't exist anymore)? What about the fairness issues mentioned, how easy is
it to starve waiters?
Ingo, you're working on it already for a while, so I would expect you already
thought about possible problems already, so why don't you take a look at the
risks for us instead of just explaining the advantages? What are the chances
we end up with semaphores just under a different name? Are there other
possible problems, which then can be only solved e.g. by adding priority
inheritance?
The point of this is to be prepared for any predictable problem, since this
change in its consequence is rather huge and we don't have the luxury of a
single development tree anymore, which is used by most developers.

bye, Roman

2005-12-27 00:33:12

by David Lang

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

On Mon, 26 Dec 2005, Andrew Morton wrote:

> Arjan van de Ven <[email protected]> wrote:
>>
>>
>>> hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost
>>> of a short spin is much less than the cost of a sleep/wakeup. The machine
>>> was doing 100,000 - 200,000 context switches per second.
>>
>> interesting.. this might be a good indication that a "spin a bit first"
>> mutex slowpath for some locks might be worth implementing...
>
> If we see a workload which is triggering such high context switch rates,
> maybe. But I don't think we've seen any such for a long time.
>

how does 'spin a bit' interact with virutal CPU's (HT and equivalent).

it would seem to me that on a multi true-core system the spin-a-bit is a
win becouse it allows the other CPU's to release the lock, but on virtual
CPU's all the spinning would do is to delay the other virtual CPU's
running and therefor delay the lock getting released.

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-12-27 14:42:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4


* Andrew Morton <[email protected]> wrote:

> > hm, can you see any easy way for me to test my bold assertion on ext3,
> > by somehow moving/hacking it back to semaphores?
>
> Not really. The problem was most apparent after the lock_kernel()
> removal patches. The first thing a CPU hit when it entered the fs was
> previously lock_kernel(). That became lock_super() and performance
> went down the tubes. From memory, the bad kernel was tip-of-tree as
> of Memorial Weekend 2003 ;)
>
> I guess you could re-add all the lock_super()s as per 2.5.x's
> ext3/jbd, check that it sucks running SDET on 8-way then implement the
> lock_super()s via a mutex.

ok - does the patch below look roughly ok as a really bad (but
functional) hack to restore that old behavior, for ext2?

Ingo

fs/ext2/balloc.c | 19 ++++++++++++++++---
fs/ext2/ialloc.c | 15 +++++++++++++--
2 files changed, 30 insertions(+), 6 deletions(-)

Index: linux/fs/ext2/balloc.c
===================================================================
--- linux.orig/fs/ext2/balloc.c
+++ linux/fs/ext2/balloc.c
@@ -139,7 +139,7 @@ static void release_blocks(struct super_
}
}

-static int group_reserve_blocks(struct ext2_sb_info *sbi, int group_no,
+static int group_reserve_blocks(struct super_block *sb, struct ext2_sb_info *sbi, int group_no,
struct ext2_group_desc *desc, struct buffer_head *bh, int count)
{
unsigned free_blocks;
@@ -147,12 +147,15 @@ static int group_reserve_blocks(struct e
if (!desc->bg_free_blocks_count)
return 0;

+ lock_super(sb);
spin_lock(sb_bgl_lock(sbi, group_no));
free_blocks = le16_to_cpu(desc->bg_free_blocks_count);
if (free_blocks < count)
count = free_blocks;
desc->bg_free_blocks_count = cpu_to_le16(free_blocks - count);
spin_unlock(sb_bgl_lock(sbi, group_no));
+ unlock_super(sb);
+
mark_buffer_dirty(bh);
return count;
}
@@ -164,10 +167,12 @@ static void group_release_blocks(struct
struct ext2_sb_info *sbi = EXT2_SB(sb);
unsigned free_blocks;

+ lock_super(sb);
spin_lock(sb_bgl_lock(sbi, group_no));
free_blocks = le16_to_cpu(desc->bg_free_blocks_count);
desc->bg_free_blocks_count = cpu_to_le16(free_blocks + count);
spin_unlock(sb_bgl_lock(sbi, group_no));
+ unlock_super(sb);
sb->s_dirt = 1;
mark_buffer_dirty(bh);
}
@@ -234,6 +239,7 @@ do_more:
"Block = %lu, count = %lu",
block, count);

+ lock_super(sb);
for (i = 0, group_freed = 0; i < count; i++) {
if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
bit + i, bitmap_bh->b_data)) {
@@ -243,6 +249,7 @@ do_more:
group_freed++;
}
}
+ unlock_super(sb);

mark_buffer_dirty(bitmap_bh);
if (sb->s_flags & MS_SYNCHRONOUS)
@@ -377,7 +384,7 @@ int ext2_new_block(struct inode *inode,
goto io_error;
}

- group_alloc = group_reserve_blocks(sbi, group_no, desc,
+ group_alloc = group_reserve_blocks(sb, sbi, group_no, desc,
gdp_bh, es_alloc);
if (group_alloc) {
ret_block = ((goal - le32_to_cpu(es->s_first_data_block)) %
@@ -389,8 +396,10 @@ int ext2_new_block(struct inode *inode,

ext2_debug("goal is at %d:%d.\n", group_no, ret_block);

+ lock_super(sb);
ret_block = grab_block(sb_bgl_lock(sbi, group_no),
bitmap_bh->b_data, group_size, ret_block);
+ unlock_super(sb);
if (ret_block >= 0)
goto got_block;
group_release_blocks(sb, group_no, desc, gdp_bh, group_alloc);
@@ -413,7 +422,7 @@ retry:
desc = ext2_get_group_desc(sb, group_no, &gdp_bh);
if (!desc)
goto io_error;
- group_alloc = group_reserve_blocks(sbi, group_no, desc,
+ group_alloc = group_reserve_blocks(sb, sbi, group_no, desc,
gdp_bh, es_alloc);
}
if (!group_alloc) {
@@ -425,8 +434,10 @@ retry:
if (!bitmap_bh)
goto io_error;

+ lock_super(sb);
ret_block = grab_block(sb_bgl_lock(sbi, group_no), bitmap_bh->b_data,
group_size, 0);
+ unlock_super(sb);
if (ret_block < 0) {
/*
* If a free block counter is corrupted we can loop inifintely.
@@ -485,12 +496,14 @@ got_block:
if (group_alloc && !*prealloc_count) {
unsigned n;

+ lock_super(sb);
for (n = 0; n < group_alloc && ++ret_block < group_size; n++) {
if (ext2_set_bit_atomic(sb_bgl_lock(sbi, group_no),
ret_block,
(void*) bitmap_bh->b_data))
break;
}
+ unlock_super(sb);
*prealloc_block = block + 1;
*prealloc_count = n;
es_alloc -= n;
Index: linux/fs/ext2/ialloc.c
===================================================================
--- linux.orig/fs/ext2/ialloc.c
+++ linux/fs/ext2/ialloc.c
@@ -75,6 +75,7 @@ static void ext2_release_inode(struct su
return;
}

+ lock_super(sb);
spin_lock(sb_bgl_lock(EXT2_SB(sb), group));
desc->bg_free_inodes_count =
cpu_to_le16(le16_to_cpu(desc->bg_free_inodes_count) + 1);
@@ -82,6 +83,7 @@ static void ext2_release_inode(struct su
desc->bg_used_dirs_count =
cpu_to_le16(le16_to_cpu(desc->bg_used_dirs_count) - 1);
spin_unlock(sb_bgl_lock(EXT2_SB(sb), group));
+ unlock_super(sb);
if (dir)
percpu_counter_dec(&EXT2_SB(sb)->s_dirs_counter);
sb->s_dirt = 1;
@@ -148,12 +150,16 @@ void ext2_free_inode (struct inode * ino
goto error_return;

/* Ok, now we can actually update the inode bitmaps.. */
+ lock_super(sb);
if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group),
- bit, (void *) bitmap_bh->b_data))
+ bit, (void *) bitmap_bh->b_data)) {
+ unlock_super(sb);
ext2_error (sb, "ext2_free_inode",
"bit already cleared for inode %lu", ino);
- else
+ } else {
+ unlock_super(sb);
ext2_release_inode(sb, block_group, is_directory);
+ }
mark_buffer_dirty(bitmap_bh);
if (sb->s_flags & MS_SYNCHRONOUS)
sync_dirty_buffer(bitmap_bh);
@@ -507,8 +513,10 @@ repeat_in_this_group:
group = 0;
continue;
}
+ lock_super(sb);
if (ext2_set_bit_atomic(sb_bgl_lock(sbi, group),
ino, bitmap_bh->b_data)) {
+ unlock_super(sb);
/* we lost this inode */
if (++ino >= EXT2_INODES_PER_GROUP(sb)) {
/* this group is exhausted, try next group */
@@ -519,6 +527,7 @@ repeat_in_this_group:
/* try to find free inode in the same group */
goto repeat_in_this_group;
}
+ unlock_super(sb);
goto got;
}

@@ -547,6 +556,7 @@ got:
if (S_ISDIR(mode))
percpu_counter_inc(&sbi->s_dirs_counter);

+ lock_super(sb);
spin_lock(sb_bgl_lock(sbi, group));
gdp->bg_free_inodes_count =
cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
@@ -560,6 +570,7 @@ got:
sbi->s_debts[group]--;
}
spin_unlock(sb_bgl_lock(sbi, group));
+ unlock_super(sb);

sb->s_dirt = 1;
mark_buffer_dirty(bh2);

2005-12-27 23:02:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

Ingo Molnar <[email protected]> wrote:
>
>
> * Andrew Morton <[email protected]> wrote:
>
> > > hm, can you see any easy way for me to test my bold assertion on ext3,
> > > by somehow moving/hacking it back to semaphores?
> >
> > Not really. The problem was most apparent after the lock_kernel()
> > removal patches. The first thing a CPU hit when it entered the fs was
> > previously lock_kernel(). That became lock_super() and performance
> > went down the tubes. From memory, the bad kernel was tip-of-tree as
> > of Memorial Weekend 2003 ;)
> >
> > I guess you could re-add all the lock_super()s as per 2.5.x's
> > ext3/jbd, check that it sucks running SDET on 8-way then implement the
> > lock_super()s via a mutex.
>
> ok - does the patch below look roughly ok as a really bad (but
> functional) hack to restore that old behavior, for ext2?
>

Hard to tell ;) 2.5.20's ext2 had 7 lock_super()s whereas for some reason
this patch adds 12...

I don't recall whether ext2 suffered wild context switches as badly as ext3
did. It becomes pretty obvious in testing.

The really bad workload was SDET, which isn't available to mortals. So
some hunting might be neded to find a suitable alternative. dbench would be
a good start I guess.

2006-01-03 17:54:51

by Abhijit Bhopatkar

[permalink] [raw]
Subject: Re: [patch 0/9] mutex subsystem, -V4

> > How can't you get the fact that semaphores could _never_ be as simple as
> > mutexes? This is a theoritical impossibility, which maybe turns out not
> > to be so true on x86, but which is damn true on ARM where the fast path
> > (the common case of a mutex) is significantly more efficient.
> >
>
> I did notice your comments. I'll grant that mutexes will save some tens of
> fastpath cycles on one minor architecture. Sorry, but that doesn't seem
> very important.

Heh !! i can't find words so i will just spell the emotion....
COMPLAIN HARD

2006-01-05 14:01:51

by Pavel Machek

[permalink] [raw]
Subject: Moore's law (was Re: [patch 0/9] mutex subsystem, -V4)

Hi!

> Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only
> does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which
> is _designed_ to hit that lock hard. So if/when i_sem is converted to a
> mutex, I figure that the benefits to ARM in that workload will be about a
> 0.01% performance increase. ie: about two hours' worth of Moore's law in a
> dopey microbenchmark.

:-) Expressing performance increases in Moore's hours seems like
neat trick. OTOH I do not think it is valid any more. Single-threaded
performance stopped increasing 2 years ago AFAICS. Plus people are
pushing Linux onto smaller machines, that were unavailable 2 years
ago.


Pavel
--
Thanks, Sharp!

2006-01-05 16:23:20

by Andi Kleen

[permalink] [raw]
Subject: Re: Moore's law (was Re: [patch 0/9] mutex subsystem, -V4)

On Monday 26 December 2005 01:33, Pavel Machek wrote:

[cc list from hell trimmed down]

> > Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only
> > does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which
> > is _designed_ to hit that lock hard. So if/when i_sem is converted to a
> > mutex, I figure that the benefits to ARM in that workload will be about a
> > 0.01% performance increase. ie: about two hours' worth of Moore's law in a
> > dopey microbenchmark.

Moore's law actually doesn't say anything about performance increases,
just about the number of transistors available.
>
> :-) Expressing performance increases in Moore's hours seems like
> neat trick. OTOH I do not think it is valid any more. Single-threaded
> performance stopped increasing 2 years ago AFAICS.

It's not true. e.g. a 2.6 Ghz FX-57 is significantly faster than the
top end CPU you could get 2 years ago. And I'm sure this years CPUs
will be still faster than last years.

> Plus people are
> pushing Linux onto smaller machines, that were unavailable 2 years
> ago.

Even smaller systems are still getting faster.

-Andi

2006-01-05 19:11:31

by Pavel Machek

[permalink] [raw]
Subject: Re: Moore's law (was Re: [patch 0/9] mutex subsystem, -V4)

On Thu 05-01-06 16:30:59, Andi Kleen wrote:
> On Monday 26 December 2005 01:33, Pavel Machek wrote:
>
> [cc list from hell trimmed down]

> > :-) Expressing performance increases in Moore's hours seems like
> > neat trick. OTOH I do not think it is valid any more. Single-threaded
> > performance stopped increasing 2 years ago AFAICS.
>
> It's not true. e.g. a 2.6 Ghz FX-57 is significantly faster than the
> top end CPU you could get 2 years ago. And I'm sure this years CPUs
> will be still faster than last years.

Well, but it is no longer 2x as fast...

> > Plus people are
> > pushing Linux onto smaller machines, that were unavailable 2 years
> > ago.
>
> Even smaller systems are still getting faster.

Yep. OTOH this is the first time when *average* linux box is getting
*slower* -- mostly due to massive ammount of cheap, linux-based
wifi access points on the market.

(Average linux desktop box is still gettting faster, though)
--
Thanks, Sharp!