2005-12-19 01:35:15

by Ingo Molnar

[permalink] [raw]
Subject: [patch 00/15] Generic Mutex Subsystem


The following patchset is a split-up and streamlined version of the
generic mutex subsystem that we have in the -rt kernel, ported to the
upstream kernel. (To reduce the confusion: this code is unrelated to the
'simple mutex' code recently posted by David Howells.)

the patchset can also be found at:

http://redhat.com/~mingo/generic-mutex-subsystem/

This patchset is intended to address most (and hopefully all) of the
objections (and suggestions) raised here in last week's mutex
discussion. Since there were many issues raised in that thread (and i
really have read all of them), the answers are unfortunately quite
extensive too. I think i have the right answer to each of them, embedded
somewhere below, just be patient and bear with this long email before
forming an opinion about the feasibility of this approach ;-)

QA status: this is prototype code that supports i386 and x86_64 (SMP and
UP) at the moment - but it is what i believe could be ported to every
architecture, and could be merged into the upstream kernel in due
course. All suggestions to further improve it are welcome.

But firstly, i'd like to answer the most important question:

"Why the hell do we need a new mutex subsystem, and what's wrong
with semaphores??"

This question is clearly nagging most of the doubters, so i'm listing my
answers first, before fully explaining the patchset. (For more
implementational details, see the subseqent sections.)

here are the top 10 reasons of why i think the generic mutex code should
be considered for upstream integration:

- 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes,
'struct mutex' is 16 bytes. A smaller structure size means less RAM
footprint, and better CPU-cache utilization.

- tighter code. On x86 i get the following .text sizes when
switching all mutex-alike semaphores in the kernel to the mutex
subsystem:

text data bss dec hex filename
3280380 868188 396860 4545428 455b94 vmlinux-semaphore
3255329 865296 396732 4517357 44eded vmlinux-mutex

that's 25051 bytes of code saved, or a 0.76% win - off the hottest
codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%)
Smaller code means better icache footprint, which is one of the
major optimization goals in the Linux kernel currently.

- the mutex subsystem is faster and has superior scalability for
contented workloads. On an 8-way x86 system, running a mutex-based
kernel and testing creat+unlink+close (of separate, per-task files)
in /tmp with 16 parallel tasks, the average number of ops/sec is:

Semaphores: Mutexes:

$ ./test-mutex V 16 10 $ ./test-mutex V 16 10
8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
checking VFS performance. checking VFS performance.
avg loops/sec: 34713 avg loops/sec: 84153
CPU utilization: 63% CPU utilization: 22%

i.e. in this workload, the mutex based kernel was 2.4 times faster
than the semaphore based kernel, _and_ it also had 2.8 times less CPU
utilization. (In terms of 'ops per CPU cycle', the semaphore kernel
performed 551 ops/sec per 1% of CPU time used, while the mutex kernel
performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times
more efficient.)

the scalability difference is visible even on a 2-way P4 HT box:

Semaphores: Mutexes:

$ ./test-mutex V 16 10 $ ./test-mutex V 16 10
4 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
checking VFS performance. checking VFS performance.
avg loops/sec: 127659 avg loops/sec: 181082
CPU utilization: 100% CPU utilization: 34%

(the straight performance advantage of mutexes is 41%, the per-cycle
efficiency of mutexes is 4.1 times better.)

- there are no fastpath tradeoffs, the mutex fastpath is just as tight
as the semaphore fastpath. On x86, the locking fastpath is 2
instructions:

c0377ccb <mutex_lock>:
c0377ccb: f0 ff 08 lock decl (%eax)
c0377cce: 78 0e js c0377cde <.text.lock.mutex>
c0377cd0: c3 ret

the unlocking fastpath is equally tight:

c0377cd1 <mutex_unlock>:
c0377cd1: f0 ff 00 lock incl (%eax)
c0377cd4: 7e 0f jle c0377ce5 <.text.lock.mutex+0x7>
c0377cd6: c3 ret

- the per-call-site inlining cost of the slowpath is cheaper and
smaller than that of semaphores, by one instruction, because the
mutex trampoline code does not do a "lea %0,%%eax" that the semaphore
code does before calling __down_failed. The mutex subsystem uses out
of line code currently so this makes only a small difference in .text
size, but in case we want to inline mutexes, they will be cheaper
than semaphores.

- No wholesale or dangerous migration path. The migration to mutexes is
fundamentally opt-in, safe and easy: multiple type-based and .config
based migration helpers are provided to make the migration to mutexes
easy. Migration is as finegrained as it gets, so robustness of the
kernel or out-of-tree code should not be jeopardized at any stage.
The migration helpers can be eliminated once migration is completed,
once the kernel has been separated into 'mutex users' and 'semaphore
users'. Out-of-tree code automatically defaults to semaphore
semantics, mutexes are not forced upon anyone, at any stage of the
migration.

- 'struct mutex' semantics are well-defined and are enforced if
CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have
virtually no debugging code or instrumentation. The mutex subsystem
checks and enforces the following rules:

* - only one task can hold the mutex at a time
* - only the owner can unlock the mutex
* - multiple unlocks are not permitted
* - recursive locking is not permitted
* - a mutex object must be initialized via the API
* - a mutex object must not be initialized via memset or copying
* - task may not exit with mutex held
* - memory areas where held locks reside must not be freed

furthermore, there are also convenience features in the debugging
code:

* - uses symbolic names of mutexes, whenever they are printed in debug output
* - point-of-acquire tracking, symbolic lookup of function names
* - list of all locks held in the system, printout of them
* - owner tracking
* - detects self-recursing locks and prints out all relevant info
* - detects multi-task circular deadlocks and prints out all affected
* locks and tasks (and only those tasks)

we have extensive experience with the mutex debugging code in the -rt
kernel, and it eases the debugging of mutex related bugs
considerably. A handful of upstream bugs were found as well this
way, and were contributed back to the vanilla kernel. We do believe
that improved debugging code is an important tool in improving the
fast-paced upstream kernel's quality.

a side-effect of the strict semantics is that mutexes are much easier
to analyze on a static level. E.g. Sparse could check the correctness
of mutex users, further improving the kernel's quality. Also, the
highest-level security and reliability validation techniques (and
certification processes) involve static code analysis.

- kernel/mutex.c is generic, and has minimal per-arch needs. No new
primitives have to be implemented to support spinlock-based generic
mutexes. Only 2 new atomic primitives have to be implemented for an
architecture to support optimized, lockless generic mutexes. In
contrast, to implement semaphores on a new architecture, hundreds of
lines of nontrivial (often assembly) code has to be written and
debugged.

- kernel/mutex.c is highly hackable. New locking features can be
implemented in C, and they carry over to every architecture.
Extensive self-consistency debugging checks of the mutex
implementation are done if CONFIG_DEBUG_MUTEXES is turned on. I do
think that hackability is the most important property of
kernel code.

- the generic mutex subsystem is also one more step towards enabling
the fully preemptable -rt kernel. Ok, this shouldnt bother the
upstream kernel too much at the moment, but it's a personal driving
force for me nevertheless ;-)

(NOTE: i consciously did not list 'Priority Inheritance' amongst the
reasons, because priority inheritance for blocking kernel locks
would be a misguided reason at best, and flat out wrong at worst.)


Implementation of mutexes
-------------------------

there are two central data types:

- 'struct mutex'
- 'struct arch_semaphore'

'struct mutex' is the new mutex type, defined in include/linux/mutex.h
and implemented in kernel/mutex.c. It is a counter-based mutex with a
spinlock and a wait-list.

'struct arch_semaphore' is the 'struct semaphore' type and
implementation, defined and implemented in include/asm-*/semaphore.h and
in various other arch-level files.

NOTE: the patchset does _NO_ wholesale renaming of 'struct semaphore' to
'struct arch_semaphore'. The limited renaming of 'struct semaphore' to
'struct arch_semaphore' is only technical, and affects only a limited
amount of architecture-level code. It does _not_ spread out into the
generic kernel itself. The reason for this renaming is to make migration
to mutexes safer and easier.

the APIs of 'struct mutex' have been streamlined:

DEFINE_MUTEX(name);

mutex_init(mutex);

void mutex_lock(struct mutex *lock);
int mutex_lock_interruptible(struct mutex *lock);
int mutex_trylock(struct mutex *lock);
void mutex_unlock(struct mutex *lock);
int mutex_is_locked(struct mutex *lock);

the APIs of 'struct arch_semaphore' are the well-known semaphore APIs:

DECLARE_ARCH_MUTEX(name)
DECLARE_ARCH_MUTEX_LOCKED(name)

void arch_sema_init(struct arch_semaphore *sem, int val);
void arch_init_MUTEX(struct arch_semaphore *sem);
void arch_init_MUTEX_LOCKED(struct arch_semaphore *sem);
void arch_down(struct arch_semaphore * sem);
int arch_down_interruptible(struct arch_semaphore * sem);
int arch_down_trylock(struct arch_semaphore * sem);
void arch_up(struct arch_semaphore * sem);
int arch_sem_is_locked(struct arch_semaphore *sem);
arch_sema_count(sem)

NOTE: no non-mutex subsystem will ever make use of these arch_-prefixed
API calls - they are all hidden within type-switching macros. So no
wholesale migration of the semaphore APIs is done.


The migratio path to mutexes
----------------------------

there are two other sub-types implemented as well, to ease migration and
to enable debugging. They are:

- 'struct semaphore'
- 'struct mutex_debug'

'Break The World' approaches are unacceptable. By default, 'struct
semaphore' is mapped to the plain semaphore implementation of the
architecture - bit for bit equivalent and compatible. The APIs are the
well-know down()/up()/etc. APIs, and they are bit for bit equivalent to
the vanilla kernel's semaphore implementation. This property is crutial
to be able to introduce the mutex subsystem in the stable kernel series.
Doing anything else would be a non-starter.

'struct mutex_debug' is a debugging variant of mutexes: it can be used
with both sets of APIs, both with the semaphore APIs and with the mutex
APIs. E.g.:

struct mutex_debug sem;

down(&sem);
...
up(&sem);
...
down(&sem);
...
up(&sem);

will be mapped by the kernel to mutex_lock() and mutex_unlock(). The
code can be changed back to a semaphore by changing the 'struct
mutex_debug sem' line to 'struct semaphore sem'. The actual down()/up()
calls in the code do not have to be modified for this type of migration.

thus 'struct mutex_debug' is the natural starting point for migrating a
kernel subsystem or driver over to mutexes: just change the 'struct
semaphore' data definition to 'struct mutex_debug', and rebuild the
kernel - the whole subsystem will use mutexes from that point on. If
there are any problems with the migration, switch the data type back to
'struct semaphore' and forget about mutexes. If the migration proves to
be successful, change the data type to 'struct mutex' and fix up all the
API uses to use the mutex variants.

there are .config debugging options that change the meaning of 'struct
semaphore': e.g. CONFIG_DEBUG_MUTEX_FULL switches all semaphores over to
mutexes. All semaphores, except the ones that have been marked
arch_semaphore. (see the arch-semaphores.patch sub-patch) The
DEBUG_MUTEX_FULL debugging mode is fully functional on all of my
testsystems, but since it's equivalent to a wholesale conversion to
mutexes, it cannot be guaranteed to be safe. But it is a nice debugging
option nevertheless, and can be used to verify the mutex subsystem.

so to summarize, these types enable the finegrained and robust
separation of the kernel's semaphores into 3 categories:

- mutexes (first 'struct mutex_debug', then 'struct mutex')

- counting semaphores (first 'struct semaphore', then
'struct arch_semaphore'

- unknown, unreviewed ('struct semaphores')

the 'mutex conversion' process would act on the 'unknown' pool, and
would move semaphores into one of the two groups, one by one.

out-of-tree semaphore users are safe by default in this scheme: they get
into the 'unknown' pool and are hence conservatively assumed to be full
semaphores.

once all semaphores have been safely categorized and converted to one
group or another, and all out-of-tree codebases are fixed and a
deprecation period has passed, we can rename arch_semaphores back to
'struct semaphore'.


Status of the patch-queue
-------------------------

the patch-series currently builds and boots on i386 and x86_64. It
consists of 15 finegrained patches:

add-atomic-xchg-i386.patch
add-atomic-xchg-x86_64.patch
add-atomic-call-func-i386.patch
add-atomic-call-func-x86_64.patch

mutex-core.patch

mutex-debug.patch
mutex-debug-more.patch

mutex-migration-helper-i386.patch
mutex-migration-helper-x86_64.patch
mutex-migration-helper-core.patch

sx8-sem2completions.patch
cpu5wdt-sem2completions.patch
ide-gendev-sem-to-completion.patch
loop-to-completions.patch

arch-semaphores.patch

these patches, ontop of Linus' current -git tree, build and boot at all
of these stages, and the mutex-kernel is fully functional with all
patches applied, in both debug and non-debug mode.

reports, suggestions, fixes are welcome.

Ingo


2005-12-19 04:23:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

> $ ./test-mutex V 16 10 $ ./test-mutex V 16 10
> 8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
> checking VFS performance. checking VFS performance.
> avg loops/sec: 34713 avg loops/sec: 84153
> CPU utilization: 63% CPU utilization: 22%
>
> i.e. in this workload, the mutex based kernel was 2.4 times faster
> than the semaphore based kernel, _and_ it also had 2.8 times less CPU
> utilization. (In terms of 'ops per CPU cycle', the semaphore kernel
> performed 551 ops/sec per 1% of CPU time used, while the mutex kernel
> performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times
> more efficient.)

Do you have an idea where this big difference comes from? It doesn't look
it's from the fast path which is essentially the same. Do the mutexes have
that much better scheduling behaviour than semaphores? It is a bit hard to
believe.

I would perhaps understand your numbers if you used adaptive mutexes
or similar that spin for a bit, but just for pure sleeping locks it seems
weird. After all the scheduler should work in the same way for both.

-Andi

2005-12-19 04:29:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


On Mon, 19 Dec 2005, Andi Kleen wrote:

> > $ ./test-mutex V 16 10 $ ./test-mutex V 16 10
> > 8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
> > checking VFS performance. checking VFS performance.
> > avg loops/sec: 34713 avg loops/sec: 84153
> > CPU utilization: 63% CPU utilization: 22%
> >
> > i.e. in this workload, the mutex based kernel was 2.4 times faster
> > than the semaphore based kernel, _and_ it also had 2.8 times less CPU
> > utilization. (In terms of 'ops per CPU cycle', the semaphore kernel
> > performed 551 ops/sec per 1% of CPU time used, while the mutex kernel
> > performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times
> > more efficient.)
>
> Do you have an idea where this big difference comes from? It doesn't look
> it's from the fast path which is essentially the same. Do the mutexes have
> that much better scheduling behaviour than semaphores? It is a bit hard to
> believe.
>
> I would perhaps understand your numbers if you used adaptive mutexes
> or similar that spin for a bit, but just for pure sleeping locks it seems
> weird. After all the scheduler should work in the same way for both.
>

Perhaps it's the smaller structures, as Ingo said, which would allow for
better cache handling.

-- Steve

2005-12-19 04:31:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

> Perhaps it's the smaller structures, as Ingo said, which would allow for
> better cache handling.

That still doesn't seem credible on a large cached x86 CPU.
However if it's that then oprofile could tell I guess.

-Andi

2005-12-19 06:29:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem



On Mon, 19 Dec 2005, Andi Kleen wrote:
>
> Do you have an idea where this big difference comes from? It doesn't look
> it's from the fast path which is essentially the same. Do the mutexes have
> that much better scheduling behaviour than semaphores? It is a bit hard to
> believe.

Are ingo's mutex'es perhaps not trying to be fair?

The normal mutexes try to make sure that if a process comes in and gets
stuck on a mutex, and then another CPU releases the mutex and immediately
tries to grab it again, the other CPU will _not_ get it.

That's a huge performance disadvantage, but it's done on purpose, because
otherwise you end up in a situation where the semaphore release code did
wake up the waiter, but before the waiter actually had time to grab it (it
has to go through the IPI and scheduling logic), the same CPU just grabbed
it again.

The original semaphores were unfair, and it works really well most of the
time. But then it really sucks in some nasty cases.

The numbers make me suspect that Ingo's mutexes are unfair too, but I've
not looked at the code yet.

NOTE! I'm not a huge fan of fairness per se. I think unfair is often ok,
and the performance advantages are certainly real. It may well be that the
cases that caused problems before are now done differently (eg we switched
the VM semaphore over to an rwsem), and that we can have an unfair and
fast mutex for those cases where we don't care.

I just suspect the comparison isn't fair.

Linus

2005-12-19 12:57:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


On Sun, 18 Dec 2005, Linus Torvalds wrote:

>
>
> On Mon, 19 Dec 2005, Andi Kleen wrote:
> >
> > Do you have an idea where this big difference comes from? It doesn't look
> > it's from the fast path which is essentially the same. Do the mutexes have
> > that much better scheduling behaviour than semaphores? It is a bit hard to
> > believe.
>
> Are ingo's mutex'es perhaps not trying to be fair?
>
> The normal mutexes try to make sure that if a process comes in and gets
> stuck on a mutex, and then another CPU releases the mutex and immediately
> tries to grab it again, the other CPU will _not_ get it.
>
> That's a huge performance disadvantage, but it's done on purpose, because
> otherwise you end up in a situation where the semaphore release code did
> wake up the waiter, but before the waiter actually had time to grab it (it
> has to go through the IPI and scheduling logic), the same CPU just grabbed
> it again.
>
> The original semaphores were unfair, and it works really well most of the
> time. But then it really sucks in some nasty cases.
>
> The numbers make me suspect that Ingo's mutexes are unfair too, but I've
> not looked at the code yet.

Yes, Ingo's code does act like this unfairness. Interesting also is that
Ingo's original code for his rt_mutexes was fair, and it killed
performance for high priority processes. I introduced a "lock stealing"
algorithm that would check if the process trying to grab the lock again
was a higher priority then the one about to get it, and if it was, it
would "steal" the lock from it unfairly as you said.

Now, you are forgetting about PREEMPT. Yes, on multiple CPUs, and that is
what Ingo is testing, to wait for the other CPU to schedule in and run is
probably not as bad as with PREEMPTION. (Ingo, did you have preemption on
in these tests?). The reason is that if you have a high priority process
release a lock (giving it to a lower priority process that hasn't woken up
yet), then try to grab it again, but a lower priority process was waiting
on it, the high priorty process would need to schedule out and wait on
the lower priority process. Here's a case of priority inversion that can
be solved without priority inheritance.

The way this situation happens is if you have three processes, A B and C
where A is the highest and C is the lowest. C grabs the lock and is
preempted by A, A tries to grab the lock and goes to sleep, then B
runs and preempts C (remember, we don't have PI here), and then tries to
grab the lock. C releases the lock and gives it to A, then A releases the
lock (gives it to B) and then tries to grab it again.

Now you must wait for two schedules and B to release the lock, before high
priority process A gets to run again.

So when we have PREEMPT, your fairness is not being very fair.

-- Steve

2005-12-19 15:51:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


* Linus Torvalds <[email protected]> wrote:

> On Mon, 19 Dec 2005, Andi Kleen wrote:
> >
> > Do you have an idea where this big difference comes from? It doesn't look
> > it's from the fast path which is essentially the same. Do the mutexes have
> > that much better scheduling behaviour than semaphores? It is a bit hard to
> > believe.
>
> Are Ingo's mutex'es perhaps not trying to be fair?
>
> The normal mutexes try to make sure that if a process comes in and
> gets stuck on a mutex, and then another CPU releases the mutex and
> immediately tries to grab it again, the other CPU will _not_ get it.

the VFS creat+unlink performance advantage i measured on SMP systems is
a pure 'struct mutex' vs. stock 'struct semaphore' measurement. I
measured the same kernel with a single .config option difference
(DEBUG_MUTEX_FULL to get the 'mutex' variant, vs. DEBUG_MUTEX_PARTIAL to
get the 'semaphore' variant), the systems were completely idle, and the
results are statistically stable.

fact is, we dont implement fairness in the upstream struct semaphore
implementation(s) either - it would be extremely bad to performance (as
you are pointing it out too).

both stock semaphores and generic mutexes are unfair in the following
sense: if there is a waiter 'in flight' (but has not grabbed the lock
yet), a 'lucky bastard may jump the queue' and may grab the lock,
without having waited anything. So comparing semaphores to generic
mutexes is an apples to apples comparison.

in fact, generic mutexes are _more_ fair than struct semaphore in their
wait logic. In the stock semaphore implementation, when a waiter is
woken up, it will retry the lock, and if it fails, it goes back to the
_tail_ of the queue again - waiting one full cycle again.

in the generic mutex implementation, the first waiter is woken up, and
it will retry the lock - but no other waiters are woken up until this
waiter has done its round. (a mutex implementation can do this, because
it knows that there can only be one owner, while a counting semaphore
has to be prepared for the possibility of multiple concurrent up()s, and
for the count going above 1.)

and this is also the crutial difference that gives mutexes the
performance edge in contended workloads (so this should also answers
Andi's question): stock semaphores easily end up creating a 'thundering
herd' scenario, if a 'fast lucky bastard' releases the lock - and wakes
up _yet another_ waiter. The end result is, that given a high enough
'semaphore use frequency', our wake-one logic in semaphores totally
falls apart and we end up having all waiters racing for the runqueues,
and racing for the lock itself. This causes much higher CPU utilization,
wasted resources, and slower total performance.

there is one more wakeup related optimization in generic mutexes: the
waiter->woken flag. It is set when the waiter is 'set into flight', and
subsequent wakeups first check this flag. Since new waiters are added to
the end of the wait-list, this waiter's data cachelines stay clean, and
the ->woken flag is nicely shared amongst CPUs. Doing a blind
wake_up_process() would at a minimum dirty the remote runqueue's lock
cacheline.

> That's a huge performance disadvantage, but it's done on purpose,
> because otherwise you end up in a situation where the semaphore
> release code did wake up the waiter, but before the waiter actually
> had time to grab it (it has to go through the IPI and scheduling
> logic), the same CPU just grabbed it again.
>
> The original semaphores were unfair, and it works really well most of
> the time. But then it really sucks in some nasty cases.
>
> The numbers make me suspect that Ingo's mutexes are unfair too, but
> I've not looked at the code yet.

yes, they are unfair - just like stock semaphores.

[ Note that the generic rt-mutexes in the -rt tree are of course fair to
higher-prio tasks, and this fairness is implemented via
priority-queueing and priority inheritance: the highest prio (RT) task
gets the lock, and if a lower-prio task is holding the lock, it is
boosted up until the end of the critical section, at which point it
hands over the lock to the highprio task.

implementing any other method to achieve fairness would result in
really bad real-world performance. The -rt tree's performance is quite
close to the vanilla kernel (considering that it's a fully preemptible
kernel), and we couldnt do that without the mutex implementation. ]

generic mutexes, as present in this patchqueue, do not include priority
queueing and priority inheritance, so they are 'plain unfair', just like
stock semaphores are.

> NOTE! I'm not a huge fan of fairness per se. I think unfair is often
> ok, and the performance advantages are certainly real. It may well be
> that the cases that caused problems before are now done differently
> (eg we switched the VM semaphore over to an rwsem), and that we can
> have an unfair and fast mutex for those cases where we don't care.
>
> I just suspect the comparison isn't fair.

the comparison is 100% totally fair, because both generic mutexes, and
struct semaphores are 100% totally unfair :-) There's no difference in
the level of unfairness either: 'lucky bastards' are allowed to steal
the lock in both implementations.

we have only one (limited) notion of fairness in Linux synchronization
primitives: rwsems prefer waiting writers, and then block subsequent
readers. Note that rwsems are still reader-reader and writer-writer
unfair, hence the -rt tree replaces the rwsem implementation too, with
generic mutexes and PI.

Ingo

2005-12-19 16:22:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


* Andi Kleen <[email protected]> wrote:

> > $ ./test-mutex V 16 10 $ ./test-mutex V 16 10
> > 8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks.
> > checking VFS performance. checking VFS performance.
> > avg loops/sec: 34713 avg loops/sec: 84153
> > CPU utilization: 63% CPU utilization: 22%
> >
> > i.e. in this workload, the mutex based kernel was 2.4 times faster
> > than the semaphore based kernel, _and_ it also had 2.8 times less CPU
> > utilization. (In terms of 'ops per CPU cycle', the semaphore kernel
> > performed 551 ops/sec per 1% of CPU time used, while the mutex kernel
> > performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times
> > more efficient.)
>
> Do you have an idea where this big difference comes from? It doesn't
> look it's from the fast path which is essentially the same. Do the
> mutexes have that much better scheduling behaviour than semaphores? It
> is a bit hard to believe.

yes, generic mutexes have that much better scheduling in this workload.
[ And no, there's no secret speedup magic hidden in the scheduler ;) ]
See my other reply to Linus about why there's such a difference.

> I would perhaps understand your numbers if you used adaptive mutexes
> or similar that spin for a bit, but just for pure sleeping locks it
> seems weird. After all the scheduler should work in the same way for
> both.

hm, i'm not so sure about adaptive mutexes - i'm a bit uneasy about
wasting cycles on spinning, it will inevitably cause wasted resources in
some workloads. I think the right approach is to make scheduling fast
and cheap, and to improve the queueing/wakeup logic of kernel code.

but by all means feel free to experiment with adaptive mutexes, all the
mutex logic is located in kernel/mutex.c, and it is well-suited for
rapid prototyping of new locking logic.

Ingo

2005-12-19 16:56:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


* Steven Rostedt <[email protected]> wrote:

> > The numbers make me suspect that Ingo's mutexes are unfair too, but I've
> > not looked at the code yet.
>
> Yes, Ingo's code does act like this unfairness. Interesting also is
> that Ingo's original code for his rt_mutexes was fair, and it killed
> performance for high priority processes. I introduced a "lock
> stealing" algorithm that would check if the process trying to grab the
> lock again was a higher priority then the one about to get it, and if
> it was, it would "steal" the lock from it unfairly as you said.

yes, it's unfair - but stock semaphores are unfair too, so what i've
measured is still a fair comparison of the two implementations.

lock stealing i've eliminated from this patch-queue, and i've moved the
point of acquire to after the schedule(). (lock-stealing is only
relevant for PI, where we always need to associate an owner with the
lock, hence we pass ownership at the point of release.)

> Now, you are forgetting about PREEMPT. Yes, on multiple CPUs, and
> that is what Ingo is testing, to wait for the other CPU to schedule in
> and run is probably not as bad as with PREEMPTION. (Ingo, did you have
> preemption on in these tests?). [...]

no, CONFIG_PREEMPT was disabled in every test result i posted. (but i
get similar results even with it enabled.)

Ingo

2005-12-19 19:12:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem



On Mon, 19 Dec 2005, Ingo Molnar wrote:
>
> in fact, generic mutexes are _more_ fair than struct semaphore in their
> wait logic. In the stock semaphore implementation, when a waiter is
> woken up, it will retry the lock, and if it fails, it goes back to the
> _tail_ of the queue again - waiting one full cycle again.

Ingo, I don't think that is true.

It shouldn't be true, at least. The whole point with the "sleeper" count
was to not have that happen. Of course, bugs happen, so I won't guarantee
that's actually true, but ..

If you are woken up as a waiter on a semaphore, you shouldn't fail to get
it. You will be woken first, and nobody else will get at it, because the
count has been kept negative or zero even by the waiters, so that a
fast-path user shouldn't be able to get the lock without going through the
slow path and adding itself (last) to the list.

But hey, somebody should test it with <n> kernel threads that just do
down()/up() and some make-believe work in between to make sure there
really is contention, and count how many times they actually get the
semaphore. That code has been changed so many times that it may not work
the way it is advertized ;)

[ Oh. I'm looking at the semaphore code, and I realize that we have a
"wake_up(&sem->wait)" in the __down() path because we had some race long
ago that we fixed by band-aiding over it. Which means that we wake up
sleepers that shouldn't be woken up. THAT may well be part of the
performance problem.. The semaphores are really meant to wake up just
one at a time, but because of that race hack they'll wake up _two_ at a
time - once by up(), once by down().

That also destroys the fairness. Does anybody remember why it's that
way? ]

Ho humm.. That's interesting.

Linus

2005-12-19 19:28:57

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

On Mon, Dec 19, 2005 at 11:11:03AM -0800, Linus Torvalds wrote:
> On Mon, 19 Dec 2005, Ingo Molnar wrote:
> >
> > in fact, generic mutexes are _more_ fair than struct semaphore in their
> > wait logic. In the stock semaphore implementation, when a waiter is
> > woken up, it will retry the lock, and if it fails, it goes back to the
> > _tail_ of the queue again - waiting one full cycle again.
>
> Ingo, I don't think that is true.
>
> It shouldn't be true, at least. The whole point with the "sleeper" count
> was to not have that happen. Of course, bugs happen, so I won't guarantee
> that's actually true, but ..

The only thing I can see as an improvement that a mutex can offer over
the current semaphore implementation is if we can perform the same
optimization that spinlocks perform in the unlock operation: don't use
a locked, serialising instruction in the up() codepath. That might be
a bit tricky to implement, but it's definately a win on the P4 where the
cost of serialisation can be quite high.

> [ Oh. I'm looking at the semaphore code, and I realize that we have a
> "wake_up(&sem->wait)" in the __down() path because we had some race long
> ago that we fixed by band-aiding over it. Which means that we wake up
> sleepers that shouldn't be woken up. THAT may well be part of the
> performance problem.. The semaphores are really meant to wake up just
> one at a time, but because of that race hack they'll wake up _two_ at a
> time - once by up(), once by down().
>
> That also destroys the fairness. Does anybody remember why it's that
> way? ]

History? I think that code is very close to what was done in the pre-SMP
version of semaphores. It is certainly possible to get rid of the separate
sleepers -- parisc seems to have such an implementation. It updates
sem->count in the wakeup path of __down().

-ben
--
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <[email protected]>.

2005-12-19 19:56:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem



On Mon, 19 Dec 2005, Benjamin LaHaise wrote:
>
> The only thing I can see as an improvement that a mutex can offer over
> the current semaphore implementation is if we can perform the same
> optimization that spinlocks perform in the unlock operation: don't use
> a locked, serialising instruction in the up() codepath. That might be
> a bit tricky to implement, but it's definately a win on the P4 where the
> cost of serialisation can be quite high.

Good point. However, it really _is_ hard, because we also need to know if
the mutex was under contention. A spinlock doesn't care, so we can just
overwrite the lock value. A mutex would always care, in order to know
whether it needs to do the slow wakeup path.

So I suspect you can't avoid serializing the unlock path for a mutex. The
issue of "was there contention while I held it" fundamentally _is_ a
serializing question.

> > [ Oh. I'm looking at the semaphore code, and I realize that we have a
> > "wake_up(&sem->wait)" in the __down() path because we had some race long
> > ago that we fixed by band-aiding over it. Which means that we wake up
> > sleepers that shouldn't be woken up. THAT may well be part of the
> > performance problem.. The semaphores are really meant to wake up just
> > one at a time, but because of that race hack they'll wake up _two_ at a
> > time - once by up(), once by down().
> >
> > That also destroys the fairness. Does anybody remember why it's that
> > way? ]
>
> History?

Oh, absolutely, I already checked the old BK history too, and that extra
wake_up() has been there at least since before we even started using BK.
So it's very much historical, I'm just wondering if somebody remembers far
enough back that we'd know.

I don't see why it's needed (since we re-try the "atomic_add_negative()"
inside the semaphore wait lock, and any up() that saw contention should
have always been guaranteed to do a wakeup that should fill the race in
between that atomic_add_negative() and the thing going to sleep).

It may be that it is _purely_ historical, and simply isn't needed. That
would be funny/sad, in the sense that we've had it there for years and
years ;)

Linus

2005-12-19 19:56:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


* Linus Torvalds <[email protected]> wrote:

> On Mon, 19 Dec 2005, Ingo Molnar wrote:
> >
> > in fact, generic mutexes are _more_ fair than struct semaphore in their
> > wait logic. In the stock semaphore implementation, when a waiter is
> > woken up, it will retry the lock, and if it fails, it goes back to the
> > _tail_ of the queue again - waiting one full cycle again.
>
> Ingo, I don't think that is true.
>
> It shouldn't be true, at least. The whole point with the "sleeper"
> count was to not have that happen. Of course, bugs happen, so I won't
> guarantee that's actually true, but ..

you are right, i based my comments on observed behavior, not on the
code's intentions.

I havent actually traced the behavior of semaphores (being mostly
interested in mutexes ;-), but fairness algorithms always show up as
heavy context-switchers on SMP (because other CPUs queue themselves as
waiters, and wakeups go across CPUs all the time), and i'm quite sure
that contended scenarios with the current semaphore implementation never
overscheduled. Hence current semaphores are very likely not fair, and
sleepers roundrobin back to the queue quite often.

but i've got some measurements of how fairness plays out in practice.
The mutex based ops go:

mars:~> ./test-mutex V 16 10
8 CPUs, running 16 parallel test-tasks.
checking VFS performance.
avg ops/sec: 85130
average cost per op: 206.59 usecs
average deviance per op: 319.08 usecs

note the 'average latency of an op' (in absolute time), and the standard
deviation (which has been measured by summing up the deltas between
subsequent latencies, and divided by the total number of ops).

With semaphores the results go like this:

mars:~> ./test-mutex V 16 10
8 CPUs, running 16 parallel test-tasks.
checking VFS performance.
avg ops/sec: 34381
average cost per op: 512.13 usecs
average deviance per op: 573.10 usecs

look that even though the ratio between the per op cost and the standard
deviance looks to be a bit better for semaphores, the pure fact that it
was much slower lengthened its standard deviance to well above that of
the mutex's.

So even if they are fairer, if the system ends up being slower, fairness
(==observed fluctuations, and resulting injustices) suffers more as a
result than due to the queueing logic. I'd chose this 200 +/- 150 usecs
behavior over 500 +/- 280 usecs behavior - even though the latter has
smaller relative fluctuations.

(although i'm still unsure which one is fairer, because i couldnt create
a scenario that is comparable in terms of fairness comparisons: the
mutex based workloads are always more efficient, and as a result they
schedule into the idle thread more often, which creates additional noise
and may be a reason why its standard deviation is higher. The semaphore
workloads are more saturated, which may flatten its standard deviation.)

> If you are woken up as a waiter on a semaphore, you shouldn't fail to
> get it. You will be woken first, and nobody else will get at it,
> because the count has been kept negative or zero even by the waiters,
> so that a fast-path user shouldn't be able to get the lock without
> going through the slow path and adding itself (last) to the list.
>
> But hey, somebody should test it with <n> kernel threads that just do
> down()/up() and some make-believe work in between to make sure there
> really is contention, and count how many times they actually get the
> semaphore. That code has been changed so many times that it may not
> work the way it is advertized ;)
>
> [ Oh. I'm looking at the semaphore code, and I realize that we have a
> "wake_up(&sem->wait)" in the __down() path because we had some race long
> ago that we fixed by band-aiding over it. Which means that we wake up
> sleepers that shouldn't be woken up. THAT may well be part of the
> performance problem.. The semaphores are really meant to wake up just
> one at a time, but because of that race hack they'll wake up _two_ at a
> time - once by up(), once by down().
>
> That also destroys the fairness. Does anybody remember why it's that
> way? ]
>
> Ho humm.. That's interesting.

hm, removing that wakeup quickly causes hung test-tasks. (i booted an
all-mutexes kernel, and did the testing on arch_semaphores, using the
modified semaphore-sleepers.c code. The test ran for a few seconds, so
semaphores werent _totally_ broken, but there's some clear race in terms
of not waking up a sleeper.)

and even considering that the current semaphore implementation may have
a fairness bug, i cannot imagine that making it more fair would also
speed it up. So it could end up having an even larger performance gap to
the mutex implementation. But in any case, it should be an interesting
comparison! I personally find the semaphore implementation clever but
too complex, maybe that's a reason why such bugs might be hiding there.
(possibly for many years already ...)

In any case, i concur with you that the fairness design of the two is
not comparable, and that semaphores _should_ be fairer.

Ingo

2005-12-19 20:12:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


* Benjamin LaHaise <[email protected]> wrote:

> > [ Oh. I'm looking at the semaphore code, and I realize that we have a
> > "wake_up(&sem->wait)" in the __down() path because we had some race long
> > ago that we fixed by band-aiding over it. Which means that we wake up
> > sleepers that shouldn't be woken up. THAT may well be part of the
> > performance problem.. The semaphores are really meant to wake up just
> > one at a time, but because of that race hack they'll wake up _two_ at a
> > time - once by up(), once by down().
> >
> > That also destroys the fairness. Does anybody remember why it's that
> > way? ]
>
> History? I think that code is very close to what was done in the
> pre-SMP version of semaphores. It is certainly possible to get rid of
> the separate sleepers -- parisc seems to have such an implementation.
> It updates sem->count in the wakeup path of __down().

i think we also need to look at the larger picture. If this really is a
bug that hid for years, it shows that the semaphore code is too complex
to be properly reviewed and improved. Hence even assuming that the mutex
code does not bring direct code advantages (which i'm disputing :-), the
mutex code is far simpler and thus easier to improve. We humans have a
given number of neurons, which form a hard limit :) In fact it's the
mutex code that made it apparent that there's something wrong with
semaphores.

we saw that with the genirq code, with the spinlock code, with the
preempt code. Consolidation did not add anything drastiically new, but
code consolidation _did_ make things more hackable, and improved the end
result far more than a splintered set of implementations would have
looked like.

Just look at the semaphore implementations of various architectures,
it's a quite colorful and inconsistent mix. Can you imagine adding
deadlock debugging to each of them?

Ingo

2005-12-19 20:13:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem



On Mon, 19 Dec 2005, Ingo Molnar wrote:
>
> average cost per op: 206.59 usecs
> average cost per op: 512.13 usecs

(mutex vs semaphore).

That looks suspiciously like exactly double the cost, so I do believe that
the double wake_up() might be exactly what is going on.

However:

> hm, removing that wakeup quickly causes hung test-tasks.

So clearly it really is still hiding some bug.

> and even considering that the current semaphore implementation may have
> a fairness bug, i cannot imagine that making it more fair would also
> speed it up.

That's not the point. The extra wakeup() in th esemaphore code wakes up
two processes for every single up(), so the semaphores end up not just
being unfair, they also end up doing twice the work (because it will
result in the other processes effectively just doing the down() twice).

> I personally find the semaphore implementation clever but too complex,
> maybe that's a reason why such bugs might be hiding there. (possibly
> for many years already ...)

Oh, absolutely. It is too complex.

And don't get me wrong: if it's easier to just ignore the performance bug,
and introduce a new "struct mutex" that just doesn't have it, I'm all for
it. However, if so, I do NOT want to do the unnecessary renaming. "struct
semaphore" should stay as "struct semaphore", and we should not affect old
code in the _least_.

Then code can switch to "struct mutex" if people want to. And if one
reason for it ends up being that the code avoids a performance bug in the
process, all the better ;)

IOW, I really think this should be a series of small patches that don't
touch old users of "struct semaphore" at all. None of this "semaphore" to
"arch_semaphore" stuff, and the new "struct mutex" would not re-use _any_
of the names that the old "struct semaphore" uses.

Linus

2005-12-19 20:22:59

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

On Mon, Dec 19, 2005 at 09:11:18PM +0100, Ingo Molnar wrote:
> i think we also need to look at the larger picture. If this really is a
> bug that hid for years, it shows that the semaphore code is too complex
> to be properly reviewed and improved. Hence even assuming that the mutex
> code does not bring direct code advantages (which i'm disputing :-), the
> mutex code is far simpler and thus easier to improve. We humans have a
> given number of neurons, which form a hard limit :) In fact it's the
> mutex code that made it apparent that there's something wrong with
> semaphores.

True, but then the original semaphores weren't designed with fairness in
mind so much as being able to operate quickly. The commodity SMP hardware
that we have today has significantly different characteristics than the
first dual Pentium I used. I think there is significant room for improving
the implementation while still making it as tight and lean as possible. To
that end, adding state diagrams that make it easier to visualise what is
going on would be a big help. With that in place, it will be easier to
provide optimized fast paths with understandable logic.

> Just look at the semaphore implementations of various architectures,
> it's a quite colorful and inconsistent mix. Can you imagine adding
> deadlock debugging to each of them?

Agreed.

-ben
--
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <[email protected]>.

2005-12-19 20:32:19

by Russell King

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

On Mon, Dec 19, 2005 at 09:11:18PM +0100, Ingo Molnar wrote:
> We humans have a given number of neurons, which form a hard limit :)

That's also a very valid argument for keeping the number of different
locking mechanisms down to a small number.

> we saw that with the genirq code, with the spinlock code, with the
> preempt code. Consolidation did not add anything drastiically new, but
> code consolidation _did_ make things more hackable, and improved the end
> result far more than a splintered set of implementations would have
> looked like.
>
> Just look at the semaphore implementations of various architectures,
> it's a quite colorful and inconsistent mix. Can you imagine adding
> deadlock debugging to each of them?

However, the argument _against_ making things generic is that they
become less optimised for specific architectures. I'm still not
convinced that the genirq stuff is as optimal for ARM as the existing
code is, so I've little motivation to move to the genirq stuff.
(Though I will try to make things easier for those who would like to.)

The same argument applies to attempts to make atomic ops generic using
cmpxchg as a basis for that (as demonstrated on "that other list"), and
the fast path of semaphores.

On "that other list", we're debating saving between 9 and 13 CPU
cycles as an argument in favour of mutexes. Such a gain which could
very probably be wiped out by any attempt to make them "generic".

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-19 20:58:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


* Russell King <[email protected]> wrote:

> However, the argument _against_ making things generic is that they
> become less optimised for specific architectures. I'm still not
> convinced that the genirq stuff is as optimal for ARM as the existing
> code is, so I've little motivation to move to the genirq stuff.
> (Though I will try to make things easier for those who would like to.)

i'm quite convinced that the final phase of the genirq conversion will
work out fine: because it mostly meant the conceptual adoption of your
ARM IRQ layer (the irqchips approach), with compatibility mechanisms for
all the other arches, with some minor SMP improvements ontop of it. So
i'd be surprised if you found _that_ one inadequate :-) If there's any
detail that ARM doesnt need, i'm sure we can find a non-runtime solution
for it. But i think i digress.

Ingo

2005-12-19 23:38:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

On Mon, Dec 19, 2005 at 12:12:26PM -0800, Linus Torvalds wrote:
> And don't get me wrong: if it's easier to just ignore the performance bug,
> and introduce a new "struct mutex" that just doesn't have it, I'm all for
> it. However, if so, I do NOT want to do the unnecessary renaming. "struct
> semaphore" should stay as "struct semaphore", and we should not affect old
> code in the _least_.
>
> Then code can switch to "struct mutex" if people want to. And if one
> reason for it ends up being that the code avoids a performance bug in the
> process, all the better ;)
>
> IOW, I really think this should be a series of small patches that don't
> touch old users of "struct semaphore" at all. None of this "semaphore" to
> "arch_semaphore" stuff, and the new "struct mutex" would not re-use _any_
> of the names that the old "struct semaphore" uses.

That's exactly what Ingo's series does if you ignore the two odd patches ;-)

2005-12-20 08:03:12

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

Linus Torvalds wrote:

> And don't get me wrong: if it's easier to just ignore the performance bug,
> and introduce a new "struct mutex" that just doesn't have it, I'm all for
> it. However, if so, I do NOT want to do the unnecessary renaming. "struct
> semaphore" should stay as "struct semaphore", and we should not affect old
> code in the _least_.
>

It would still be good to look at a fair mutex implementation first
IMO before making a choice to use unfair mutexes.

They'll often be held for longer than spinlocks so fairness may be
more important.

> Then code can switch to "struct mutex" if people want to. And if one
> reason for it ends up being that the code avoids a performance bug in the
> process, all the better ;)
>

Is this a good idea? Then we will have for a long time different
bits of code with exactly the same synchronisation requirements
using two different constructs that are slightly different. Not to
mention code specifically requiring semaphores would get confusing.

If we agree mutex is a good idea at all (and I think it is), then
wouldn't it be better to aim for a wholesale conversion rather than
"if people want to"?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-20 08:06:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem


> > Then code can switch to "struct mutex" if people want to. And if one
> > reason for it ends up being that the code avoids a performance bug in the
> > process, all the better ;)
> >
>
> Is this a good idea? Then we will have for a long time different
> bits of code with exactly the same synchronisation requirements
> using two different constructs that are slightly different. Not to
> mention code specifically requiring semaphores would get confusing.
>
> If we agree mutex is a good idea at all (and I think it is), then
> wouldn't it be better to aim for a wholesale conversion rather than
> "if people want to"?

well most of this will "only" take a few kernel releases ;-)


2005-12-20 08:21:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

Arjan van de Ven wrote:

>>If we agree mutex is a good idea at all (and I think it is), then
>>wouldn't it be better to aim for a wholesale conversion rather than
>>"if people want to"?
>
>
> well most of this will "only" take a few kernel releases ;-)
>

OK, so in my mind that is aiming for a wholesale conversion. And
yes I would expect the semaphore to stay around for a while...

But I thought Linus phrased it as though mutex should be made
available and no large scale conversions.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-20 08:36:39

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

On Tue, 2005-12-20 at 19:21 +1100, Nick Piggin wrote:
> Arjan van de Ven wrote:
>
> >>If we agree mutex is a good idea at all (and I think it is), then
> >>wouldn't it be better to aim for a wholesale conversion rather than
> >>"if people want to"?
> >
> >
> > well most of this will "only" take a few kernel releases ;-)
> >
>
> OK, so in my mind that is aiming for a wholesale conversion. And
> yes I would expect the semaphore to stay around for a while...
>
> But I thought Linus phrased it as though mutex should be made
> available and no large scale conversions.

it's one at a time. Manual inspection etc etc
and I don't expect 99% coverage, but the important 50% after 6 months or
so...


2005-12-20 08:48:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

Arjan van de Ven wrote:
> On Tue, 2005-12-20 at 19:21 +1100, Nick Piggin wrote:

>>But I thought Linus phrased it as though mutex should be made
>>available and no large scale conversions.
>
>
> it's one at a time. Manual inspection etc etc
> and I don't expect 99% coverage, but the important 50% after 6 months or
> so...
>

So long as there is a plan and movement, that is fine in my opinion.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-21 15:26:41

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

Linus Torvalds wrote:
>
> > > [ Oh. I'm looking at the semaphore code, and I realize that we have a
> > > "wake_up(&sem->wait)" in the __down() path because we had some race long
> > > ago that we fixed by band-aiding over it. Which means that we wake up
> > > sleepers that shouldn't be woken up. THAT may well be part of the
> > > performance problem.. The semaphores are really meant to wake up just
> > > one at a time, but because of that race hack they'll wake up _two_ at a
> > > time - once by up(), once by down().
> > >
> > > That also destroys the fairness. Does anybody remember why it's that
> > > way? ]
> >
> > History?
>
> Oh, absolutely, I already checked the old BK history too, and that extra
> wake_up() has been there at least since before we even started using BK.
> So it's very much historical, I'm just wondering if somebody remembers far
> enough back that we'd know.
>
> I don't see why it's needed (since we re-try the "atomic_add_negative()"
> inside the semaphore wait lock, and any up() that saw contention should
> have always been guaranteed to do a wakeup that should fill the race in
> between that atomic_add_negative() and the thing going to sleep).
>
> It may be that it is _purely_ historical, and simply isn't needed. That
> would be funny/sad, in the sense that we've had it there for years and
> years ;)

This does not look like it was added to fix a race or historical
to me. I think without that "wake_up" a waiter can miss wakeup
if it is not the only one sleeper.

sem->count == 0, ->sleepers == 0. down() decrements ->count,

__down:
// ->count == -1

++sleepers; // == 1

for (;;) {
->count += (->sleepers - 1); // does nothing
if (->count >= 0) // NO
break;

->sleepers = 1;
schedule();
...

Another process calls down(), ->count == -2

__down:
++sleepers; // == 2;

for (;;) {
->count += (->sleepers - 1) // ->count == -1;

->sleepers = 1;
schedule();
...

up() makes ++->count == 0, and wakes one of them. It will see
->sleepers == 1, so atomic_add_negative(sleepers - 1) again
has no effect, sets ->sleepers == 0 and takes the semaphore.

Note that subsequent up() will not call wakeup(): ->count == 0,
it just increment it. That is why we are waking the next waiter
in advance. When it gets cpu, it will decrement ->count by 1,
because ->sleepers == 0. If up() (++->count) was already called,
it takes semaphore. If not - goes to sleep again.

Or my understanding is completely broken?

Oleg.

2006-01-10 10:28:52

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

On Wed, Dec 21, 2005 at 07:42:14PM +0300, Oleg Nesterov wrote:
> Linus Torvalds wrote:
> >
> > > > [ Oh. I'm looking at the semaphore code, and I realize that we have a
> > > > "wake_up(&sem->wait)" in the __down() path because we had some race long
> > > > ago that we fixed by band-aiding over it. Which means that we wake up
> > > > sleepers that shouldn't be woken up. THAT may well be part of the
> > > > performance problem.. The semaphores are really meant to wake up just
> > > > one at a time, but because of that race hack they'll wake up _two_ at a
> > > > time - once by up(), once by down().
> > > >
> > > > That also destroys the fairness. Does anybody remember why it's that
> > > > way? ]
> > >
> > > History?
> >
> > Oh, absolutely, I already checked the old BK history too, and that extra
> > wake_up() has been there at least since before we even started using BK.
> > So it's very much historical, I'm just wondering if somebody remembers far
> > enough back that we'd know.
> >
> > I don't see why it's needed (since we re-try the "atomic_add_negative()"
> > inside the semaphore wait lock, and any up() that saw contention should
> > have always been guaranteed to do a wakeup that should fill the race in
> > between that atomic_add_negative() and the thing going to sleep).
> >
> > It may be that it is _purely_ historical, and simply isn't needed. That
> > would be funny/sad, in the sense that we've had it there for years and
> > years ;)
>
> This does not look like it was added to fix a race or historical
> to me. I think without that "wake_up" a waiter can miss wakeup
> if it is not the only one sleeper.
>
> sem->count == 0, ->sleepers == 0. down() decrements ->count,
>
> __down:
> // ->count == -1
>
> ++sleepers; // == 1
>
> for (;;) {
> ->count += (->sleepers - 1); // does nothing
> if (->count >= 0) // NO
> break;
>
> ->sleepers = 1;
> schedule();
> ...
>
> Another process calls down(), ->count == -2
>
> __down:
> ++sleepers; // == 2;
>
> for (;;) {
> ->count += (->sleepers - 1) // ->count == -1;
>
> ->sleepers = 1;
> schedule();
> ...
>
> up() makes ++->count == 0, and wakes one of them. It will see
> ->sleepers == 1, so atomic_add_negative(sleepers - 1) again
> has no effect, sets ->sleepers == 0 and takes the semaphore.
>
> Note that subsequent up() will not call wakeup(): ->count == 0,
> it just increment it. That is why we are waking the next waiter
> in advance. When it gets cpu, it will decrement ->count by 1,
> because ->sleepers == 0. If up() (++->count) was already called,
> it takes semaphore. If not - goes to sleep again.
>
> Or my understanding is completely broken?

Sorry for the delayed response, but I just started looking at the double
wakeup issue.

The analysis looks a bit confusing. We start with an initial count=0 and
after 2 down()'s and 2 up()'s, the final value of expected count==0, but
it seems like it is +1.

My analysis is based on your analysis and I have used your cool convention!

Lets assume that there are two tasks P1 and P2.

For a semaphore with ->count = 0 and ->sleepers = 0

If P1 does a down(), then

->count = -1 // decl

->sleepers++ //sleepers = 1

for (;;)
sleepers = 1;
->count += (sleepers - 1); // sleepers - 1 = 0, count = -1
if (->count >= 0) // count is -1
break;
->sleepers = 1;
schedule();


Now, if there is another down() by P2

->count = -2 // decl
->sleepers++ // sleepers = 2

for (;;)
sleepers = 2;
->count += (sleepers - 1); // (sleepers - 1) = 1, count = -1
if (->count >= 0) // count is -1
break;
->sleepers = 1;
schedule()


Consider some task doing an up()

->count++ //incl, count = 0, count was previously -1

This wakes up one of P1. P1 will do the following

....
spin_lock_irqsave(...)
tsk->state = TASK_UNINTERRUPTIBLE

and go back to the for loop

sleepers = 1;
->count += (sleepers - 1) // count += 0, count = 0
if (count >= 0) { // YES
->sleepers = 0;
break;
}

// outside the for loop
remove_wait_queue_locked(->wait, &wait);

wake_up_locked(->wait); // double wakeup
tsk->state = TASK_UNINTERRUPTIBLE

The *double wakeup* will cause P2 to be woken up, it will do the following

...
....
spin_lock_irqsave(...)
tsk->state = TASK_UNINTERRUPTIBLE

and go back to the for loop
sleepers = 0; // the last task out of __down, set sleepers to 0
->count += -1; // (sleepers - 1) = -1, count = -1

if (count >= 0) // NO, count = -1
break;

sleepers = 1;
spin_unlock_irqrestore(...)
schedule()

without this *double wakeup*, the count would become 0, which is incorrect.

If some other task does an up()
->count++ // incl, count = 0, was previously -1


This would wakeup P2, which would do the following

....
spin_lock_irqsave(...)
tsk->state = TASK_UNINTERRUPTIBLE

and go back to the for loop

sleepers = 1;
->count += 0; // count = 0
if (count >= 0) { // YES
->sleepers = 0;
break;
}


The question now remains as to why we have the atomic_add_negative()? Why do
we change the count in __down(), when down() has already decremented it
for us?

Comments?

Balbir

PS: This analysis involved no caffeine, so it might be incorrect or bogus.
I did try and do my best to check its validity.

2006-01-10 16:46:59

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

Balbir Singh wrote:
>
> On Wed, Dec 21, 2005 at 07:42:14PM +0300, Oleg Nesterov wrote:
> >
> > Note that subsequent up() will not call wakeup(): ->count == 0,
> > it just increment it. That is why we are waking the next waiter
> > in advance. When it gets cpu, it will decrement ->count by 1,
> > because ->sleepers == 0. If up() (++->count) was already called,
> > it takes semaphore. If not - goes to sleep again.
> >
> > Or my understanding is completely broken?
>
> [ ... long snip ... ]
>
> The question now remains as to why we have the atomic_add_negative()? Why do
> we change the count in __down(), when down() has already decremented it
> for us?

... and why __down() always sets ->sleepers = 0 on return.

I don't have an answer, only a wild guess.

Note that if P1 releases this semaphore before pre-woken P2 actually
gets cpu what happens is:

P1->up() just increments ->count, no wake_up() (fastpath)

P2 takes the semaphore without schedule.

So *may be* it was designed this way as some form of optimization,
in this scenario P2 has some chances to run with sem held earlier.

Oleg.

2006-01-11 06:33:27

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

On Tue, Jan 10, 2006 at 09:03:07PM +0300, Oleg Nesterov wrote:
> Balbir Singh wrote:
> >
> > On Wed, Dec 21, 2005 at 07:42:14PM +0300, Oleg Nesterov wrote:
> > >
> > > Note that subsequent up() will not call wakeup(): ->count == 0,
> > > it just increment it. That is why we are waking the next waiter
> > > in advance. When it gets cpu, it will decrement ->count by 1,
> > > because ->sleepers == 0. If up() (++->count) was already called,
> > > it takes semaphore. If not - goes to sleep again.
> > >
> > > Or my understanding is completely broken?
> >
> > [ ... long snip ... ]
> >
> > The question now remains as to why we have the atomic_add_negative()? Why do
> > we change the count in __down(), when down() has already decremented it
> > for us?
>
> ... and why __down() always sets ->sleepers = 0 on return.
>

I think sem->sleepers initially started as a counter, but was later converted
to a boolean value (0 or 1). The only possible values of sem->sleepers is 0, 1
or 2 and we always use sem->sleepers - 1 and set the value to either 0 or 1.

sem->sleepers is set to 0, so that when the double wakeup is called on the
wait queue, the task that wakes up (P2) corrects the count to
(sem->sleepers - 1) = -1. This ensures that other tasks do not acquire
the semaphore with down() (count is -1) and P2 sets sem->sleepers back to 1
and sleeps.


> I don't have an answer, only a wild guess.
>
> Note that if P1 releases this semaphore before pre-woken P2 actually
> gets cpu what happens is:
>
> P1->up() just increments ->count, no wake_up() (fastpath)
>
> P2 takes the semaphore without schedule.
>
> So *may be* it was designed this way as some form of optimization,
> in this scenario P2 has some chances to run with sem held earlier.
>

P1->up() will do a wake_up() only if count < 0. For no wake_up()
the count >=0 before the increment. This means that there is no one
sleeping on the semaphore.

Feel free to correct me,
Balbir

2006-01-11 08:05:50

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch 00/15] Generic Mutex Subsystem

Balbir Singh wrote:
>
> On Tue, Jan 10, 2006 at 09:03:07PM +0300, Oleg Nesterov wrote:
>
> > I don't have an answer, only a wild guess.
> >
> > Note that if P1 releases this semaphore before pre-woken P2 actually
> > gets cpu what happens is:
> >
> > P1->up() just increments ->count, no wake_up() (fastpath)
> >
> > P2 takes the semaphore without schedule.
> >
> > So *may be* it was designed this way as some form of optimization,
> > in this scenario P2 has some chances to run with sem held earlier.
> >
>
> P1->up() will do a wake_up() only if count < 0. For no wake_up()
> the count >=0 before the increment. This means that there is no one
> sleeping on the semaphore.

And this exactly happens. P1 returns from __down() with ->count == 0
and P2 pre-woken.

Oleg.