2000-11-18 01:32:11

by David Mansfield

[permalink] [raw]
Subject: [PATCH] semaphore fairness patch against test11-pre6

Index: linux/arch/i386/kernel/semaphore.c
===================================================================
RCS file: /home/kernel/cvs_master/linux/arch/i386/kernel/semaphore.c,v
retrieving revision 1.3
diff -u -r1.3 semaphore.c
--- linux/arch/i386/kernel/semaphore.c 2000/11/16 19:58:26 1.3
+++ linux/arch/i386/kernel/semaphore.c 2000/11/17 23:12:48
@@ -64,6 +64,14 @@

spin_lock_irq(&semaphore_lock);
sem->sleepers++;
+
+ /*
+ * Are there other people waiting for this?
+ * They get to go first.
+ */
+ if (sem->sleepers > 1)
+ goto inside;
+
for (;;) {
int sleepers = sem->sleepers;

@@ -76,6 +84,7 @@
break;
}
sem->sleepers = 1; /* us - see -1 above */
+inside:
spin_unlock_irq(&semaphore_lock);

schedule();


Attachments:
sem-patch.test11-pre6 (747.00 B)

2000-11-18 10:22:58

by Christoph Rohland

[permalink] [raw]
Subject: Re: [PATCH] semaphore fairness patch against test11-pre6

Hi David,

David Mansfield <[email protected]> writes:
> If you can find the time to check this out more completely, I recommend
> it, because it seems like a great improvement to be able to accurately
> see vmstat numbers in times of system load. I hope the other side
> effects are beneficial as well :-)

I wanted to point out that there may be some performance impacts by
this: We had exactly the new behaviour on SYSV semaphores. It led to
very bad behaviour in high load situations since for high frequency,
short critical paths this led to very high context switch rates
instead of using the available time slice for the program.

We changed the behaviour of SYSV semaphores to the current kernel sem
behaviour and never had problems with that change.

I still think that your change is right since this is kernel space and
you do not have the notion of a time slice.

Christoph


2000-11-19 01:43:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] semaphore fairness patch against test11-pre6

Christoph Rohland wrote:
>
> Hi David,
>
> David Mansfield <[email protected]> writes:
> > If you can find the time to check this out more completely, I recommend
> > it, because it seems like a great improvement to be able to accurately
> > see vmstat numbers in times of system load. I hope the other side
> > effects are beneficial as well :-)
>
> I wanted to point out that there may be some performance impacts by
> this: We had exactly the new behaviour on SYSV semaphores. It led to
> very bad behaviour in high load situations since for high frequency,
> short critical paths this led to very high context switch rates
> instead of using the available time slice for the program.
>
> We changed the behaviour of SYSV semaphores to the current kernel sem
> behaviour and never had problems with that change.
>
> I still think that your change is right since this is kernel space and
> you do not have the notion of a time slice.

Has anyone tried it on SMP? I get fairly repeatable instances of immortal
`D'-state processes with this patch.

The patch isn't right - it allows `sleepers' to increase without bound.
But it's a boolean!

If you cut out the unnecessary code and the incorrect comments from
__down() it looks like this:


void __down(struct semaphore * sem)
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
tsk->state = TASK_UNINTERRUPTIBLE;
add_wait_queue_exclusive(&sem->wait, &wait);

spin_lock_irq(&semaphore_lock);
for (;;) {
/*
* Effectively:
*
* if (sem->sleepers)
* sem->count++
* if (sem->count >= 0)
* sem->sleepers = 0;
* break;
*/
if (!atomic_add_negative(sem->sleepers, &sem->count)) {
sem->sleepers = 0;
break;
}
sem->sleepers = 1;
spin_unlock_irq(&semaphore_lock);

schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
spin_unlock_irq(&semaphore_lock);
remove_wait_queue(&sem->wait, &wait);
tsk->state = TASK_RUNNING;
wake_up(&sem->wait);
}

I spent a couple of hours trying to get "fairness" working right
and have not been able to come up with a non-racy solution. That
semaphore algorithm is amazing. I'm really impressed.

2000-11-19 02:18:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] semaphore fairness patch against test11-pre6



On Sun, 19 Nov 2000, Andrew Morton wrote:
>
> Has anyone tried it on SMP? I get fairly repeatable instances of immortal
> `D'-state processes with this patch.

Too bad. I really thought it should be safe to do.

> The patch isn't right - it allows `sleepers' to increase without bound.
> But it's a boolean!

It's not a boolean. It's really a "bias count". It happens to get only the
values 0 and 1 simply becase the logic is that we always account for all
the other people when any process goes to sleep, so "sleepers" only ever
counts the one process that went to sleep last.

But the algorithm itself should allow for other values. In fact, I think
that you'll find that it works fine if you switch to non-exclusive
wait-queues, and the only reason you see the repeatable D states is
exactly the case where we didn't "take" the semaphore even though we were
awake, and that basically makes us an exclusive process that didn't react
to an exclusive wakeup.

(Think of it this way: with the "inside" patch, the process does

tsk->state = TASK_INTERRUPTIBLE;

twice, even though there was only one semaphore that woke it up: we
basically "lost" a wakeup event, not because "sleepers" cannot be 2, but
because we didn't pick up the wakeup that we might have gotten.

Instead of the "goto inside", how about just doing it without the "double
sleep", and doing something like

tsk->state = TASK_INTERRUPTIBLE;
add_wait_queue_exclusive(&sem->wait, &wait);

spin_lock_irq(&semaphore_lock);
sem->sleepers ++;
+ if (sem->sleepers > 1) {
+ spin_unlock_irq(&semaphore_lock);
+ schedule();
+ spin_lock_irq(&semaphore_lock);
+ }
for (;;) {

The only difference between the above and the "goto inside" variant is
really that the above sets "tsk->state = TASK_INTERRUPTIBLE;" just once
per loop, not twice as the "inside" case did. So if we happened to get an
exclusive wakeup at just the right point, we won't go to sleep again and
miss it.

But these things are very subtle. The current semaphore algorithm was
basically perfected over a week of some serious thinking. The fairness
change should similarly get a _lot_ of attention. It's way too easy to
miss things.

Does the above work for you even in SMP?

Linus

2000-11-19 13:21:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] semaphore fairness patch against test11-pre6

Linus Torvalds wrote:
>
> ...
> But the algorithm itself should allow for other values. In fact, I think
> that you'll find that it works fine if you switch to non-exclusive
> wait-queues, and the only reason you see the repeatable D states is
> exactly the case where we didn't "take" the semaphore even though we were
> awake, and that basically makes us an exclusive process that didn't react
> to an exclusive wakeup.
>
> (Think of it this way: with the "inside" patch, the process does
>
> tsk->state = TASK_INTERRUPTIBLE;
>
> twice, even though there was only one semaphore that woke it up: we
> basically "lost" a wakeup event, not because "sleepers" cannot be 2, but
> because we didn't pick up the wakeup that we might have gotten.

I don't see a path where David's patch can cause a lost wakeup in the
way you describe.

> Instead of the "goto inside", how about just doing it without the "double
> sleep", and doing something like
>
> tsk->state = TASK_INTERRUPTIBLE;
> add_wait_queue_exclusive(&sem->wait, &wait);
>
> spin_lock_irq(&semaphore_lock);
> sem->sleepers ++;
> + if (sem->sleepers > 1) {
> + spin_unlock_irq(&semaphore_lock);
> + schedule();
> + spin_lock_irq(&semaphore_lock);
> + }
> for (;;) {
>
> The only difference between the above and the "goto inside" variant is
> really that the above sets "tsk->state = TASK_INTERRUPTIBLE;" just once
> per loop, not twice as the "inside" case did. So if we happened to get an
> exclusive wakeup at just the right point, we won't go to sleep again and
> miss it.

This still causes stuck processes. There's a bonnie++ test which hammers
pipes. It's quite easy to get four instances of bonnie++ stuck on a pipe
semaphore. This is with your suggested change applied to both __down and
__down_interruptible, BTW. Switching both functions to use non-exclusive
waitqueues didn't help. Still missing a wakeup somewhere.

Moving on to this version:

void __down(struct semaphore * sem)
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
tsk->state = TASK_UNINTERRUPTIBLE;
add_wait_queue_exclusive(&sem->wait, &wait);

spin_lock_irq(&semaphore_lock);
if (sem->sleepers)
goto inside;
for (;;) {
/*
* Add "everybody else" into it. They aren't
* playing, because we own the spinlock.
*/
if (!atomic_add_negative(sem->sleepers, &sem->count)) {
sem->sleepers = 0;
break;
}
sem->sleepers = 1; /* us - see -1 above */
inside:
spin_unlock_irq(&semaphore_lock);

schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
spin_unlock_irq(&semaphore_lock);
remove_wait_queue(&sem->wait, &wait);
tsk->state = TASK_RUNNING;
wake_up(&sem->wait);
}

The only difference here is that we're not allowing `sem->sleepers'
to take the value '2' outside the spinlock. But it still turns
bonnie++ into Rip van Winkle.

Next step is to move the waitqueue and wakeup operations so they're
inside the spinlock. Nope. That doesn't work either.

Next step is to throw away the semaphore_lock and use the sem->wait
lock instead. That _does_ work. This is probably just a
fluke - it synchronises the waker with the sleepers and we get lucky.

But at least it's now benchmarkable. It works well. Kernel build
times are unaffected. bw_tcp is down a few percent. Worth pursuing.

David, could you please verify that the patch at
http://www.uow.edu.au/~andrewm/linux/semaphore.patch
still fixes the starvation problem?

> But these things are very subtle. The current semaphore algorithm was
> basically perfected over a week of some serious thinking. The fairness
> change should similarly get a _lot_ of attention. It's way too easy to
> miss things.

Agree. The fact that this algorithm can work while random CPUs are
asynchronously incrementing and decrementing sem->count makes
analysis and understanding a _lot_ harder.

There's a lot more work needed here.

2000-11-19 19:17:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] semaphore fairness patch against test11-pre6



On Sun, 19 Nov 2000, Andrew Morton wrote:
>
> I don't see a path where David's patch can cause a lost wakeup in the
> way you describe.

Basically, if there are two up() calls, they might end up waking up only
one process, because the same process goes to sleep twice. That's wrong.
It should wake up two processes.

However, thinking about it more, that's obviously possible only for
semaphores that are used for more than just mutual exclusion, and
basically nobody does that anyway.

> Next step is to move the waitqueue and wakeup operations so they're
> inside the spinlock. Nope. That doesn't work either.
>
> Next step is to throw away the semaphore_lock and use the sem->wait
> lock instead. That _does_ work. This is probably just a
> fluke - it synchronises the waker with the sleepers and we get lucky.

Yes, especially on a two-cpu machine that kind of synchronization can
basically end up hiding real bugs.

I'll think about this some more. One thing I noticed is that the
"wake_up(&sem->wait);" at the end of __down() is kind of bogus: we don't
actually want to wake anybody up at that point at all, it's just that if
we don't wake anybody up we'll end up having "sem = 0, sleeper = 0", and
when we unlock the semaphore the "__up()" logic won't trigger, and we
won't ever wake anybody up. That's just incredibly bogus.

Instead of the "wake_up()" at the end of __down(), we should have
something like this at the end of __down() instead:

... for-loop ...
}
tsk->state = TASK_RUNNING;
remove_wait_queue(&sem->wait, &wait);

/* If there are others, mark the semaphore active */
if (wait_queue_active(&sem_wait)) {
atomic_dec(&sem->count);
sem->sleepers = 1;
}
spin_unlock_irq(&semaphore_lock);
}

which would avoid an unnecessary reschedule, and cause the wakeup to
happen at the proper point, namely "__up()" when we release the
semaphore.

I suspect this may be part of the trouble with the "sleepers" count
playing: we had these magic rules that I know I thought about when the
code was written, but that aren't really part of the "real" rules.

Linus

2000-11-20 14:10:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] semaphore fairness patch against test11-pre6

Linus Torvalds wrote:
>
> ...
>
> I'll think about this some more. One thing I noticed is that the
> "wake_up(&sem->wait);" at the end of __down() is kind of bogus: we don't
> actually want to wake anybody up at that point at all, it's just that if
> we don't wake anybody up we'll end up having "sem = 0, sleeper = 0", and
> when we unlock the semaphore the "__up()" logic won't trigger, and we
> won't ever wake anybody up. That's just incredibly bogus.

There's another reason why we need the wakeup in the tail of __down():

1 void __down(struct semaphore * sem)
2 {
3 struct task_struct *tsk = current;
4 DECLARE_WAITQUEUE(wait, tsk);
5 tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE;
6 add_wait_queue_exclusive(&sem->wait, &wait);
7
8 spin_lock_irq(&semaphore_lock);
9 sem->sleepers++;
10 for (;;) {
11 int sleepers = sem->sleepers;
12
13 /*
14 * Add "everybody else" into it. They aren't
15 * playing, because we own the spinlock.
16 */
17 if (!atomic_add_negative(sleepers - 1, &sem->count)) {
18 sem->sleepers = 0;
19 break;
20 }
21 sem->sleepers = 1; /* us - see -1 above */
22 spin_unlock_irq(&semaphore_lock);
23
24 schedule();
25 tsk->state = TASK_UNINTERRUPTIBLE;
26 spin_lock_irq(&semaphore_lock);
27 }
28 spin_unlock_irq(&semaphore_lock);
29 remove_wait_queue(&sem->wait, &wait);
30 tsk->state = TASK_RUNNING;
31 wake_up(&sem->wait);
32 }

Suppose two tasks A and B are sleeping on the semaphore and somebody
does an up(). Task A wakes and sets TASK_UNINTERRUPTIBLE.

Task A then executes statements 26, 11, 17, 18, 19, 28 and 29 while
it's on the head of the exclusive waitqueue in state
TASK_UNINTERRUPTIBLE. If during this, another CPU does an up() on the
semaphore the wakeup will be directed to Task A and will be lost.

Hence Task A must do the wakeup once it has removed itself from the
waitqueue just in case this happened. Did anyone ever mention that
exclusive wakeups should atomically remove tasks from the waitqueue? :)


> Instead of the "wake_up()" at the end of __down(), we should have
> something like this at the end of __down() instead:
>
> ... for-loop ...
> }
> tsk->state = TASK_RUNNING;
> remove_wait_queue(&sem->wait, &wait);
>
> /* If there are others, mark the semaphore active */
> if (wait_queue_active(&sem_wait)) {
> atomic_dec(&sem->count);
> sem->sleepers = 1;
> }
> spin_unlock_irq(&semaphore_lock);
> }
>
> which would avoid an unnecessary reschedule, and cause the wakeup to
> happen at the proper point, namely "__up()" when we release the
> semaphore.

This won't help, because in the failure scenario we never get to
exit the 'for' loop. Read on...

Here's why the `goto inside' thingy isn't working:

Assume the semaphore is free:

Task A does down():
-------------------

count->0
sleepers=0

Task B does down():
-------------------

count: 0 -> -1
sleepers: 0->1
if (sleepers > 1) is false
atomic_add_negative(1-1, -1): count -1 -> -1.
`if' fails.
sleepers: 1->1
schedule()

Task C does down():
-------------------

count: -1 -> -2
sleepers: 1->2
if (sleepers > 1) is true
schedule()

Someone does up():
------------------

(reminder: count=-2, sleepers=2)
count: -2 -> -1
wakeup()
-FINISHED-

Task C, CPU0 CPU1

wakes up
tsk->state = TASK_UNINTERRUPTIBLE
spin_lock
down()
count: -1->-2
spins....
(reminder: sleepers=2, count=-2)
atomic_add_negative(2-1, -2): count->-1
aargh! `if' fails!
sleepers->1
spin_unlock()
schedule()
unspins....
sem->sleepers++ sleepers: 1->2
if (sleepers > 1) is true
aargh! atomic_add_negative test would have succeeded!
spin_unlock()
schedule()


** Nobody woke up **


Someone does up(), but assume CPU1 gets the lock
------------------------------------------------

(reminder: count=-2, sleepers=2)

count: -2 -> -1
wakeup()
-FINISHED-

Task C, CPU0 CPU1

wakes up
tsk->state = TASK_UNINTERRUPTIBLE
down()
count: -1->-2
spin_lock
spins...
sem->sleepers++ sleepers: 2->3
if (sleepers > 1) is true
spin_unlock()
schedule()
unspins...
(reminder: sleepers=3, count=-2)
atomic_add_negative(3-1, -2): count->0
'if' succeeds
sleepers->1
spin_unlock()
schedule()

** OK, that worked, and it was "fair" **

Here's the story with the current (test11) implementation
=========================================================

Task A does down():
-------------------

count->0
sleepers=0

Task B does down():
-------------------

count: 0 -> -1
sleepers: 0->1
atomic_add_negative: count -1 -> -1. false.
sleepers->1
schedule()

Task C does down():
-------------------

count: -1 -> -2
sleepers: 1->2
atomic_add_negative: count -2 -> -1. false
sleepers->1
schedule()

Someone does up():
------------------

count: -1 -> 0
wakeup()
-FINISHED-

Task C, CPU0 CPU1

wakes up
tsk->state = TASK_UNINTERRUPTIBLE
spin_lock
down()
count: 0 -> -1
spins....
(reminder: count=-1, sleepers=1)
atomic_add_negative(1-1, -1): count->-1
`if' fails.
sleepers->1
spin_unlock()
schedule()
unspins...
sem->sleepers++: 1->2
atomic_add_negative(2-1, -1): count: -1->0
'if' is true.
sleepers->0
break
<runs>

** OK, that worked too **


> I suspect this may be part of the trouble with the "sleepers" count
> playing: we had these magic rules that I know I thought about when the
> code was written, but that aren't really part of the "real" rules.

Linus, this code is too complex.

-