2001-04-09 03:08:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: rw_semaphores



On Sun, 8 Apr 2001, Andrew Morton wrote:
>
> One issue with the current implementation is the (necessary)
> design where one reader (or writer) is sleeping in
> down_x_failed_biased(), with its bias in place, whereas
> all other sleepers are sleeping in down_x_failed(), with
> their bias not in place.

Why not do this the same way the _real_ semaphores do it?

You have a fast-path that has a single integer, and a slow path which has
more state but and protected by a spinlock. The _only_ worry I see is to
make sure that we properly have a "contention" state, because both readers
and writers need to be able to know when they should wake things up. But
there's a _trivial_ contention state. Read on:

Forget about the BIAS stuff, and go back to the old simple "negative is
writer, positive is reader" implementation:

0 - unlocked
1..n - 1-n readers
0xc0000000 - writer
other <0 - contention

Do you see anything wrong with this?

Implementation:

- fast path:

down_read:
lock incl (%sem)
js __down_read_failed

down_write:
xorl %eax,%eax
movl $0xc0000000,%r
lock cmpxchgl %r,(%sem)
jne __down_write_failed

up_read:
lock decl (%sem)
js __up_read_wakeup

up_write:
lock andl $0x3fffffff,(%sem)
jne __up_write_wakeup

The above are all fairly obvious for the non-failure case, agreed?
Including the guarantee that _if_ there is contention, the "up()"
routines will always go to the slow path to handle the contention.

Now, the _slow_ case could be your generic "protected by a spinlock" code,
although I have a suggestion. As follows:

The only half-way "subtle" case is that __down_write_failed needs to make
sure that it marks itself as a contender (the down_read() fast-case code
will have done so already by virtue of incrementing the counter, which is
guaranteed to have resulted in a "contention" value.

While down_read() automatically gets a "contention value" on failure,
"down_write()" needs to do extra work. The extra work is not all that
expensive: the simplest thing is to just do

subl $0x8000,(%sem)

at the beginning - which will cause a "contention" value regardless of
whether it was pure-reader before (it goes negative by the simple
assumption that there will never be more than 32k concurrent readers), or
whether it was locked by a writer (it stays negative due to the 0x40000000
bit in the write lock logic). In both cases, both a up_write() and a
up_read() will notice that they need to handle the contention of a waiting
writer-to-be.

Would you agree that the above fast-path implementation guarantees that we
always get to the spinlock-protected slow path?

Now, the above _heavily_ favours readers. There's no question that the
above is unfair. I think that's ok. I think fairness and efficiency tend
to be at odds. But queuing theory shows that the faster you can get out of
the critical region, the smaller your queue will be - exponentially. So
speed is worth it.

Let's take an example at this point:

- lock is zero

- writer(1) gets lock: lock is 0xc0000000

- reader(2) tries to get in: lock becomes 0xc0000001, synchronizes on
spinlock.

- another writer(3) tries to get in: lock becomes 0xbfff8001, synchronizes
on spinlock.

- writer(1) does up_write: lock becomes 0x3fff8001, != 0, so writer decides
it needs to wake up, and synchronizes on spinlock.

- another reader(4) comes on on another CPU, increments, and notices that
it can do so without it being negative: lock becomes 0x3fff8002 and
this one does NOT synchronize on the spinlock.

End result: reader(4) "stole base" and actually got the lock without ever
seeing any contention, and we now have (1), (2) and (3) who are serialized
inside the spinlock.

So we get to what the serializers have to do, ie the slow path:

First, let's do the __up_read/write_wakeup() case, because that one is the
easy one. In fact, I think it ends up being the same function:

spin_lock(&sem->lock);
wake_up(&sem->waiters);
spin_unlock(&sem->lock);

and we're all done. The only optimization here (which we should do for
regular semaphores too) is to use the same spinlock for the semaphore lock
and the wait-queue lock.

The above is fairly obviously correct, and sufficient: we have shown that
we'll get here if there is contention, and the only other thing that the
wakup could sanely do would possibly be to select which process to wake.
Let's not do that yet.

The harder case is __down_read/write_failed(). Here is my suggested
pseudo-code:

__down_write_failed(sem)
{
DECLARE_WAIT_QUEUE(wait, current);

lock subl $0x8000,(%sem) /* Contention marker */
spin_lock(&sem->lock);
add_wait_queue_exclusive(&sem->wait, &wait);
for (;;) {
unsigned int value, newvalue;

set_task_state(TASK_SLEEPING);
value = sem->value;

/*
* Ignore other pending writers: but if there is
* are pending readers or a write holder we should
* sleep
*/
if (value & 0xc0007fff) {
spin_unlock(&sem->lock);
schedule();
spin_lock(&sem->lock);
continue;
}

/*
* This also undoes our contention marker thing, while
* leaving other waiters contention markers in place
*/
newvalue = (value + 0x8000) | 0xc0000000;
if (lock_cmpxchg(sem->value, value, newvalue))
break; /* GOT IT! */

/* Damn, somebody else changed it from under us */
continue;
}
remove_wait_queue(&sem->wait, &wait);
spin_unlock(&sem->lock);
}

The down_read() slow case is equivalent, but ends up being much simpler
(because we don't need to play with contention markers or ignore other
peoples contention markers):

__down_read_failed(sem)
{
DECLARE_WAIT_QUEUE(wait, current);

spin_lock(&sem->lock);
add_wait_queue(&sem->wait, &wait);
for (;;) {
set_task_state(TASK_SLEEPING);
/*
* Yah! We already did our "inc", so if we ever see
* a positive value we're all done.
*/
if (sem->value > 0)
break;
spin_unlock(&sem->lock);
schedule();
spin_lock(&sem->lock);
}
remove_wait_queue(&sem->wait, &wait);
spin_unlock(&sem->lock);
}

Can anybody shoot any holes in this? I haven't actually tested it, but
race conditions in locking primitives are slippery things, and I'd much
rather have an algorithm we can _think_ about and prove to be working. And
I think the above one is provably correct.

Not that I want to go to that kind of extremes.

Anybody? Andrew? Mind giving the above a whirl on your testbed? Or can you
see some thinko in it without even testing?

Note in particular how the above keeps the non-contention "down_read()" /
"up_read()" cases as two single instructions, no slower than a spinlock.

(There are the following assumptions in the code: there are at most 32k
active readers, and also at most 32k pending writers. The limits come from
the writer contention marker logic. You have the 32 bits split up as:

- 2 bits "write lock owner" (it's really only one bit, but the other bit
is set so that the writer contention marker won't turn the most
negative number into a positive one, so we have one "extra" bit set to
keep the thing negative for the whole duration of a write lock)
- 15 "reader count" bits
- 15 "pending writer count" bits

and the 15 bits is why you have the 32k user limitation. I think it's an
acceptable one - and if not you can expand on it by adding extra fields
thata re only accessed within the spinlock).

Linus


2001-04-09 04:19:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: rw_semaphores


The "down_writer_failed()" case was wrong:

On Sun, 8 Apr 2001, Linus Torvalds wrote:
>
> __down_write_failed(sem)
> {
> ....
> /*
> * Ignore other pending writers: but if there is
> * are pending readers or a write holder we should
> * sleep
> */
> if (value & 0xc0007fff) {
....

The "value & 0xc0007ffff" test is wrong, because it's actually always true
for some normal contention case (multiple pending writers waiting for a
reader to release the lock). Because even if there is no write lock
holder, other pending writers (and we're one) will have caused the high
bits to be set, so the above would end up causing us to think that we
can't get the lock even if there is no real lock holder.

The comment is right, though. It's just the test that is simplistic and
wrong.

The pending readers part is correct, and obvious enough:

(value & 0x7fff) != 0

implies that there are readers. In which case we should try again. Simple
enough.

The pending writers part is slightly more subtle: we know that there is at
least _one_ pending writer, namely us. It turns out that we must check the
two high bits, and the logic is:

- 11: only pending writers, no write lock holder (if we had a write lock
holder, he would have set the bits to 11, but a pending writer
would have borrowed from the lower bit, so you'd get bit pattern
10).
- 10: we have a real lock holder, and the pending writers borrowed from
the low lock bit when they did the "subl 0x8000" to mark off
contention.
- 01: must not happen. BUG.
- 00: we had a real write lock holder, but he released the lock and
cleared both bits.

So the "si there a write lock holder" test basically becomes

(value & 0xc0000000) == 0x80000000

and the full test should be

if ((value & 0x7fff) || ((value & 0xc0000000) == 0x80000000) {
spin_unlock();
schedule();
spin_lock();
continue;
}

which might be rewritten some simpler way. I'm too lazy to think about it
even more.

For similar reasons, the "newvalue" calculation was subtly bogus: we must
make sure that we maintain the correct logic for the two upper bits in the
presense of _other_ pending writers. We can't just do the unconditional
binary "or" operation to set the two upper bits, because then we'd break
the above rules if there are other pending writers. So the newvalue
calculation should be something along the lines of

/* Undo _our_ contention marker */
newvalue = value + 0x8000;

/* Get rid of stale writer lock bits */
newvalue &= 0x3fffffff;

/*
* If there were no other pending writers (newvalue == 0), we set
* both high bits, otherwise we only set bit 31.
* (see above on the "borrow bit being clear" logic).
*/
if (!newvalue)
newvalue = 0xc0000000;
newvalue |= 0x80000000;

And THEN I think the algorithm in the email I'm following up to should
actually really work.

Does anybody find any other details I missed?

And no, I have not yet actually _tested_ any of this. But all my code
works on the first try (or, in this case, second try if you want to be a
stickler for details).

No wonder we didn't get this right first time through. It's not really all
that horribly complicated, but the _details_ kill you.

Linus

2001-04-09 13:56:45

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: rw_semaphores

On Sun, 8 Apr 2001, Linus Torvalds wrote:

>
> The "down_writer_failed()" case was wrong:

Which is exactly the same problem in the original code. How about the
following patch against the original code? I hadn't sent it yet as the
test code isn't finished (hence, it's untested), but given that Andrew is
going full steam ahead, people might as well give this a try.

-ben

rwsem-2.4.4-pre1-A0.diff
diff -ur v2.4.4-pre1/arch/i386/kernel/semaphore.c work-2.4.4-pre1/arch/i386/kernel/semaphore.c
--- v2.4.4-pre1/arch/i386/kernel/semaphore.c Sat Nov 18 20:31:25 2000
+++ work-2.4.4-pre1/arch/i386/kernel/semaphore.c Mon Apr 9 09:47:02 2001
@@ -269,10 +269,9 @@
ret

2: call down_write_failed
- " LOCK "subl $" RW_LOCK_BIAS_STR ",(%eax)
- jz 1b
- jnc 2b
- jmp 3b
+ popl %ecx
+ popl %edx
+ ret
"
);

@@ -366,19 +365,56 @@
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);

- __up_write(sem); /* this takes care of granting the lock */
+ /* Originally we called __up_write here, but that
+ * doesn't work: the lock add operation could result
+ * in us failing to detect a bias grant. Instead,
+ * we'll use a compare and exchange to get the lock
+ * from a known state: either <= -BIAS while another
+ * waiter is still around, or > -BIAS if we were given
+ * the lock's bias.
+ */

+ do {
+ int old = atomic_read(&sem->count), new, res;
+ if (old > -RW_LOCK_BIAS)
+ return down_write_failed_biased(sem);
+ new = old + RW_LOCK_BIAS;
+ res = cmpxchg(&sem->count.counter, old, new);
+ } while (res != old);
+
+again:
+ /* We are now removed from the lock. Wait for all other
+ * waiting writers to go away.
+ */
add_wait_queue_exclusive(&sem->wait, &wait);

while (atomic_read(&sem->count) < 0) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (atomic_read(&sem->count) >= 0)
+ if (atomic_read(&sem->count) >= 0) {
break; /* we must attempt to acquire or bias the lock */
+ }
+
schedule();
}

remove_wait_queue(&sem->wait, &wait);
tsk->state = TASK_RUNNING;
+
+ /* Okay, try to grab the lock. */
+ for (;;) {
+ int old = atomic_read(&sem->count), new, res;
+ if (old < 0)
+ goto again;
+ new = old - RW_LOCK_BIAS;
+ res = cmpxchg(&sem->count.counter, old, new);
+ if (res != old)
+ continue;
+ if (old == RW_LOCK_BIAS)
+ break;
+ if (old >= 0)
+ return down_write_failed_biased(sem);
+ goto again;
+ }

return sem;
}

2001-04-10 02:42:03

by Tachino Nobuhiro

[permalink] [raw]
Subject: Re: rw_semaphores


Hello,

At Sun, 8 Apr 2001 20:08:13 -0700 (PDT),
Linus Torvalds wrote:
>
> Can anybody shoot any holes in this? I haven't actually tested it, but
> race conditions in locking primitives are slippery things, and I'd much
> rather have an algorithm we can _think_ about and prove to be working. And
> I think the above one is provably correct.

I am not familiar with semaphore or x86, so this may not be correct,
but if the following sequence is possible, the writer can call wake_up()
before the reader calls add_wait_queue() and reader may sleep forever.
Is it possible?


Reader Writer

down_read:
lock incl (%sem)
js __down_read_failed
up_write:
lock andl $0x3fffffff,(%sem)
jne __up_write_wakeup

__up_write_wakeup:
spin_lock(&sem->lock);
wake_up(&sem->waiters);
spin_unlock(&sem->lock);
__down_read_failed:
spin_lock(%sem->lock)
add_wait_queue(&sem->waiters, &wait);
.
.

2001-04-10 05:44:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: rw_semaphores



On Tue, 10 Apr 2001, Tachino Nobuhiro wrote:
>
> I am not familiar with semaphore or x86, so this may not be correct,
> but if the following sequence is possible, the writer can call wake_up()
> before the reader calls add_wait_queue() and reader may sleep forever.
> Is it possible?

The ordering is certainly possible, but if it happens,
__down_read_failed() won't actually sleep, because it will notice that the
value is positive and just return immediately. So it will do some
unnecessary work (add itself to the wait-queue only to remove itself
immediately again), but it will do the right thing.

Linus

2001-04-10 06:34:31

by Tachino Nobuhiro

[permalink] [raw]
Subject: Re: rw_semaphores


At Mon, 9 Apr 2001 22:43:53 -0700 (PDT),
Linus Torvalds wrote:

> The ordering is certainly possible, but if it happens,
> __down_read_failed() won't actually sleep, because it will notice that the
> value is positive and just return immediately. So it will do some
> unnecessary work (add itself to the wait-queue only to remove itself
> immediately again), but it will do the right thing.
>
> Linus
>

I understand. Thank you.

2001-04-10 07:47:59

by David Howells

[permalink] [raw]
Subject: Re: rw_semaphores


Since you're willing to use CMPXCHG in your suggested implementation, would it
make it make life easier if you were willing to use XADD too?

Plus, are you really willing to limit the number of readers or writers to be
32767? If so, I think I can suggest a way that limits it to ~65535 apiece
instead...

David

2001-04-10 18:03:12

by David Howells

[permalink] [raw]
Subject: [PATCH] i386 rw_semaphores fix

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
char buf[1];
int max, loop, fd, tmp, rw;

fd = open("/proc/rwsem",O_RDWR);
if (fd<0) {
perror("open");
return 1;
}

max = (argc>1) ? atoi(argv[1]) : 50;
rw = max<0;
max = abs(max);

for (loop=max; loop>0; loop--) {
switch (fork()) {
case -1:
perror("fork");
return 1;
case 0:
rw ? write(fd," ",1) : read(fd,buf,1);
exit(1);
default:
break;
}
}

for (loop=max; loop>0; loop--) {
if (wait(&tmp)<0) {
perror("wait");
return 1;
}
}

return 0;
}

2001-04-10 19:42:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix



On Tue, 10 Apr 2001, David Howells wrote:
>
> Here's a patch that fixes RW semaphores on the i386 architecture. It is very
> simple in the way it works.

XADD only works on Pentium+.

That's no problem if we make this SMP-specific - I doubt anybody actually
uses SMP on i486's even if the machines exist, as I think they all had
special glue logic that Linux would have trouble with anyway. But the
advantages of being able to use one generic kernel that works on plain UP
i386 machines as well as SMP P6+ machines is big enough that I would want
to be able to say "CONFIG_X86_GENERIC" + "CONFIG_SMP".

Even if it would be noticeably slower (ie a fallback to a spinlock might
be perfectly ok).

If you do this, I woul dsuggest having asm-i386/{rwsem.h|rwsem-xadd.h},
and just having a

#ifndef CONFIG_XADD
#include <asm/rwsem.h>
#else
#include <asm/rwsem-xadd.h>
#endif

(And adding "CONFIG_XADD" to the list of generated optimization
configuration options in arch/i386/config.in, of course).

That way we don't make the semaphore.h file even more unreadable.


Linus


2001-04-10 19:57:13

by Jeff Garzik

[permalink] [raw]
Subject: x86 cpu configuration (was: Re: [PATCH] i386 rw_semaphores fix)

Linus Torvalds wrote:
> That's no problem if we make this SMP-specific - I doubt anybody actually
> uses SMP on i486's even if the machines exist, as I think they all had
> special glue logic that Linux would have trouble with anyway. But the
> advantages of being able to use one generic kernel that works on plain UP
> i386 machines as well as SMP P6+ machines is big enough that I would want
> to be able to say "CONFIG_X86_GENERIC" + "CONFIG_SMP".

(slight tangent)
FWIW I think the i386 CPU selection logic in make config should be
reconsidered in 2.5+...

The alpha presents you with a list of platforms, and allows you to
select a specific one, or a generic option that includes all platforms.
The current way of doing things on x86 -- essentially selecting a
minimal level of CPU support -- is nice for vendors like Mandrake who
drops support for older CPUs. But it isn't nice for the case where a
user wants support for their specific processor and no other. I have a
P-II, why include code that only works on K7 or P-III? The embedded
people, I think, would find such a change useful too.

The only problem with an alpha-like configuration is that Mandrake
cannot select a minimum CPU level of support anymore... I guess that
could be solved by an "advanced" sub-menu, similar to that which is
found in drivers/video/Config.in, which allows fine-grained Y/N
selections of CPU support.

--
Jeff Garzik | Sam: "Mind if I drive?"
Building 1024 | Max: "Not if you don't mind me clawing at the dash
MandrakeSoft | and shrieking like a cheerleader."

2001-04-10 20:07:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Tue, Apr 10, 2001 at 12:42:07PM -0700, Linus Torvalds wrote:
>
>
> On Tue, 10 Apr 2001, David Howells wrote:
> >
> > Here's a patch that fixes RW semaphores on the i386 architecture. It is very
> > simple in the way it works.
>
> XADD only works on Pentium+.

My Intel manual says it 486+:

``
XADD-Exchange and Add
[...]
Intel Architecture Compatibility
Intel Architecture processors earlier than the Intel486TM processor do not recognize this instruc-
tion. If this instruction is used, you should provide an equivalent code sequence that runs on
earlier processors.
''

I guess 386 could live with an exception handler that emulates it.

(BTW an generic exception handler for CMPXCHG would also be very useful
for glibc -- currently it has special checking code for 386 in its mutexes)
The 386 are so slow that nobody would probably notice a bit more slowness
by a few exceptions.

-Andi

2001-04-10 20:16:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix



On Tue, 10 Apr 2001, Andi Kleen wrote:
>
> I guess 386 could live with an exception handler that emulates it.

That approach is fine, although I'd personally prefer to take the
exception just once and just rewrite the instuction as a "call". The
places that need xadd would have to follow some strict guidelines (long
modrms or other instructions to pad out to enough size, and have the
arguments in fixed registers)

> (BTW an generic exception handler for CMPXCHG would also be very useful
> for glibc -- currently it has special checking code for 386 in its mutexes)
> The 386 are so slow that nobody would probably notice a bit more slowness
> by a few exceptions.

Ehh. I find that the slower the machine is, the more easily I _notice_
that it is slow. So..

Linus

2001-04-10 21:56:20

by Alan

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

> That's no problem if we make this SMP-specific - I doubt anybody actually
> uses SMP on i486's even if the machines exist, as I think they all had

They do. There are two (total world wide) 486 SMP users I know about and they
mostly do it to be awkward ;)

> special glue logic that Linux would have trouble with anyway. But the

The Compaqs were custom, the non Compaq ones included several using Intel
MP 1.1.

Alan

2001-04-10 21:57:11

by Alan

[permalink] [raw]
Subject: Re: x86 cpu configuration (was: Re: [PATCH] i386 rw_semaphores fix)

> The current way of doing things on x86 -- essentially selecting a
> minimal level of CPU support -- is nice for vendors like Mandrake who

That isnt how the current x86 one works. It just sort of looks like that
for a common subset.

2001-04-10 21:59:40

by Alan

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

> I guess 386 could live with an exception handler that emulates it.

386 could use a simpler setup and is non SMP

> (BTW an generic exception handler for CMPXCHG would also be very useful
> for glibc -- currently it has special checking code for 386 in its mutexes)
> The 386 are so slow that nobody would probably notice a bit more slowness
> by a few exceptions.

Be serious. You can compile glibc without 386 support. Most vendors already
distribute 386/586 or 386/686 glibc sets.

2001-04-11 00:01:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Tue, Apr 10, 2001 at 11:00:31PM +0100, Alan Cox wrote:
> > I guess 386 could live with an exception handler that emulates it.
>
> 386 could use a simpler setup and is non SMP

The idea was to have a `generic' kernel that works on all architectures.
If you drop 386 support much is better already.

> > (BTW an generic exception handler for CMPXCHG would also be very useful
> > for glibc -- currently it has special checking code for 386 in its mutexes)
> > The 386 are so slow that nobody would probably notice a bit more slowness
> > by a few exceptions.
>
> Be serious. You can compile glibc without 386 support. Most vendors already
> distribute 386/586 or 386/686 glibc sets.

Yes, and with CMPXCHG handler in the kernel it wouldn't be needed
(the other 686 optimizations like memcpy also work on 386)

-Andi

2001-04-11 00:14:07

by David Weinehall

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Wed, Apr 11, 2001 at 02:00:58AM +0200, Andi Kleen wrote:
> On Tue, Apr 10, 2001 at 11:00:31PM +0100, Alan Cox wrote:
> > > I guess 386 could live with an exception handler that emulates it.
> >
> > 386 could use a simpler setup and is non SMP
>
> The idea was to have a `generic' kernel that works on all architectures.
> If you drop 386 support much is better already.
>
> > > (BTW an generic exception handler for CMPXCHG would also be very useful
> > > for glibc -- currently it has special checking code for 386 in its mutexes)
> > > The 386 are so slow that nobody would probably notice a bit more slowness
> > > by a few exceptions.
> >
> > Be serious. You can compile glibc without 386 support. Most vendors already
> > distribute 386/586 or 386/686 glibc sets.
>
> Yes, and with CMPXCHG handler in the kernel it wouldn't be needed
> (the other 686 optimizations like memcpy also work on 386)

But the code would be much slower, and it's on 386's and similarly
slow beasts you need every cycle you can get, NOT on a Pentium IV.


/David
_ _
// David Weinehall <[email protected]> /> Northern lights wander \\
// Project MCA Linux hacker // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </

2001-04-11 00:20:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Wed, Apr 11, 2001 at 02:13:18AM +0200, David Weinehall wrote:
> >
> > Yes, and with CMPXCHG handler in the kernel it wouldn't be needed
> > (the other 686 optimizations like memcpy also work on 386)
>
> But the code would be much slower, and it's on 386's and similarly
> slow beasts you need every cycle you can get, NOT on a Pentium IV.

My reasoning is that who uses a 386 is not interested in speed, so a little
bit more slowness is not that bad.

You realize that the alternative for distributions is to drop 386 support
completely?

Most 386 i've seen used for linux do not run multi threaded applications
anyways; they usually do things like ISDN routing. Also on early 386 with
the kernel mode wp bug it's a security hazard to use clone().

-Andi

2001-04-11 00:41:04

by Tim Wright

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

Sequent Symmetry machinses supported SMP on i486 and even i386 going back
to the original 16MHz 386 processors. You could put up to 30 in a system.
I do not, however, envisage anyone porting Linux to these any time soon.
The hardware is just too "unusual" for it to be feasible. The later Symmetry
5000 machines might be slightly more practical, but I'm not sure if you'd have
room in your new house for one. You could probably do without central heating
if we did send you one :-)

Tim

On Tue, Apr 10, 2001 at 10:57:09PM +0100, Alan Cox wrote:
> > That's no problem if we make this SMP-specific - I doubt anybody actually
> > uses SMP on i486's even if the machines exist, as I think they all had
>
> They do. There are two (total world wide) 486 SMP users I know about and they
> mostly do it to be awkward ;)
>
> > special glue logic that Linux would have trouble with anyway. But the
>
> The Compaqs were custom, the non Compaq ones included several using Intel
> MP 1.1.
>
> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI

2001-04-11 00:55:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix



On Wed, 11 Apr 2001, David Weinehall wrote:
> >
> > Yes, and with CMPXCHG handler in the kernel it wouldn't be needed
> > (the other 686 optimizations like memcpy also work on 386)
>
> But the code would be much slower, and it's on 386's and similarly
> slow beasts you need every cycle you can get, NOT on a Pentium IV.

Note that the "fixup" approach is not necessarily very painful at all,
from a performance standpoint (either on 386 or on newer CPU's). It's not
really that hard to just replace the instruction in the "undefined
instruction" handler by having strict rules about how to use the "xadd"
instruction.

For example, you would not actually fix up the xadd to be a function call
to something that emulates "xadd" itself on a 386. You would fix up the
whole sequence of "inline down_write()" with a simple call to an
out-of-line "i386_down_write()" function.

Note that down_write() on an old 386 is likely to be complicated enough
that you want to do it out-of-line anyway, so the code-path you take
(afetr the first time you've trapped on that particular location) would be
the one you would take for an optimized 386 kernel anyway. And similarly,
the restrictions you place on non-386-code to make it fixable are simple
enough that it probably shouldn't make a difference for performance on
modern chips.

Linus



2001-04-11 00:57:17

by David Weinehall

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Wed, Apr 11, 2001 at 02:20:28AM +0200, Andi Kleen wrote:
> On Wed, Apr 11, 2001 at 02:13:18AM +0200, David Weinehall wrote:
> > >
> > > Yes, and with CMPXCHG handler in the kernel it wouldn't be needed
> > > (the other 686 optimizations like memcpy also work on 386)
> >
> > But the code would be much slower, and it's on 386's and similarly
> > slow beasts you need every cycle you can get, NOT on a Pentium IV.
>
> My reasoning is that who uses a 386 is not interested in speed, so a little
> bit more slowness is not that bad.

My reasoning is that the choice of computer is a direct function of
your financial situation. I can get hold of a lot of 386's/486's, but
however old a Pentium may be, people are still reluctant to give away
those. Doing the sometimes necessary updates on my 386:en is already
painfully slow, and I'd rather not take another performance hit.

> You realize that the alternative for distributions is to drop 386 support
> completely?

Yes, I realise that. But if _distribution X_ drops support for the 386,
there's always _distribution Y_ available with it still in, while if
we give the glibc-people the thumbs up saying "Ignore the 386 users from
now on", every distribution will get lousy performance on those machines.

> Most 386 i've seen used for linux do not run multi threaded applications
> anyways; they usually do things like ISDN routing. Also on early 386 with
> the kernel mode wp bug it's a security hazard to use clone().

Well, not all 386:en are early...


/David
_ _
// David Weinehall <[email protected]> /> Northern lights wander \\
// Project MCA Linux hacker // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </

2001-04-11 01:04:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Wed, Apr 11, 2001 at 02:56:32AM +0200, David Weinehall wrote:
> My reasoning is that the choice of computer is a direct function of
> your financial situation. I can get hold of a lot of 386's/486's, but
> however old a Pentium may be, people are still reluctant to give away
> those. Doing the sometimes necessary updates on my 386:en is already
> painfully slow, and I'd rather not take another performance hit.

As long as you don't use multithreaded applications there is no performance
hit with a kernel-mode CMPXCHG handler. iirc most multithreaded applications
are either too bloated for a 386 anyways, or trivial so that it doesn't matter.

-Andi

2001-04-11 01:08:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Tue, Apr 10, 2001 at 05:55:09PM -0700, Linus Torvalds wrote:
> Note that the "fixup" approach is not necessarily very painful at all,
> from a performance standpoint (either on 386 or on newer CPU's). It's not
> really that hard to just replace the instruction in the "undefined
> instruction" handler by having strict rules about how to use the "xadd"
> instruction.

Fixup for user space is probably not that nice (CMPXCHG is used there by
linuxthreads)


-Andi


2001-04-11 01:12:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix



On Wed, 11 Apr 2001, Andi Kleen wrote:
>
> Fixup for user space is probably not that nice (CMPXCHG is used there by
> linuxthreads)

In user space I'm not convinced that you couldn't do the same thing
equally well by just having the proper dynamically linked library. You'd
not get in-lined lock primitives, but that's probably fine.

Linus

2001-04-11 01:24:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Tue, Apr 10, 2001 at 06:12:12PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 11 Apr 2001, Andi Kleen wrote:
> >
> > Fixup for user space is probably not that nice (CMPXCHG is used there by
> > linuxthreads)
>
> In user space I'm not convinced that you couldn't do the same thing
> equally well by just having the proper dynamically linked library. You'd
> not get in-lined lock primitives, but that's probably fine.

It's currently done this way, ld-linux.so looks in a special "686" path when
the ELF vector mentions it, otherwise normal path. There is a special 686
version of glibc and linuxthread. Just it's a very complicated and disk
space chewing solution for a simple problem (some distributions are starting
to drop support for 386 because of that)

-Andi

2001-04-11 07:39:20

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

Can CONFIG_X86_XADD be equated to CONFIG_X86_CMPXCHG?

David

2001-04-11 12:28:17

by Alan

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

> > 386 could use a simpler setup and is non SMP
>
> The idea was to have a `generic' kernel that works on all architectures.
> If you drop 386 support much is better already.

Having the 386 non SMP only means you dont have to worry about this. We dont
currently support an SMP kernel that boots reliably on all non SMP boxes.
At least not officially although current 2.4 seems close.

> > > (BTW an generic exception handler for CMPXCHG would also be very useful
> > > for glibc -- currently it has special checking code for 386 in its mutexes)
> > > The 386 are so slow that nobody would probably notice a bit more slowness
> > > by a few exceptions.
> >
> > Be serious. You can compile glibc without 386 support. Most vendors already
> > distribute 386/586 or 386/686 glibc sets.
>
> Yes, and with CMPXCHG handler in the kernel it wouldn't be needed
> (the other 686 optimizations like memcpy also work on 386)

They would still be needed. The 686 built glibc materially improves performance
on 686 class machines. That one isnt an interesting problem to solve. Install
the right software instead.


2001-04-11 12:32:27

by Alan

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

> My reasoning is that who uses a 386 is not interested in speed, so a little
> bit more slowness is not that bad.
>
> You realize that the alternative for distributions is to drop 386 support
> completely?

Rubbish. Mandrake and Red Hat have been shipping multiple kernel images,
multiple gzips and multiple glibc's for a very long time. It is quite simply
not a problem.

2001-04-11 12:36:28

by Alan

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

> It's currently done this way, ld-linux.so looks in a special "686" path when
> the ELF vector mentions it, otherwise normal path. There is a special 686
> version of glibc and linuxthread. Just it's a very complicated and disk
> space chewing solution for a simple problem (some distributions are starting
> to drop support for 386 because of that)

So drop your support for that distribution. I do not see the problem here. Look
at PS/2 support as another example. Supporting the IBM PS/2 machines is
commercially impractical and has no business case. Red Hat won't install on
a PS/2 machine. Debian has different constraints and guess what - Debian
installs beautifully on a PS/2.

Alan

2001-04-11 12:40:40

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

On Wed, 11 Apr 2001, David Howells wrote:

> Can CONFIG_X86_XADD be equated to CONFIG_X86_CMPXCHG?

Yes. Modulo very early i486s which used incorrect opcodes for cmpxchg.
They can probably be safely ignored.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-11 12:58:14

by David Howells

[permalink] [raw]
Subject: [PATCH] 2nd try: i386 rw_semaphores fix

diff -uNr linux-2.4.3/arch/i386/config.in linux/arch/i386/config.in
--- linux-2.4.3/arch/i386/config.in Thu Apr 5 14:44:04 2001
+++ linux/arch/i386/config.in Wed Apr 11 08:38:04 2001
@@ -47,11 +47,13 @@

if [ "$CONFIG_M386" = "y" ]; then
define_bool CONFIG_X86_CMPXCHG n
+ define_bool CONFIG_X86_XADD n
define_int CONFIG_X86_L1_CACHE_SHIFT 4
else
define_bool CONFIG_X86_WP_WORKS_OK y
define_bool CONFIG_X86_INVLPG y
define_bool CONFIG_X86_CMPXCHG y
+ define_bool CONFIG_X86_XADD y
define_bool CONFIG_X86_BSWAP y
define_bool CONFIG_X86_POPAD_OK y
fi
diff -uNr linux-2.4.3/arch/i386/kernel/semaphore.c linux/arch/i386/kernel/semaphore.c
--- linux-2.4.3/arch/i386/kernel/semaphore.c Thu Apr 5 14:38:34 2001
+++ linux/arch/i386/kernel/semaphore.c Wed Apr 11 12:49:28 2001
@@ -14,7 +14,6 @@
*/
#include <linux/config.h>
#include <linux/sched.h>
-
#include <asm/semaphore.h>

/*
@@ -237,19 +236,10 @@
__down_read_failed:
pushl %edx
pushl %ecx
- jnc 2f
-
-3: call down_read_failed_biased
-
-1: popl %ecx
+ call down_read_failed
+ popl %ecx
popl %edx
ret
-
-2: call down_read_failed
- " LOCK "subl $1,(%eax)
- jns 1b
- jnc 2b
- jmp 3b
"
);

@@ -260,169 +250,267 @@
__down_write_failed:
pushl %edx
pushl %ecx
- jnc 2f
-
-3: call down_write_failed_biased
-
-1: popl %ecx
+ call down_write_failed
+ popl %ecx
popl %edx
ret
-
-2: call down_write_failed
- " LOCK "subl $" RW_LOCK_BIAS_STR ",(%eax)
- jz 1b
- jnc 2b
- jmp 3b
"
);

-struct rw_semaphore *FASTCALL(rwsem_wake_readers(struct rw_semaphore *sem));
-struct rw_semaphore *FASTCALL(rwsem_wake_writer(struct rw_semaphore *sem));
+asm(
+"
+.align 4
+.globl __rwsem_wake
+__rwsem_wake:
+ pushl %edx
+ pushl %ecx
+ call rwsem_wake
+ popl %ecx
+ popl %edx
+ ret
+"
+);

-struct rw_semaphore *FASTCALL(down_read_failed_biased(struct rw_semaphore *sem));
-struct rw_semaphore *FASTCALL(down_write_failed_biased(struct rw_semaphore *sem));
+struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *sem));
struct rw_semaphore *FASTCALL(down_read_failed(struct rw_semaphore *sem));
struct rw_semaphore *FASTCALL(down_write_failed(struct rw_semaphore *sem));

-struct rw_semaphore *down_read_failed_biased(struct rw_semaphore *sem)
+/*
+ * implement exchange and add functionality
+ */
+static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
{
- struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
-
- add_wait_queue(&sem->wait, &wait); /* put ourselves at the head of the list */
+ int tmp = delta;

- for (;;) {
- if (sem->read_bias_granted && xchg(&sem->read_bias_granted, 0))
- break;
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (!sem->read_bias_granted)
- schedule();
- }
-
- remove_wait_queue(&sem->wait, &wait);
- tsk->state = TASK_RUNNING;
+#ifndef CONFIG_USING_SPINLOCK_BASED_RWSEM
+ __asm__ __volatile__(
+ LOCK_PREFIX "xadd %0,(%1)"
+ : "+r"(tmp)
+ : "r"(sem)
+ : "memory");
+
+#else
+
+ __asm__ __volatile__(
+ "# beginning rwsem_atomic_update\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchgl %0,(%1)\n\t" /* retrieve the old value */
+ " addl %0,(%1)\n\t" /* subtract 0x00010001, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* release the spinlock */
+#endif
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ ".previous\n"
+ "# ending rwsem_atomic_update\n\t"
+ : "+r"(tmp)
+ : "r"(sem)
+ : "memory");

- return sem;
+#endif
+ return tmp+delta;
}

-struct rw_semaphore *down_write_failed_biased(struct rw_semaphore *sem)
+/*
+ * implement compare and exchange functionality on the rw-semaphore count LSW
+ */
+static inline __u16 rwsem_cmpxchgw(struct rw_semaphore *sem, __u16 old, __u16 new)
{
- struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+#ifndef CONFIG_USING_SPINLOCK_BASED_RWSEM
+ return cmpxchg((__u16*)&sem->count.counter,0,RWSEM_ACTIVE_BIAS);

- add_wait_queue_exclusive(&sem->write_bias_wait, &wait); /* put ourselves at the end of the list */
+#else
+ __u16 prev;

- for (;;) {
- if (sem->write_bias_granted && xchg(&sem->write_bias_granted, 0))
- break;
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (!sem->write_bias_granted)
- schedule();
- }
-
- remove_wait_queue(&sem->write_bias_wait, &wait);
- tsk->state = TASK_RUNNING;
-
- /* if the lock is currently unbiased, awaken the sleepers
- * FIXME: this wakes up the readers early in a bit of a
- * stampede -> bad!
- */
- if (atomic_read(&sem->count) >= 0)
- wake_up(&sem->wait);
+ __asm__ __volatile__(
+ "# beginning rwsem_cmpxchgw\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " cmpw %w1,(%3)\n\t"
+ " jne 4f\n\t" /* jump if old doesn't match sem->count LSW */
+ " movw %w2,(%3)\n\t" /* replace sem->count LSW with the new value */
+ "2:\n\t"
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t" /* release the spinlock */
+#endif
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " movw (%3),%w0\n" /* we'll want to return the current value */
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending rwsem_cmpxchgw\n\t"
+ : "=r"(prev)
+ : "r0"(old), "r"(new), "r"(sem)
+ : "memory");

- return sem;
+ return prev;
+#endif
}

-/* Wait for the lock to become unbiased. Readers
- * are non-exclusive. =)
+/*
+ * wait for the read lock to be granted
+ * - need to repeal the increment made inline by the caller
+ * - need to throw a write-lock style spanner into the works (sub 0x00010000 from count)
*/
struct rw_semaphore *down_read_failed(struct rw_semaphore *sem)
{
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ DECLARE_WAITQUEUE(wait,tsk);
+ int count;
+
+ rwsemdebug("[%d] Entering down_read_failed(%08x)\n",current->pid,atomic_read(&sem->count));

- __up_read(sem); /* this takes care of granting the lock */
+ /* this waitqueue context flag will be cleared when we are granted the lock */
+ __set_bit(RWSEM_WAITING_FOR_READ,&wait.flags);

- add_wait_queue(&sem->wait, &wait);
+ add_wait_queue_exclusive(&sem->wait, &wait); /* FIFO */
+
+ /* note that we're now waiting on the lock, but no longer actively read-locking */
+ count = rwsem_atomic_update(RWSEM_WAITING_BIAS-RWSEM_ACTIVE_BIAS,sem);
+ rwsemdebug("X(%08x)\n",count);
+
+ /* if there are no longer active locks, wake the front queued process(es) up
+ * - it might even be this process, since the waker takes a more active part
+ */
+ if (!(count & RWSEM_ACTIVE_MASK))
+ rwsem_wake(sem);

- while (atomic_read(&sem->count) < 0) {
+ /* wait to be given the lock */
+ for (;;) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (atomic_read(&sem->count) >= 0)
+ if (!test_bit(RWSEM_WAITING_FOR_READ,&wait.flags))
break;
schedule();
}

- remove_wait_queue(&sem->wait, &wait);
+ remove_wait_queue(&sem->wait,&wait);
tsk->state = TASK_RUNNING;

+ rwsemdebug("[%d] Leaving down_read_failed(%08x)\n",current->pid,atomic_read(&sem->count));
+
return sem;
}

-/* Wait for the lock to become unbiased. Since we're
- * a writer, we'll make ourselves exclusive.
+/*
+ * wait for the write lock to be granted
*/
struct rw_semaphore *down_write_failed(struct rw_semaphore *sem)
{
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ DECLARE_WAITQUEUE(wait,tsk);
+ int count;

- __up_write(sem); /* this takes care of granting the lock */
+ rwsemdebug("[%d] Entering down_write_failed(%08x)\n",
+ current->pid,atomic_read(&sem->count));

- add_wait_queue_exclusive(&sem->wait, &wait);
+ /* this waitqueue context flag will be cleared when we are granted the lock */
+ __set_bit(RWSEM_WAITING_FOR_WRITE,&wait.flags);
+
+ add_wait_queue_exclusive(&sem->wait, &wait); /* FIFO */

- while (atomic_read(&sem->count) < 0) {
+ /* note that we're waiting on the lock, but no longer actively locking */
+ count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem);
+ rwsemdebug("A(%08x)\n",count);
+
+ /* if there are no longer active locks, wake the front queued process(es) up
+ * - it might even be this process, since the waker takes a more active part
+ */
+ if (!(count & RWSEM_ACTIVE_MASK))
+ rwsem_wake(sem);
+
+ /* wait to be given the lock */
+ for (;;) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (atomic_read(&sem->count) >= 0)
- break; /* we must attempt to acquire or bias the lock */
+ if (!test_bit(RWSEM_WAITING_FOR_WRITE,&wait.flags))
+ break;
schedule();
}

- remove_wait_queue(&sem->wait, &wait);
+ remove_wait_queue(&sem->wait,&wait);
tsk->state = TASK_RUNNING;

+ rwsemdebug("[%d] Leaving down_write_failed(%08x)\n",current->pid,atomic_read(&sem->count));
+
return sem;
}

-asm(
-"
-.align 4
-.globl __rwsem_wake
-__rwsem_wake:
- pushl %edx
- pushl %ecx
-
- jz 1f
- call rwsem_wake_readers
- jmp 2f
+/*
+ * handle the lock being released whilst there are processes blocked on it that can now run
+ * - if we come here, then:
+ * - the 'active part' of the count (&0x0000ffff) reached zero (but may no longer be zero)
+ * - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
+ */
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+ int woken;

-1: call rwsem_wake_writer
+ rwsemdebug("[%d] Entering rwsem_wake(%08x)\n",current->pid,atomic_read(&sem->count));

-2: popl %ecx
- popl %edx
- ret
-"
-);
+ /* try to grab an 'activity' marker
+ * - need to make sure two copies of rwsem_wake() don't do this for two separate processes
+ * simultaneously
+ * - be horribly naughty, and only deal with the LSW of the atomic counter
+ */
+ if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0)
+ goto out;

-/* Called when someone has done an up that transitioned from
- * negative to non-negative, meaning that the lock has been
- * granted to whomever owned the bias.
- */
-struct rw_semaphore *rwsem_wake_readers(struct rw_semaphore *sem)
-{
- if (xchg(&sem->read_bias_granted, 1))
+ /* try to grant a single write lock if there's a writer at the front of the queue
+ * - note we leave the 'active part' of the count incremented by 1 and the waiting part
+ * incremented by 0x00010000
+ */
+ switch (wake_up_ctx(&sem->wait,1,-RWSEM_WAITING_FOR_WRITE)) {
+ case 1:
+ goto out;
+ case 0:
+ break;
+ default:
BUG();
- wake_up(&sem->wait);
- return sem;
-}
+ }

-struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem)
-{
- if (xchg(&sem->write_bias_granted, 1))
+ rwsemdebug("E\n");
+
+ /* grant an infinite number of read locks to the readers at the front of the queue
+ * - note we increment the 'active part' of the count by the number of readers just woken,
+ * less one for the activity decrement we've already done
+ */
+ woken = wake_up_ctx(&sem->wait,65535,-RWSEM_WAITING_FOR_READ);
+ if (woken>0) {
+ woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
+ woken -= RWSEM_ACTIVE_BIAS;
+ rwsem_atomic_update(woken,sem);
+ }
+ else {
BUG();
- wake_up(&sem->write_bias_wait);
+ }
+
+ out:
+ rwsemdebug("[%d] Leaving rwsem_wake(%08x)\n",current->pid,atomic_read(&sem->count));
return sem;
}

+/*
+ * rw spinlock fallbacks
+ */
#if defined(CONFIG_SMP)
asm(
"
@@ -451,4 +539,3 @@
"
);
#endif
-
diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h linux/include/asm-i386/rwsem-spin.h
--- linux-2.4.3/include/asm-i386/rwsem-spin.h Thu Jan 1 01:00:00 1970
+++ linux/include/asm-i386/rwsem-spin.h Wed Apr 11 13:25:44 2001
@@ -0,0 +1,228 @@
+/* rwsem.h: R/W semaphores based on spinlocks
+ *
+ * Written by David Howells ([email protected]).
+ *
+ * Derived from asm-i386/semaphore.h and asm-i386/spinlock.h
+ */
+
+#ifndef _I386_RWSEM_SPIN_H
+#define _I386_RWSEM_SPIN_H
+
+#ifdef __KERNEL__
+
+#define CONFIG_USING_SPINLOCK_BASED_RWSEM 1
+
+/*
+ * the semaphore definition
+ */
+struct rw_semaphore {
+ atomic_t count;
+#define RWSEM_UNLOCKED_VALUE 0x00000000
+#define RWSEM_ACTIVE_BIAS 0x00000001
+#define RWSEM_ACTIVE_MASK 0x0000ffff
+#define RWSEM_WAITING_BIAS (-0x00010000)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+ spinlock_t lock;
+#define RWSEM_SPINLOCK_OFFSET_STR "4" /* byte offset of spinlock */
+ wait_queue_head_t wait;
+#define RWSEM_WAITING_FOR_READ WQ_FLAG_CONTEXT_0 /* bits to use in wait_queue_t.flags */
+#define RWSEM_WAITING_FOR_WRITE WQ_FLAG_CONTEXT_1
+#if WAITQUEUE_DEBUG
+ long __magic;
+ atomic_t readers;
+ atomic_t writers;
+#endif
+};
+
+/*
+ * initialisation
+ */
+#if WAITQUEUE_DEBUG
+#define __RWSEM_DEBUG_INIT , ATOMIC_INIT(0), ATOMIC_INIT(0)
+#else
+#define __RWSEM_DEBUG_INIT /* */
+#endif
+
+#define __RWSEM_INITIALIZER(name,count) \
+{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \
+ __WAIT_QUEUE_HEAD_INITIALIZER((name).wait), \
+ __SEM_DEBUG_INIT(name) __RWSEM_DEBUG_INIT }
+
+#define __DECLARE_RWSEM_GENERIC(name,count) \
+ struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
+
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+ atomic_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+ spin_lock_init(&sem->lock);
+ init_waitqueue_head(&sem->wait);
+#if WAITQUEUE_DEBUG
+ sem->__magic = (long)&sem->__magic;
+ atomic_set(&sem->readers, 0);
+ atomic_set(&sem->writers, 0);
+#endif
+}
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning down_read\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " incl (%%eax)\n\t" /* adds 0x00000001, returns the old value */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if we weren't granted the lock */
+ "2:\n"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __down_read_failed\n\t"
+ " jmp 2b\n"
+ ".previous"
+ "# ending __down_read\n\t"
+ : "=a"(sem)
+ : "a0"(sem)
+ : "memory");
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = RWSEM_ACTIVE_WRITE_BIAS;
+ __asm__ __volatile__(
+ "# beginning down_write\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchg %0,(%%eax)\n\t" /* retrieve the old value */
+ " add %0,(%%eax)\n\t" /* subtract 0x00010001, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " testl %0,%0\n\t" /* was the count 0 before? */
+ " jnz 4f\n\t" /* jump if we weren't granted the lock */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __down_write_failed\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending down_write"
+ : "+r"(tmp)
+ : "a"(sem)
+ : "memory");
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = -RWSEM_ACTIVE_READ_BIAS;
+ __asm__ __volatile__(
+ "# beginning __up_read\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchg %0,(%%eax)\n\t" /* retrieve the old value */
+ " addl %0,(%%eax)\n\t" /* subtract 1, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if the lock is being waited upon */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " decl %0\n\t" /* xchg gave us the old count */
+ " testl %2,%0\n\t" /* do nothing if still outstanding active readers */
+ " jnz 2b\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending __up_read\n"
+ : "+r"(tmp)
+ : "a"(sem), "i"(RWSEM_ACTIVE_MASK)
+ : "memory");
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning __up_write\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " addl %1,(%%eax)\n\t" /* adds 0x00010001 */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if the lock is being waited upon */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending __up_write\n"
+ :
+ : "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS)
+ : "memory");
+}
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_SPIN_H */
diff -uNr linux-2.4.3/include/asm-i386/rwsem-xadd.h linux/include/asm-i386/rwsem-xadd.h
--- linux-2.4.3/include/asm-i386/rwsem-xadd.h Thu Jan 1 01:00:00 1970
+++ linux/include/asm-i386/rwsem-xadd.h Wed Apr 11 13:20:15 2001
@@ -0,0 +1,183 @@
+/* rwsem-xadd.h: R/W semaphores implemented using XADD/CMPXCHG
+ *
+ * Written by David Howells ([email protected]), 2001.
+ * Derived from asm-i386/semaphore.h
+ *
+ *
+ * The MSW of the count is the negated number of active writers and waiting
+ * lockers, and the LSW is the total number of active locks
+ *
+ * The lock count is initialized to 0 (no active and no waiting lockers).
+ *
+ * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
+ * uncontended lock. This can be determined because XADD returns the old value.
+ * Readers increment by 1 and see a positive value when uncontended, negative
+ * if there are writers (and maybe) readers waiting (in which case it goes to
+ * sleep).
+ *
+ * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
+ * be extended to 65534 by manually checking the whole MSW rather than relying
+ * on the S flag.
+ *
+ * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ *
+ * This should be totally fair - if anything is waiting, a process that wants a
+ * lock will go to the back of the queue. When the currently active lock is
+ * released, if there's a writer at the front of the queue, then that and only
+ * that will be woken up; if there's a bunch of consequtive readers at the
+ * front, then they'll all be woken up, but no other readers will be.
+ */
+
+#ifndef _I386_RWSEM_XADD_H
+#define _I386_RWSEM_XADD_H
+
+#ifdef __KERNEL__
+
+/*
+ * the semaphore definition
+ */
+struct rw_semaphore {
+ atomic_t count;
+#define RWSEM_UNLOCKED_VALUE 0x00000000
+#define RWSEM_ACTIVE_BIAS 0x00000001
+#define RWSEM_ACTIVE_MASK 0x0000ffff
+#define RWSEM_WAITING_BIAS (-0x00010000)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+ wait_queue_head_t wait;
+#define RWSEM_WAITING_FOR_READ WQ_FLAG_CONTEXT_0 /* bits to use in wait_queue_t.flags */
+#define RWSEM_WAITING_FOR_WRITE WQ_FLAG_CONTEXT_1
+#if WAITQUEUE_DEBUG
+ long __magic;
+ atomic_t readers;
+ atomic_t writers;
+#endif
+};
+
+/*
+ * initialisation
+ */
+#if WAITQUEUE_DEBUG
+#define __RWSEM_DEBUG_INIT , ATOMIC_INIT(0), ATOMIC_INIT(0)
+#else
+#define __RWSEM_DEBUG_INIT /* */
+#endif
+
+#define __RWSEM_INITIALIZER(name,count) \
+{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), __WAIT_QUEUE_HEAD_INITIALIZER((name).wait), \
+ __SEM_DEBUG_INIT(name) __RWSEM_DEBUG_INIT }
+
+#define __DECLARE_RWSEM_GENERIC(name,count) \
+ struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
+
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+ atomic_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+ init_waitqueue_head(&sem->wait);
+#if WAITQUEUE_DEBUG
+ sem->__magic = (long)&sem->__magic;
+ atomic_set(&sem->readers, 0);
+ atomic_set(&sem->writers, 0);
+#endif
+}
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning down_read\n\t"
+LOCK_PREFIX " incl (%%eax)\n\t" /* adds 0x00000001, returns the old value */
+ " js 2f\n\t" /* jump if we weren't granted the lock */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __down_read_failed\n\t"
+ " jmp 1b\n"
+ ".previous"
+ "# ending down_read\n\t"
+ :
+ : "a"(sem)
+ : "memory");
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = RWSEM_ACTIVE_WRITE_BIAS;
+ __asm__ __volatile__(
+ "# beginning down_write\n\t"
+LOCK_PREFIX " xadd %0,(%%eax)\n\t" /* subtract 0x00010001, returns the old value */
+ " testl %0,%0\n\t" /* was the count 0 before? */
+ " jnz 2f\n\t" /* jump if we weren't granted the lock */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __down_write_failed\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending down_write"
+ : "+r"(tmp)
+ : "a"(sem)
+ : "memory");
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = -RWSEM_ACTIVE_READ_BIAS;
+ __asm__ __volatile__(
+ "# beginning __up_read\n\t"
+LOCK_PREFIX " xadd %0,(%%eax)\n\t" /* subtracts 1, returns the old value */
+ " js 2f\n\t" /* jump if the lock is being waited upon */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " decl %0\n\t" /* xadd gave us the old count */
+ " testl %2,%0\n\t" /* do nothing if still outstanding active readers */
+ " jnz 1b\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending __up_read\n"
+ : "+r"(tmp)
+ : "a"(sem), "i"(RWSEM_ACTIVE_MASK)
+ : "memory");
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning __up_write\n\t"
+LOCK_PREFIX " addl %1,(%%eax)\n\t" /* adds 0x00010001 */
+ " js 2f\n\t" /* jump if the lock is being waited upon */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending __up_write\n"
+ :
+ : "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS)
+ : "memory");
+}
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_XADD_H */
diff -uNr linux-2.4.3/include/asm-i386/rwsem.h linux/include/asm-i386/rwsem.h
--- linux-2.4.3/include/asm-i386/rwsem.h Thu Jan 1 01:00:00 1970
+++ linux/include/asm-i386/rwsem.h Wed Apr 11 13:28:35 2001
@@ -0,0 +1,125 @@
+/* rwsem.h: R/W semaphores based on spinlocks
+ *
+ * Written by David Howells ([email protected]).
+ *
+ * Derived from asm-i386/semaphore.h
+ */
+
+#ifndef _I386_RWSEM_H
+#define _I386_RWSEM_H
+
+#include <linux/linkage.h>
+
+#undef RWSEM_DEBUG
+
+#ifdef __KERNEL__
+
+#include <asm/system.h>
+#include <asm/atomic.h>
+#include <asm/spinlock.h>
+#include <linux/wait.h>
+
+#ifndef RWSEM_DEBUG
+#define rwsemdebug(FMT,...)
+#else
+#define rwsemdebug printk
+#endif
+
+#ifdef CONFIG_X86_XADD
+#include <asm/rwsem-xadd.h> /* use XADD based semaphores if possible */
+#else
+#include <asm/rwsem-spin.h> /* use spinlock based semaphores otherwise */
+#endif
+
+/* we use FASTCALL convention for the helpers */
+extern struct rw_semaphore *FASTCALL(__down_read_failed(struct rw_semaphore *sem));
+extern struct rw_semaphore *FASTCALL(__down_write_failed(struct rw_semaphore *sem));
+extern struct rw_semaphore *FASTCALL(__rwsem_wake(struct rw_semaphore *sem));
+
+/*
+ * lock for reading
+ */
+static inline void down_read(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering down_read(count=%08x)\n",atomic_read(&sem->count));
+
+#if WAITQUEUE_DEBUG
+ if (sem->__magic != (long)&sem->__magic)
+ BUG();
+#endif
+
+ __down_read(sem);
+
+#if WAITQUEUE_DEBUG
+ if (atomic_read(&sem->writers))
+ BUG();
+ atomic_inc(&sem->readers);
+#endif
+
+ rwsemdebug("Leaving down_read(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * lock for writing
+ */
+static inline void down_write(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering down_write(count=%08x)\n",atomic_read(&sem->count));
+
+#if WAITQUEUE_DEBUG
+ if (sem->__magic != (long)&sem->__magic)
+ BUG();
+#endif
+
+ __down_write(sem);
+
+#if WAITQUEUE_DEBUG
+ if (atomic_read(&sem->writers))
+ BUG();
+ if (atomic_read(&sem->readers))
+ BUG();
+ atomic_inc(&sem->writers);
+#endif
+
+ rwsemdebug("Leaving down_write(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * release a read lock
+ */
+static inline void up_read(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering up_read(count=%08x)\n",atomic_read(&sem->count));
+
+#if WAITQUEUE_DEBUG
+ if (atomic_read(&sem->writers))
+ BUG();
+ atomic_dec(&sem->readers);
+#endif
+ __up_read(sem);
+
+ rwsemdebug("Leaving up_read(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * release a write lock
+ */
+static inline void up_write(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering up_write(count=%08x)\n",atomic_read(&sem->count));
+
+#if WAITQUEUE_DEBUG
+ if (atomic_read(&sem->readers))
+ BUG();
+ if (atomic_read(&sem->writers) != 1)
+ BUG();
+ atomic_dec(&sem->writers);
+#endif
+ __up_write(sem);
+
+ rwsemdebug("Leaving up_write(count=%08x)\n",atomic_read(&sem->count));
+}
+
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_H */
diff -uNr linux-2.4.3/include/asm-i386/semaphore.h linux/include/asm-i386/semaphore.h
--- linux-2.4.3/include/asm-i386/semaphore.h Thu Apr 5 14:50:36 2001
+++ linux/include/asm-i386/semaphore.h Wed Apr 11 13:29:12 2001
@@ -38,8 +38,8 @@

#include <asm/system.h>
#include <asm/atomic.h>
-#include <asm/rwlock.h>
#include <linux/wait.h>
+#include <asm/rwsem.h>

struct semaphore {
atomic_t count;
@@ -202,184 +202,6 @@
:"=m" (sem->count)
:"c" (sem)
:"memory");
-}
-
-/* rw mutexes (should that be mutices? =) -- throw rw
- * spinlocks and semaphores together, and this is what we
- * end up with...
- *
- * The lock is initialized to BIAS. This way, a writer
- * subtracts BIAS ands gets 0 for the case of an uncontended
- * lock. Readers decrement by 1 and see a positive value
- * when uncontended, negative if there are writers waiting
- * (in which case it goes to sleep).
- *
- * The value 0x01000000 supports up to 128 processors and
- * lots of processes. BIAS must be chosen such that subl'ing
- * BIAS once per CPU will result in the long remaining
- * negative.
- *
- * In terms of fairness, this should result in the lock
- * flopping back and forth between readers and writers
- * under heavy use.
- *
- * -ben
- */
-struct rw_semaphore {
- atomic_t count;
- volatile unsigned char write_bias_granted;
- volatile unsigned char read_bias_granted;
- volatile unsigned char pad1;
- volatile unsigned char pad2;
- wait_queue_head_t wait;
- wait_queue_head_t write_bias_wait;
-#if WAITQUEUE_DEBUG
- long __magic;
- atomic_t readers;
- atomic_t writers;
-#endif
-};
-
-#if WAITQUEUE_DEBUG
-#define __RWSEM_DEBUG_INIT , ATOMIC_INIT(0), ATOMIC_INIT(0)
-#else
-#define __RWSEM_DEBUG_INIT /* */
-#endif
-
-#define __RWSEM_INITIALIZER(name,count) \
-{ ATOMIC_INIT(count), 0, 0, 0, 0, __WAIT_QUEUE_HEAD_INITIALIZER((name).wait), \
- __WAIT_QUEUE_HEAD_INITIALIZER((name).write_bias_wait) \
- __SEM_DEBUG_INIT(name) __RWSEM_DEBUG_INIT }
-
-#define __DECLARE_RWSEM_GENERIC(name,count) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
-
-#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
-#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
-#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
- atomic_set(&sem->count, RW_LOCK_BIAS);
- sem->read_bias_granted = 0;
- sem->write_bias_granted = 0;
- init_waitqueue_head(&sem->wait);
- init_waitqueue_head(&sem->write_bias_wait);
-#if WAITQUEUE_DEBUG
- sem->__magic = (long)&sem->__magic;
- atomic_set(&sem->readers, 0);
- atomic_set(&sem->writers, 0);
-#endif
-}
-
-/* we use FASTCALL convention for the helpers */
-extern struct rw_semaphore *FASTCALL(__down_read_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(__down_write_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(__rwsem_wake(struct rw_semaphore *sem));
-
-static inline void down_read(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->__magic != (long)&sem->__magic)
- BUG();
-#endif
- __build_read_lock(sem, "__down_read_failed");
-#if WAITQUEUE_DEBUG
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->writers))
- BUG();
- atomic_inc(&sem->readers);
-#endif
-}
-
-static inline void down_write(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->__magic != (long)&sem->__magic)
- BUG();
-#endif
- __build_write_lock(sem, "__down_write_failed");
-#if WAITQUEUE_DEBUG
- if (atomic_read(&sem->writers))
- BUG();
- if (atomic_read(&sem->readers))
- BUG();
- if (sem->read_bias_granted)
- BUG();
- if (sem->write_bias_granted)
- BUG();
- atomic_inc(&sem->writers);
-#endif
-}
-
-/* When a reader does a release, the only significant
- * case is when there was a writer waiting, and we've
- * bumped the count to 0: we must wake the writer up.
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- __asm__ __volatile__(
- "# up_read\n\t"
- LOCK "incl %0\n\t"
- "jz 2f\n" /* only do the wake if result == 0 (ie, a writer) */
- "1:\n\t"
- ".section .text.lock,\"ax\"\n"
- "2:\tcall __rwsem_wake\n\t"
- "jmp 1b\n"
- ".previous"
- :"=m" (sem->count)
- :"a" (sem)
- :"memory"
- );
-}
-
-/* releasing the writer is easy -- just release it and
- * wake up any sleepers.
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- __asm__ __volatile__(
- "# up_write\n\t"
- LOCK "addl $" RW_LOCK_BIAS_STR ",%0\n"
- "jc 2f\n" /* only do the wake if the result was -'ve to 0/+'ve */
- "1:\n\t"
- ".section .text.lock,\"ax\"\n"
- "2:\tcall __rwsem_wake\n\t"
- "jmp 1b\n"
- ".previous"
- :"=m" (sem->count)
- :"a" (sem)
- :"memory"
- );
-}
-
-static inline void up_read(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->writers))
- BUG();
- atomic_dec(&sem->readers);
-#endif
- __up_read(sem);
-}
-
-static inline void up_write(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->read_bias_granted)
- BUG();
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->readers))
- BUG();
- if (atomic_read(&sem->writers) != 1)
- BUG();
- atomic_dec(&sem->writers);
-#endif
- __up_write(sem);
}

#endif
diff -uNr linux-2.4.3/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.4.3/include/linux/sched.h Thu Apr 5 14:50:36 2001
+++ linux/include/linux/sched.h Wed Apr 11 13:29:12 2001
@@ -548,6 +548,8 @@

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
+extern int FASTCALL(__wake_up_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit));
+extern int FASTCALL(__wake_up_sync_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -566,6 +568,8 @@
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, nr)
+#define wake_up_ctx(x,count,bit) __wake_up_ctx((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,count,bit)
+#define wake_up_sync_ctx(x,count,bit) __wake_up_ctx((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,count,bit)
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -uNr linux-2.4.3/include/linux/wait.h linux/include/linux/wait.h
--- linux-2.4.3/include/linux/wait.h Thu Apr 5 14:50:36 2001
+++ linux/include/linux/wait.h Wed Apr 11 13:25:44 2001
@@ -26,6 +26,14 @@
struct __wait_queue {
unsigned int flags;
#define WQ_FLAG_EXCLUSIVE 0x01
+#define WQ_FLAG_CONTEXT_0 8 /* context specific flag bit numbers */
+#define WQ_FLAG_CONTEXT_1 9
+#define WQ_FLAG_CONTEXT_2 10
+#define WQ_FLAG_CONTEXT_3 11
+#define WQ_FLAG_CONTEXT_4 12
+#define WQ_FLAG_CONTEXT_5 13
+#define WQ_FLAG_CONTEXT_6 14
+#define WQ_FLAG_CONTEXT_7 15
struct task_struct * task;
struct list_head task_list;
#if WAITQUEUE_DEBUG
diff -uNr linux-2.4.3/kernel/fork.c linux/kernel/fork.c
--- linux-2.4.3/kernel/fork.c Thu Apr 5 14:44:17 2001
+++ linux/kernel/fork.c Tue Apr 10 09:19:47 2001
@@ -39,7 +39,7 @@
unsigned long flags;

wq_write_lock_irqsave(&q->lock, flags);
- wait->flags = 0;
+ wait->flags &= ~WQ_FLAG_EXCLUSIVE;
__add_wait_queue(q, wait);
wq_write_unlock_irqrestore(&q->lock, flags);
}
@@ -49,7 +49,7 @@
unsigned long flags;

wq_write_lock_irqsave(&q->lock, flags);
- wait->flags = WQ_FLAG_EXCLUSIVE;
+ wait->flags |= WQ_FLAG_EXCLUSIVE;
__add_wait_queue_tail(q, wait);
wq_write_unlock_irqrestore(&q->lock, flags);
}
diff -uNr linux-2.4.3/kernel/sched.c linux/kernel/sched.c
--- linux-2.4.3/kernel/sched.c Thu Apr 5 14:44:17 2001
+++ linux/kernel/sched.c Tue Apr 10 15:25:44 2001
@@ -739,7 +739,7 @@
state = p->state;
if (state & mode) {
WQ_NOTE_WAKER(curr);
- if (try_to_wake_up(p, sync) && curr->flags && !--nr_exclusive)
+ if (try_to_wake_up(p, sync) && (curr->flags&WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
}
@@ -763,6 +763,80 @@
__wake_up_common(q, mode, nr, 1);
wq_read_unlock_irqrestore(&q->lock, flags);
}
+}
+
+/*
+ * wake up processes in the wait queue depending on the state of a context bit in the flags
+ * - wakes up a process if the specified bit is set in the flags member
+ * - the context bit is cleared if the process is woken up
+ * - if the bit number is negative, then the loop stops at the first unset context bit encountered
+ * - returns the number of processes woken
+ */
+static inline int __wake_up_ctx_common (wait_queue_head_t *q, unsigned int mode,
+ int count, int bit, const int sync)
+{
+ struct list_head *tmp, *head;
+ struct task_struct *p;
+ int stop, woken;
+
+ woken = 0;
+ stop = bit<0;
+ if (bit<0) bit = -bit;
+
+ CHECK_MAGIC_WQHEAD(q);
+ head = &q->task_list;
+ WQ_CHECK_LIST_HEAD(head);
+ tmp = head->next;
+ while (tmp != head) {
+ unsigned int state;
+ wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
+
+ tmp = tmp->next;
+ CHECK_MAGIC(curr->__magic);
+ p = curr->task;
+ state = p->state;
+ if (state & mode) {
+ if (!test_and_clear_bit(bit,&curr->flags)) {
+ if (stop)
+ break;
+ continue;
+ }
+
+ WQ_NOTE_WAKER(curr);
+ if (!try_to_wake_up(p, sync) || !(curr->flags&WQ_FLAG_EXCLUSIVE))
+ continue;
+
+ woken++;
+ if (woken>=count)
+ break;
+ }
+ }
+
+ return woken;
+}
+
+int __wake_up_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit)
+{
+ int woken = 0;
+ if (q && count) {
+ unsigned long flags;
+ wq_read_lock_irqsave(&q->lock, flags);
+ woken = __wake_up_ctx_common(q, mode, count, bit, 0);
+ wq_read_unlock_irqrestore(&q->lock, flags);
+ }
+ return woken;
+}
+
+int __wake_up_ctx_sync(wait_queue_head_t *q, unsigned int mode, int count, int bit)
+{
+ int woken = 0;
+ if (q && count) {
+ unsigned long flags;
+ wq_read_lock_irqsave(&q->lock, flags);
+ woken = __wake_up_ctx_common(q, mode, count, bit, 1);
+ wq_read_unlock_irqrestore(&q->lock, flags);
+ }
+ return woken;
}

#define SLEEP_ON_VAR \

2001-04-11 14:17:47

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] 2nd try: i386 rw_semaphores fix

> I'd like you to look over it. It seems newer GCC's (snapshots and the
> upcoming 3.0) will be more strict when modifying some values through
> assembler-passed pointers - in this case, the passed semaphore structure got
> freed too early, causing massive stack corruption on early bootup.
>
> The solution was to directly mention the modified element (in this case,
> sem->count) with a "=m" qualifier, which told GCC that the contents of the
> semaphore structure are still really needed. It does not seem to have any
> bad side effects on older GCC, but lets the code work on people trying to
> use the newer snapshots.

I've just consulted with one of the gcc people we have here, and he says that
the '"memory"' constraint should do the trick.

Do I take it that that is actually insufficient?

David

2001-04-11 14:32:49

by Andreas Franck

[permalink] [raw]
Subject: Re: [PATCH] 2nd try: i386 rw_semaphores fix

Hello David and people,

> I've just consulted with one of the gcc people we have here, and he says
> that
> the '"memory"' constraint should do the trick.
>
> Do I take it that that is actually insufficient?

I don't remember exactly, it's been a while, but I think it was not
sufficient when I came up with this change. I can look at it in a few
hours.

The GCC manual is not really precise here:

> If your assembler instruction modifies memory in an unpredictable fashion,

> add `memory' to the list of clobbered registers. This will cause GNU CC to
> not keep memory values cached in registers across the assembler
> instruction. You will also want to add the volatile keyword if the memory
> affected is not listed in the inputs or outputs of the
> asm, as the `memory' clobber does not count as a side-effect of the asm.

So 'memory' alone won't probably do the trick, as caching is not the
problem here, but the unknown storage size of the semaphore.

Perhaps the __voaltile__ will help, but I don't know.

What are the reasons against mentioning sem->count directly as a "=m"
reference? This makes the whole thing less fragile and no more dependent
on the memory layout of the structure.

Greetings,
Andreas

--
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net

2001-04-11 14:44:10

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] 2nd try: i386 rw_semaphores fix

I've been discussing it with some other kernel and GCC people, and they think
that only "memory" is required.

> What are the reasons against mentioning sem->count directly as a "=m"
> reference? This makes the whole thing less fragile and no more dependent
> on the memory layout of the structure.

Apart from the risk of breaking it, you mean? Well, "=m" seems to reserve an
extra register to hold a second copy of the semaphore address, probably since
it thinks EAX might get clobbered.

Also, as a minor point, it probably ought to be "+m" not "=m".

David

2001-04-11 15:01:10

by Andreas Franck

[permalink] [raw]
Subject: Re: [PATCH] 2nd try: i386 rw_semaphores fix

Hello David,

> I've been discussing it with some other kernel and GCC people, and they
> think
> that only "memory" is required.

Hmm.. I just looked at my GCC problem report from December, perhaps you're
interested, too:

http://gcc.gnu.org/ml/gcc-bugs/2000-12/msg00554.html

The example in there compiles out-of-the box and is much easier to
experiment on than the whole kernel :-)

It should reflect the situation in the kernel as of December 2000, where no
outputs were declared at all.

I can try this examples again with current GCC snapshots and will see if I
can find a working solution without reserving more registers.

> Apart from the risk of breaking it, you mean? Well, "=m" seems to reserve
> an
> extra register to hold a second copy of the semaphore address, probably
> since
> it thinks EAX might get clobbered.
>
> Also, as a minor point, it probably ought to be "+m" not "=m".

Perhaps, I'm no real expert on this things, and "=m" worked for me, so
I used it :)

Greetings,
Andreas

--
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net

2001-04-11 15:14:30

by Bernd Schmidt

[permalink] [raw]
Subject: Re: [PATCH] 2nd try: i386 rw_semaphores fix

On Wed, 11 Apr 2001, Andreas Franck wrote:

> Hello David,
>
> > I've been discussing it with some other kernel and GCC people, and they
> > think
> > that only "memory" is required.
>
> Hmm.. I just looked at my GCC problem report from December, perhaps you're
> interested, too:
>
> http://gcc.gnu.org/ml/gcc-bugs/2000-12/msg00554.html
>
> The example in there compiles out-of-the box and is much easier to
> experiment on than the whole kernel :-)

That example seems to fail because a "memory" clobber only tells the compiler
that memory is written, not that it is read.


Bernd

2001-04-11 16:37:31

by David Howells

[permalink] [raw]
Subject: [PATCH] 3rd try: i386 rw_semaphores fix

diff -uNr linux-2.4.3/arch/i386/config.in linux-rwsem/arch/i386/config.in
--- linux-2.4.3/arch/i386/config.in Thu Apr 5 14:44:04 2001
+++ linux-rwsem/arch/i386/config.in Wed Apr 11 08:38:04 2001
@@ -47,11 +47,13 @@

if [ "$CONFIG_M386" = "y" ]; then
define_bool CONFIG_X86_CMPXCHG n
+ define_bool CONFIG_X86_XADD n
define_int CONFIG_X86_L1_CACHE_SHIFT 4
else
define_bool CONFIG_X86_WP_WORKS_OK y
define_bool CONFIG_X86_INVLPG y
define_bool CONFIG_X86_CMPXCHG y
+ define_bool CONFIG_X86_XADD y
define_bool CONFIG_X86_BSWAP y
define_bool CONFIG_X86_POPAD_OK y
fi
diff -uNr linux-2.4.3/arch/i386/kernel/semaphore.c linux-rwsem/arch/i386/kernel/semaphore.c
--- linux-2.4.3/arch/i386/kernel/semaphore.c Thu Apr 5 14:38:34 2001
+++ linux-rwsem/arch/i386/kernel/semaphore.c Wed Apr 11 12:49:28 2001
@@ -14,7 +14,6 @@
*/
#include <linux/config.h>
#include <linux/sched.h>
-
#include <asm/semaphore.h>

/*
@@ -237,19 +236,10 @@
__down_read_failed:
pushl %edx
pushl %ecx
- jnc 2f
-
-3: call down_read_failed_biased
-
-1: popl %ecx
+ call down_read_failed
+ popl %ecx
popl %edx
ret
-
-2: call down_read_failed
- " LOCK "subl $1,(%eax)
- jns 1b
- jnc 2b
- jmp 3b
"
);

@@ -260,169 +250,267 @@
__down_write_failed:
pushl %edx
pushl %ecx
- jnc 2f
-
-3: call down_write_failed_biased
-
-1: popl %ecx
+ call down_write_failed
+ popl %ecx
popl %edx
ret
-
-2: call down_write_failed
- " LOCK "subl $" RW_LOCK_BIAS_STR ",(%eax)
- jz 1b
- jnc 2b
- jmp 3b
"
);

-struct rw_semaphore *FASTCALL(rwsem_wake_readers(struct rw_semaphore *sem));
-struct rw_semaphore *FASTCALL(rwsem_wake_writer(struct rw_semaphore *sem));
+asm(
+"
+.align 4
+.globl __rwsem_wake
+__rwsem_wake:
+ pushl %edx
+ pushl %ecx
+ call rwsem_wake
+ popl %ecx
+ popl %edx
+ ret
+"
+);

-struct rw_semaphore *FASTCALL(down_read_failed_biased(struct rw_semaphore *sem));
-struct rw_semaphore *FASTCALL(down_write_failed_biased(struct rw_semaphore *sem));
+struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *sem));
struct rw_semaphore *FASTCALL(down_read_failed(struct rw_semaphore *sem));
struct rw_semaphore *FASTCALL(down_write_failed(struct rw_semaphore *sem));

-struct rw_semaphore *down_read_failed_biased(struct rw_semaphore *sem)
+/*
+ * implement exchange and add functionality
+ */
+static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
{
- struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
-
- add_wait_queue(&sem->wait, &wait); /* put ourselves at the head of the list */
+ int tmp = delta;

- for (;;) {
- if (sem->read_bias_granted && xchg(&sem->read_bias_granted, 0))
- break;
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (!sem->read_bias_granted)
- schedule();
- }
-
- remove_wait_queue(&sem->wait, &wait);
- tsk->state = TASK_RUNNING;
+#ifndef CONFIG_USING_SPINLOCK_BASED_RWSEM
+ __asm__ __volatile__(
+ LOCK_PREFIX "xadd %0,(%1)"
+ : "+r"(tmp)
+ : "r"(sem)
+ : "memory");
+
+#else
+
+ __asm__ __volatile__(
+ "# beginning rwsem_atomic_update\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchgl %0,(%1)\n\t" /* retrieve the old value */
+ " addl %0,(%1)\n\t" /* subtract 0x00010001, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* release the spinlock */
+#endif
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ ".previous\n"
+ "# ending rwsem_atomic_update\n\t"
+ : "+r"(tmp)
+ : "r"(sem)
+ : "memory");

- return sem;
+#endif
+ return tmp+delta;
}

-struct rw_semaphore *down_write_failed_biased(struct rw_semaphore *sem)
+/*
+ * implement compare and exchange functionality on the rw-semaphore count LSW
+ */
+static inline __u16 rwsem_cmpxchgw(struct rw_semaphore *sem, __u16 old, __u16 new)
{
- struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+#ifndef CONFIG_USING_SPINLOCK_BASED_RWSEM
+ return cmpxchg((__u16*)&sem->count.counter,0,RWSEM_ACTIVE_BIAS);

- add_wait_queue_exclusive(&sem->write_bias_wait, &wait); /* put ourselves at the end of the list */
+#else
+ __u16 prev;

- for (;;) {
- if (sem->write_bias_granted && xchg(&sem->write_bias_granted, 0))
- break;
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (!sem->write_bias_granted)
- schedule();
- }
-
- remove_wait_queue(&sem->write_bias_wait, &wait);
- tsk->state = TASK_RUNNING;
-
- /* if the lock is currently unbiased, awaken the sleepers
- * FIXME: this wakes up the readers early in a bit of a
- * stampede -> bad!
- */
- if (atomic_read(&sem->count) >= 0)
- wake_up(&sem->wait);
+ __asm__ __volatile__(
+ "# beginning rwsem_cmpxchgw\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " cmpw %w1,(%3)\n\t"
+ " jne 4f\n\t" /* jump if old doesn't match sem->count LSW */
+ " movw %w2,(%3)\n\t" /* replace sem->count LSW with the new value */
+ "2:\n\t"
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t" /* release the spinlock */
+#endif
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " movw (%3),%w0\n" /* we'll want to return the current value */
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending rwsem_cmpxchgw\n\t"
+ : "=r"(prev)
+ : "r0"(old), "r"(new), "r"(sem)
+ : "memory");

- return sem;
+ return prev;
+#endif
}

-/* Wait for the lock to become unbiased. Readers
- * are non-exclusive. =)
+/*
+ * wait for the read lock to be granted
+ * - need to repeal the increment made inline by the caller
+ * - need to throw a write-lock style spanner into the works (sub 0x00010000 from count)
*/
struct rw_semaphore *down_read_failed(struct rw_semaphore *sem)
{
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ DECLARE_WAITQUEUE(wait,tsk);
+ int count;
+
+ rwsemdebug("[%d] Entering down_read_failed(%08x)\n",current->pid,atomic_read(&sem->count));

- __up_read(sem); /* this takes care of granting the lock */
+ /* this waitqueue context flag will be cleared when we are granted the lock */
+ __set_bit(RWSEM_WAITING_FOR_READ,&wait.flags);

- add_wait_queue(&sem->wait, &wait);
+ add_wait_queue_exclusive(&sem->wait, &wait); /* FIFO */
+
+ /* note that we're now waiting on the lock, but no longer actively read-locking */
+ count = rwsem_atomic_update(RWSEM_WAITING_BIAS-RWSEM_ACTIVE_BIAS,sem);
+ rwsemdebug("X(%08x)\n",count);
+
+ /* if there are no longer active locks, wake the front queued process(es) up
+ * - it might even be this process, since the waker takes a more active part
+ */
+ if (!(count & RWSEM_ACTIVE_MASK))
+ rwsem_wake(sem);

- while (atomic_read(&sem->count) < 0) {
+ /* wait to be given the lock */
+ for (;;) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (atomic_read(&sem->count) >= 0)
+ if (!test_bit(RWSEM_WAITING_FOR_READ,&wait.flags))
break;
schedule();
}

- remove_wait_queue(&sem->wait, &wait);
+ remove_wait_queue(&sem->wait,&wait);
tsk->state = TASK_RUNNING;

+ rwsemdebug("[%d] Leaving down_read_failed(%08x)\n",current->pid,atomic_read(&sem->count));
+
return sem;
}

-/* Wait for the lock to become unbiased. Since we're
- * a writer, we'll make ourselves exclusive.
+/*
+ * wait for the write lock to be granted
*/
struct rw_semaphore *down_write_failed(struct rw_semaphore *sem)
{
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ DECLARE_WAITQUEUE(wait,tsk);
+ int count;

- __up_write(sem); /* this takes care of granting the lock */
+ rwsemdebug("[%d] Entering down_write_failed(%08x)\n",
+ current->pid,atomic_read(&sem->count));

- add_wait_queue_exclusive(&sem->wait, &wait);
+ /* this waitqueue context flag will be cleared when we are granted the lock */
+ __set_bit(RWSEM_WAITING_FOR_WRITE,&wait.flags);
+
+ add_wait_queue_exclusive(&sem->wait, &wait); /* FIFO */

- while (atomic_read(&sem->count) < 0) {
+ /* note that we're waiting on the lock, but no longer actively locking */
+ count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem);
+ rwsemdebug("A(%08x)\n",count);
+
+ /* if there are no longer active locks, wake the front queued process(es) up
+ * - it might even be this process, since the waker takes a more active part
+ */
+ if (!(count & RWSEM_ACTIVE_MASK))
+ rwsem_wake(sem);
+
+ /* wait to be given the lock */
+ for (;;) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (atomic_read(&sem->count) >= 0)
- break; /* we must attempt to acquire or bias the lock */
+ if (!test_bit(RWSEM_WAITING_FOR_WRITE,&wait.flags))
+ break;
schedule();
}

- remove_wait_queue(&sem->wait, &wait);
+ remove_wait_queue(&sem->wait,&wait);
tsk->state = TASK_RUNNING;

+ rwsemdebug("[%d] Leaving down_write_failed(%08x)\n",current->pid,atomic_read(&sem->count));
+
return sem;
}

-asm(
-"
-.align 4
-.globl __rwsem_wake
-__rwsem_wake:
- pushl %edx
- pushl %ecx
-
- jz 1f
- call rwsem_wake_readers
- jmp 2f
+/*
+ * handle the lock being released whilst there are processes blocked on it that can now run
+ * - if we come here, then:
+ * - the 'active part' of the count (&0x0000ffff) reached zero (but may no longer be zero)
+ * - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
+ */
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+ int woken;

-1: call rwsem_wake_writer
+ rwsemdebug("[%d] Entering rwsem_wake(%08x)\n",current->pid,atomic_read(&sem->count));

-2: popl %ecx
- popl %edx
- ret
-"
-);
+ /* try to grab an 'activity' marker
+ * - need to make sure two copies of rwsem_wake() don't do this for two separate processes
+ * simultaneously
+ * - be horribly naughty, and only deal with the LSW of the atomic counter
+ */
+ if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0)
+ goto out;

-/* Called when someone has done an up that transitioned from
- * negative to non-negative, meaning that the lock has been
- * granted to whomever owned the bias.
- */
-struct rw_semaphore *rwsem_wake_readers(struct rw_semaphore *sem)
-{
- if (xchg(&sem->read_bias_granted, 1))
+ /* try to grant a single write lock if there's a writer at the front of the queue
+ * - note we leave the 'active part' of the count incremented by 1 and the waiting part
+ * incremented by 0x00010000
+ */
+ switch (wake_up_ctx(&sem->wait,1,-RWSEM_WAITING_FOR_WRITE)) {
+ case 1:
+ goto out;
+ case 0:
+ break;
+ default:
BUG();
- wake_up(&sem->wait);
- return sem;
-}
+ }

-struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem)
-{
- if (xchg(&sem->write_bias_granted, 1))
+ rwsemdebug("E\n");
+
+ /* grant an infinite number of read locks to the readers at the front of the queue
+ * - note we increment the 'active part' of the count by the number of readers just woken,
+ * less one for the activity decrement we've already done
+ */
+ woken = wake_up_ctx(&sem->wait,65535,-RWSEM_WAITING_FOR_READ);
+ if (woken>0) {
+ woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
+ woken -= RWSEM_ACTIVE_BIAS;
+ rwsem_atomic_update(woken,sem);
+ }
+ else {
BUG();
- wake_up(&sem->write_bias_wait);
+ }
+
+ out:
+ rwsemdebug("[%d] Leaving rwsem_wake(%08x)\n",current->pid,atomic_read(&sem->count));
return sem;
}

+/*
+ * rw spinlock fallbacks
+ */
#if defined(CONFIG_SMP)
asm(
"
@@ -451,4 +539,3 @@
"
);
#endif
-
diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h linux-rwsem/include/asm-i386/rwsem-spin.h
--- linux-2.4.3/include/asm-i386/rwsem-spin.h Thu Jan 1 01:00:00 1970
+++ linux-rwsem/include/asm-i386/rwsem-spin.h Wed Apr 11 17:18:07 2001
@@ -0,0 +1,228 @@
+/* rwsem.h: R/W semaphores based on spinlocks
+ *
+ * Written by David Howells ([email protected]).
+ *
+ * Derived from asm-i386/semaphore.h and asm-i386/spinlock.h
+ */
+
+#ifndef _I386_RWSEM_SPIN_H
+#define _I386_RWSEM_SPIN_H
+
+#ifdef __KERNEL__
+
+#define CONFIG_USING_SPINLOCK_BASED_RWSEM 1
+
+/*
+ * the semaphore definition
+ */
+struct rw_semaphore {
+ atomic_t count;
+#define RWSEM_UNLOCKED_VALUE 0x00000000
+#define RWSEM_ACTIVE_BIAS 0x00000001
+#define RWSEM_ACTIVE_MASK 0x0000ffff
+#define RWSEM_WAITING_BIAS (-0x00010000)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+ spinlock_t lock;
+#define RWSEM_SPINLOCK_OFFSET_STR "4" /* byte offset of spinlock */
+ wait_queue_head_t wait;
+#define RWSEM_WAITING_FOR_READ WQ_FLAG_CONTEXT_0 /* bits to use in wait_queue_t.flags */
+#define RWSEM_WAITING_FOR_WRITE WQ_FLAG_CONTEXT_1
+#if WAITQUEUE_DEBUG
+ long __magic;
+ atomic_t readers;
+ atomic_t writers;
+#endif
+};
+
+/*
+ * initialisation
+ */
+#if WAITQUEUE_DEBUG
+#define __RWSEM_DEBUG_INIT , ATOMIC_INIT(0), ATOMIC_INIT(0)
+#else
+#define __RWSEM_DEBUG_INIT /* */
+#endif
+
+#define __RWSEM_INITIALIZER(name,count) \
+{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \
+ __WAIT_QUEUE_HEAD_INITIALIZER((name).wait), \
+ __SEM_DEBUG_INIT(name) __RWSEM_DEBUG_INIT }
+
+#define __DECLARE_RWSEM_GENERIC(name,count) \
+ struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
+
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+ atomic_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+ spin_lock_init(&sem->lock);
+ init_waitqueue_head(&sem->wait);
+#if WAITQUEUE_DEBUG
+ sem->__magic = (long)&sem->__magic;
+ atomic_set(&sem->readers, 0);
+ atomic_set(&sem->writers, 0);
+#endif
+}
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning down_read\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " incl (%%eax)\n\t" /* adds 0x00000001, returns the old value */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if we weren't granted the lock */
+ "2:\n"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __down_read_failed\n\t"
+ " jmp 2b\n"
+ ".previous"
+ "# ending __down_read\n\t"
+ : "=m"(sem->count), "=m"(sem->lock)
+ : "a"(sem), "m"(sem->count), "m"(sem->lock)
+ : "memory");
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = RWSEM_ACTIVE_WRITE_BIAS;
+ __asm__ __volatile__(
+ "# beginning down_write\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchg %0,(%%eax)\n\t" /* retrieve the old value */
+ " add %0,(%%eax)\n\t" /* subtract 0x00010001, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " testl %0,%0\n\t" /* was the count 0 before? */
+ " jnz 4f\n\t" /* jump if we weren't granted the lock */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __down_write_failed\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending down_write"
+ : "+r"(tmp), "=m"(sem->count), "=m"(sem->lock)
+ : "a"(sem), "m"(sem->count), "m"(sem->lock)
+ : "memory");
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = -RWSEM_ACTIVE_READ_BIAS;
+ __asm__ __volatile__(
+ "# beginning __up_read\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchg %0,(%%eax)\n\t" /* retrieve the old value */
+ " addl %0,(%%eax)\n\t" /* subtract 1, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if the lock is being waited upon */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " decl %0\n\t" /* xchg gave us the old count */
+ " testl %4,%0\n\t" /* do nothing if still outstanding active readers */
+ " jnz 2b\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending __up_read\n"
+ : "+r"(tmp), "=m"(sem->count), "=m"(sem->lock)
+ : "a"(sem), "i"(RWSEM_ACTIVE_MASK), "m"(sem->count), "m"(sem->lock)
+ : "memory");
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning __up_write\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " addl %3,(%%eax)\n\t" /* adds 0x00010001 */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if the lock is being waited upon */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending __up_write\n"
+ : "=m"(sem->count), "=m"(sem->lock)
+ : "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS), "m"(sem->count), "m"(sem->lock)
+ : "memory");
+}
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_SPIN_H */
diff -uNr linux-2.4.3/include/asm-i386/rwsem-xadd.h linux-rwsem/include/asm-i386/rwsem-xadd.h
--- linux-2.4.3/include/asm-i386/rwsem-xadd.h Thu Jan 1 01:00:00 1970
+++ linux-rwsem/include/asm-i386/rwsem-xadd.h Wed Apr 11 16:50:49 2001
@@ -0,0 +1,183 @@
+/* rwsem-xadd.h: R/W semaphores implemented using XADD/CMPXCHG
+ *
+ * Written by David Howells ([email protected]), 2001.
+ * Derived from asm-i386/semaphore.h
+ *
+ *
+ * The MSW of the count is the negated number of active writers and waiting
+ * lockers, and the LSW is the total number of active locks
+ *
+ * The lock count is initialized to 0 (no active and no waiting lockers).
+ *
+ * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
+ * uncontended lock. This can be determined because XADD returns the old value.
+ * Readers increment by 1 and see a positive value when uncontended, negative
+ * if there are writers (and maybe) readers waiting (in which case it goes to
+ * sleep).
+ *
+ * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
+ * be extended to 65534 by manually checking the whole MSW rather than relying
+ * on the S flag.
+ *
+ * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ *
+ * This should be totally fair - if anything is waiting, a process that wants a
+ * lock will go to the back of the queue. When the currently active lock is
+ * released, if there's a writer at the front of the queue, then that and only
+ * that will be woken up; if there's a bunch of consequtive readers at the
+ * front, then they'll all be woken up, but no other readers will be.
+ */
+
+#ifndef _I386_RWSEM_XADD_H
+#define _I386_RWSEM_XADD_H
+
+#ifdef __KERNEL__
+
+/*
+ * the semaphore definition
+ */
+struct rw_semaphore {
+ atomic_t count;
+#define RWSEM_UNLOCKED_VALUE 0x00000000
+#define RWSEM_ACTIVE_BIAS 0x00000001
+#define RWSEM_ACTIVE_MASK 0x0000ffff
+#define RWSEM_WAITING_BIAS (-0x00010000)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+ wait_queue_head_t wait;
+#define RWSEM_WAITING_FOR_READ WQ_FLAG_CONTEXT_0 /* bits to use in wait_queue_t.flags */
+#define RWSEM_WAITING_FOR_WRITE WQ_FLAG_CONTEXT_1
+#if WAITQUEUE_DEBUG
+ long __magic;
+ atomic_t readers;
+ atomic_t writers;
+#endif
+};
+
+/*
+ * initialisation
+ */
+#if WAITQUEUE_DEBUG
+#define __RWSEM_DEBUG_INIT , ATOMIC_INIT(0), ATOMIC_INIT(0)
+#else
+#define __RWSEM_DEBUG_INIT /* */
+#endif
+
+#define __RWSEM_INITIALIZER(name,count) \
+{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), __WAIT_QUEUE_HEAD_INITIALIZER((name).wait), \
+ __SEM_DEBUG_INIT(name) __RWSEM_DEBUG_INIT }
+
+#define __DECLARE_RWSEM_GENERIC(name,count) \
+ struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
+
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+ atomic_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+ init_waitqueue_head(&sem->wait);
+#if WAITQUEUE_DEBUG
+ sem->__magic = (long)&sem->__magic;
+ atomic_set(&sem->readers, 0);
+ atomic_set(&sem->writers, 0);
+#endif
+}
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning down_read\n\t"
+LOCK_PREFIX " incl (%%eax)\n\t" /* adds 0x00000001, returns the old value */
+ " js 2f\n\t" /* jump if we weren't granted the lock */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __down_read_failed\n\t"
+ " jmp 1b\n"
+ ".previous"
+ "# ending down_read\n\t"
+ : "=m"(sem->count)
+ : "a"(sem), "m"(sem->count)
+ : "memory");
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = RWSEM_ACTIVE_WRITE_BIAS;
+ __asm__ __volatile__(
+ "# beginning down_write\n\t"
+LOCK_PREFIX " xadd %0,(%%eax)\n\t" /* subtract 0x00010001, returns the old value */
+ " testl %0,%0\n\t" /* was the count 0 before? */
+ " jnz 2f\n\t" /* jump if we weren't granted the lock */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __down_write_failed\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending down_write"
+ : "+r"(tmp), "=m"(sem->count)
+ : "a"(sem), "m"(sem->count)
+ : "memory");
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = -RWSEM_ACTIVE_READ_BIAS;
+ __asm__ __volatile__(
+ "# beginning __up_read\n\t"
+LOCK_PREFIX " xadd %0,(%%eax)\n\t" /* subtracts 1, returns the old value */
+ " js 2f\n\t" /* jump if the lock is being waited upon */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " decl %0\n\t" /* xadd gave us the old count */
+ " testl %3,%0\n\t" /* do nothing if still outstanding active readers */
+ " jnz 1b\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending __up_read\n"
+ : "+r"(tmp), "=m"(sem->count)
+ : "a"(sem), "i"(RWSEM_ACTIVE_MASK), "m"(sem->count)
+ : "memory");
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning __up_write\n\t"
+LOCK_PREFIX " addl %2,(%%eax)\n\t" /* adds 0x00010001 */
+ " js 2f\n\t" /* jump if the lock is being waited upon */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending __up_write\n"
+ : "=m"(sem->count)
+ : "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS), "m"(sem->count)
+ : "memory");
+}
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_XADD_H */
diff -uNr linux-2.4.3/include/asm-i386/rwsem.h linux-rwsem/include/asm-i386/rwsem.h
--- linux-2.4.3/include/asm-i386/rwsem.h Thu Jan 1 01:00:00 1970
+++ linux-rwsem/include/asm-i386/rwsem.h Wed Apr 11 17:29:18 2001
@@ -0,0 +1,125 @@
+/* rwsem.h: R/W semaphores based on spinlocks
+ *
+ * Written by David Howells ([email protected]).
+ *
+ * Derived from asm-i386/semaphore.h
+ */
+
+#ifndef _I386_RWSEM_H
+#define _I386_RWSEM_H
+
+#include <linux/linkage.h>
+
+//#undef RWSEM_DEBUG
+
+#ifdef __KERNEL__
+
+#include <asm/system.h>
+#include <asm/atomic.h>
+#include <asm/spinlock.h>
+#include <linux/wait.h>
+
+#ifndef RWSEM_DEBUG
+#define rwsemdebug(FMT,...)
+#else
+#define rwsemdebug printk
+#endif
+
+#ifdef CONFIG_X86_XADD
+#include <asm/rwsem-xadd.h> /* use XADD based semaphores if possible */
+#else
+#include <asm/rwsem-spin.h> /* use spinlock based semaphores otherwise */
+#endif
+
+/* we use FASTCALL convention for the helpers */
+extern struct rw_semaphore *FASTCALL(__down_read_failed(struct rw_semaphore *sem));
+extern struct rw_semaphore *FASTCALL(__down_write_failed(struct rw_semaphore *sem));
+extern struct rw_semaphore *FASTCALL(__rwsem_wake(struct rw_semaphore *sem));
+
+/*
+ * lock for reading
+ */
+static inline void down_read(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering down_read(count=%08x)\n",atomic_read(&sem->count));
+
+#if WAITQUEUE_DEBUG
+ if (sem->__magic != (long)&sem->__magic)
+ BUG();
+#endif
+
+ __down_read(sem);
+
+#if WAITQUEUE_DEBUG
+ if (atomic_read(&sem->writers))
+ BUG();
+ atomic_inc(&sem->readers);
+#endif
+
+ rwsemdebug("Leaving down_read(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * lock for writing
+ */
+static inline void down_write(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering down_write(count=%08x)\n",atomic_read(&sem->count));
+
+#if WAITQUEUE_DEBUG
+ if (sem->__magic != (long)&sem->__magic)
+ BUG();
+#endif
+
+ __down_write(sem);
+
+#if WAITQUEUE_DEBUG
+ if (atomic_read(&sem->writers))
+ BUG();
+ if (atomic_read(&sem->readers))
+ BUG();
+ atomic_inc(&sem->writers);
+#endif
+
+ rwsemdebug("Leaving down_write(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * release a read lock
+ */
+static inline void up_read(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering up_read(count=%08x)\n",atomic_read(&sem->count));
+
+#if WAITQUEUE_DEBUG
+ if (atomic_read(&sem->writers))
+ BUG();
+ atomic_dec(&sem->readers);
+#endif
+ __up_read(sem);
+
+ rwsemdebug("Leaving up_read(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * release a write lock
+ */
+static inline void up_write(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering up_write(count=%08x)\n",atomic_read(&sem->count));
+
+#if WAITQUEUE_DEBUG
+ if (atomic_read(&sem->readers))
+ BUG();
+ if (atomic_read(&sem->writers) != 1)
+ BUG();
+ atomic_dec(&sem->writers);
+#endif
+ __up_write(sem);
+
+ rwsemdebug("Leaving up_write(count=%08x)\n",atomic_read(&sem->count));
+}
+
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_H */
diff -uNr linux-2.4.3/include/asm-i386/semaphore.h linux-rwsem/include/asm-i386/semaphore.h
--- linux-2.4.3/include/asm-i386/semaphore.h Thu Apr 5 14:50:36 2001
+++ linux-rwsem/include/asm-i386/semaphore.h Wed Apr 11 13:29:12 2001
@@ -38,8 +38,8 @@

#include <asm/system.h>
#include <asm/atomic.h>
-#include <asm/rwlock.h>
#include <linux/wait.h>
+#include <asm/rwsem.h>

struct semaphore {
atomic_t count;
@@ -202,184 +202,6 @@
:"=m" (sem->count)
:"c" (sem)
:"memory");
-}
-
-/* rw mutexes (should that be mutices? =) -- throw rw
- * spinlocks and semaphores together, and this is what we
- * end up with...
- *
- * The lock is initialized to BIAS. This way, a writer
- * subtracts BIAS ands gets 0 for the case of an uncontended
- * lock. Readers decrement by 1 and see a positive value
- * when uncontended, negative if there are writers waiting
- * (in which case it goes to sleep).
- *
- * The value 0x01000000 supports up to 128 processors and
- * lots of processes. BIAS must be chosen such that subl'ing
- * BIAS once per CPU will result in the long remaining
- * negative.
- *
- * In terms of fairness, this should result in the lock
- * flopping back and forth between readers and writers
- * under heavy use.
- *
- * -ben
- */
-struct rw_semaphore {
- atomic_t count;
- volatile unsigned char write_bias_granted;
- volatile unsigned char read_bias_granted;
- volatile unsigned char pad1;
- volatile unsigned char pad2;
- wait_queue_head_t wait;
- wait_queue_head_t write_bias_wait;
-#if WAITQUEUE_DEBUG
- long __magic;
- atomic_t readers;
- atomic_t writers;
-#endif
-};
-
-#if WAITQUEUE_DEBUG
-#define __RWSEM_DEBUG_INIT , ATOMIC_INIT(0), ATOMIC_INIT(0)
-#else
-#define __RWSEM_DEBUG_INIT /* */
-#endif
-
-#define __RWSEM_INITIALIZER(name,count) \
-{ ATOMIC_INIT(count), 0, 0, 0, 0, __WAIT_QUEUE_HEAD_INITIALIZER((name).wait), \
- __WAIT_QUEUE_HEAD_INITIALIZER((name).write_bias_wait) \
- __SEM_DEBUG_INIT(name) __RWSEM_DEBUG_INIT }
-
-#define __DECLARE_RWSEM_GENERIC(name,count) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
-
-#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
-#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
-#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
- atomic_set(&sem->count, RW_LOCK_BIAS);
- sem->read_bias_granted = 0;
- sem->write_bias_granted = 0;
- init_waitqueue_head(&sem->wait);
- init_waitqueue_head(&sem->write_bias_wait);
-#if WAITQUEUE_DEBUG
- sem->__magic = (long)&sem->__magic;
- atomic_set(&sem->readers, 0);
- atomic_set(&sem->writers, 0);
-#endif
-}
-
-/* we use FASTCALL convention for the helpers */
-extern struct rw_semaphore *FASTCALL(__down_read_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(__down_write_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(__rwsem_wake(struct rw_semaphore *sem));
-
-static inline void down_read(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->__magic != (long)&sem->__magic)
- BUG();
-#endif
- __build_read_lock(sem, "__down_read_failed");
-#if WAITQUEUE_DEBUG
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->writers))
- BUG();
- atomic_inc(&sem->readers);
-#endif
-}
-
-static inline void down_write(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->__magic != (long)&sem->__magic)
- BUG();
-#endif
- __build_write_lock(sem, "__down_write_failed");
-#if WAITQUEUE_DEBUG
- if (atomic_read(&sem->writers))
- BUG();
- if (atomic_read(&sem->readers))
- BUG();
- if (sem->read_bias_granted)
- BUG();
- if (sem->write_bias_granted)
- BUG();
- atomic_inc(&sem->writers);
-#endif
-}
-
-/* When a reader does a release, the only significant
- * case is when there was a writer waiting, and we've
- * bumped the count to 0: we must wake the writer up.
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- __asm__ __volatile__(
- "# up_read\n\t"
- LOCK "incl %0\n\t"
- "jz 2f\n" /* only do the wake if result == 0 (ie, a writer) */
- "1:\n\t"
- ".section .text.lock,\"ax\"\n"
- "2:\tcall __rwsem_wake\n\t"
- "jmp 1b\n"
- ".previous"
- :"=m" (sem->count)
- :"a" (sem)
- :"memory"
- );
-}
-
-/* releasing the writer is easy -- just release it and
- * wake up any sleepers.
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- __asm__ __volatile__(
- "# up_write\n\t"
- LOCK "addl $" RW_LOCK_BIAS_STR ",%0\n"
- "jc 2f\n" /* only do the wake if the result was -'ve to 0/+'ve */
- "1:\n\t"
- ".section .text.lock,\"ax\"\n"
- "2:\tcall __rwsem_wake\n\t"
- "jmp 1b\n"
- ".previous"
- :"=m" (sem->count)
- :"a" (sem)
- :"memory"
- );
-}
-
-static inline void up_read(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->writers))
- BUG();
- atomic_dec(&sem->readers);
-#endif
- __up_read(sem);
-}
-
-static inline void up_write(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->read_bias_granted)
- BUG();
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->readers))
- BUG();
- if (atomic_read(&sem->writers) != 1)
- BUG();
- atomic_dec(&sem->writers);
-#endif
- __up_write(sem);
}

#endif
diff -uNr linux-2.4.3/include/linux/sched.h linux-rwsem/include/linux/sched.h
--- linux-2.4.3/include/linux/sched.h Thu Apr 5 14:50:36 2001
+++ linux-rwsem/include/linux/sched.h Wed Apr 11 13:29:12 2001
@@ -548,6 +548,8 @@

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
+extern int FASTCALL(__wake_up_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit));
+extern int FASTCALL(__wake_up_sync_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -566,6 +568,8 @@
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, nr)
+#define wake_up_ctx(x,count,bit) __wake_up_ctx((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,count,bit)
+#define wake_up_sync_ctx(x,count,bit) __wake_up_ctx((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,count,bit)
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -uNr linux-2.4.3/include/linux/wait.h linux-rwsem/include/linux/wait.h
--- linux-2.4.3/include/linux/wait.h Thu Apr 5 14:50:36 2001
+++ linux-rwsem/include/linux/wait.h Wed Apr 11 13:25:44 2001
@@ -26,6 +26,14 @@
struct __wait_queue {
unsigned int flags;
#define WQ_FLAG_EXCLUSIVE 0x01
+#define WQ_FLAG_CONTEXT_0 8 /* context specific flag bit numbers */
+#define WQ_FLAG_CONTEXT_1 9
+#define WQ_FLAG_CONTEXT_2 10
+#define WQ_FLAG_CONTEXT_3 11
+#define WQ_FLAG_CONTEXT_4 12
+#define WQ_FLAG_CONTEXT_5 13
+#define WQ_FLAG_CONTEXT_6 14
+#define WQ_FLAG_CONTEXT_7 15
struct task_struct * task;
struct list_head task_list;
#if WAITQUEUE_DEBUG
diff -uNr linux-2.4.3/kernel/fork.c linux-rwsem/kernel/fork.c
--- linux-2.4.3/kernel/fork.c Thu Apr 5 14:44:17 2001
+++ linux-rwsem/kernel/fork.c Tue Apr 10 09:19:47 2001
@@ -39,7 +39,7 @@
unsigned long flags;

wq_write_lock_irqsave(&q->lock, flags);
- wait->flags = 0;
+ wait->flags &= ~WQ_FLAG_EXCLUSIVE;
__add_wait_queue(q, wait);
wq_write_unlock_irqrestore(&q->lock, flags);
}
@@ -49,7 +49,7 @@
unsigned long flags;

wq_write_lock_irqsave(&q->lock, flags);
- wait->flags = WQ_FLAG_EXCLUSIVE;
+ wait->flags |= WQ_FLAG_EXCLUSIVE;
__add_wait_queue_tail(q, wait);
wq_write_unlock_irqrestore(&q->lock, flags);
}
diff -uNr linux-2.4.3/kernel/sched.c linux-rwsem/kernel/sched.c
--- linux-2.4.3/kernel/sched.c Thu Apr 5 14:44:17 2001
+++ linux-rwsem/kernel/sched.c Tue Apr 10 15:25:44 2001
@@ -739,7 +739,7 @@
state = p->state;
if (state & mode) {
WQ_NOTE_WAKER(curr);
- if (try_to_wake_up(p, sync) && curr->flags && !--nr_exclusive)
+ if (try_to_wake_up(p, sync) && (curr->flags&WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
}
@@ -763,6 +763,80 @@
__wake_up_common(q, mode, nr, 1);
wq_read_unlock_irqrestore(&q->lock, flags);
}
+}
+
+/*
+ * wake up processes in the wait queue depending on the state of a context bit in the flags
+ * - wakes up a process if the specified bit is set in the flags member
+ * - the context bit is cleared if the process is woken up
+ * - if the bit number is negative, then the loop stops at the first unset context bit encountered
+ * - returns the number of processes woken
+ */
+static inline int __wake_up_ctx_common (wait_queue_head_t *q, unsigned int mode,
+ int count, int bit, const int sync)
+{
+ struct list_head *tmp, *head;
+ struct task_struct *p;
+ int stop, woken;
+
+ woken = 0;
+ stop = bit<0;
+ if (bit<0) bit = -bit;
+
+ CHECK_MAGIC_WQHEAD(q);
+ head = &q->task_list;
+ WQ_CHECK_LIST_HEAD(head);
+ tmp = head->next;
+ while (tmp != head) {
+ unsigned int state;
+ wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
+
+ tmp = tmp->next;
+ CHECK_MAGIC(curr->__magic);
+ p = curr->task;
+ state = p->state;
+ if (state & mode) {
+ if (!test_and_clear_bit(bit,&curr->flags)) {
+ if (stop)
+ break;
+ continue;
+ }
+
+ WQ_NOTE_WAKER(curr);
+ if (!try_to_wake_up(p, sync) || !(curr->flags&WQ_FLAG_EXCLUSIVE))
+ continue;
+
+ woken++;
+ if (woken>=count)
+ break;
+ }
+ }
+
+ return woken;
+}
+
+int __wake_up_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit)
+{
+ int woken = 0;
+ if (q && count) {
+ unsigned long flags;
+ wq_read_lock_irqsave(&q->lock, flags);
+ woken = __wake_up_ctx_common(q, mode, count, bit, 0);
+ wq_read_unlock_irqrestore(&q->lock, flags);
+ }
+ return woken;
+}
+
+int __wake_up_ctx_sync(wait_queue_head_t *q, unsigned int mode, int count, int bit)
+{
+ int woken = 0;
+ if (q && count) {
+ unsigned long flags;
+ wq_read_lock_irqsave(&q->lock, flags);
+ woken = __wake_up_ctx_common(q, mode, count, bit, 1);
+ wq_read_unlock_irqrestore(&q->lock, flags);
+ }
+ return woken;
}

#define SLEEP_ON_VAR \

2001-04-11 17:00:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

David Howells wrote:
>
> Here's a patch that fixes RW semaphores on the i386 architecture. It is very
> simple in the way it works.
>
> The lock counter is dealt with as two semi-independent words: the LSW is the
> number of active (granted) locks, and the MSW, if negated, is the number of
> active writers (0 or 1) plus the number of waiting lockers of any type.
>
> The fast paths are either two or three instructions long.
>
> This algorithm should also be totally fair! Contentious lockers get put on the
> back of the wait queue, and a waker function wakes them starting at the front,
> but only wakes either one writer or the first consecutive bundle of readers.

I think that's a very good approach. Sure, it's suboptimal when there
are three or more waiters (and they're the right type and order). But
that never happens. Nice design idea.

> The disadvantage is that the maximum number of active locks is 65535, and the
> maximum number of waiting locks is 32766 (this can be extended to 65534 by not
> relying on the S flag).

These numbers are infinity :)

> I've included a revised testing module (rwsem.c) that allows read locks to be
> obtained as well as write locks and a revised driver program (driver.c) that
> can use rwsem.c. Try the following tests:
>
> driver -200 & driver 200 # fork 200 writers and then 200 readers
> driver 200 & driver -200 # fork 200 readers and then 200 writers

You need sterner testing stuff :) I hit the BUG at the end of rwsem_wake()
in about a second running rwsem-4. Removed the BUG and everything stops
in D state.

Grab rwsem-4 from

http://www.uow.edu.au/~andrewm/linux/rwsem.tar.gz

It's very simple. But running fully in-kernel shortens the
code paths enormously and allows you to find those little
timing windows. Run rmsem-4 in two modes: one with
the schedule() in sched() enabled, and also with it
commented out. If it passes that, it works. When
you remove the module it'll print out the number of
read-grants versus write-grants. If these run at 6:1
with schedule() disabled then you've kicked butt.

Also, rwsem-4 checks that the rwsems are actually providing
exclusion between readers and writers, and between
writers and writers. A useful thing to check, that.


Some random comments:

- rwsemdebug(FMT, ...) doesn't compile with egcs-1.1.2. Need
to remove the comma.

- In include/linux/sched.h:INIT_MM() I suggest we remove the
second arg to __RWSEM_INITIALISER(). Your code doesn't use it,
other architectures don't use it. __RWSEM_INITIALIZER(name.mmap_sem)
seems appropriate.

- The comments in down_write and down_read() are inaccurate.
RWSEM_ACTIVE_WRITE_BIAS is 0xffff0001, not 0x00010001

- It won't compile when WAITQUEUE_DEBUG is turned on. I
guess you knew that. (Good luck trying to enable
WAITQUEUE_DEBUG in -ac kernels, BTW. Someone added
a config option for it (CONFIG_DEBUG_WAITK) but forgot
to add it to the config system!)

- The comments above the functions in semaphore.h need
updating.

- What on earth does __xg() do? (And why do people write
code like that without explaining why? Don't answer this
one).

- May as well kill the local variable `state' in the __wake_up
functions. Not sure why I left that in there...

- Somewhat offtopic: the `asm' statements in semaphore.c
are really dangerous. If you precede them with an init_module(),
the asm code gets assembled into the .initcall section and
gets unloaded when you least expected it. If you precede it
with an EXPORT_SYMBOL() then the code gets assembled into the
__ksymtab section and your kernel mysteriously explodes when
loading modules, providing you with a fun hour working out why
(I measured it). If you predece the asm with a data initialiser
then the code gets assembled into .data, etc...

I think the most elegant fix for this is to wrap the asm code
inside a dummy C function. This way, we allow the compiler
to put the code into whatever section gcc-of-the-day is putting
text into. Drawbacks of this are that the wrapper function
will need global scope to avoid a compile-time warning, and
it'll get dropped altogether if people are playing with
-ffunction-sections. Alternative is, of course, to
add `.text' to the asm itself.

2001-04-11 17:36:44

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

Andrew Morton wrote:
> I think that's a very good approach. Sure, it's suboptimal when there
> are three or more waiters (and they're the right type and order). But
> that never happens. Nice design idea.

Cheers.

> These numbers are infinity :)

I know, but I think Linus may be happy with the resolution for the moment. It
can be extended later by siphoning off excess quantities of waiters into a
separate counter (as is done now) and by making the access count use a larger
part of the variable.

Unfortunately, managing the count and siphoned-off count together is tricky.

> You need sterner testing stuff :) I hit the BUG at the end of rwsem_wake()
> in about a second running rwsem-4. Removed the BUG and everything stops
> in D state.
>
> Grab rwsem-4 from
> ...

Will do.

> It's very simple. But running fully in-kernel shortens the
> code paths enormously and allows you to find those little
> timing windows.

I thought I'd got them all by using an activity counter incremented by both
read and write lockers.

> - rwsemdebug(FMT, ...) doesn't compile with egcs-1.1.2. Need
> to remove the comma.

This is tricky... you get all sorts of horrible warnings with gcc-2.96 if you
remove the comma. What I've done is now ANSI-C99 compliant, but egcs is not.

> - The comments in down_write and down_read() are inaccurate.
> RWSEM_ACTIVE_WRITE_BIAS is 0xffff0001, not 0x00010001

Done.

> - It won't compile when WAITQUEUE_DEBUG is turned on. I
> guess you knew that.

Currently putting in separate debugging stuff for rwsems.

> - The comments above the functions in semaphore.h need
> updating.

Done. (BTW in the latest patch, they're actually split out into separate
header files as per Linus's suggestion).

> - What on earth does __xg() do? (And why do people write
> code like that without explaining why? Don't answer this
> one).

Stolen from the xchg() macro/function, but I'm not sure what it does. Plus I
don't use it now.

> - Somewhat offtopic: the `asm' statements in semaphore.c
> are really dangerous.

Now all got .text in.

2001-04-11 18:06:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

Followup to: <[email protected]>
By author: Linus Torvalds <[email protected]>
In newsgroup: linux.dev.kernel
>
> Note that the "fixup" approach is not necessarily very painful at all,
> from a performance standpoint (either on 386 or on newer CPU's). It's not
> really that hard to just replace the instruction in the "undefined
> instruction" handler by having strict rules about how to use the "xadd"
> instruction.
>
> For example, you would not actually fix up the xadd to be a function call
> to something that emulates "xadd" itself on a 386. You would fix up the
> whole sequence of "inline down_write()" with a simple call to an
> out-of-line "i386_down_write()" function.
>
> Note that down_write() on an old 386 is likely to be complicated enough
> that you want to do it out-of-line anyway, so the code-path you take
> (afetr the first time you've trapped on that particular location) would be
> the one you would take for an optimized 386 kernel anyway. And similarly,
> the restrictions you place on non-386-code to make it fixable are simple
> enough that it probably shouldn't make a difference for performance on
> modern chips.
>

This reminds me a lot of how FPU emulation was done on 16-bit x86
CPUs, which didn't have the #EM trap on FPU instructions. Each FPU
instruction would actually be assembled as CD + <FPU insn>, except
that the first byte of the FPU insn had its top bits modified. Of
course the CD is the first byte of the INT instruction, so it would
dispatch to a very small set of interrupt vectors based on the first
byte of the FPU instruction; in case there really was an FPU it would
patch in <FPU insn>+NOP, otherwise it would patch in CALL <FPU
emulation routine> if you were running in a small-code model, or just
emulate it in a large-code model (since the far CALL wouldn't fit.)

This is a very nice way to deal with this, since your performance
impact is virtually nil in either case, since you're only taking the
trap once per call site. A little bit of icache footprint, that's
all.

Now, if you're compiling for 486+ anyway, you would of course not add
the extra padding, and skip the trap handler.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-04-11 18:06:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

Followup to: <[email protected]>
By author: Alan Cox <[email protected]>
In newsgroup: linux.dev.kernel
> >
> > Yes, and with CMPXCHG handler in the kernel it wouldn't be needed
> > (the other 686 optimizations like memcpy also work on 386)
>
> They would still be needed. The 686 built glibc materially improves performance
> on 686 class machines. That one isnt an interesting problem to solve. Install
> the right software instead.
>

Yes, the big 686 optimization is CMOV, and that one is
ultra-pervasive.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-04-11 18:28:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] 2nd try: i386 rw_semaphores fix



On Wed, 11 Apr 2001, Bernd Schmidt wrote:
> >
> > The example in there compiles out-of-the box and is much easier to
> > experiment on than the whole kernel :-)
>
> That example seems to fail because a "memory" clobber only tells the compiler
> that memory is written, not that it is read.

The above makes no sense. The notion of "memory is written" is a true
superset of the "memory is read", and must be a complete memory barrier
(ie telling the compiler that "we read memory" is non-sensical: it can't
give the compiler any more information).

Because the memory clobber doesn't tell _what_ memory is clobbered, you
cannot consider memory dead after the instruction. As such, the compiler
HAS to treat a clobber as a "read-modify-write" - because on a very
fundamental level it _is_. Clobbering memory is logically 100% equivalent
to reading all of memory, modifying some of it, and writing the modified
information back.

(This is different from a "clobber specific register" thing, btw, where
the compiler can honestly assuem that the register is dead after the
instruction and contains no useful data. That is a write-only dependency,
and means that gcc can validly use optimization techniques like removing
previous dead writes to the register.)

See? Do you see why a "memory" clobber is _not_ comparable to a "ax"
clobber? And why that non-comparability makes a memory clobber equivalent
to a read-modify-write cycle?

In short: I disagree 100%. A "memory" clobber -does- effectively tell the
compiler that memory is read. If the compiler doesn't realize that, then
it's a compiler bug waiting to happen. No ifs, buts of maybes.

Linus

2001-04-11 18:41:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix



On Wed, 11 Apr 2001, David Howells wrote:
>
> > These numbers are infinity :)
>
> I know, but I think Linus may be happy with the resolution for the moment. It
> can be extended later by siphoning off excess quantities of waiters into a
> separate counter (as is done now) and by making the access count use a larger
> part of the variable.

I'm certainly ok with the could being limited to "thoudands". I don't see
people being able to exploit it any practical way. But we should remember
to be careful: starting thousands of threads and trying to make them all
take page faults and overflowing the read counter would be a _nasty_
attack, It would probably not be particularly easy to arrange, but still.

Note that blocking locks are different from spinlocks: for spinlocks we
can get by with just 7 bits in a byte, and be guaranteed that that is
enough for any machine with less than 128 processors. For the blocking
locks, that is not true.

(Right now the default "max user processes" ulimit already protects us
from this exploit, I just wanted to make sure that people _think_ about
this).

So a 16-bit count is _fine_. And I could live with less.

We should remember the issue, though. If we ever decide to expand it, it
would be easy enough to make an alternative "rwsem-reallybig.h" that uses
cmpxchg8b instead, or whatever. You don't need to start splitting the
counts up to expand them past 16 bits, you could have a simple system
where the _read_ locks only look at one (fast) 32-bit word for their fast
case, and only the write lock actually needs to use cmpxchg8b.

(I think it's reasonable to optimize the read-lock more than the
write-lock: in the cases where you want to do rw-locking, the common
reason is because you reall _want_ to allow many concurrent readers. That
also implies that the read case is the common one).

So you could fairly easily expand past 16-bit counters by using a 31-bit
counter for the reader, and making the high bit in the reader count be the
"contention" bit. Then the slow path (and the write path) would use the
64-bit operations offered by cmpxchg8b.

And yes, it's a Pentium+ instruction (and this time I -checked- ;), but by
the time you worry about hundreds of thousands of threads I think you can
safely just say "you'd better be running on a big a modern machine", and
just make the code conditional on CONFIG_PENTIUM+

So no real trickiness needed for expanding the number space, but certainly
also no real _reason_ for it at this time.

Linus

2001-04-11 21:27:30

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix


> You need sterner testing stuff :) I hit the BUG at the end of rwsem_wake()
> in about a second running rwsem-4. Removed the BUG and everything stops
> in D state.
>
> Grab rwsem-4 from
>
> http://www.uow.edu.au/~andrewm/linux/rwsem.tar.gz
>
> It's very simple. But running fully in-kernel shortens the
> code paths enormously and allows you to find those little
> timing windows. Run rmsem-4 in two modes: one with
> the schedule() in sched() enabled, and also with it
> commented out. If it passes that, it works. When
> you remove the module it'll print out the number of
> read-grants versus write-grants. If these run at 6:1
> with schedule() disabled then you've kicked butt.
>
> Also, rwsem-4 checks that the rwsems are actually providing
> exclusion between readers and writers, and between
> writers and writers. A useful thing to check, that.

It now works (patch to follow).

schedule() enabled:
reads taken: 686273
writes taken: 193414

schedule() disabled:
reads taken: 585619
writes taken: 292997

David

2001-04-11 21:42:33

by David Howells

[permalink] [raw]
Subject: [PATCH] 4th try: i386 rw_semaphores fix

diff -uNr linux-2.4.3/arch/i386/config.in linux-rwsem/arch/i386/config.in
--- linux-2.4.3/arch/i386/config.in Thu Apr 5 14:44:04 2001
+++ linux-rwsem/arch/i386/config.in Wed Apr 11 08:38:04 2001
@@ -47,11 +47,13 @@

if [ "$CONFIG_M386" = "y" ]; then
define_bool CONFIG_X86_CMPXCHG n
+ define_bool CONFIG_X86_XADD n
define_int CONFIG_X86_L1_CACHE_SHIFT 4
else
define_bool CONFIG_X86_WP_WORKS_OK y
define_bool CONFIG_X86_INVLPG y
define_bool CONFIG_X86_CMPXCHG y
+ define_bool CONFIG_X86_XADD y
define_bool CONFIG_X86_BSWAP y
define_bool CONFIG_X86_POPAD_OK y
fi
diff -uNr linux-2.4.3/arch/i386/kernel/semaphore.c linux-rwsem/arch/i386/kernel/semaphore.c
--- linux-2.4.3/arch/i386/kernel/semaphore.c Thu Apr 5 14:38:34 2001
+++ linux-rwsem/arch/i386/kernel/semaphore.c Wed Apr 11 22:30:59 2001
@@ -14,7 +14,6 @@
*/
#include <linux/config.h>
#include <linux/sched.h>
-
#include <asm/semaphore.h>

/*
@@ -179,6 +178,7 @@
* value..
*/
asm(
+".text\n"
".align 4\n"
".globl __down_failed\n"
"__down_failed:\n\t"
@@ -193,6 +193,7 @@
);

asm(
+".text\n"
".align 4\n"
".globl __down_failed_interruptible\n"
"__down_failed_interruptible:\n\t"
@@ -205,6 +206,7 @@
);

asm(
+".text\n"
".align 4\n"
".globl __down_failed_trylock\n"
"__down_failed_trylock:\n\t"
@@ -217,6 +219,7 @@
);

asm(
+".text\n"
".align 4\n"
".globl __up_wakeup\n"
"__up_wakeup:\n\t"
@@ -232,197 +235,292 @@

asm(
"
+.text
.align 4
.globl __down_read_failed
__down_read_failed:
pushl %edx
pushl %ecx
- jnc 2f
-
-3: call down_read_failed_biased
-
-1: popl %ecx
+ call down_read_failed
+ popl %ecx
popl %edx
ret
-
-2: call down_read_failed
- " LOCK "subl $1,(%eax)
- jns 1b
- jnc 2b
- jmp 3b
"
);

asm(
"
+.text
.align 4
.globl __down_write_failed
__down_write_failed:
pushl %edx
pushl %ecx
- jnc 2f
-
-3: call down_write_failed_biased
-
-1: popl %ecx
+ call down_write_failed
+ popl %ecx
popl %edx
ret
-
-2: call down_write_failed
- " LOCK "subl $" RW_LOCK_BIAS_STR ",(%eax)
- jz 1b
- jnc 2b
- jmp 3b
"
);

-struct rw_semaphore *FASTCALL(rwsem_wake_readers(struct rw_semaphore *sem));
-struct rw_semaphore *FASTCALL(rwsem_wake_writer(struct rw_semaphore *sem));
+asm(
+"
+.text
+.align 4
+.globl __rwsem_wake
+__rwsem_wake:
+ pushl %edx
+ pushl %ecx
+ call rwsem_wake
+ popl %ecx
+ popl %edx
+ ret
+"
+);

-struct rw_semaphore *FASTCALL(down_read_failed_biased(struct rw_semaphore *sem));
-struct rw_semaphore *FASTCALL(down_write_failed_biased(struct rw_semaphore *sem));
+struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *sem));
struct rw_semaphore *FASTCALL(down_read_failed(struct rw_semaphore *sem));
struct rw_semaphore *FASTCALL(down_write_failed(struct rw_semaphore *sem));

-struct rw_semaphore *down_read_failed_biased(struct rw_semaphore *sem)
+/*
+ * implement exchange and add functionality
+ */
+static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
{
- struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
-
- add_wait_queue(&sem->wait, &wait); /* put ourselves at the head of the list */
+ int tmp = delta;

- for (;;) {
- if (sem->read_bias_granted && xchg(&sem->read_bias_granted, 0))
- break;
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (!sem->read_bias_granted)
- schedule();
- }
-
- remove_wait_queue(&sem->wait, &wait);
- tsk->state = TASK_RUNNING;
+#ifndef CONFIG_USING_SPINLOCK_BASED_RWSEM
+ __asm__ __volatile__(
+ LOCK_PREFIX "xadd %0,(%1)"
+ : "+r"(tmp)
+ : "r"(sem)
+ : "memory");
+
+#else
+
+ __asm__ __volatile__(
+ "# beginning rwsem_atomic_update\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchgl %0,(%1)\n\t" /* retrieve the old value */
+ " addl %0,(%1)\n\t" /* add 0xffff0001, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* release the spinlock */
+#endif
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ ".previous\n"
+ "# ending rwsem_atomic_update\n\t"
+ : "+r"(tmp)
+ : "r"(sem)
+ : "memory");

- return sem;
+#endif
+ return tmp+delta;
}

-struct rw_semaphore *down_write_failed_biased(struct rw_semaphore *sem)
+/*
+ * implement compare and exchange functionality on the rw-semaphore count LSW
+ */
+static inline __u16 rwsem_cmpxchgw(struct rw_semaphore *sem, __u16 old, __u16 new)
{
- struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+#ifndef CONFIG_USING_SPINLOCK_BASED_RWSEM
+ return cmpxchg((__u16*)&sem->count.counter,0,RWSEM_ACTIVE_BIAS);

- add_wait_queue_exclusive(&sem->write_bias_wait, &wait); /* put ourselves at the end of the list */
+#else
+ __u16 prev;

- for (;;) {
- if (sem->write_bias_granted && xchg(&sem->write_bias_granted, 0))
- break;
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (!sem->write_bias_granted)
- schedule();
- }
-
- remove_wait_queue(&sem->write_bias_wait, &wait);
- tsk->state = TASK_RUNNING;
-
- /* if the lock is currently unbiased, awaken the sleepers
- * FIXME: this wakes up the readers early in a bit of a
- * stampede -> bad!
- */
- if (atomic_read(&sem->count) >= 0)
- wake_up(&sem->wait);
+ __asm__ __volatile__(
+ "# beginning rwsem_cmpxchgw\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " cmpw %w1,(%3)\n\t"
+ " jne 4f\n\t" /* jump if old doesn't match sem->count LSW */
+ " movw %w2,(%3)\n\t" /* replace sem->count LSW with the new value */
+ "2:\n\t"
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t" /* release the spinlock */
+#endif
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%3)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " movw (%3),%w0\n" /* we'll want to return the current value */
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending rwsem_cmpxchgw\n\t"
+ : "=r"(prev)
+ : "r0"(old), "r"(new), "r"(sem)
+ : "memory");

- return sem;
+ return prev;
+#endif
}

-/* Wait for the lock to become unbiased. Readers
- * are non-exclusive. =)
+/*
+ * wait for the read lock to be granted
+ * - need to repeal the increment made inline by the caller
+ * - need to throw a write-lock style spanner into the works (sub 0x00010000 from count)
*/
struct rw_semaphore *down_read_failed(struct rw_semaphore *sem)
{
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ DECLARE_WAITQUEUE(wait,tsk);
+ int count;

- __up_read(sem); /* this takes care of granting the lock */
+ rwsemdebug("[%d] Entering down_read_failed(%08x)\n",current->pid,atomic_read(&sem->count));

- add_wait_queue(&sem->wait, &wait);
+ /* this waitqueue context flag will be cleared when we are granted the lock */
+ __set_bit(RWSEM_WAITING_FOR_READ,&wait.flags);
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);

- while (atomic_read(&sem->count) < 0) {
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (atomic_read(&sem->count) >= 0)
+ add_wait_queue_exclusive(&sem->wait, &wait); /* FIFO */
+
+ /* note that we're now waiting on the lock, but no longer actively read-locking */
+ count = rwsem_atomic_update(RWSEM_WAITING_BIAS-RWSEM_ACTIVE_BIAS,sem);
+ rwsemdebug("X(%08x)\n",count);
+
+ /* if there are no longer active locks, wake the front queued process(es) up
+ * - it might even be this process, since the waker takes a more active part
+ */
+ if (!(count & RWSEM_ACTIVE_MASK))
+ rwsem_wake(sem);
+
+ /* wait to be given the lock */
+ for (;;) {
+ if (!test_bit(RWSEM_WAITING_FOR_READ,&wait.flags))
break;
schedule();
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}

- remove_wait_queue(&sem->wait, &wait);
+ remove_wait_queue(&sem->wait,&wait);
tsk->state = TASK_RUNNING;

+ rwsemdebug("[%d] Leaving down_read_failed(%08x)\n",current->pid,atomic_read(&sem->count));
+
return sem;
}

-/* Wait for the lock to become unbiased. Since we're
- * a writer, we'll make ourselves exclusive.
+/*
+ * wait for the write lock to be granted
*/
struct rw_semaphore *down_write_failed(struct rw_semaphore *sem)
{
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ DECLARE_WAITQUEUE(wait,tsk);
+ int count;

- __up_write(sem); /* this takes care of granting the lock */
+ rwsemdebug("[%d] Entering down_write_failed(%08x)\n",
+ current->pid,atomic_read(&sem->count));

- add_wait_queue_exclusive(&sem->wait, &wait);
+ /* this waitqueue context flag will be cleared when we are granted the lock */
+ __set_bit(RWSEM_WAITING_FOR_WRITE,&wait.flags);
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);

- while (atomic_read(&sem->count) < 0) {
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (atomic_read(&sem->count) >= 0)
- break; /* we must attempt to acquire or bias the lock */
+ add_wait_queue_exclusive(&sem->wait, &wait); /* FIFO */
+
+ /* note that we're waiting on the lock, but no longer actively locking */
+ count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem);
+ rwsemdebug("[%d] updated(%08x)\n",current->pid,count);
+
+ /* if there are no longer active locks, wake the front queued process(es) up
+ * - it might even be this process, since the waker takes a more active part
+ */
+ if (!(count & RWSEM_ACTIVE_MASK))
+ rwsem_wake(sem);
+
+ /* wait to be given the lock */
+ for (;;) {
+ if (!test_bit(RWSEM_WAITING_FOR_WRITE,&wait.flags))
+ break;
schedule();
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}

- remove_wait_queue(&sem->wait, &wait);
+ remove_wait_queue(&sem->wait,&wait);
tsk->state = TASK_RUNNING;

+ rwsemdebug("[%d] Leaving down_write_failed(%08x)\n",current->pid,atomic_read(&sem->count));
+
return sem;
}

-asm(
-"
-.align 4
-.globl __rwsem_wake
-__rwsem_wake:
- pushl %edx
- pushl %ecx
+/*
+ * handle the lock being released whilst there are processes blocked on it that can now run
+ * - if we come here, then:
+ * - the 'active part' of the count (&0x0000ffff) reached zero (but may no longer be zero)
+ * - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
+ */
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+ int woken, count;

- jz 1f
- call rwsem_wake_readers
- jmp 2f
+ rwsemdebug("[%d] Entering rwsem_wake(%08x)\n",current->pid,atomic_read(&sem->count));

-1: call rwsem_wake_writer
+ try_again:
+ /* try to grab an 'activity' marker
+ * - need to make sure two copies of rwsem_wake() don't do this for two separate processes
+ * simultaneously
+ * - be horribly naughty, and only deal with the LSW of the atomic counter
+ */
+ if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0) {
+ rwsemdebug("[%d] rwsem_wake: abort wakeup due to renewed activity\n",current->pid);
+ goto out;
+ }

-2: popl %ecx
- popl %edx
- ret
-"
-);
+ /* try to grant a single write lock if there's a writer at the front of the queue
+ * - note we leave the 'active part' of the count incremented by 1 and the waiting part
+ * incremented by 0x00010000
+ */
+ if (wake_up_ctx(&sem->wait,1,-RWSEM_WAITING_FOR_WRITE)==1)
+ goto out;

-/* Called when someone has done an up that transitioned from
- * negative to non-negative, meaning that the lock has been
- * granted to whomever owned the bias.
- */
-struct rw_semaphore *rwsem_wake_readers(struct rw_semaphore *sem)
-{
- if (xchg(&sem->read_bias_granted, 1))
- BUG();
- wake_up(&sem->wait);
- return sem;
-}
+ /* grant an infinite number of read locks to the readers at the front of the queue
+ * - note we increment the 'active part' of the count by the number of readers just woken,
+ * less one for the activity decrement we've already done
+ */
+ woken = wake_up_ctx(&sem->wait,65535,-RWSEM_WAITING_FOR_READ);
+ if (woken<=0)
+ goto counter_correction;
+
+ woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
+ woken -= RWSEM_ACTIVE_BIAS;
+ rwsem_atomic_update(woken,sem);

-struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem)
-{
- if (xchg(&sem->write_bias_granted, 1))
- BUG();
- wake_up(&sem->write_bias_wait);
+ out:
+ rwsemdebug("[%d] Leaving rwsem_wake(%08x)\n",current->pid,atomic_read(&sem->count));
return sem;
+
+ /* come here if we need to correct the counter for odd SMP-isms */
+ counter_correction:
+ count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem);
+ rwsemdebug("[%d] corrected(%08x)\n",current->pid,count);
+ if (!(count & RWSEM_ACTIVE_MASK))
+ goto try_again;
+ goto out;
}

+/*
+ * rw spinlock fallbacks
+ */
#if defined(CONFIG_SMP)
asm(
"
@@ -451,4 +549,3 @@
"
);
#endif
-
diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h linux-rwsem/include/asm-i386/rwsem-spin.h
--- linux-2.4.3/include/asm-i386/rwsem-spin.h Thu Jan 1 01:00:00 1970
+++ linux-rwsem/include/asm-i386/rwsem-spin.h Wed Apr 11 22:28:32 2001
@@ -0,0 +1,239 @@
+/* rwsem.h: R/W semaphores based on spinlocks
+ *
+ * Written by David Howells ([email protected]).
+ *
+ * Derived from asm-i386/semaphore.h and asm-i386/spinlock.h
+ */
+
+#ifndef _I386_RWSEM_SPIN_H
+#define _I386_RWSEM_SPIN_H
+
+#ifdef __KERNEL__
+
+#define CONFIG_USING_SPINLOCK_BASED_RWSEM 1
+
+/*
+ * the semaphore definition
+ */
+struct rw_semaphore {
+ atomic_t count;
+#define RWSEM_UNLOCKED_VALUE 0x00000000
+#define RWSEM_ACTIVE_BIAS 0x00000001
+#define RWSEM_ACTIVE_MASK 0x0000ffff
+#define RWSEM_WAITING_BIAS (-0x00010000)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+ spinlock_t lock;
+#define RWSEM_SPINLOCK_OFFSET_STR "4" /* byte offset of spinlock */
+ wait_queue_head_t wait;
+#define RWSEM_WAITING_FOR_READ WQ_FLAG_CONTEXT_0 /* bits to use in wait_queue_t.flags */
+#define RWSEM_WAITING_FOR_WRITE WQ_FLAG_CONTEXT_1
+#if RWSEM_DEBUG
+ int debug;
+#endif
+#if RWSEM_DEBUG_MAGIC
+ long __magic;
+ atomic_t readers;
+ atomic_t writers;
+#endif
+};
+
+/*
+ * initialisation
+ */
+#if RWSEM_DEBUG
+#define __RWSEM_DEBUG_INIT , 0
+#else
+#define __RWSEM_DEBUG_INIT /* */
+#endif
+#if RWSEM_DEBUG_MAGIC
+#define __RWSEM_DEBUG_MINIT(name) , (int)&(name).__magic, ATOMIC_INIT(0), ATOMIC_INIT(0)
+#else
+#define __RWSEM_DEBUG_MINIT(name) /* */
+#endif
+
+#define __RWSEM_INITIALIZER(name,count) \
+{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \
+ __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \
+ __RWSEM_DEBUG_INIT __RWSEM_DEBUG_MINIT(name) }
+
+#define __DECLARE_RWSEM_GENERIC(name,count) \
+ struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
+
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+ atomic_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+ spin_lock_init(&sem->lock);
+ init_waitqueue_head(&sem->wait);
+#if RWSEM_DEBUG
+ sem->debug = 0;
+#endif
+#if RWSEM_DEBUG_MAGIC
+ sem->__magic = (long)&sem->__magic;
+ atomic_set(&sem->readers, 0);
+ atomic_set(&sem->writers, 0);
+#endif
+}
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning down_read\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " incl (%%eax)\n\t" /* adds 0x00000001, returns the old value */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if we weren't granted the lock */
+ "2:\n"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __down_read_failed\n\t"
+ " jmp 2b\n"
+ ".previous"
+ "# ending __down_read\n\t"
+ : "=m"(sem->count), "=m"(sem->lock)
+ : "a"(sem), "m"(sem->count), "m"(sem->lock)
+ : "memory");
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = RWSEM_ACTIVE_WRITE_BIAS;
+ __asm__ __volatile__(
+ "# beginning down_write\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchg %0,(%%eax)\n\t" /* retrieve the old value */
+ " add %0,(%%eax)\n\t" /* add 0xffff0001, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " testl %0,%0\n\t" /* was the count 0 before? */
+ " jnz 4f\n\t" /* jump if we weren't granted the lock */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __down_write_failed\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending down_write"
+ : "+r"(tmp), "=m"(sem->count), "=m"(sem->lock)
+ : "a"(sem), "m"(sem->count), "m"(sem->lock)
+ : "memory");
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = -RWSEM_ACTIVE_READ_BIAS;
+ __asm__ __volatile__(
+ "# beginning __up_read\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " xchg %0,(%%eax)\n\t" /* retrieve the old value */
+ " addl %0,(%%eax)\n\t" /* subtract 1, result in memory */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if the lock is being waited upon */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " decl %0\n\t" /* xchg gave us the old count */
+ " testl %4,%0\n\t" /* do nothing if still outstanding active readers */
+ " jnz 2b\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending __up_read\n"
+ : "+r"(tmp), "=m"(sem->count), "=m"(sem->lock)
+ : "a"(sem), "i"(RWSEM_ACTIVE_MASK), "m"(sem->count), "m"(sem->lock)
+ : "memory");
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning __up_write\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX " decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab the spinlock */
+ " js 3f\n" /* jump if failed */
+ "1:\n\t"
+#endif
+ " addl %3,(%%eax)\n\t" /* adds 0x00010001 */
+#ifdef CONFIG_SMP
+ " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the spinlock */
+#endif
+ " js 4f\n\t" /* jump if the lock is being waited upon */
+ "2:\n\t"
+ ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+ "3:\n\t" /* spin on the spinlock till we get it */
+ " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t"
+ " rep;nop \n\t"
+ " jle 3b\n\t"
+ " jmp 1b\n"
+#endif
+ "4:\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 2b\n"
+ ".previous\n"
+ "# ending __up_write\n"
+ : "=m"(sem->count), "=m"(sem->lock)
+ : "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS), "m"(sem->count), "m"(sem->lock)
+ : "memory");
+}
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_SPIN_H */
diff -uNr linux-2.4.3/include/asm-i386/rwsem-xadd.h linux-rwsem/include/asm-i386/rwsem-xadd.h
--- linux-2.4.3/include/asm-i386/rwsem-xadd.h Thu Jan 1 01:00:00 1970
+++ linux-rwsem/include/asm-i386/rwsem-xadd.h Wed Apr 11 22:28:30 2001
@@ -0,0 +1,194 @@
+/* rwsem-xadd.h: R/W semaphores implemented using XADD/CMPXCHG
+ *
+ * Written by David Howells ([email protected]), 2001.
+ * Derived from asm-i386/semaphore.h
+ *
+ *
+ * The MSW of the count is the negated number of active writers and waiting
+ * lockers, and the LSW is the total number of active locks
+ *
+ * The lock count is initialized to 0 (no active and no waiting lockers).
+ *
+ * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
+ * uncontended lock. This can be determined because XADD returns the old value.
+ * Readers increment by 1 and see a positive value when uncontended, negative
+ * if there are writers (and maybe) readers waiting (in which case it goes to
+ * sleep).
+ *
+ * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
+ * be extended to 65534 by manually checking the whole MSW rather than relying
+ * on the S flag.
+ *
+ * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ *
+ * This should be totally fair - if anything is waiting, a process that wants a
+ * lock will go to the back of the queue. When the currently active lock is
+ * released, if there's a writer at the front of the queue, then that and only
+ * that will be woken up; if there's a bunch of consequtive readers at the
+ * front, then they'll all be woken up, but no other readers will be.
+ */
+
+#ifndef _I386_RWSEM_XADD_H
+#define _I386_RWSEM_XADD_H
+
+#ifdef __KERNEL__
+
+/*
+ * the semaphore definition
+ */
+struct rw_semaphore {
+ atomic_t count;
+#define RWSEM_UNLOCKED_VALUE 0x00000000
+#define RWSEM_ACTIVE_BIAS 0x00000001
+#define RWSEM_ACTIVE_MASK 0x0000ffff
+#define RWSEM_WAITING_BIAS (-0x00010000)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+ wait_queue_head_t wait;
+#define RWSEM_WAITING_FOR_READ WQ_FLAG_CONTEXT_0 /* bits to use in wait_queue_t.flags */
+#define RWSEM_WAITING_FOR_WRITE WQ_FLAG_CONTEXT_1
+#if RWSEM_DEBUG
+ int debug;
+#endif
+#if RWSEM_DEBUG_MAGIC
+ long __magic;
+ atomic_t readers;
+ atomic_t writers;
+#endif
+};
+
+/*
+ * initialisation
+ */
+#if RWSEM_DEBUG
+#define __RWSEM_DEBUG_INIT , 0
+#else
+#define __RWSEM_DEBUG_INIT /* */
+#endif
+#if RWSEM_DEBUG_MAGIC
+#define __RWSEM_DEBUG_MINIT(name) , (int)&(name).__magic, ATOMIC_INIT(0), ATOMIC_INIT(0)
+#else
+#define __RWSEM_DEBUG_MINIT(name) /* */
+#endif
+
+#define __RWSEM_INITIALIZER(name,count) \
+{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \
+ __RWSEM_DEBUG_INIT __RWSEM_DEBUG_MINIT(name) }
+
+#define __DECLARE_RWSEM_GENERIC(name,count) \
+ struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
+
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+ atomic_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+ init_waitqueue_head(&sem->wait);
+#if RWSEM_DEBUG
+ sem->debug = 0;
+#endif
+#if RWSEM_DEBUG_MAGIC
+ sem->__magic = (long)&sem->__magic;
+ atomic_set(&sem->readers, 0);
+ atomic_set(&sem->writers, 0);
+#endif
+}
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning down_read\n\t"
+LOCK_PREFIX " incl (%%eax)\n\t" /* adds 0x00000001, returns the old value */
+ " js 2f\n\t" /* jump if we weren't granted the lock */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __down_read_failed\n\t"
+ " jmp 1b\n"
+ ".previous"
+ "# ending down_read\n\t"
+ : "=m"(sem->count)
+ : "a"(sem), "m"(sem->count)
+ : "memory");
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = RWSEM_ACTIVE_WRITE_BIAS;
+ __asm__ __volatile__(
+ "# beginning down_write\n\t"
+LOCK_PREFIX " xadd %0,(%%eax)\n\t" /* subtract 0x00010001, returns the old value */
+ " testl %0,%0\n\t" /* was the count 0 before? */
+ " jnz 2f\n\t" /* jump if we weren't granted the lock */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __down_write_failed\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending down_write"
+ : "+r"(tmp), "=m"(sem->count)
+ : "a"(sem), "m"(sem->count)
+ : "memory");
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ int tmp;
+
+ tmp = -RWSEM_ACTIVE_READ_BIAS;
+ __asm__ __volatile__(
+ "# beginning __up_read\n\t"
+LOCK_PREFIX " xadd %0,(%%eax)\n\t" /* subtracts 1, returns the old value */
+ " js 2f\n\t" /* jump if the lock is being waited upon */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " decl %0\n\t" /* xadd gave us the old count */
+ " testl %3,%0\n\t" /* do nothing if still outstanding active readers */
+ " jnz 1b\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending __up_read\n"
+ : "+r"(tmp), "=m"(sem->count)
+ : "a"(sem), "i"(RWSEM_ACTIVE_MASK), "m"(sem->count)
+ : "memory");
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ __asm__ __volatile__(
+ "# beginning __up_write\n\t"
+LOCK_PREFIX " addl %2,(%%eax)\n\t" /* adds 0x0000ffff */
+ " js 2f\n\t" /* jump if the lock is being waited upon */
+ "1:\n\t"
+ ".section .text.lock,\"ax\"\n"
+ "2:\n\t"
+ " call __rwsem_wake\n\t"
+ " jmp 1b\n"
+ ".previous\n"
+ "# ending __up_write\n"
+ : "=m"(sem->count)
+ : "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS), "m"(sem->count)
+ : "memory");
+}
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_XADD_H */
diff -uNr linux-2.4.3/include/asm-i386/rwsem.h linux-rwsem/include/asm-i386/rwsem.h
--- linux-2.4.3/include/asm-i386/rwsem.h Thu Jan 1 01:00:00 1970
+++ linux-rwsem/include/asm-i386/rwsem.h Wed Apr 11 22:28:27 2001
@@ -0,0 +1,133 @@
+/* rwsem.h: R/W semaphores based on spinlocks
+ *
+ * Written by David Howells ([email protected]).
+ *
+ * Derived from asm-i386/semaphore.h
+ */
+
+#ifndef _I386_RWSEM_H
+#define _I386_RWSEM_H
+
+#include <linux/linkage.h>
+
+#define RWSEM_DEBUG 0
+#define RWSEM_DEBUG_MAGIC 0
+
+#ifdef __KERNEL__
+
+#include <asm/system.h>
+#include <asm/atomic.h>
+#include <asm/spinlock.h>
+#include <linux/wait.h>
+
+#if RWSEM_DEBUG
+#define rwsemdebug(FMT,...) do { if (sem->debug) printk(FMT,__VA_ARGS__); } while(0)
+#else
+#define rwsemdebug(FMT,...)
+#endif
+
+/* old gcc */
+#if RWSEM_DEBUG
+//#define rwsemdebug(FMT, ARGS...) do { if (sem->debug) printk(FMT,##ARGS); } while(0)
+#else
+//#define rwsemdebug(FMT, ARGS...)
+#endif
+
+#ifdef CONFIG_X86_XADD
+#include <asm/rwsem-xadd.h> /* use XADD based semaphores if possible */
+#else
+#include <asm/rwsem-spin.h> /* use spinlock based semaphores otherwise */
+#endif
+
+/* we use FASTCALL convention for the helpers */
+extern struct rw_semaphore *FASTCALL(__down_read_failed(struct rw_semaphore *sem));
+extern struct rw_semaphore *FASTCALL(__down_write_failed(struct rw_semaphore *sem));
+extern struct rw_semaphore *FASTCALL(__rwsem_wake(struct rw_semaphore *sem));
+
+/*
+ * lock for reading
+ */
+static inline void down_read(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering down_read(count=%08x)\n",atomic_read(&sem->count));
+
+#if RWSEM_DEBUG_MAGIC
+ if (sem->__magic != (long)&sem->__magic)
+ BUG();
+#endif
+
+ __down_read(sem);
+
+#if RWSEM_DEBUG_MAGIC
+ if (atomic_read(&sem->writers))
+ BUG();
+ atomic_inc(&sem->readers);
+#endif
+
+ rwsemdebug("Leaving down_read(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * lock for writing
+ */
+static inline void down_write(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering down_write(count=%08x)\n",atomic_read(&sem->count));
+
+#if RWSEM_DEBUG_MAGIC
+ if (sem->__magic != (long)&sem->__magic)
+ BUG();
+#endif
+
+ __down_write(sem);
+
+#if RWSEM_DEBUG_MAGIC
+ if (atomic_read(&sem->writers))
+ BUG();
+ if (atomic_read(&sem->readers))
+ BUG();
+ atomic_inc(&sem->writers);
+#endif
+
+ rwsemdebug("Leaving down_write(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * release a read lock
+ */
+static inline void up_read(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering up_read(count=%08x)\n",atomic_read(&sem->count));
+
+#if RWSEM_DEBUG_MAGIC
+ if (atomic_read(&sem->writers))
+ BUG();
+ atomic_dec(&sem->readers);
+#endif
+ __up_read(sem);
+
+ rwsemdebug("Leaving up_read(count=%08x)\n",atomic_read(&sem->count));
+}
+
+/*
+ * release a write lock
+ */
+static inline void up_write(struct rw_semaphore *sem)
+{
+ rwsemdebug("Entering up_write(count=%08x)\n",atomic_read(&sem->count));
+
+#if RWSEM_DEBUG_MAGIC
+ if (atomic_read(&sem->readers))
+ BUG();
+ if (atomic_read(&sem->writers) != 1)
+ BUG();
+ atomic_dec(&sem->writers);
+#endif
+ __up_write(sem);
+
+ rwsemdebug("Leaving up_write(count=%08x)\n",atomic_read(&sem->count));
+}
+
+
+#endif /* __KERNEL__ */
+#endif /* _I386_RWSEM_H */
diff -uNr linux-2.4.3/include/asm-i386/semaphore.h linux-rwsem/include/asm-i386/semaphore.h
--- linux-2.4.3/include/asm-i386/semaphore.h Thu Apr 5 14:50:36 2001
+++ linux-rwsem/include/asm-i386/semaphore.h Wed Apr 11 13:29:12 2001
@@ -38,8 +38,8 @@

#include <asm/system.h>
#include <asm/atomic.h>
-#include <asm/rwlock.h>
#include <linux/wait.h>
+#include <asm/rwsem.h>

struct semaphore {
atomic_t count;
@@ -202,184 +202,6 @@
:"=m" (sem->count)
:"c" (sem)
:"memory");
-}
-
-/* rw mutexes (should that be mutices? =) -- throw rw
- * spinlocks and semaphores together, and this is what we
- * end up with...
- *
- * The lock is initialized to BIAS. This way, a writer
- * subtracts BIAS ands gets 0 for the case of an uncontended
- * lock. Readers decrement by 1 and see a positive value
- * when uncontended, negative if there are writers waiting
- * (in which case it goes to sleep).
- *
- * The value 0x01000000 supports up to 128 processors and
- * lots of processes. BIAS must be chosen such that subl'ing
- * BIAS once per CPU will result in the long remaining
- * negative.
- *
- * In terms of fairness, this should result in the lock
- * flopping back and forth between readers and writers
- * under heavy use.
- *
- * -ben
- */
-struct rw_semaphore {
- atomic_t count;
- volatile unsigned char write_bias_granted;
- volatile unsigned char read_bias_granted;
- volatile unsigned char pad1;
- volatile unsigned char pad2;
- wait_queue_head_t wait;
- wait_queue_head_t write_bias_wait;
-#if WAITQUEUE_DEBUG
- long __magic;
- atomic_t readers;
- atomic_t writers;
-#endif
-};
-
-#if WAITQUEUE_DEBUG
-#define __RWSEM_DEBUG_INIT , ATOMIC_INIT(0), ATOMIC_INIT(0)
-#else
-#define __RWSEM_DEBUG_INIT /* */
-#endif
-
-#define __RWSEM_INITIALIZER(name,count) \
-{ ATOMIC_INIT(count), 0, 0, 0, 0, __WAIT_QUEUE_HEAD_INITIALIZER((name).wait), \
- __WAIT_QUEUE_HEAD_INITIALIZER((name).write_bias_wait) \
- __SEM_DEBUG_INIT(name) __RWSEM_DEBUG_INIT }
-
-#define __DECLARE_RWSEM_GENERIC(name,count) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
-
-#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
-#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
-#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
- atomic_set(&sem->count, RW_LOCK_BIAS);
- sem->read_bias_granted = 0;
- sem->write_bias_granted = 0;
- init_waitqueue_head(&sem->wait);
- init_waitqueue_head(&sem->write_bias_wait);
-#if WAITQUEUE_DEBUG
- sem->__magic = (long)&sem->__magic;
- atomic_set(&sem->readers, 0);
- atomic_set(&sem->writers, 0);
-#endif
-}
-
-/* we use FASTCALL convention for the helpers */
-extern struct rw_semaphore *FASTCALL(__down_read_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(__down_write_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(__rwsem_wake(struct rw_semaphore *sem));
-
-static inline void down_read(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->__magic != (long)&sem->__magic)
- BUG();
-#endif
- __build_read_lock(sem, "__down_read_failed");
-#if WAITQUEUE_DEBUG
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->writers))
- BUG();
- atomic_inc(&sem->readers);
-#endif
-}
-
-static inline void down_write(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->__magic != (long)&sem->__magic)
- BUG();
-#endif
- __build_write_lock(sem, "__down_write_failed");
-#if WAITQUEUE_DEBUG
- if (atomic_read(&sem->writers))
- BUG();
- if (atomic_read(&sem->readers))
- BUG();
- if (sem->read_bias_granted)
- BUG();
- if (sem->write_bias_granted)
- BUG();
- atomic_inc(&sem->writers);
-#endif
-}
-
-/* When a reader does a release, the only significant
- * case is when there was a writer waiting, and we've
- * bumped the count to 0: we must wake the writer up.
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- __asm__ __volatile__(
- "# up_read\n\t"
- LOCK "incl %0\n\t"
- "jz 2f\n" /* only do the wake if result == 0 (ie, a writer) */
- "1:\n\t"
- ".section .text.lock,\"ax\"\n"
- "2:\tcall __rwsem_wake\n\t"
- "jmp 1b\n"
- ".previous"
- :"=m" (sem->count)
- :"a" (sem)
- :"memory"
- );
-}
-
-/* releasing the writer is easy -- just release it and
- * wake up any sleepers.
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- __asm__ __volatile__(
- "# up_write\n\t"
- LOCK "addl $" RW_LOCK_BIAS_STR ",%0\n"
- "jc 2f\n" /* only do the wake if the result was -'ve to 0/+'ve */
- "1:\n\t"
- ".section .text.lock,\"ax\"\n"
- "2:\tcall __rwsem_wake\n\t"
- "jmp 1b\n"
- ".previous"
- :"=m" (sem->count)
- :"a" (sem)
- :"memory"
- );
-}
-
-static inline void up_read(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->writers))
- BUG();
- atomic_dec(&sem->readers);
-#endif
- __up_read(sem);
-}
-
-static inline void up_write(struct rw_semaphore *sem)
-{
-#if WAITQUEUE_DEBUG
- if (sem->read_bias_granted)
- BUG();
- if (sem->write_bias_granted)
- BUG();
- if (atomic_read(&sem->readers))
- BUG();
- if (atomic_read(&sem->writers) != 1)
- BUG();
- atomic_dec(&sem->writers);
-#endif
- __up_write(sem);
}

#endif
diff -uNr linux-2.4.3/include/linux/sched.h linux-rwsem/include/linux/sched.h
--- linux-2.4.3/include/linux/sched.h Thu Apr 5 14:50:36 2001
+++ linux-rwsem/include/linux/sched.h Wed Apr 11 13:29:12 2001
@@ -548,6 +548,8 @@

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
+extern int FASTCALL(__wake_up_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit));
+extern int FASTCALL(__wake_up_sync_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -566,6 +568,8 @@
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, nr)
+#define wake_up_ctx(x,count,bit) __wake_up_ctx((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,count,bit)
+#define wake_up_sync_ctx(x,count,bit) __wake_up_ctx((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,count,bit)
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -uNr linux-2.4.3/include/linux/wait.h linux-rwsem/include/linux/wait.h
--- linux-2.4.3/include/linux/wait.h Thu Apr 5 14:50:36 2001
+++ linux-rwsem/include/linux/wait.h Wed Apr 11 13:25:44 2001
@@ -26,6 +26,14 @@
struct __wait_queue {
unsigned int flags;
#define WQ_FLAG_EXCLUSIVE 0x01
+#define WQ_FLAG_CONTEXT_0 8 /* context specific flag bit numbers */
+#define WQ_FLAG_CONTEXT_1 9
+#define WQ_FLAG_CONTEXT_2 10
+#define WQ_FLAG_CONTEXT_3 11
+#define WQ_FLAG_CONTEXT_4 12
+#define WQ_FLAG_CONTEXT_5 13
+#define WQ_FLAG_CONTEXT_6 14
+#define WQ_FLAG_CONTEXT_7 15
struct task_struct * task;
struct list_head task_list;
#if WAITQUEUE_DEBUG
diff -uNr linux-2.4.3/kernel/fork.c linux-rwsem/kernel/fork.c
--- linux-2.4.3/kernel/fork.c Thu Apr 5 14:44:17 2001
+++ linux-rwsem/kernel/fork.c Tue Apr 10 09:19:47 2001
@@ -39,7 +39,7 @@
unsigned long flags;

wq_write_lock_irqsave(&q->lock, flags);
- wait->flags = 0;
+ wait->flags &= ~WQ_FLAG_EXCLUSIVE;
__add_wait_queue(q, wait);
wq_write_unlock_irqrestore(&q->lock, flags);
}
@@ -49,7 +49,7 @@
unsigned long flags;

wq_write_lock_irqsave(&q->lock, flags);
- wait->flags = WQ_FLAG_EXCLUSIVE;
+ wait->flags |= WQ_FLAG_EXCLUSIVE;
__add_wait_queue_tail(q, wait);
wq_write_unlock_irqrestore(&q->lock, flags);
}
diff -uNr linux-2.4.3/kernel/sched.c linux-rwsem/kernel/sched.c
--- linux-2.4.3/kernel/sched.c Thu Apr 5 14:44:17 2001
+++ linux-rwsem/kernel/sched.c Wed Apr 11 22:30:11 2001
@@ -739,7 +739,7 @@
state = p->state;
if (state & mode) {
WQ_NOTE_WAKER(curr);
- if (try_to_wake_up(p, sync) && curr->flags && !--nr_exclusive)
+ if (try_to_wake_up(p, sync) && (curr->flags&WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
}
@@ -763,6 +763,75 @@
__wake_up_common(q, mode, nr, 1);
wq_read_unlock_irqrestore(&q->lock, flags);
}
+}
+
+/*
+ * wake up processes in the wait queue depending on the state of a context bit in the flags
+ * - wakes up a process if the specified bit is set in the flags member
+ * - the context bit is cleared if the process is woken up
+ * - if the bit number is negative, then the loop stops at the first unset context bit encountered
+ * - returns the number of processes woken
+ */
+static inline int __wake_up_ctx_common (wait_queue_head_t *q,
+ int count, int bit, const int sync)
+{
+ struct list_head *tmp, *head;
+ struct task_struct *p;
+ int stop, woken;
+
+ woken = 0;
+ stop = bit<0;
+ if (bit<0) bit = -bit;
+
+ CHECK_MAGIC_WQHEAD(q);
+ head = &q->task_list;
+ WQ_CHECK_LIST_HEAD(head);
+ tmp = head->next;
+ while (tmp != head) {
+ wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
+
+ tmp = tmp->next;
+ CHECK_MAGIC(curr->__magic);
+ p = curr->task;
+ if (!test_and_clear_bit(bit,&curr->flags)) {
+ if (stop)
+ break;
+ continue;
+ }
+
+ WQ_NOTE_WAKER(curr);
+ try_to_wake_up(p,sync);
+
+ woken++;
+ if (woken>=count)
+ break;
+ }
+
+ return woken;
+}
+
+int __wake_up_ctx(wait_queue_head_t *q, unsigned int mode, int count, int bit)
+{
+ int woken = 0;
+ if (q && count) {
+ unsigned long flags;
+ wq_read_lock_irqsave(&q->lock, flags);
+ woken = __wake_up_ctx_common(q, count, bit, 0);
+ wq_read_unlock_irqrestore(&q->lock, flags);
+ }
+ return woken;
+}
+
+int __wake_up_ctx_sync(wait_queue_head_t *q, unsigned int mode, int count, int bit)
+{
+ int woken = 0;
+ if (q && count) {
+ unsigned long flags;
+ wq_read_lock_irqsave(&q->lock, flags);
+ woken = __wake_up_ctx_common(q, count, bit, 1);
+ wq_read_unlock_irqrestore(&q->lock, flags);
+ }
+ return woken;
}

#define SLEEP_ON_VAR \

2001-04-11 22:42:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

Alan Cox wrote:
>
> > Yes, the big 686 optimization is CMOV, and that one is
> > ultra-pervasive.
>
> Be careful there. CMOV is an optional instruction. gcc is arguably wrong
> in using cmov in '686' mode. Building libs with cmov makes sense though
> especially for the PIV with its ridiculously long pipeline
>

It is just a matter how you define "686 mode", otherwise the very concept
is meaningless.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-04-11 22:49:59

by Alan

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

> Yes, the big 686 optimization is CMOV, and that one is
> ultra-pervasive.

Be careful there. CMOV is an optional instruction. gcc is arguably wrong
in using cmov in '686' mode. Building libs with cmov makes sense though
especially for the PIV with its ridiculously long pipeline

>

2001-04-11 22:55:40

by Alan

[permalink] [raw]
Subject: Re: [PATCH] i386 rw_semaphores fix

> > Be careful there. CMOV is an optional instruction. gcc is arguably wrong
> > in using cmov in '686' mode. Building libs with cmov makes sense though
> > especially for the PIV with its ridiculously long pipeline
>
> It is just a matter how you define "686 mode", otherwise the very concept
> is meaningless.

I use the pentium pro documentation.

2001-04-11 23:03:39

by Anton Blanchard

[permalink] [raw]
Subject: Re: [PATCH] 3rd try: i386 rw_semaphores fix


Hi,

> Here's the RW semaphore patch #3. This time with more asm constraints added.

Personally I care about sparc and ppc64 and as such would like to see the
slow paths end up in lib/rwsem.c protected by #ifndef __HAVE_ARCH_RWSEM
or something like that. If we couldn't get rwsems to work on x86, what
hope have we on other archs? :)

I have a few questions:

In arch/i386/kernel/semaphore.c:

static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
static inline __u16 rwsem_cmpxchgw(struct rw_semaphore *sem, __u16 old, __u16 new)

Can these end up in include/asm/rwsem*? The rest could then go into
lib/rwsem.c.

/* try to grab an 'activity' marker
* - need to make sure two copies of rwsem_wake() don't do this for two separate processes
* simultaneously
* - be horribly naughty, and only deal with the LSW of the atomic counter
*/
if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0)

Many archs dont have cmpxchg on 16 bit quantities. We can implement it
but it will be extra instructions. Anyway on 64 bit archs, count will be
64 bit so we will have two 32 bit quantities to cmpxchg on.

Now that I look at it, can you just do a cmpxchg on the complete sem->count
and retry if it failed because the high order bits have changed?

Anton

2001-04-12 08:38:35

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] 2nd try: i386 rw_semaphores fix

Linus Torvalds wrote:
>
>
> On Wed, 11 Apr 2001, Bernd Schmidt wrote:
> See? Do you see why a "memory" clobber is _not_ comparable to a "ax"
> clobber? And why that non-comparability makes a memory clobber equivalent
> to a read-modify-write cycle?

I had to think about this, so I'll explain it a different way in case it
is helpful.

An "ax" clobber says "the whole of eax is undefined after this". In
contrast, a "memory" clobber says "we are writing unspecified values to
unspecified addresses, but the values are actually valid if you read
memory later".

All earlier writes to any part of _addressed_ memory (or the named
register) must be either discarded, or placed earlier in the instruction
stream. For "memory" this is a "compiler write barrier". Addressed
memory means everything except local variables whose addresses are not
taken.

Later reads from memory could be reading from a region that was written
by the "memory" clobbering instruction. The compiler doesn't know, so
it must assume so. Therefore later reads must be placed later in the
instruction stream. This means that "memory" acts as a "compiler read
barrier".

> In short: I disagree 100%. A "memory" clobber -does- effectively tell the
> compiler that memory is read. If the compiler doesn't realize that, then
> it's a compiler bug waiting to happen. No ifs, buts of maybes.

Indeed.

-- Jamie

2001-04-12 18:21:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 4th try: i386 rw_semaphores fix

David Howells wrote:
>
> Here's the RW semaphore patch attempt #4. This fixes the bugs that Andrew
> Morton's test cases showed up.
>


It still doesn't compile with gcc-2.91.66 because of the
"#define rwsemdebug(FMT, ...)" thing. What can we do
about this?

I cooked up a few more tests, generally beat the thing
up. This code works. Good stuff. I'll do some more testing
later this week - put some delays inside the semaphore code
itself to stretch the timing windows, but I don't see how
it can break it.

Manfred says the rep;nop should come *before* the memory
operation, but I don't know if he's been heard...

parisc looks OK. ppc tried hard, but didn't get it
right. The spin_lock_irqsave() in up_read/up_write shouldn't be
necessary plus it puts the readers onto the
waitqueue with add_wait_queue_exclusive, so an up_write
will only release one reader. Looks like they'll end
up with stuck readers. Other architectures need work.
If they can live with a spinlock in the fastpath, then
the code at http://www.uow.edu.au/~andrewm/linux/rw_semaphore.tar.gz
is bog-simple and works.



Now I think we should specify the API a bit:

* A task is not allowed to down_read() the same rwsem
more than once time. Otherwise another task can
do a down_write() in the middle and cause deadlock.
(I don't think this has been specified for read_lock()
and write_lock(). There's no such deadlock on the
x86 implementation of these. Other architectures?
Heaven knows. They're probably OK).

* up_read() and up_write() may *not* be called from
interrupt context.

The latter is just a suggestion. It looks like your
code is safe for interrupt-context upping - but it
isn't tested for this. The guys who are going to
support rwsems for architectures which have less
rich atomic ops are going to *need* to know the rules
here. Let's tell 'em.

If we get agreement on these points, then please
drop a comment in there and I believe we're done.

In another email you said:

> schedule() disabled:
> reads taken: 585629
> writes taken: 292997

That's interesting. With two writer threads and
four reader threads hammering the rwsem, the write/read
grant ratio is 0.5003. Neat, but I'd have expected
readers to do better. Perhaps the writer-wakes-multiple
readers stuff isn't happening. I'll test for this.


There was some discussion regarding contention rates a few
days back. I did dome instrumentation on the normal semaphores.
The highest contention rate I could observe was during a `make
-j3 bzImage' on dual CPU. With this workload, down() enters
__down() 2% of the time. That's a heck of a lot. It means that
the semaphore slow path is by a very large margin the most
expensive part.

The most contention was on i_sem in real_lookup(), then pipe_sem
and vfork_sem. ext2_lookup is slow; no news there.

But tweaking the semaphore slowpath won't help here
because when a process slept in __down(), it was always
the only sleeper.

With `dbench 12' the contention rate was much lower
(0.01%). But when someone slept, there was almost
always another sleeper, so the extra wake_up() at
the end of __down() is worth attention.

I don't have any decent multithread workloads which would
allow me to form an opinion on whether we need to optimise
the slowpath, or to more finely thread the contention areas.
Could be with some workloads the contention is significant and
threading is painful. I'll have another look at it sometime...

2001-04-12 21:20:59

by D.W.Howells

[permalink] [raw]
Subject: Re: [PATCH] 4th try: i386 rw_semaphores fix




Andrew Morton wrote:
> It still doesn't compile with gcc-2.91.66 because of the "#define
> rwsemdebug(FMT, ...)" thing. What can we do about this?

Hmmm... It probably needs to be made conditional on the version of the
compiler by means of the macros that hold the version numbers.

> I cooked up a few more tests, generally beat the thing
> up. This code works. Good stuff. I'll do some more testing
> later this week - put some delays inside the semaphore code
> itself to stretch the timing windows, but I don't see how
> it can break it.

Excellent. I tried to keep it as simple as possible to make it easier to test
and follow.

> Manfred says the rep;nop should come *before* the memory
> operation, but I don't know if he's been heard...

Probably... It's copied directly out of the spinlock stuff. I noticed it at
the time, but it didn't stick in my memory.

> The spin_lock_irqsave() in up_read/up_write shouldn't be
> necessary plus it puts the readers onto the
> waitqueue with add_wait_queue_exclusive, so an up_write
> will only release one reader.

Not so... have you looked at the new __wait_ctx_common() function? It
actually ignores the exclusive flag since it uses other criteria as to whom
to wake. All the add_wait_queue_exclusive() function does is put the process
on the _back_ of the queue as far as rwsems are concerned.

> Other architectures need work.
> If they can live with a spinlock in the fastpath, then
> the code at http://www.uow.edu.au/~andrewm/linux/rw_semaphore.tar.gz
> is bog-simple and works.

Sorry to pre-empt you, but have you seen my "advanced" patch? I sent it to
the list in an email with the subject:

[PATCH] i386 rw_semaphores, general abstraction patch

I added a general fallback implementation with the inline implementation
functions written wholly in C and using spinlocks.

> Now I think we should specify the API a bit:

Agreed... I'll think more in it on Tuesday, though... after Easter.

Your points, however, look fairly reasonable.

> Perhaps the writer-wakes-multiple
> readers stuff isn't happening

It does. Using my test program & module, try:

driver -5 & sleep 1; driver 100 &

Keep doing "ps", and you'll see all the reader processes jump into the S
state at the same time, once all the writers have finished. You may need to
increase the delay between the ups and downs in the module to see it clearly.

David Howells

2001-04-16 14:37:10

by Victor Yodaiken

[permalink] [raw]
Subject: Re: rw_semaphores

On Tue, Apr 10, 2001 at 08:47:34AM +0100, David Howells wrote:
>
> Since you're willing to use CMPXCHG in your suggested implementation, would it
> make it make life easier if you were willing to use XADD too?
>
> Plus, are you really willing to limit the number of readers or writers to be
> 32767? If so, I think I can suggest a way that limits it to ~65535 apiece
> instead...

I'm trying to imagine a case where 32,000 sharing a semaphore was anything but a
major failure and I can't. To me: the result of an attempt by the 32,768th locker
should be a kernel panic. Is there a reasonable scenario where this is wrong?


>
> David
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2001-04-16 14:56:54

by Alan

[permalink] [raw]
Subject: Re: rw_semaphores

> I'm trying to imagine a case where 32,000 sharing a semaphore was anything but a
> major failure and I can't. To me: the result of an attempt by the 32,768th locker
> should be a kernel panic. Is there a reasonable scenario where this is wrong?

32000 threads all trying to lock the same piece of memory ? Its not down
to reasonable scenarios its down to malicious scenarios

2001-04-16 17:06:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: rw_semaphores



On Mon, 16 Apr 2001 [email protected] wrote:
>
> I'm trying to imagine a case where 32,000 sharing a semaphore was anything but a
> major failure and I can't. To me: the result of an attempt by the 32,768th locker
> should be a kernel panic. Is there a reasonable scenario where this is wrong?

Hint: "I'm trying to imagine a case when writing all zeroes to /etc/passwd
is anything but a major failure, but I can't. So why don't we make
/etc/passwd world-writable?"

Right. Security.

There is _never_ any excuse for panic'ing because of some inherent
limitation of the data structures. You can return -ENOMEM, -EAGAIN or
somehting like that, but you must _not_ allow a panic (or a roll-over,
which would just result in corrupted kernel data structures).

Note that the limit is probably really easy to work around even without
extending the number of bits: a sleeper that notices that the count is
even _halfway_ to rolling around could easily do something like:

- undo "this process" action
- sleep for 1 second
- try again from the beginning.

I certainly agree that no _reasonable_ pattern can cause the failure, but
we need to worry about people who are malicious. The above trivial
approach would take care of that, while not penalizing any non-malicious
users.

So I'm not worried about this at all. I just want people _always_ to think
about "how could I mis-use this if I was _truly_ evil", and making sure it
doesn't cause problems for others on the system.

(NOTE: This does not mean that the kernel has to do anything _reasonable_
under all circumstances. There are cases where Linux has decided that
"this is not something a reasonable program can do, and if you try to do
it, we'll give you random results back - but they will not be _security_
holes". We don't need to be _nice_ to unreasonable requests. We just must
never panic, otherwise crash or allow unreasonable requests to mess up
_other_ people)

Linus

2001-04-16 17:31:20

by Andrew Morton

[permalink] [raw]
Subject: Re: rw_semaphores

[email protected] wrote:
>
> On Tue, Apr 10, 2001 at 08:47:34AM +0100, David Howells wrote:
> >
> > Since you're willing to use CMPXCHG in your suggested implementation, would it
> > make it make life easier if you were willing to use XADD too?
> >
> > Plus, are you really willing to limit the number of readers or writers to be
> > 32767? If so, I think I can suggest a way that limits it to ~65535 apiece
> > instead...
>
> I'm trying to imagine a case where 32,000 sharing a semaphore was anything but a
> major failure and I can't. To me: the result of an attempt by the 32,768th locker
> should be a kernel panic. Is there a reasonable scenario where this is wrong?
>

It can't happen (famous last words).

- It is a bug for a task to acquire an rwsem more than once
(either for read or write), so we don't do this.

- Linux only supports, err, 31700 user processes.

2001-04-16 17:32:31

by Victor Yodaiken

[permalink] [raw]
Subject: Re: rw_semaphores

On Mon, Apr 16, 2001 at 10:05:57AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 16 Apr 2001 [email protected] wrote:
> >
> > I'm trying to imagine a case where 32,000 sharing a semaphore was anything but a
> > major failure and I can't. To me: the result of an attempt by the 32,768th locker
> > should be a kernel panic. Is there a reasonable scenario where this is wrong?
>
> Hint: "I'm trying to imagine a case when writing all zeroes to /etc/passwd
> is anything but a major failure, but I can't. So why don't we make
> /etc/passwd world-writable?"
>
> Right. Security.

The analogy is too subtle for me,
but my question was not whether the correct error
response should be to panic, but whether there was a good reason for allowing
such a huge number of users of a lock.

> There is _never_ any excuse for panic'ing because of some inherent
> limitation of the data structures. You can return -ENOMEM, -EAGAIN or
> somehting like that, but you must _not_ allow a panic (or a roll-over,
> which would just result in corrupted kernel data structures).

There's a difference between a completely reasonable situation in which
all of some resource has been committed
and a situation which in itself indicates some sort of fundamental error.
If 32K+ users of a lock is an errror, then returning -ENOMEM may be
inadequate.

>
> Note that the limit is probably really easy to work around even without
> extending the number of bits: a sleeper that notices that the count is
> even _halfway_ to rolling around could easily do something like:
>
> - undo "this process" action
> - sleep for 1 second
> - try again from the beginning.
>
> I certainly agree that no _reasonable_ pattern can cause the failure, but
> we need to worry about people who are malicious. The above trivial
> approach would take care of that, while not penalizing any non-malicious
> users.

Ok. I'm too nice a guy to think about malicious users so I simply considered
the kernel error case.
You probably want a diagnostic so people who get mysterious slowdowns can
report:
/var/log/messages included the message "Too many users on lock 0x..."


>
> So I'm not worried about this at all. I just want people _always_ to think
> about "how could I mis-use this if I was _truly_ evil", and making sure it
> doesn't cause problems for others on the system.
>
> (NOTE: This does not mean that the kernel has to do anything _reasonable_
> under all circumstances. There are cases where Linux has decided that
> "this is not something a reasonable program can do, and if you try to do
> it, we'll give you random results back - but they will not be _security_
> holes". We don't need to be _nice_ to unreasonable requests. We just must
> never panic, otherwise crash or allow unreasonable requests to mess up
> _other_ people)
>
> Linus

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com