This patch (made against linux-2.4.4-pre6) makes a number of changes to the
rwsem implementation:
(1) Everything in try #2
plus
(2) Changes proposed by Linus for the generic semaphore code.
(3) Ideas from Andrea and how he implemented his semaphores.
Linus, you suggested that the generic list handling stuff would be faster (2
unconditional stores) than mine (1 unconditional store and 1 conditional
store and branch to jump round it). You are both right and wrong. The generic
code does two stores per _process_ woken up (list_del) mine does the 1 or 2
stores per _batch_ of processes woken up. So the generic way is better when
the queue is an even mixture of readers or writers and my way is better when
there are far greater numbers of waiting readers. However, that said, there
is not much in it either way, so I've reverted it to the generic list stuff.
David
On Mon, Apr 23, 2001 at 09:35:34PM +0100, D . W . Howells wrote:
> This patch (made against linux-2.4.4-pre6) makes a number of changes to the
> rwsem implementation:
>
> (1) Everything in try #2
>
> plus
>
> (2) Changes proposed by Linus for the generic semaphore code.
>
> (3) Ideas from Andrea and how he implemented his semaphores.
I benchmarked try3 on top of pre6 and I get this:
----------------------------------------------
RWSEM_GENERIC_SPINLOCK y in rwsem-2.4.4-pre6 + your latest #try3
rw
reads taken: 5842496
writes taken: 3016649
reads taken: 5823381
writes taken: 3006773
r1
reads taken: 13309316
reads taken: 13311722
r2
reads taken: 5010534
reads taken: 5023185
ro
reads taken: 3850228
reads taken: 3845954
w1
writes taken: 13012701
writes taken: 13021716
wo
writes taken: 1825789
writes taken: 1802560
----------------------------------------------
RWSEM_XCHGADD y in rwsem-2.4.4-pre6 + your latest #try3
rw
reads taken: 5789542
writes taken: 2989478
reads taken: 5801777
writes taken: 2995669
r1
reads taken: 16922653
reads taken: 16946132
r2
reads taken: 5650211
reads taken: 5647272
ro
reads taken: 4956250
reads taken: 4959828
w1
writes taken: 15431139
writes taken: 15439790
wo
writes taken: 813756
writes taken: 816005
graph updated attached. so in short my fast path is still quite faster (r1/w1),
slow path is comparable now (I still win in all tests but wo which is probably
the less interesting one in real life [write contention]). I still have room to
improve the wo test [write contention] by spending more icache btw but it
probably doesn't worth.
Andrea
On Mon, 23 Apr 2001, D.W.Howells wrote:
>
> Linus, you suggested that the generic list handling stuff would be faster (2
> unconditional stores) than mine (1 unconditional store and 1 conditional
> store and branch to jump round it). You are both right and wrong. The generic
> code does two stores per _process_ woken up (list_del) mine does the 1 or 2
> stores per _batch_ of processes woken up. So the generic way is better when
> the queue is an even mixture of readers or writers and my way is better when
> there are far greater numbers of waiting readers. However, that said, there
> is not much in it either way, so I've reverted it to the generic list stuff.
Note that the generic list structure already has support for "batching".
It only does it for multiple adds right now (see the "list_splice"
merging code), but there is nothing to stop people from doing it for
multiple deletions too. The code is something like
static inline void list_remove_between(x,y)
{
n->next = y;
y->prev = x;
}
and notice how it's still just two unconditional stores for _any_ number
of deleted entries.
Anyway, I've already applied your #2, how about a patch relative to that?
Linus
On Mon, Apr 23, 2001 at 11:34:35PM +0200, Andrea Arcangeli wrote:
> On Mon, Apr 23, 2001 at 09:35:34PM +0100, D . W . Howells wrote:
> > This patch (made against linux-2.4.4-pre6) makes a number of changes to the
> > rwsem implementation:
> >
> > (1) Everything in try #2
> >
> > plus
> >
> > (2) Changes proposed by Linus for the generic semaphore code.
> >
> > (3) Ideas from Andrea and how he implemented his semaphores.
>
> I benchmarked try3 on top of pre6 and I get this:
>
> ----------------------------------------------
> RWSEM_GENERIC_SPINLOCK y in rwsem-2.4.4-pre6 + your latest #try3
>
> rw
>
> reads taken: 5842496
> writes taken: 3016649
> reads taken: 5823381
> writes taken: 3006773
>
> r1
>
> reads taken: 13309316
> reads taken: 13311722
>
> r2
>
> reads taken: 5010534
> reads taken: 5023185
>
> ro
>
> reads taken: 3850228
> reads taken: 3845954
>
> w1
>
> writes taken: 13012701
> writes taken: 13021716
>
> wo
>
> writes taken: 1825789
> writes taken: 1802560
>
> ----------------------------------------------
> RWSEM_XCHGADD y in rwsem-2.4.4-pre6 + your latest #try3
>
> rw
>
> reads taken: 5789542
> writes taken: 2989478
> reads taken: 5801777
> writes taken: 2995669
>
> r1
>
> reads taken: 16922653
> reads taken: 16946132
>
> r2
>
> reads taken: 5650211
> reads taken: 5647272
>
> ro
>
> reads taken: 4956250
> reads taken: 4959828
>
> w1
>
> writes taken: 15431139
> writes taken: 15439790
>
> wo
>
> writes taken: 813756
> writes taken: 816005
>
> graph updated attached. so in short my fast path is still quite faster (r1/w1),
> slow path is comparable now (I still win in all tests but wo which is probably
> the less interesting one in real life [write contention]). I still have room to
> improve the wo test [write contention] by spending more icache btw but it
> probably doesn't worth.
Ok I finished now my asm optimized rwsemaphores and I improved a little my
spinlock based one but without touching the icache usage.
Here it is the code against pre6 vanilla:
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre6/rwsem-8
here the same but against David's try#2:
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre6/rwsem-8-against-dh-try2
and here again the same but against David's latest try#3:
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre6/rwsem-8-against-dh-try3
The main advantage of my rewrite are:
- my x86 version is visibly faster than yours in the write fast path and it
also saves icache because it's smaller (read fast path is reproducibly a bit faster too)
- my spinlock version is visibly faster in both read and write fast path and still
has smaller icache footprint (of course my version is completly out of line too
and I'm comparing apples to apples)
- at least to me the slow path of my code seems to be much simpler
than yours (and I can actually understand it ;), for example I don't feel
any need of any atomic exchange anywhere, I admit now comparing
my code to yours I still don't know why you need it
- the common code automatically extends itself to support 2^32 concurrent sleepers
on 64bit archs
- there is no code duplication at all in supporting xchgadd common code logic
for other archs (and I prepared a skeleton to fill for the alpha)
Only disavantage is that sparc64 will have to fixup some bit again.
Here the numbers of the above patches on top of vanilla 2.4.4-pre6:
----------------------------------------------
RWSEM_XCHGADD y in rwsem-2.4.4-pre6 + my latest rwsem
rw
reads taken: 5736978
writes taken: 2962325
reads taken: 5799163
writes taken: 2994404
r1
reads taken: 17044842
reads taken: 17053405
r2
reads taken: 5603085
reads taken: 5601647
ro
reads taken: 4831655
reads taken: 4833518
w1
writes taken: 16064773
writes taken: 16037018
wo
writes taken: 860791
writes taken: 864103
----------------------------------------------
RWSEM_SPINLOCK y in rwsem-2.4.4-pre6 + my latest rwsem
rw
reads taken: 6061713
writes taken: 3129801
reads taken: 6099046
writes taken: 3148951
r1
reads taken: 14251500
reads taken: 14265389
r2
reads taken: 4972932
reads taken: 4936267
ro
reads taken: 4253814
reads taken: 4253432
w1
writes taken: 13652385
writes taken: 13632914
wo
writes taken: 1751857
writes taken: 1753608
I draw a graph with the above numbers compared to the previous quoted numbers I
generated a few hours ago with your latest try#3 (attached). (W1 and R1 are
respectively the benchmark of the write fast path and read fast path, only one
writer and only one reader and they're of course the most interesting ones)
So I'd suggest Linus to apply one of my above -8 patches for pre7. (I hope I
won't need any secondary try)
this is a diffstat of the rwsem-8 patch:
arch/alpha/config.in | 1
arch/alpha/kernel/alpha_ksyms.c | 4
arch/arm/config.in | 2
arch/cris/config.in | 1
arch/i386/config.in | 4
arch/ia64/config.in | 1
arch/m68k/config.in | 1
arch/mips/config.in | 1
arch/mips64/config.in | 1
arch/parisc/config.in | 1
arch/ppc/config.in | 1
arch/ppc/kernel/ppc_ksyms.c | 2
arch/s390/config.in | 1
arch/s390x/config.in | 1
arch/sh/config.in | 1
arch/sparc/config.in | 1
arch/sparc64/config.in | 4
include/asm-alpha/compiler.h | 9 -
include/asm-alpha/rwsem_xchgadd.h | 27 +++
include/asm-alpha/semaphore.h | 2
include/asm-i386/rwsem.h | 225 --------------------------------
include/asm-i386/rwsem_xchgadd.h | 93 +++++++++++++
include/asm-sparc64/rwsem.h | 35 ++---
include/linux/compiler.h | 13 +
include/linux/rwsem-spinlock.h | 57 --------
include/linux/rwsem.h | 106 ---------------
include/linux/rwsem_spinlock.h | 61 ++++++++
include/linux/rwsem_xchgadd.h | 96 +++++++++++++
include/linux/sched.h | 2
lib/Makefile | 6
lib/rwsem-spinlock.c | 245 -----------------------------------
lib/rwsem.c | 265 --------------------------------------
lib/rwsem_spinlock.c | 124 +++++++++++++++++
lib/rwsem_xchgadd.c | 88 ++++++++++++
35 files changed, 531 insertions(+), 951 deletions(-)
Andrea
> Ok I finished now my asm optimized rwsemaphores and I improved a little my
> spinlock based one but without touching the icache usage.
And I can break it. There's a very good reason the I changed __up_write() to
use CMPXCHG instead of SUBL. I found a sequence of operations that locked up
on this.
Unfortunately, I can't remember what it was. I do have it written down at
home, I think, so I see about sending it to you later.
>
> The main advantage of my rewrite are:
>
> - my x86 version is visibly faster than yours in the write fast path and
> it also saves icache because it's smaller
That's because your fast write path is wrong. You can get away without
clobbering EDX and ECX, which I can't.
> (read fast path is reproducibly a bit faster too)
| +static inline void __down_read(struct rw_semaphore *sem)
...
| + : "+m" (sem->count), "+a" (sem)
>From what I've been told, you're lucky here... you avoid a pipeline stall
between whatever loads sem into EAX and the INCL that increments what it
points to because the "+a" constraint says that EAX will be changed. This
means that the compiler saves EAX before entering the inline asm, thus
delaying the INCL and thus avoiding the stall. However, you generally seem to
gain this at the expense of clobbering another register, say EDX.
So, for a function that just calls down_read twice, I get:
20: 8b 44 24 04 mov 0x4(%esp,1),%eax
24: f0 ff 00 lock incl (%eax)
27: 0f 88 22 00 00 00 js 4f <dr2+0x2f>
2d: f0 ff 00 lock incl (%eax)
30: 0f 88 30 00 00 00 js 66 <dr2+0x46>
36: c3 ret
And you get:
0: 83 ec 08 sub $0x8,%esp
3: 8b 54 24 0c mov 0xc(%esp,1),%edx
7: 89 d0 mov %edx,%eax
9: f0 ff 02 lock incl (%edx)
c: 0f 88 fc ff ff ff js e <dr+0xe>
12: 89 d0 mov %edx,%eax
14: f0 ff 02 lock incl (%edx)
17: 0f 88 0f 00 00 00 js 2c <dr2+0xc>
1d: 58 pop %eax
1e: 5a pop %edx
1f: c3 ret
Note also your asm constraints cause the compiler to eat an extra 8 bytes of
stack and then to pop it into registers to get rid of it. This is a gcc bug,
and will only hurt if the up_read and down_read are done in separate
subroutines.
| +static inline void __up_read(struct rw_semaphore *sem)
...
| + "jnz 1b\n\t"
| + "pushl %%ecx\n\t"
| + "call rwsem_wake\n\t"
Putting a PUSHL or two there hurt performance when I tried it, because, I'm
told, it introduces a pipeline stall.
> - the common code automatically extends itself to support 2^32
> concurrent sleepers on 64bit archs
You shouldn't do this in the XADD case, since you are explicitly using 32-bit
registers and instructions.
Actually, both of our generic cases allow 2^31 sleepers on a 32-bit arch, and
by changing mine to a long I can make it so we both support up to 2^63 on a
64-bit arch. However, I suspect that is overkill...
> - there is no code duplication at all in supporting xchgadd common code
> logic for other archs (and I prepared a skeleton to fill for the alpha)
Why not make it shareable? It doesn't have to be mandatory...
> So I'd suggest Linus to apply one of my above -8 patches for pre7. (I hope I
> won't need any secondary try)
I recommend against this at the moment (there's a bug in __up_write in his X86
optimised code).
David
On Tue, Apr 24, 2001 at 09:56:11AM +0100, David Howells wrote:
> > Ok I finished now my asm optimized rwsemaphores and I improved a little my
> > spinlock based one but without touching the icache usage.
>
> And I can break it. There's a very good reason the I changed __up_write() to
> use CMPXCHG instead of SUBL. I found a sequence of operations that locked up
> on this.
I'd love to hear this sequence. Certainly regression testing never generated
this sequence yet but yes that doesn't mean anything. Note that your slow path
is very different than mine.
My rule is really really simple:
- the last one that moves from 1 to 0 the lower half of the word
has to do 1 wakeup
I don't care who does that or where it does that.
That's all. No other real rules. Of course then both wakeup and down_failed
have to do their retire logic with the spinlock acquired to make sure to
serialize but that doesn't change the real rule. I don't feel the need
of any xchg to enforce additional serialization.
Anyways if you can provide a sequence that breaks my simple algorithm I will
love to know that, of course then it will be possible also to write a kernel
module to exploit it and reproduce the hang in real life. It is possible that I
missed something however right now the logic seems simple enough to be right (I
mean mine, yours plays with cmpxchg in a way that I still cannot understand).
> Unfortunately, I can't remember what it was. I do have it written down at
> home, I think, so I see about sending it to you later.
Ok thanks!
> > The main advantage of my rewrite are:
> >
> > - my x86 version is visibly faster than yours in the write fast path and
> > it also saves icache because it's smaller
>
> That's because your fast write path is wrong. You can get away without
> clobbering EDX and ECX, which I can't.
>
> > (read fast path is reproducibly a bit faster too)
>
> | +static inline void __down_read(struct rw_semaphore *sem)
> ...
> | + : "+m" (sem->count), "+a" (sem)
>
> >From what I've been told, you're lucky here... you avoid a pipeline stall
> between whatever loads sem into EAX and the INCL that increments what it
> points to because the "+a" constraint says that EAX will be changed. This
Infact eax will be changed because it will be clobbered by the slow path, so I
have to. Infact you are not using the +a like I do there and you don't save EAX
explicitly on the stack I think that's "your" bug.
> means that the compiler saves EAX before entering the inline asm, thus
> delaying the INCL and thus avoiding the stall. However, you generally seem to
> gain this at the expense of clobbering another register, say EDX.
Again it's not a performance issue, the "+a" (sem) is a correctness issue
because the slow path will clobber it.
About the reason I'm faster than you in the down_write fast path is that I can
do the subl instead of the cmpxchg, you say this is my big fault, I think my
algorithm allows me to do that, but maybe I'm wrong.
> So, for a function that just calls down_read twice, I get:
> 20: 8b 44 24 04 mov 0x4(%esp,1),%eax
> 24: f0 ff 00 lock incl (%eax)
> 27: 0f 88 22 00 00 00 js 4f <dr2+0x2f>
> 2d: f0 ff 00 lock incl (%eax)
> 30: 0f 88 30 00 00 00 js 66 <dr2+0x46>
> 36: c3 ret
>
> And you get:
> 0: 83 ec 08 sub $0x8,%esp
> 3: 8b 54 24 0c mov 0xc(%esp,1),%edx
> 7: 89 d0 mov %edx,%eax
> 9: f0 ff 02 lock incl (%edx)
> c: 0f 88 fc ff ff ff js e <dr+0xe>
> 12: 89 d0 mov %edx,%eax
> 14: f0 ff 02 lock incl (%edx)
> 17: 0f 88 0f 00 00 00 js 2c <dr2+0xc>
> 1d: 58 pop %eax
> 1e: 5a pop %edx
> 1f: c3 ret
>
> Note also your asm constraints cause the compiler to eat an extra 8 bytes of
> stack and then to pop it into registers to get rid of it. This is a gcc bug,
> and will only hurt if the up_read and down_read are done in separate
> subroutines.
>
> | +static inline void __up_read(struct rw_semaphore *sem)
> ...
> | + "jnz 1b\n\t"
> | + "pushl %%ecx\n\t"
> | + "call rwsem_wake\n\t"
>
> Putting a PUSHL or two there hurt performance when I tried it, because, I'm
> told, it introduces a pipeline stall.
Unfortunatly I "have" to put the pushl there because I don't want to save %ecx
in the fast path (if I declare ecx clobbered it's even worse no?).
> > - the common code automatically extends itself to support 2^32
> > concurrent sleepers on 64bit archs
>
> You shouldn't do this in the XADD case, since you are explicitly using 32-bit
> registers and instructions.
I said on 64bit archs. Of course on x86-64 there is xaddq and the rex
registers.
> Actually, both of our generic cases allow 2^31 sleepers on a 32-bit arch, and
Both my generic and asm code only allows 2^16 sleepers on 32bit archs, then I
don't know what happens, if it works it wasn't intentional ;).
> by changing mine to a long I can make it so we both support up to 2^63 on a
> 64-bit arch. However, I suspect that is overkill...
>
> > - there is no code duplication at all in supporting xchgadd common code
> > logic for other archs (and I prepared a skeleton to fill for the alpha)
>
> Why not make it shareable? It doesn't have to be mandatory...
It isn't mandatory, if you don't want the xchgadd infrastructure then you don't
set even CONFIG_RWSEM_XCHGADD, no?
> I recommend against this at the moment (there's a bug in __up_write in his X86
> optimised code).
Certainly if there's a bug I agree ;). It will be really fun for me to get a
kernel module that deadlocks my algorithm. thanks for the auditing effort!
Andrea
Linus Torvalds <[email protected]> wrote:
> Note that the generic list structure already has support for "batching".
> It only does it for multiple adds right now (see the "list_splice"
> merging code), but there is nothing to stop people from doing it for
> multiple deletions too. The code is something like
>
> static inline void list_remove_between(x,y)
> {
> n->next = y;
> y->prev = x;
> }
>
> and notice how it's still just two unconditional stores for _any_ number
> of deleted entries.
Yes but the "struct rwsem_waiter" batch would have to be entirely deleted from
the list before any of them are woken, otherwise the waking processes may
destroy their "rwsem_waiter" blocks before they are dequeued (this destruction
is not guarded by a spinlock).
This would then reintroduce a second loop to find out which was the last block
we would be waking.
> Anyway, I've already applied your #2, how about a patch relative to that?
Attached.
David
=================================
diff -uNr linux-rwsem-opt2/include/linux/rwsem-spinlock.h linux/include/linux/rwsem-spinlock.h
--- linux-rwsem-opt2/include/linux/rwsem-spinlock.h Tue Apr 24 10:51:58 2001
+++ linux/include/linux/rwsem-spinlock.h Tue Apr 24 08:40:20 2001
@@ -1,6 +1,8 @@
/* rwsem-spinlock.h: fallback C implementation
*
* Copyright (c) 2001 David Howells ([email protected]).
+ * - Derived partially from ideas by Andrea Arcangeli <[email protected]>
+ * - Derived also from comments by Linus
*/
#ifndef _LINUX_RWSEM_SPINLOCK_H
@@ -11,6 +13,7 @@
#endif
#include <linux/spinlock.h>
+#include <linux/list.h>
#ifdef __KERNEL__
@@ -19,14 +22,16 @@
struct rwsem_waiter;
/*
- * the semaphore definition
+ * the rw-semaphore definition
+ * - if activity is 0 then there are no active readers or writers
+ * - if activity is +ve then that is the number of active readers
+ * - if activity is -1 then there is one active writer
+ * - if wait_list is not empty, then there are processes waiting for the semaphore
*/
struct rw_semaphore {
- __u32 active;
- __u32 waiting;
+ __s32 activity;
spinlock_t wait_lock;
- struct rwsem_waiter *wait_front;
- struct rwsem_waiter **wait_back;
+ struct list_head wait_list;
#if RWSEM_DEBUG
int debug;
#endif
@@ -42,7 +47,7 @@
#endif
#define __RWSEM_INITIALIZER(name) \
-{ 0, 0, SPIN_LOCK_UNLOCKED, NULL, &(name).wait_front __RWSEM_DEBUG_INIT }
+{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
#define DECLARE_RWSEM(name) \
struct rw_semaphore name = __RWSEM_INITIALIZER(name)
diff -uNr linux-rwsem-opt2/lib/rwsem-spinlock.c linux/lib/rwsem-spinlock.c
--- linux-rwsem-opt2/lib/rwsem-spinlock.c Tue Apr 24 10:51:58 2001
+++ linux/lib/rwsem-spinlock.c Tue Apr 24 08:40:20 2001
@@ -2,13 +2,15 @@
* implementation
*
* Copyright (c) 2001 David Howells ([email protected]).
+ * - Derived partially from idea by Andrea Arcangeli <[email protected]>
+ * - Derived also from comments by Linus
*/
#include <linux/rwsem.h>
#include <linux/sched.h>
#include <linux/module.h>
struct rwsem_waiter {
- struct rwsem_waiter *next;
+ struct list_head list;
struct task_struct *task;
unsigned int flags;
#define RWSEM_WAITING_FOR_READ 0x00000001
@@ -19,7 +21,8 @@
void rwsemtrace(struct rw_semaphore *sem, const char *str)
{
if (sem->debug)
- printk("[%d] %s({%d,%d})\n",current->pid,str,sem->active,sem->waiting);
+ printk("[%d] %s({%d,%d})\n",
+ current->pid,str,sem->activity,list_empty(&sem->wait_list)?0:1);
}
#endif
@@ -28,11 +31,9 @@
*/
void init_rwsem(struct rw_semaphore *sem)
{
- sem->active = 0;
- sem->waiting = 0;
+ sem->activity = 0;
spin_lock_init(&sem->wait_lock);
- sem->wait_front = NULL;
- sem->wait_back = &sem->wait_front;
+ INIT_LIST_HEAD(&sem->wait_list);
#if RWSEM_DEBUG
sem->debug = 0;
#endif
@@ -48,60 +49,58 @@
*/
static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem)
{
- struct rwsem_waiter *waiter, *next;
- int woken, loop;
+ struct rwsem_waiter *waiter;
+ int woken;
rwsemtrace(sem,"Entering __rwsem_do_wake");
- waiter = sem->wait_front;
-
- if (!waiter)
- goto list_unexpectedly_empty;
-
- next = NULL;
+ waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
/* try to grant a single write lock if there's a writer at the front of the queue
* - we leave the 'waiting count' incremented to signify potential contention
*/
if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
- sem->active++;
- next = waiter->next;
+ sem->activity = -1;
+ list_del(&waiter->list);
waiter->flags = 0;
wake_up_process(waiter->task);
- goto discard_woken_processes;
+ goto out;
}
/* grant an infinite number of read locks to the readers at the front of the queue */
woken = 0;
do {
- woken++;
- waiter = waiter->next;
- } while (waiter && waiter->flags&RWSEM_WAITING_FOR_READ);
-
- sem->active += woken;
- sem->waiting -= woken;
-
- waiter = sem->wait_front;
- for (loop=woken; loop>0; loop--) {
- next = waiter->next;
+ list_del(&waiter->list);
waiter->flags = 0;
wake_up_process(waiter->task);
- waiter = next;
- }
+ woken++;
+ if (list_empty(&sem->wait_list))
+ break;
+ waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
+ } while (waiter->flags&RWSEM_WAITING_FOR_READ);
- discard_woken_processes:
- sem->wait_front = next;
- if (!next) sem->wait_back = &sem->wait_front;
+ sem->activity += woken;
out:
rwsemtrace(sem,"Leaving __rwsem_do_wake");
return sem;
+}
+
+/*
+ * wake a single writer
+ */
+static inline struct rw_semaphore *__rwsem_wake_one_writer(struct rw_semaphore *sem)
+{
+ struct rwsem_waiter *waiter;
+
+ sem->activity = -1;
- list_unexpectedly_empty:
- printk("__rwsem_do_wake(): wait_list unexpectedly empty\n");
- printk("[%d] %p = { %d, %d })\n",current->pid,sem,sem->active,sem->waiting);
- BUG();
- goto out;
+ waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
+ list_del(&waiter->list);
+
+ waiter->flags = 0;
+ wake_up_process(waiter->task);
+ return sem;
}
/*
@@ -110,29 +109,27 @@
void __down_read(struct rw_semaphore *sem)
{
struct rwsem_waiter waiter;
- struct task_struct *tsk = current;
+ struct task_struct *tsk;
rwsemtrace(sem,"Entering __down_read");
spin_lock(&sem->wait_lock);
- if (!sem->waiting) {
+ if (sem->activity>=0 && list_empty(&sem->wait_list)) {
/* granted */
- sem->active++;
+ sem->activity++;
spin_unlock(&sem->wait_lock);
goto out;
}
- sem->waiting++;
+ tsk = current;
set_task_state(tsk,TASK_UNINTERRUPTIBLE);
/* set up my own style of waitqueue */
- waiter.next = NULL;
waiter.task = tsk;
waiter.flags = RWSEM_WAITING_FOR_READ;
- *sem->wait_back = &waiter; /* add to back of queue */
- sem->wait_back = &waiter.next;
+ list_add_tail(&waiter.list,&sem->wait_list);
/* we don't need to touch the semaphore struct anymore */
spin_unlock(&sem->wait_lock);
@@ -158,30 +155,27 @@
void __down_write(struct rw_semaphore *sem)
{
struct rwsem_waiter waiter;
- struct task_struct *tsk = current;
+ struct task_struct *tsk;
rwsemtrace(sem,"Entering __down_write");
spin_lock(&sem->wait_lock);
- if (!sem->waiting && !sem->active) {
+ if (sem->activity==0 && list_empty(&sem->wait_list)) {
/* granted */
- sem->active++;
- sem->waiting++;
+ sem->activity = -1;
spin_unlock(&sem->wait_lock);
goto out;
}
- sem->waiting++;
+ tsk = current;
set_task_state(tsk,TASK_UNINTERRUPTIBLE);
/* set up my own style of waitqueue */
- waiter.next = NULL;
waiter.task = tsk;
waiter.flags = RWSEM_WAITING_FOR_WRITE;
- *sem->wait_back = &waiter; /* add to back of queue */
- sem->wait_back = &waiter.next;
+ list_add_tail(&waiter.list,&sem->wait_list);
/* we don't need to touch the semaphore struct anymore */
spin_unlock(&sem->wait_lock);
@@ -209,8 +203,8 @@
spin_lock(&sem->wait_lock);
- if (--sem->active==0 && sem->waiting)
- __rwsem_do_wake(sem);
+ if (--sem->activity==0 && !list_empty(&sem->wait_list))
+ sem = __rwsem_wake_one_writer(sem);
spin_unlock(&sem->wait_lock);
@@ -226,9 +220,9 @@
spin_lock(&sem->wait_lock);
- sem->waiting--;
- if (--sem->active==0 && sem->waiting)
- __rwsem_do_wake(sem);
+ sem->activity = 0;
+ if (!list_empty(&sem->wait_list))
+ sem = __rwsem_do_wake(sem);
spin_unlock(&sem->wait_lock);
> I see what you meant here and no, I'm not lucky, I thought about that. gcc x
> 2.95.* seems smart enough to produce (%%eax) that you hardcoded when the
> sem is not a constant (I'm not clobbering another register, if it does it's
> stupid and I consider this a compiler mistake).
It is a compiler mistake... the compiler clobbers another register for
you. The compiler does not, however, know about timing issues with the
contents of the inline assembly... otherwise it'd stick a delay in front of
the XADD in my stuff.
> I tried with a variable pointer and gcc as I expected generated the (%%eax)
> but instead when it's a constant like in the bench my way it avoids to stall
> the pipeline by using the constant address for the locked incl, exactly as
> you said and that's probably why I beat you on the down read fast path too.
> (I also benchmarked with a variable semaphore and it was running a little
> slower)
*grin* Fun ain't it... Try it on a dual athlon or P4 and the answer may come
out differently.
David
On Tue, Apr 24, 2001 at 11:33:13AM +0100, David Howells wrote:
> *grin* Fun ain't it... Try it on a dual athlon or P4 and the answer may come
> out differently.
compile with -mathlon and the compiler then should generate (%%eax) if that's
faster even if the sem is a constant, that's a compiler issue IMHO, I just give the
compiler the flexibility to do the right choice.
Andrea
On Tue, Apr 24, 2001 at 11:25:23AM +0100, David Howells wrote:
> > I'd love to hear this sequence. Certainly regression testing never generated
> > this sequence yet but yes that doesn't mean anything. Note that your slow
> > path is very different than mine.
>
> One of my testcases fell over on it...
so you reproduced a deadlock with my patch applied, or you are saying
you discovered that case with one of you testcases?
I'm asking because I'm running regression testing on my patch for several hours
now and it didn't showed up anything wrong yet (however I'm mainly using my
rwsem program to better stress the rwsem in an interesting environment with
different timings as well, but your stress tests by default looks less
aggressive than my rwsem, infact the bug I found in your code was never
triggered by your testcases until I changed them).
> > I don't feel the need of any xchg to enforce additional serialization.
>
> I don't use XCHG anywhere... do you mean CMPXCHG?
yes of course, you use rwsem_cmpxchgw that is unnecessary.
>
> > yours plays with cmpxchg in a way that I still cannot understand
>
> It tries not to let the "active count" transition 1->0 happen if it can avoid
I don't try to avoid it anytime. I let it to happen all the time it wants.
Immediatly as soon as it can happen.
> it (ie: it would rather wake someone up and not decrement the count). It also
I always retire first.
> only calls __rwsem_do_wake() if the caller manages to transition the active
> count 0->1.
I call rwsem_wake if while retiring the counter gone down to 0 so it's
time to wakeup somebody according to my rule, then I handle the additional
synchroniziation between 0->1 inside the rwsem_wake if it fails I break
the rwsem_wake in the middle while doing the usual retire check from 1 to 0.
That's why up_write is a no brainer for me as far I can see and that's probably
why I can provide a much faster up_write fast path as benchmark shows.
> > Infact eax will be changed because it will be clobbered by the slow path, so
> > I have to. Infact you are not using the +a like I do there and you don't
> > save EAX explicitly on the stack I think that's "your" bug.
>
> Not so... my down-failed slowpath functions return sem in EAX.
Ah ok, I didn't had such idea, I will optimize that bit in my code now.
> > Again it's not a performance issue, the "+a" (sem) is a correctness issue
> > because the slow path will clobber it.
>
> There must be a performance issue too, otherwise our read up/down fastpaths
> are the same. Which clearly they're not.
I guess I'm faster because I avoid the pipeline stall using "+m" (sem->count)
that is written as a constant, that was definitely intentional idea. For
me right now the "+a" (sem) is a correctness issue that is hurting me (probably
not in the benchmark), and I can optimize it the same way you did.
> > I said on 64bit archs. Of course on x86-64 there is xaddq and the rex
> > registers.
>
> But the instructions you've specified aren't 64-bit.
And infact on 32bit arch I use 32bit regs xaddl and 2^16 max sleepers.
> > It isn't mandatory, if you don't want the xchgadd infrastructure then you
> > don't set even CONFIG_RWSEM_XCHGADD, no?
>
> My point is that mine isn't mandatory either... You just don't set the XADD
> config option.
My point is that when you set XADD you still force duplication of the header
stuff into the the asm/*
Andrea
There is a bug in both the C version and asm version of my rwsem
and it is the slow path where I forgotten to drop the _irq part
from the spinlock calls ;) Silly bug. (I inherit it also in the
asm fast path version because I started hacking the same C slow path)
I catched it now because it locks hard my alpha as soon as I play with
any rwsem testcase, not sure why x86 is apparently immune by the hard lockup.
then I also added your trick of returning the semaphore so I can declare "a"
(sem) as read only (that is an improvement for the fast path).
Because of that changes I rerun all the benchmarks. I finished now to
re-benchmark the asm version as it runs even faster than before in the write
contention, the other numbers are basically unchanged. the read down fast path
now runs exactly like yours (so yes it seems the "+a" was giving a no sensical
improvement to my code for the down read fast path).
Of course my down write fast path remains significantly faster than yours and that
really make sense because my smarter algorithm allows me to avoid all your
cmpxchg stuff.
I'm starting the benchmarks of the C version and I will post a number update
and a new patch in a few minutes.
If you can ship me the testcase (also theorical) that breaks my algorihtm in the
next few minutes that would help.
Andrea
On Tue, Apr 24, 2001 at 02:19:28PM +0200, Andrea Arcangeli wrote:
> I'm starting the benchmarks of the C version and I will post a number update
> and a new patch in a few minutes.
(sorry for the below wrap around, just grow your terminal to read it stright)
aa RW (reads) aa RW (writes) R1 R2 RO W1 WO
dh generic out of line try3 5842496 3016649 13309316 5010534 3850228 13012701 1825789
dh generic out of line try3 #2 5823381 3006773 13311722 5023185 3845954 13021716 1802560
aa generic out of line buggy 6061713 3129801 14251500 4972932 4253814 13652385 1751857
aa generic out of line #2 buggy 6099046 3148951 14265389 4936267 4253432 13632914 1753608
aa generic out of line 6133756 3167138 14244991 5122637 4254504 13656896 1797627
aa generic out of line #2 6093079 3145761 14259141 5126506 4254532 13658447 1803505
dh x86 asm in line try3 5789542 2989478 16922653 5650211 4956250 15431139 813756
dh x86 asm in line try3 #2 5801777 2995669 16946132 5647272 4959828 15439790 816005
aa x86 asm in line buggy 5736978 2962325 17044842 5603085 4831655 16064773 860791
aa x86 asm in line #2 buggy 5799163 2994404 17053405 5601647 4833518 16037018 864103
aa generic in line 5706875 2946931 16943038 5644018 4837576 16085859 870833
aa generic in line #2 5755126 2971578 16924502 5639379 4836111 16073916 873499
I tagged my previous rows as "buggy", I left your try#3 at the start of each
version and I added at the end the new numbers with the -9 fixed revision of my
rwsem at the end. new graph is attached.
So nothing interesting is changed in the numbers as far I can tell after the
fixes and improvement of the fast path using "a" instead of "+a".
Unless you can provide a testcase that fails with my smarter and more compact
algorithm I suggest to Linus to merge my code into pre7.
Against pre6:
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre6/rwsem-9
against David's try2:
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre6/rwsem-9-against-dh-try2
against David's try3:
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre6/rwsem-9-against-dh-try3
I will keep doing regression testing in the next hours of course.
Andrea
> so you reproduced a deadlock with my patch applied, or you are saying
> you discovered that case with one of you testcases?
It was my implementation that triggered it (I haven't tried it with yours),
but the bug occurred because the SUBL happened to make the change outside of
the spinlocked region in the slowpath at the same time as the wakeup routine
was running on the other CPU.
I'll have a look at the sequence and make sure that it does actually apply to
your implementation. It may not... but it doesn't hurt to check.
The thing occurred with one of my simple testcases, but only happened once in
a number of runs, fortunately whilst I had the rwsemtrace()'s enabled.
> yes of course, you use rwsem_cmpxchgw that is unnecessary.
Actually, I use this to try and avoid the following loop that you've got in
your code:
> +static void __rwsem_wake(struct rw_semaphore *sem)
...
> + again:
> + count = rwsem_xchgadd(-wait->retire, &sem->count);
> + if (!wake_read && (count & RWSEM_READ_MASK)) {
> + count = rwsem_xchgadd(wait->retire, &sem->count);
> + if ((count & RWSEM_READ_MASK) == 1)
> + goto again;
I now only have that loop in the rwsem_up_write_wake() function.
But! In mine, if __up_write()'s CMPXCHG failed, then it has also read the
counter, which it then passes as an argument to rwsem_up_write_wake(). This
means I can avoid the aforementioned loop in most cases, I suspect, by seeing
if the active counter was 1 at the time of the failed CMPXCHG.
This also means that if a ex-writer wakes up a writer-to-be, the only atomic
instruction performed on sem->count is the CMPXCHG in __up_write().
For the ex-writer waking up readers case, we have to perform an additional
XADD, but this must be done before anyone is woken up, else __up_read() can
get called to decrement the count before we've incremented it. I count the
number of things I want to wake up, adjust the count (one LOCKED ADD only) and
then wake the batch up.
You dive into a LOCKED XADD/XADD loop for each process you wake, which in the
best case will give you one LOCKED XADD per process.
Looking again at rwsem_up_read_wake()... I can't actually eliminate the
CMPXCHG there because the active count has already been decremented, and so
will need to be incremented again prior to a wake-up being performed. However,
if the increment was performed and someone else had incremented the count in
the meantime, we have to decrement it again... but this can cause a transition
back to zero, which we have to check for... and if that occurred...
You get the idea, anyway.
Oh yes... this idea should be faster on SPARC/SPARC64 and IA64 which don't
have useful XADD instructions (FETCHADD can only use small immediate values),
only CMPXCHG/CAS are really useful there.
> I guess I'm faster because I avoid the pipeline stall using "+m" (sem->count)
> that is written as a constant, that was definitely intentional idea.
"+m" doesn't avoid the stall. That's just shorthand for:
: "=m"(sem->count) : "m"(sem->count)
which is what mine has.
"a+" luckily causes the avoidance by saying EAX gets clobbered, causing the
compiler to save via an additional register. Note that the compiler saves
anyway, even if the register will be discarded after being restored - yuk!
I think this one will depend on the surrounding code anyway. I suspect in some
chunks of C mine will be faster, and in others yours will be, all depending on
when EAX is loaded.
> My point is that when you set XADD you still force duplication of the header
> stuff into the the asm/*
If you want to use my XADD algorithm, then I think that's going to be fair
enough for now. You have to provide some extra routines anyway.
David
On Tue, Apr 24, 2001 at 02:07:47PM +0100, David Howells wrote:
> It was my implementation that triggered it (I haven't tried it with yours),
> but the bug occurred because the SUBL happened to make the change outside of
> the spinlocked region in the slowpath at the same time as the wakeup routine
> was running on the other CPU.
That problem seems not to apply to my slowpath algorithm.
> I'll have a look at the sequence and make sure that it does actually apply to
> your implementation. It may not... but it doesn't hurt to check.
indeed, I'd be glad if you could verify of course, thanks!
> > yes of course, you use rwsem_cmpxchgw that is unnecessary.
>
> Actually, I use this to try and avoid the following loop that you've got in
> your code:
>
> > +static void __rwsem_wake(struct rw_semaphore *sem)
> ...
> > + again:
> > + count = rwsem_xchgadd(-wait->retire, &sem->count);
> > + if (!wake_read && (count & RWSEM_READ_MASK)) {
> > + count = rwsem_xchgadd(wait->retire, &sem->count);
> > + if ((count & RWSEM_READ_MASK) == 1)
> > + goto again;
>
> I now only have that loop in the rwsem_up_write_wake() function.
I don't care about the slow path performance, mainly if I to make it faster I have
to slowdown up_write with an cmpxchg too.
> But! In mine, if __up_write()'s CMPXCHG failed, then it has also read the
> counter, which it then passes as an argument to rwsem_up_write_wake(). This
> means I can avoid the aforementioned loop in most cases, I suspect, by seeing
> if the active counter was 1 at the time of the failed CMPXCHG.
>
> This also means that if a ex-writer wakes up a writer-to-be, the only atomic
> instruction performed on sem->count is the CMPXCHG in __up_write().
>
> For the ex-writer waking up readers case, we have to perform an additional
> XADD, but this must be done before anyone is woken up, else __up_read() can
> get called to decrement the count before we've incremented it. I count the
> number of things I want to wake up, adjust the count (one LOCKED ADD only) and
> then wake the batch up.
>
> You dive into a LOCKED XADD/XADD loop for each process you wake, which in the
> best case will give you one LOCKED XADD per process.
I don't dive into the loop for each process I wake, I only execute one locked
xadd for each prcess I wake (it's the de-retire logic). As I also execute
1 xadd for each process that goes to sleep in schedule() in the slow path
(that's the retire logic). That's really clean beahiour of my slow path I think.
> Looking again at rwsem_up_read_wake()... I can't actually eliminate the
> CMPXCHG there because the active count has already been decremented, and so
> will need to be incremented again prior to a wake-up being performed. However,
> if the increment was performed and someone else had incremented the count in
> the meantime, we have to decrement it again... but this can cause a transition
> back to zero, which we have to check for... and if that occurred...
I think what you describe is similar of what I'm doing in my algorithm (and this way
I don't need any chmxchg in both slow and fast path).
> You get the idea, anyway.
>
> Oh yes... this idea should be faster on SPARC/SPARC64 and IA64 which don't
> have useful XADD instructions (FETCHADD can only use small immediate values),
> only CMPXCHG/CAS are really useful there.
>
> > I guess I'm faster because I avoid the pipeline stall using "+m" (sem->count)
> > that is written as a constant, that was definitely intentional idea.
>
> "+m" doesn't avoid the stall. That's just shorthand for:
>
> : "=m"(sem->count) : "m"(sem->count)
>
> which is what mine has.
You have it yes, but you don't use it ;). While I use it so my compiler can
choose if to put dereference the constant in the incl/etc or to put the (%%eax)
that you hardcoded.
> "a+" luckily causes the avoidance by saying EAX gets clobbered, causing the
> compiler to save via an additional register. Note that the compiler saves
> anyway, even if the register will be discarded after being restored - yuk!
>
> I think this one will depend on the surrounding code anyway. I suspect in some
> chunks of C mine will be faster, and in others yours will be, all depending on
> when EAX is loaded.
Ok but you save only in the slow path so the fast path should not be affected.
So doing "a" instead of "+a" shouldn't be able to hurt the fast path, it should
only help I think (that's why I changed it too).
> > My point is that when you set XADD you still force duplication of the header
> > stuff into the the asm/*
>
> If you want to use my XADD algorithm, then I think that's going to be fair
> enough for now. You have to provide some extra routines anyway.
from the arch part you "only" have to provide some extra routine (not the
headers), that's why I included <asm/rwsem_xchgadd.h> from <linux/rwsem_xchgadd.h> ;)
Andrea
On Tue, 24 Apr 2001, David Howells wrote:
>
> Yes but the "struct rwsem_waiter" batch would have to be entirely deleted from
> the list before any of them are woken, otherwise the waking processes may
> destroy their "rwsem_waiter" blocks before they are dequeued (this destruction
> is not guarded by a spinlock).
Look again.
Yes, they may destroy the list, but nobody cares.
Why?
- nobody will look up the list because we do have the spinlock at this
point, so a destroyed list doesn't actually _matter_ to anybody
You were actually depending on this earlier, although maybe not on
purpose.
- list_remove_between() doesn't care about the integrity of the entries
it destroys. It only uses, and only changes, the entries that are still
on the list.
Subtlety is fine. It might warrant a comment, though.
Linus
On Tue, 24 Apr 2001, Andrea Arcangeli wrote:
>
> > > Again it's not a performance issue, the "+a" (sem) is a correctness issue
> > > because the slow path will clobber it.
> >
> > There must be a performance issue too, otherwise our read up/down fastpaths
> > are the same. Which clearly they're not.
>
> I guess I'm faster because I avoid the pipeline stall using "+m" (sem->count)
> that is written as a constant, that was definitely intentional idea.
Guys.
You're arguing over stalls that are (a) compiler-dependent and (b) in code
that doesn't hapeen _anywhere_ except in the specific benchmark you're
using.
Get over it.
- The benchmark may use constant addresses. None of the kernel does. The
benchmark is fairly meaningless in this regard.
- the stalls will almost certainly depend on the code around the thing,
and will also depend on the compiler version. If you're down to
haggling about issues like that, then there is no real difference
between the code.
So calm down guys. And improving the benchmark might not be a bad idea.
Linus
Linus Torvalds <[email protected]> wrote:
> - nobody will look up the list because we do have the spinlock at this
> point, so a destroyed list doesn't actually _matter_ to anybody
I suppose that it'll be okay, provided I take care not to access a block for a
task I've just woken up.
> - list_remove_between() doesn't care about the integrity of the entries
> it destroys. It only uses, and only changes, the entries that are still
> on the list.
True. Okay, I can change it to use that.
David
On Tue, Apr 24, 2001 at 09:56:11AM +0100, David Howells wrote:
> | + : "+m" (sem->count), "+a" (sem)
^^^^^^^^^^ I think you were comenting on
the +m not +a ok
>
> >From what I've been told, you're lucky here... you avoid a pipeline stall
I see what you meant here and no, I'm not lucky, I thought about that. gcc
2.95.* seems smart enough to produce (%%eax) that you hardcoded when the sem is
not a constant (I'm not clobbering another register, if it does it's stupid and
I consider this a compiler mistake). I tried with a variable pointer and gcc
as I expected generated the (%%eax) but instead when it's a constant like in
the bench my way it avoids to stall the pipeline by using the constant address
for the locked incl, exactly as you said and that's probably why I beat you on
the down read fast path too. (I also benchmarked with a variable semaphore and
it was running a little slower)
Andrea
> I'd love to hear this sequence. Certainly regression testing never generated
> this sequence yet but yes that doesn't mean anything. Note that your slow
> path is very different than mine.
One of my testcases fell over on it...
> I don't feel the need of any xchg to enforce additional serialization.
I don't use XCHG anywhere... do you mean CMPXCHG?
> yours plays with cmpxchg in a way that I still cannot understand
It tries not to let the "active count" transition 1->0 happen if it can avoid
it (ie: it would rather wake someone up and not decrement the count). It also
only calls __rwsem_do_wake() if the caller manages to transition the active
count 0->1.
This avoids another subtle bug that I think I have a sequence written down for
too.
> Infact eax will be changed because it will be clobbered by the slow path, so
> I have to. Infact you are not using the +a like I do there and you don't
> save EAX explicitly on the stack I think that's "your" bug.
Not so... my down-failed slowpath functions return sem in EAX.
> Again it's not a performance issue, the "+a" (sem) is a correctness issue
> because the slow path will clobber it.
There must be a performance issue too, otherwise our read up/down fastpaths
are the same. Which clearly they're not.
> About the reason I'm faster than you in the down_write fast path is that I
> can do the subl instead of the cmpxchg, you say this is my big fault, I
> think my algorithm allows me to do that, but maybe I'm wrong.
I used to do that.
> Unfortunatly I "have" to put the pushl there because I don't want to save
> %ecx in the fast path (if I declare ecx clobbered it's even worse no?).
It benchmarked faster without it for me. I suspect this will be different on
different CPUs anyway.
I'm going to have to have a play with Intel's VTUNE program and see what it
says.
> I said on 64bit archs. Of course on x86-64 there is xaddq and the rex
> registers.
But the instructions you've specified aren't 64-bit.
> It isn't mandatory, if you don't want the xchgadd infrastructure then you
> don't set even CONFIG_RWSEM_XCHGADD, no?
My point is that mine isn't mandatory either... You just don't set the XADD
config option.
David