This is a userspace implementation of rwlocks on top of futexes.
Release was delayed because tdbtorture started crashing... turns out
it's unrelated 2.5.6-pre2 wierdness (after a good 6 hours debugging
<SIGH>).
So I don't have numbers, but I'm pretty sure this is as good as it
gets without explicit kernel support for rwlocks. Kudos to Paul
Mackerras for the brainwork on this one. Blame me for the name.
ftp://ftp.kernel.org/pub/linux/kernel/people/rusty/futex-1.2.tar.gz
Cheers!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Rusty Russell wrote:
> This is a userspace implementation of rwlocks on top of futexes.
>
With the futex approach in mind: Does anybody think it's desirable to have
pthread_cond_wait/signal and pthread_mutex_* with inter process scope build
into the kernel as system call?
The only issue I see so far, is that libpthread should get a "reserved" namespace
entry ( /dev/shm/.linuxthreads-locks ?) to hold all the PTHREAD_PROCESS_SHARE
locks/condvars.
OTOH Irix seems to implement inter process locks as syscall, so that the kernel does
all the bookkeeping. That approach denies a malicious program to trash all locks
in the system...
Hmh, then we could implement a per user /dev/shm/.linuxthreads-lock-<uid> with
tight permissions?
What do you think?
Rusty Russell wrote:
>
> This is a userspace implementation of rwlocks on top of futexes.
question: if rwlocks aren't actually slower in the fast path than
futexes,
would it make sense to only do the rw variant and in some userspace
layer
map "traditional" semaphores to write locks ?
Saves half the implementation and testing....
On Thursday 07 March 2002 07:40 am, Peter W?chtler wrote:
> Rusty Russell wrote:
> > This is a userspace implementation of rwlocks on top of futexes.
>
> With the futex approach in mind: Does anybody think it's desirable to have
>
> pthread_cond_wait/signal and pthread_mutex_* with inter process scope build
> into the kernel as system call?
>
Yes, I talked with Bill Abt from IBM's NPthreads package about it in
December. Huge value as it would provide full POSIX compliants.
There are differences whether you have a 1:1 threading model or
a M:N threading model.
Eitherway this could be implemented using futexes.
M:N is surely more tricky. The problem is that the calling process/kernel
thread can not be blocked but has to return to user level to continue another
user level thread. What needs to happen is something like a signaling
mechanism.
> The only issue I see so far, is that libpthread should get a "reserved"
> namespace entry ( /dev/shm/.linuxthreads-locks ?) to hold all the
> PTHREAD_PROCESS_SHARE locks/condvars.
>
> OTOH Irix seems to implement inter process locks as syscall, so that the
> kernel does all the bookkeeping. That approach denies a malicious program
> to trash all locks in the system...
>
> Hmh, then we could implement a per user /dev/shm/.linuxthreads-lock-<uid>
> with tight permissions?
>
> What do you think?
--
-- Hubertus Franke ([email protected])
On Thursday 07 March 2002 07:11 am, Rusty Russell wrote:
> This is a userspace implementation of rwlocks on top of futexes.
>
> Release was delayed because tdbtorture started crashing... turns out
> it's unrelated 2.5.6-pre2 wierdness (after a good 6 hours debugging
> <SIGH>).
>
> So I don't have numbers, but I'm pretty sure this is as good as it
> gets without explicit kernel support for rwlocks. Kudos to Paul
> Mackerras for the brainwork on this one. Blame me for the name.
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/rusty/futex-1.2.tar.gz
>
> Cheers!
> Rusty.
I'll integrate these into the ulockflex and see what the numbers are
and whether the implementation is correct.
--
-- Hubertus Franke ([email protected])
On Thursday 07 March 2002 07:50 am, Arjan van de Ven wrote:
> Rusty Russell wrote:
> > This is a userspace implementation of rwlocks on top of futexes.
>
> question: if rwlocks aren't actually slower in the fast path than
> futexes,
> would it make sense to only do the rw variant and in some userspace
> layer
> map "traditional" semaphores to write locks ?
> Saves half the implementation and testing....
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
I m not in favor of that. The dominant lock will be mutexes.
--
-- Hubertus Franke ([email protected])
On Thu, Mar 07, 2002 at 10:33:32AM -0500, Hubertus Franke wrote:
> On Thursday 07 March 2002 07:50 am, Arjan van de Ven wrote:
> > Rusty Russell wrote:
> > > This is a userspace implementation of rwlocks on top of futexes.
> >
> > question: if rwlocks aren't actually slower in the fast path than
> > futexes,
> > would it make sense to only do the rw variant and in some userspace
> > layer
> > map "traditional" semaphores to write locks ?
> > Saves half the implementation and testing....
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> I m not in favor of that. The dominant lock will be mutexes.
if there's no extra cost I don't care which is dominant; having one well
tested path is worth it then. If there is extra cost then yes a split is
better.
On Thursday 07 March 2002 10:42 am, Arjan van de Ven wrote:
> On Thu, Mar 07, 2002 at 10:33:32AM -0500, Hubertus Franke wrote:
> > On Thursday 07 March 2002 07:50 am, Arjan van de Ven wrote:
> > > Rusty Russell wrote:
> > > > This is a userspace implementation of rwlocks on top of futexes.
> > >
> > > question: if rwlocks aren't actually slower in the fast path than
> > > futexes,
> > > would it make sense to only do the rw variant and in some userspace
> > > layer
> > > map "traditional" semaphores to write locks ?
> > > Saves half the implementation and testing....
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > in the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> >
> > I m not in favor of that. The dominant lock will be mutexes.
>
> if there's no extra cost I don't care which is dominant; having one well
> tested path is worth it then. If there is extra cost then yes a split is
> better.
Take a look at Rusty's futex-1.2, the code is not that different, however
if its all inlined it creates additional code on the critical path
and why do it if not necessary.
In this case the futexes are the well tested path, the rest is a cludge on
top of it.
--
-- Hubertus Franke ([email protected])
Followup to: <[email protected]>
By author: Hubertus Franke <[email protected]>
In newsgroup: linux.dev.kernel
>
> Take a look at Rusty's futex-1.2, the code is not that different, however
> if its all inlined it creates additional code on the critical path
> and why do it if not necessary.
>
> In this case the futexes are the well tested path, the rest is a cludge on
> top of it.
>
Perhaps someone could give a high-level description of how these
"futexes" work?
-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>
In message <[email protected]> you write:
> On Thursday 07 March 2002 07:50 am, Arjan van de Ven wrote:
> > Rusty Russell wrote:
> > > This is a userspace implementation of rwlocks on top of futexes.
> >
> > question: if rwlocks aren't actually slower in the fast path than
> > futexes,
> > would it make sense to only do the rw variant and in some userspace
> > layer
> > map "traditional" semaphores to write locks ?
> > Saves half the implementation and testing....
>
> I m not in favor of that. The dominant lock will be mutexes.
To clarify: I'd love this, but rwlocks in the kernel aren't even
vaguely fair. With a steady stream of overlapping readers, a writer
will never get the lock.
Hope that clarifies,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Followup to: <[email protected]>
By author: Rusty Russell <[email protected]>
In newsgroup: linux.dev.kernel
>
> > I m not in favor of that. The dominant lock will be mutexes.
>
> To clarify: I'd love this, but rwlocks in the kernel aren't even
> vaguely fair. With a steady stream of overlapping readers, a writer
> will never get the lock.
>
Note that there really are two kinds of rwlocks: rwlocks with read
priority, and rwlocks with write priority. They're actually fairly
different operations. I guess one can envision other schemes, too,
but that's the main distinction.
Neither is particularly hard to implement, however, it's probably
better if they are considered different types (perhaps we can call the
ones with write priority "wrlocks" instead of "rwlocks").
-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>
On 7 Mar 2002 12:17:52 -0800
"H. Peter Anvin" <[email protected]> wrote:
> Followup to: <[email protected]>
> By author: Hubertus Franke <[email protected]>
> In newsgroup: linux.dev.kernel
> >
> > Take a look at Rusty's futex-1.2, the code is not that different, however
> > if its all inlined it creates additional code on the critical path
> > and why do it if not necessary.
> >
> > In this case the futexes are the well tested path, the rest is a cludge on
> > top of it.
> >
>
> Perhaps someone could give a high-level description of how these
> "futexes" work?
Certainly! This is how Futexes IV (and Futexes V, ignoring the fairness
stuff that adds) works:
One or more userspace processes share address space, so they can both do
simple atomic operations on the same memory (hence the new PROT_SEM flag to
mmap/mprotect for architectures which need to know). They agree that "this
address contains an integer which we use as a mutex" (aka. struct futex).
The futex starts at 1 (available). down() looks like:
if (atomic_dec_and_test(futex)) return 0; /* We got it! */
else sys_futex(futex, DOWN); /* Go to kernel, wait. */
up() looks like:
if (atomic_inc_positive(futex)) return 0; /* Noone waiting */
else sys_futex(futex, UP); /* go to kernel, wake people */
Inside the kernel, we do what you'd expect. For sys_futex(futex, DOWN):
Pin the page containing the futex, for convenience.
Hash the kernel address of the futex to get a waitqueue.
Add ourselves to the waitqueue.
While !atomic_dec_and_test() (ie. futex still unavailable):
sleep
if signal pending, break;
Unhash from waitqueue
unpin page.
return success or -EINTR.
For sys_futex(futex, UP):
Pin page for convenience.
Hash kernel address of futex to get the waitqueue.
set futex to 1 (ie. available).
Wake up the first one on the waitqueue waiting for this futex.
unpin page
The only two twists to add are that we don't keep atomic_dec_and_test'ing
in the loop forever (if it's already negative we don't bother), so counter
doesn't wrap, and we don't use actual waitqueues because we share them, but
we want to know which futex each waiter is waiting on.
Pretty simple, no?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Rusty Russell wrote:
> On 7 Mar 2002 12:17:52 -0800
> "H. Peter Anvin" <[email protected]> wrote:
>
>
>>Followup to: <[email protected]>
>>By author: Hubertus Franke <[email protected]>
>>In newsgroup: linux.dev.kernel
>>
>>>Take a look at Rusty's futex-1.2, the code is not that different, however
>>>if its all inlined it creates additional code on the critical path
>>>and why do it if not necessary.
>>>
>>>In this case the futexes are the well tested path, the rest is a cludge on
>>>top of it.
>>>
>>>
>>Perhaps someone could give a high-level description of how these
>>"futexes" work?
>>
>
> Certainly! This is how Futexes IV (and Futexes V, ignoring the fairness
> stuff that adds) works:
>
> One or more userspace processes share address space, so they can both do
> simple atomic operations on the same memory (hence the new PROT_SEM flag to
> mmap/mprotect for architectures which need to know). They agree that "this
> address contains an integer which we use as a mutex" (aka. struct futex).
>
> The futex starts at 1 (available). down() looks like:
> if (atomic_dec_and_test(futex)) return 0; /* We got it! */
> else sys_futex(futex, DOWN); /* Go to kernel, wait. */
>
> up() looks like:
> if (atomic_inc_positive(futex)) return 0; /* Noone waiting */
> else sys_futex(futex, UP); /* go to kernel, wake people */
>
> Inside the kernel, we do what you'd expect. For sys_futex(futex, DOWN):
> Pin the page containing the futex, for convenience.
> Hash the kernel address of the futex to get a waitqueue.
> Add ourselves to the waitqueue.
> While !atomic_dec_and_test() (ie. futex still unavailable):
> sleep
> if signal pending, break;
> Unhash from waitqueue
> unpin page.
> return success or -EINTR.
>
> For sys_futex(futex, UP):
> Pin page for convenience.
> Hash kernel address of futex to get the waitqueue.
> set futex to 1 (ie. available).
> Wake up the first one on the waitqueue waiting for this futex.
> unpin page
>
> The only two twists to add are that we don't keep atomic_dec_and_test'ing
> in the loop forever (if it's already negative we don't bother), so counter
> doesn't wrap, and we don't use actual waitqueues because we share them, but
> we want to know which futex each waiter is waiting on.
>
Okay, dumb question...
What can this do that shared memory + existing semaphores can't do?
-hpa
In message <[email protected]> you write:
> Okay, dumb question...
>
> What can this do that shared memory + existing semaphores can't do?
>
> -hpa
Ignoring speed issues and the fundamentally horrible interface of SysV
semaphores, there are two main issues:
1) There's a limit on the number of SysV semaphores you can have.
2) Instead of one object to deal with (ie. a memory region), you now
have two, with different lifetimes.
Hope that clarifies,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Rusty Russell wrote:
> In message <[email protected]> you write:
>
>>On Thursday 07 March 2002 07:50 am, Arjan van de Ven wrote:
>>
>>>Rusty Russell wrote:
>>>
>>>>This is a userspace implementation of rwlocks on top of futexes.
>>>>
>>>question: if rwlocks aren't actually slower in the fast path than
>>>futexes,
>>>would it make sense to only do the rw variant and in some userspace
>>>layer
>>>map "traditional" semaphores to write locks ?
>>>Saves half the implementation and testing....
>>>
>>I m not in favor of that. The dominant lock will be mutexes.
>>
>
> To clarify: I'd love this, but rwlocks in the kernel aren't even
> vaguely fair. With a steady stream of overlapping readers, a writer
> will never get the lock.
>
> Hope that clarifies,
But you talk about the current implementation, don't you?
Is there something to prevent an implementation of rwlocks in the
kernel, where a wrlock will lock (postponed) further rdlock requests?
I mean: the wrlocker prevents newly rdlocks to succeed and waits for the
current rdlockers to release the lock an then gets the lock..
On Friday 08 March 2002 04:21 am, Peter W?chtler wrote:
> Rusty Russell wrote:
> > In message <[email protected]> you write:
> >>On Thursday 07 March 2002 07:50 am, Arjan van de Ven wrote:
> >>>Rusty Russell wrote:
> >>>>This is a userspace implementation of rwlocks on top of futexes.
> >>>
> >>>question: if rwlocks aren't actually slower in the fast path than
> >>>futexes,
> >>>would it make sense to only do the rw variant and in some userspace
> >>>layer
> >>>map "traditional" semaphores to write locks ?
> >>>Saves half the implementation and testing....
> >>
> >>I m not in favor of that. The dominant lock will be mutexes.
> >
> > To clarify: I'd love this, but rwlocks in the kernel aren't even
> > vaguely fair. With a steady stream of overlapping readers, a writer
> > will never get the lock.
> >
> > Hope that clarifies,
>
> But you talk about the current implementation, don't you?
> Is there something to prevent an implementation of rwlocks in the
> kernel, where a wrlock will lock (postponed) further rdlock requests?
>
> I mean: the wrlocker prevents newly rdlocks to succeed and waits for the
> current rdlockers to release the lock an then gets the lock..
Correct, the idea is to have four functionalities.
(a) writer preference
if any writer is waiting then wake that one up.
(b) reader preference
if any reader is waiting wait up all the readers in the queue
(c) fifo preference
if the first waiter is a writer wait it up, otherwise wake up all readers
(d) fifo-fair preference
like (c), but only wake up readers until the next writer is encountered
(a) - (c) can be implemented with Rusty's 2 user-ueue approach as long
as the wakeup type is always the same. The last one can't (?).
In the kernel this is easy to implement, but the trouble is the status
word in user space, still thinking about it.
It also requires compare and swap.
Also we still need the verdict on the FUTEX_UP and FUTEX_UP_FAIR issue.
Rusty, I noticed you have not stated anything to my combining the two things
into FUTEX V submission. Could you respond with your take on these issues.
--
-- Hubertus Franke ([email protected])
Rusty Russell wrote:
> One or more userspace processes share address space, so they can both do
> simple atomic operations on the same memory (hence the new PROT_SEM flag to
> mmap/mprotect for architectures which need to know). They agree that "this
> address contains an integer which we use as a mutex" (aka. struct futex).
FWIW, there is a precedent from FreeBSD for a MAP_HASSEMAPHORE flag
meaning "this region may contain semaphores".
I don't know if that implies anything about the support of futex system
calls, but it may be the appropriate flag to indicate "atomic operations
are seen as such between processes on these pages".
-- Jamie
In message <[email protected]> you write:
> Rusty Russell wrote:
> > To clarify: I'd love this, but rwlocks in the kernel aren't even
> > vaguely fair. With a steady stream of overlapping readers, a writer
> > will never get the lock.
> >
> > Hope that clarifies,
>
>
> But you talk about the current implementation, don't you?
> Is there something to prevent an implementation of rwlocks in the
> kernel, where a wrlock will lock (postponed) further rdlock requests?
It's proven hard to do without performance impact. Also, we can't do
rw semaphores in a single word: you need two.
Disproving me by implementation VERY welcome!
Hope that helps,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Friday 08 March 2002 11:50 pm, Rusty Russell wrote:
> In message <[email protected]> you write:
> > Rusty Russell wrote:
> > > To clarify: I'd love this, but rwlocks in the kernel aren't even
> > > vaguely fair. With a steady stream of overlapping readers, a writer
> > > will never get the lock.
> > >
> > > Hope that clarifies,
> >
> > But you talk about the current implementation, don't you?
> > Is there something to prevent an implementation of rwlocks in the
> > kernel, where a wrlock will lock (postponed) further rdlock requests?
>
> It's proven hard to do without performance impact. Also, we can't do
> rw semaphores in a single word: you need two.
>
> Disproving me by implementation VERY welcome!
>
> Hope that helps,
> Rusty.
Well I did it with one word and cmpxchg and two queues in the kernel.
The two queues can be consolidated to one if the wait_element for the
queue has a third word, namely read or write.
Did you have a chance to think about the
FUTEX_UP and FUTEX_UP_FAIR issue.
This should be simple enough as shown in the patch that pulled
your two approaches together. It doesn't seem to come at an
additional cost for the simple FUTEX_UP case.
I thought about it some more and talked to some other folks here.
One option is to wake up multiple ones as well to avoid starvation
and increase throughput.
E.g. FUTEX_UP_SOME where you wake up #cpu tasks.
They would then go and contend for the lock again.
--
-- Hubertus Franke ([email protected])