I was toying with a scalable rw_mutex and found that it gives ~10% reduction in
system time on ebizzy runs (without the MADV_FREE patch).
2-way x86_64 pentium D box:
2.6.21
/usr/bin/time ./ebizzy -m -P
60.10user 137.72system 1:49.59elapsed 180%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
/usr/bin/time ./ebizzy -m -P
59.73user 139.50system 1:50.28elapsed 180%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33555878minor)pagefaults 0swaps
/usr/bin/time ./ebizzy -m -P
59.49user 137.74system 1:49.22elapsed 180%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
2.6.21-rw_mutex
/usr/bin/time ./ebizzy -m -P
57.85user 124.30system 1:42.99elapsed 176%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
/usr/bin/time ./ebizzy -m -P
58.09user 124.11system 1:43.18elapsed 176%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33555876minor)pagefaults 0swaps
/usr/bin/time ./ebizzy -m -P
57.36user 124.92system 1:43.52elapsed 176%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
--
* Peter Zijlstra <[email protected]> wrote:
> I was toying with a scalable rw_mutex and found that it gives ~10%
> reduction in system time on ebizzy runs (without the MADV_FREE patch).
>
> 2-way x86_64 pentium D box:
>
> 2.6.21
>
> /usr/bin/time ./ebizzy -m -P
> 59.49user 137.74system 1:49.22elapsed 180%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
>
> 2.6.21-rw_mutex
>
> /usr/bin/time ./ebizzy -m -P
> 57.85user 124.30system 1:42.99elapsed 176%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
nice! This 6% runtime reduction on a 2-way box will i suspect get
exponentially better on systems with more CPUs/cores.
i also like the design, alot: instead of doing a full new lock type
(with per-arch changes, extra lockdep support, etc. etc) you layered the
new abstraction ontop of mutexes. This makes this hard locking
abstraction look really, really simple, while the percpu_counter trick
makes it scale _perfectly_ for the reader case. Congratulations!
given how nice this looks already, have you considered completely
replacing rwsems with this? I suspect you could test the correctness of
that without doing a mass API changeover, by embedding struct rw_mutex
in struct rwsem and implementing kernel/rwsem.c's API that way. (the
real patch would just flip it all over to rw-mutexes)
Ingo
Ingo Molnar a ?crit :
> * Peter Zijlstra <[email protected]> wrote:
>
>> I was toying with a scalable rw_mutex and found that it gives ~10%
>> reduction in system time on ebizzy runs (without the MADV_FREE patch).
>>
>> 2-way x86_64 pentium D box:
>>
>> 2.6.21
>>
>> /usr/bin/time ./ebizzy -m -P
>> 59.49user 137.74system 1:49.22elapsed 180%CPU (0avgtext+0avgdata 0maxresident)k
>> 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
>>
>> 2.6.21-rw_mutex
>>
>> /usr/bin/time ./ebizzy -m -P
>> 57.85user 124.30system 1:42.99elapsed 176%CPU (0avgtext+0avgdata 0maxresident)k
>> 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
>
> nice! This 6% runtime reduction on a 2-way box will i suspect get
> exponentially better on systems with more CPUs/cores.
As long you only have readers, yes.
But I personally find this new rw_mutex not scalable at all if you have some
writers around.
percpu_counter_sum is just a L1 cache eater, and O(NR_CPUS)
On Fri, 11 May 2007, Ingo Molnar wrote:
> given how nice this looks already, have you considered completely
> replacing rwsems with this? I suspect you could test the correctness of
> that without doing a mass API changeover, by embedding struct rw_mutex
> in struct rwsem and implementing kernel/rwsem.c's API that way. (the
> real patch would just flip it all over to rw-mutexes)
Ummmm... How much memory would that cost on large systems?
On Fri, 2007-05-11 at 18:52 +0200, Eric Dumazet wrote:
> Ingo Molnar a écrit :
> > * Peter Zijlstra <[email protected]> wrote:
> >
> >> I was toying with a scalable rw_mutex and found that it gives ~10%
> >> reduction in system time on ebizzy runs (without the MADV_FREE patch).
> >>
> >> 2-way x86_64 pentium D box:
> >>
> >> 2.6.21
> >>
> >> /usr/bin/time ./ebizzy -m -P
> >> 59.49user 137.74system 1:49.22elapsed 180%CPU (0avgtext+0avgdata 0maxresident)k
> >> 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
> >>
> >> 2.6.21-rw_mutex
> >>
> >> /usr/bin/time ./ebizzy -m -P
> >> 57.85user 124.30system 1:42.99elapsed 176%CPU (0avgtext+0avgdata 0maxresident)k
> >> 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
> >
> > nice! This 6% runtime reduction on a 2-way box will i suspect get
> > exponentially better on systems with more CPUs/cores.
>
> As long you only have readers, yes.
>
> But I personally find this new rw_mutex not scalable at all if you have some
> writers around.
>
> percpu_counter_sum is just a L1 cache eater, and O(NR_CPUS)
Yeah, that is true; there are two occurences, the one in
rw_mutex_read_unlock() is not strictly needed for correctness.
Write locks are indeed quite expensive. But given the ratio of
reader:writer locks on mmap_sem (I'm not all that familiar with other
rwsem users) this trade-off seems workable.
On Fri, 11 May 2007, Peter Zijlstra wrote:
>
> I was toying with a scalable rw_mutex and found that it gives ~10% reduction in
> system time on ebizzy runs (without the MADV_FREE patch).
>
You break priority enheritance on user space futexes! :-(
The problems is that the futex waiter have to take the mmap_sem. And as
your rw_mutex isn't PI enabled you get priority inversions :-(
Esben
On Sat, 2007-05-12 at 11:27 +0200, Esben Nielsen wrote:
>
> On Fri, 11 May 2007, Peter Zijlstra wrote:
>
> >
> > I was toying with a scalable rw_mutex and found that it gives ~10% reduction in
> > system time on ebizzy runs (without the MADV_FREE patch).
> >
>
> You break priority enheritance on user space futexes! :-(
> The problems is that the futex waiter have to take the mmap_sem. And as
> your rw_mutex isn't PI enabled you get priority inversions :-(
Do note that rwsems have no PI either.
PI is not a concern for mainline - yet, I do have ideas here though.
On Sat, 12 May 2007, Peter Zijlstra wrote:
> On Sat, 2007-05-12 at 11:27 +0200, Esben Nielsen wrote:
>>
>> On Fri, 11 May 2007, Peter Zijlstra wrote:
>>
>>>
>>> I was toying with a scalable rw_mutex and found that it gives ~10% reduction in
>>> system time on ebizzy runs (without the MADV_FREE patch).
>>>
>>
>> You break priority enheritance on user space futexes! :-(
>> The problems is that the futex waiter have to take the mmap_sem. And as
>> your rw_mutex isn't PI enabled you get priority inversions :-(
>
> Do note that rwsems have no PI either.
> PI is not a concern for mainline - yet, I do have ideas here though.
>
>
If PI wasn't a concern for mainline, why is PI futexes merged into the
mainline?
I notice that the rwsems used now isn't priority inversion safe (thus
destroyingthe perpose of having PI futexes). We thus already have a bug in
the mainline.
I suggest making a rw_mutex which does read side PI: A reader boosts the
writer, but a writer can't boost the readers, since there can be a large
amount of those.
I don't have time to make such a rw_mutex but I have a simple idea for
one, where the rt_mutex can be reused.
struct pi_rw_mutex {
int count; /* 0 -> unoccupied,
>0 -> the number of current readers,
Second highest bit: there are a waiting writer
-1 -> A writer have it. */
struct rt_mutex mutex;
}
Use atomic exchange on count.
When locking:
A writer checks if count <= 0. If so it sets the value to -1 and takes
the mutex. When it gets the mutex it rechecks the count and proceeds.
If count > 0 the writer sets the second highest bit and add itself to
the wait-list in the mutex and sleeps. (The mutex will now be in a state
where owner==NULL but there are waiters. It must be cheched if the
rt_mutex code can handle this.)
A reader checks if count >= 0. If so it does count++ and proceeds.
If count < 0 it takes the rtmutex. When it gets the mutex it sets the
count to 1, unlocks the mutex and proceeds.
When unlocking:
The writer sets count to 0 or 0x8000000 (second highest bit) depending
on how many waiters the mutex have and unlocks the mutex.
The reader checks if count is 0x80000001. If so it sets count to 0 and
wakes up the first waiter on the mutex (if there are any). Otherwise it
just do count--.
Esben
* Esben Nielsen <[email protected]> wrote:
> I notice that the rwsems used now isn't priority inversion safe (thus
> destroying the perpose of having PI futexes). We thus already have a
> bug in the mainline.
you see everything in black and white, ignoring all the grey scales!
Upstream PI futexes are perfectly fine as long as the mm semaphore is
not write-locked (by anyone) while the critical path is running. Given
that real-time tasks often use mlockall and other practices to simplify
their workload so this is not all that hard to achieve.
> I suggest making a rw_mutex which does read side PI: A reader boosts
> the writer, but a writer can't boost the readers, since there can be a
> large amount of those.
this happens automatically when you use Peter's stuff on -rt. But
mainline should not be bothered with this.
> I don't have time to make such a rw_mutex but I have a simple idea for
> one, where the rt_mutex can be reused.
Peter's stuff does this already if you remap all the mutex ops to
rt_mutex ops. Which is also what happens on -rt automatically. Ok?
[ for mainline it is totally pointless and unnecessary to slow down all
MM ops via an rt_mutex, because there are so many other, much larger
effects that make execution time unbound. (interrupts for example) ]
Ingo
Esben Nielsen a ?crit :
>
>
> On Sat, 12 May 2007, Peter Zijlstra wrote:
>
>> On Sat, 2007-05-12 at 11:27 +0200, Esben Nielsen wrote:
>>>
>>> On Fri, 11 May 2007, Peter Zijlstra wrote:
>>>
>>>>
>>>> I was toying with a scalable rw_mutex and found that it gives ~10%
>>>> reduction in
>>>> system time on ebizzy runs (without the MADV_FREE patch).
>>>>
>>>
>>> You break priority enheritance on user space futexes! :-(
>>> The problems is that the futex waiter have to take the mmap_sem. And as
>>> your rw_mutex isn't PI enabled you get priority inversions :-(
>>
>> Do note that rwsems have no PI either.
>> PI is not a concern for mainline - yet, I do have ideas here though.
>>
>>
> If PI wasn't a concern for mainline, why is PI futexes merged into the
> mainline?
If you really care about futexes and mmap_sem, just use private futexes, since
they dont use mmap_sem at all.
On Sat, 12 May 2007, Ingo Molnar wrote:
>
> * Esben Nielsen <[email protected]> wrote:
>
>> I notice that the rwsems used now isn't priority inversion safe (thus
>> destroying the perpose of having PI futexes). We thus already have a
>> bug in the mainline.
>
> you see everything in black and white, ignoring all the grey scales!
> Upstream PI futexes are perfectly fine as long as the mm semaphore is
> not write-locked (by anyone) while the critical path is running. Given
> that real-time tasks often use mlockall and other practices to simplify
> their workload so this is not all that hard to achieve.
>
Yeah, after sending that mail I realized I accepted this fact way back...
But I disagree in that it is easy to avoid not write-lcling the mm
semaphore: A simple malloc() might lead to a mmap() call creating trouble.
Am I right?
>> I suggest making a rw_mutex which does read side PI: A reader boosts
>> the writer, but a writer can't boost the readers, since there can be a
>> large amount of those.
>
> this happens automatically when you use Peter's stuff on -rt.
Because the rw_mutex is translated into a rt_mutex - as you say below.
> But
> mainline should not be bothered with this.
>
I disagree. You lay a large burdon on the users of PI futexes to avoid
write locking the mm semaphore. PI boosting those writers would be a good
idea even in the mainline.
>> I don't have time to make such a rw_mutex but I have a simple idea for
>> one, where the rt_mutex can be reused.
>
> Peter's stuff does this already if you remap all the mutex ops to
> rt_mutex ops. Which is also what happens on -rt automatically. Ok?
>
That is how -rt works and should work. No disagrement there.
> [ for mainline it is totally pointless and unnecessary to slow down all
> MM ops via an rt_mutex, because there are so many other, much larger
> effects that make execution time unbound. (interrupts for example) ]
>
1) How much slower would the pi_rw_mutex I suggested really be? As far as
I see there is only an overhead when there is congestion. I can not see
that that overhead is much larger than a non-PI boosting implementation.
2) I know that execution time isn't bounded in the main-line - that is why
-rt is needed. But it is _that_ bad? How low can you get your latencies
with preemption on on a really busy machine?
Basically, I think PI futexes should provide the same kind of latencies
as the basic thread latency. That is what the user would expect.
Priority invertions should be removed, but ofcourse you can't do better
than the OS does otherwise.
> Ingo
>
Esben
* Esben Nielsen <[email protected]> wrote:
> Yeah, after sending that mail I realized I accepted this fact way
> back... But I disagree in that it is easy to avoid not write-lcling
> the mm semaphore: A simple malloc() might lead to a mmap() call
> creating trouble. Am I right?
yeah - that's why "hard RT" apps generally either preallocate all memory
in advance, or use special, deterministic allocators. And for "soft RT"
it's all a matter of degree.
> > But mainline should not be bothered with this.
>
> I disagree. You lay a large burdon on the users of PI futexes to avoid
> write locking the mm semaphore. PI boosting those writers would be a
> good idea even in the mainline.
only if it can be done without slowing down all the much more important
uses of the MM semaphore.
> 1) How much slower would the pi_rw_mutex I suggested really be? As far
> as I see there is only an overhead when there is congestion. I can not
> see that that overhead is much larger than a non-PI boosting
> implementation.
it could be measured, but it's certainly not going to be zero.
> 2) I know that execution time isn't bounded in the main-line - that is
> why -rt is needed. But it is _that_ bad? How low can you get your
> latencies with preemption on on a really busy machine?
on mainline? It can get arbitrarily large (read: seconds) in essence.
Ingo
On Sat, 12 May 2007, Eric Dumazet wrote:
> Esben Nielsen a ?crit :
>>
>>
>> On Sat, 12 May 2007, Peter Zijlstra wrote:
>>
>> > On Sat, 2007-05-12 at 11:27 +0200, Esben Nielsen wrote:
>> > >
>> > > On Fri, 11 May 2007, Peter Zijlstra wrote:
>> > >
>> > > >
>> > > > I was toying with a scalable rw_mutex and found that it gives ~10%
>> > > > reduction in
>> > > > system time on ebizzy runs (without the MADV_FREE patch).
>> > > >
>> > >
>> > > You break priority enheritance on user space futexes! :-(
>> > > The problems is that the futex waiter have to take the mmap_sem. And
>> > > as
>> > > your rw_mutex isn't PI enabled you get priority inversions :-(
>> >
>> > Do note that rwsems have no PI either.
>> > PI is not a concern for mainline - yet, I do have ideas here though.
>> >
>> >
>> If PI wasn't a concern for mainline, why is PI futexes merged into the
>> mainline?
>
> If you really care about futexes and mmap_sem, just use private futexes,
> since they dont use mmap_sem at all.
>
futex_wait_pi() takes mmap_sem. So does futex_fd(). I can't see a code
path into the PI futexes which doesn't use mmap_sem.
There is another way to avoid problems with mmap_sem:
Use shared memory for data you need to share with high priority tasks and
normal low priority tasks there. The high priority task(s) run(s) in
an isolated high priority process having its own mmap_sem. This high
priority process is mlock'ed and doesn't do any operations write locking
mmap_sem.
But it would be nice if you can avoid such a cumbersome workaround...
Esben
On Fri, May 11, 2007 at 05:56:21PM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > I was toying with a scalable rw_mutex and found that it gives ~10%
> > reduction in system time on ebizzy runs (without the MADV_FREE patch).
> >
> > 2-way x86_64 pentium D box:
> >
> > 2.6.21
> >
> > /usr/bin/time ./ebizzy -m -P
> > 59.49user 137.74system 1:49.22elapsed 180%CPU (0avgtext+0avgdata 0maxresident)k
> > 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
> >
> > 2.6.21-rw_mutex
> >
> > /usr/bin/time ./ebizzy -m -P
> > 57.85user 124.30system 1:42.99elapsed 176%CPU (0avgtext+0avgdata 0maxresident)k
> > 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
>
> nice! This 6% runtime reduction on a 2-way box will i suspect get
> exponentially better on systems with more CPUs/cores.
Is this with the MADV_DONTNEED kernel and glibc work?
> i also like the design, alot: instead of doing a full new lock type
> (with per-arch changes, extra lockdep support, etc. etc) you layered the
> new abstraction ontop of mutexes. This makes this hard locking
> abstraction look really, really simple, while the percpu_counter trick
> makes it scale _perfectly_ for the reader case. Congratulations!
>
> given how nice this looks already, have you considered completely
> replacing rwsems with this? I suspect you could test the correctness of
Not to take anything away from this lock type (because it can have its
uses), but have you considered the size of this lock and its write side
performance?
On Fri, May 11, 2007 at 07:18:33PM +0200, Peter Zijlstra wrote:
> On Fri, 2007-05-11 at 18:52 +0200, Eric Dumazet wrote:
> >
> > But I personally find this new rw_mutex not scalable at all if you have some
> > writers around.
> >
> > percpu_counter_sum is just a L1 cache eater, and O(NR_CPUS)
>
> Yeah, that is true; there are two occurences, the one in
> rw_mutex_read_unlock() is not strictly needed for correctness.
>
> Write locks are indeed quite expensive. But given the ratio of
> reader:writer locks on mmap_sem (I'm not all that familiar with other
> rwsem users) this trade-off seems workable.
I guess the problem with that logic is assuming the mmap_sem read side
always needs to be scalable. Given the ratio of threaded:unthreaded
apps, maybe the trade-off swings away from favour?
On Mon, 2007-05-14 at 13:58 +0200, Nick Piggin wrote:
> On Fri, May 11, 2007 at 05:56:21PM +0200, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > I was toying with a scalable rw_mutex and found that it gives ~10%
> > > reduction in system time on ebizzy runs (without the MADV_FREE patch).
> > >
> > > 2-way x86_64 pentium D box:
> > >
> > > 2.6.21
> > >
> > > /usr/bin/time ./ebizzy -m -P
> > > 59.49user 137.74system 1:49.22elapsed 180%CPU (0avgtext+0avgdata 0maxresident)k
> > > 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
> > >
> > > 2.6.21-rw_mutex
> > >
> > > /usr/bin/time ./ebizzy -m -P
> > > 57.85user 124.30system 1:42.99elapsed 176%CPU (0avgtext+0avgdata 0maxresident)k
> > > 0inputs+0outputs (0major+33555877minor)pagefaults 0swaps
> >
> > nice! This 6% runtime reduction on a 2-way box will i suspect get
> > exponentially better on systems with more CPUs/cores.
>
> Is this with the MADV_DONTNEED kernel and glibc work?
No, just the old stuff.
>
> > i also like the design, alot: instead of doing a full new lock type
> > (with per-arch changes, extra lockdep support, etc. etc) you layered the
> > new abstraction ontop of mutexes. This makes this hard locking
> > abstraction look really, really simple, while the percpu_counter trick
> > makes it scale _perfectly_ for the reader case. Congratulations!
> >
> > given how nice this looks already, have you considered completely
> > replacing rwsems with this? I suspect you could test the correctness of
>
> Not to take anything away from this lock type (because it can have its
> uses), but have you considered the size of this lock and its write side
> performance?
Yeah, I did, I was initially just toying about with the concept, to see
if it was doable at all. I only posted because I got these fairly good
numbers.
On Mon, 2007-05-14 at 14:07 +0200, Nick Piggin wrote:
> On Fri, May 11, 2007 at 07:18:33PM +0200, Peter Zijlstra wrote:
> > On Fri, 2007-05-11 at 18:52 +0200, Eric Dumazet wrote:
> > >
> > > But I personally find this new rw_mutex not scalable at all if you have some
> > > writers around.
> > >
> > > percpu_counter_sum is just a L1 cache eater, and O(NR_CPUS)
> >
> > Yeah, that is true; there are two occurences, the one in
> > rw_mutex_read_unlock() is not strictly needed for correctness.
> >
> > Write locks are indeed quite expensive. But given the ratio of
> > reader:writer locks on mmap_sem (I'm not all that familiar with other
> > rwsem users) this trade-off seems workable.
>
> I guess the problem with that logic is assuming the mmap_sem read side
> always needs to be scalable. Given the ratio of threaded:unthreaded
> apps, maybe the trade-off swings away from favour?
Could be; I've been bashing my head against the wall trying to find a
scalable write side solution. But so far only got a massive dent in my
brain from the effort.
Perhaps I can do a similar optimistic locking for my rcu-btree as I did
for the radix tree. That way most of the trouble would be endowed upon
the vmas instead of the mm itself. And then it would be up to user-space
to ensure it has in the order of nr_cpu_ids arenas to work in.
Also, as Hugh pointed out in an earlier thread; mmap_sem's write side
also protects the page tables, so we'd need to fix that up too;
assumedly the write side equivalent of the vma lock would then protect
all underlying page tables....
/me drifting away, rambling incoherently,..