LinuxLists.cc - Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

2005-01-28 07:39:32

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

* Esben Nielsen <[email protected]> wrote:

> I noticed that you changed rw-locks to behave quite diferently under
> real-time preemption: They basicly works like normal locks now. I.e.
> there can only be one reader task within each region. This can can
> however lock the region recursively. [...]

correct.

> [...] I wanted to start looking at fixing that because it ought to
> hurt scalability quite a bit - and even on UP create a few unneeded
> task-switchs. [...]

no, it's not a big scalability problem. rwlocks are really a mistake -
if you want scalability and spinlocks/semaphores are not enough then one
should either use per-CPU locks or lockless structures. rwlocks/rwsems
will very unlikely help much.

> However, the more I think about it the bigger the problem:

yes, that complexity to get it perform in a deterministic manner is why
i introduced this (major!) simplification of locking. It turns out that
most of the time the actual use of rwlocks matches this simplified
'owner-recursive exclusive lock' semantics, so we are lucky.

look at what kind of worst-case scenarios there may already be with
multiple spinlocks (blocker.c). With rwlocks that just gets insane.

Ingo

2005-01-28 11:57:18

by William Lee Irwin III

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

* Esben Nielsen <[email protected]> wrote:
>> [...] I wanted to start looking at fixing that because it ought to
>> hurt scalability quite a bit - and even on UP create a few unneeded
>> task-switchs. [...]

On Fri, Jan 28, 2005 at 08:38:56AM +0100, Ingo Molnar wrote:
> no, it's not a big scalability problem. rwlocks are really a mistake -
> if you want scalability and spinlocks/semaphores are not enough then one
> should either use per-CPU locks or lockless structures. rwlocks/rwsems
> will very unlikely help much.

I wouldn't be so sure about that. SGI is already implicitly relying on
the parallel holding of rwsems for the lockless pagefaulting, and
Oracle has been pushing on mapping->tree_lock becoming an rwlock for a
while, both for large performance gains.

* Esben Nielsen <[email protected]> wrote:
>> However, the more I think about it the bigger the problem:

On Fri, Jan 28, 2005 at 08:38:56AM +0100, Ingo Molnar wrote:
> yes, that complexity to get it perform in a deterministic manner is why
> i introduced this (major!) simplification of locking. It turns out that
> most of the time the actual use of rwlocks matches this simplified
> 'owner-recursive exclusive lock' semantics, so we are lucky.
> look at what kind of worst-case scenarios there may already be with
> multiple spinlocks (blocker.c). With rwlocks that just gets insane.

tasklist_lock is one large exception; it's meant for concurrency there,
and it even gets sufficient concurrency to starve the write side.

Try test_remap.c on mainline vs. -mm to get a microbenchmark-level
notion of the importance of mapping->tree_lock being an rwlock (IIRC
you were cc:'d in at least some of those threads).

net/ has numerous rwlocks, which appear to frequently be associated
with hashtables, and at least some have some relevance to performance.

Are you suggesting that lockless alternatives to mapping->tree_lock,
mm->mmap_sem, and tasklist_lock should be pursued now?

-- wli

2005-01-28 15:28:55

by Ingo Molnar

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

* William Lee Irwin III <[email protected]> wrote:

> On Fri, Jan 28, 2005 at 08:38:56AM +0100, Ingo Molnar wrote:
> > no, it's not a big scalability problem. rwlocks are really a mistake -
> > if you want scalability and spinlocks/semaphores are not enough then one
> > should either use per-CPU locks or lockless structures. rwlocks/rwsems
> > will very unlikely help much.
>
> I wouldn't be so sure about that. SGI is already implicitly relying on
> the parallel holding of rwsems for the lockless pagefaulting, and
> Oracle has been pushing on mapping->tree_lock becoming an rwlock for a
> while, both for large performance gains.

i dont really buy it. Any rwlock-type of locking causes global cacheline
bounces. It can make a positive scalability difference only if the
read-lock hold time is large, at which point RCU could likely have
significantly higher performance. There _may_ be an intermediate locking
pattern that is both long-held but has a higher mix of write-locking
where rwlocks/rwsems may have a performance advantage over RCU or
spinlocks.

Also this is about PREEMPT_RT, mainly aimed towards embedded systems,
and at most aimed towards small (dual-CPU) SMP systems, not the really
big systems.

But, the main argument wrt. PREEMPT_RT stands and is independent of any
scalability properties: rwlocks/rwsems have so bad deterministic
behavior that they are almost impossible to implement in a sane way.

Ingo

2005-01-28 15:59:44

by William Lee Irwin III

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

* William Lee Irwin III <[email protected]> wrote:
>> I wouldn't be so sure about that. SGI is already implicitly relying on
>> the parallel holding of rwsems for the lockless pagefaulting, and
>> Oracle has been pushing on mapping->tree_lock becoming an rwlock for a
>> while, both for large performance gains.

On Fri, Jan 28, 2005 at 04:28:02PM +0100, Ingo Molnar wrote:
> i dont really buy it. Any rwlock-type of locking causes global cacheline
> bounces. It can make a positive scalability difference only if the
> read-lock hold time is large, at which point RCU could likely have
> significantly higher performance. There _may_ be an intermediate locking
> pattern that is both long-held but has a higher mix of write-locking
> where rwlocks/rwsems may have a performance advantage over RCU or
> spinlocks.

The performance relative to mutual exclusion is quantifiable and very
reproducible. These results have people using arguments similar to what
you made above baffled. Systems as small as 4 logical cpus feel these
effects strongly, and it appears to scale almost linearly with the
number of cpus. It may be worth consulting an x86 processor architect
or similar to get an idea of why the counterargument fails. I'm rather
interested in hearing why as well, as I believed the cacheline bounce
argument until presented with incontrovertible evidence to the contrary.

As far as performance relative to RCU goes, I suspect cases where
write-side latency is important will arise for these. Other lockless
methods are probably more appropriate, and are more likely to dominate
rwlocks as expected. For instance, a reimplementation of the radix
trees for lockless insertion and traversal (c.f. lockless pagetable
patches for examples of how that's carried out) is plausible, where RCU
memory overhead in struct page is not.

On Fri, Jan 28, 2005 at 04:28:02PM +0100, Ingo Molnar wrote:
> Also this is about PREEMPT_RT, mainly aimed towards embedded systems,
> and at most aimed towards small (dual-CPU) SMP systems, not the really
> big systems.
> But, the main argument wrt. PREEMPT_RT stands and is independent of any
> scalability properties: rwlocks/rwsems have so bad deterministic
> behavior that they are almost impossible to implement in a sane way.

I suppose if it's not headed toward mainline the counterexamples don't
really matter. I don't have much to say about the RT-related issues,
though I'm aware of priority inheritance's infeasibility for the
rwlock/rwsem case.

-- wli

2005-01-28 16:17:13

by Ingo Molnar

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

* William Lee Irwin III <[email protected]> wrote:

> The performance relative to mutual exclusion is quantifiable and very
> reproducible. [...]

yes, i dont doubt the results - my point is that it's not proven that
the other, more read-friendly types of locking underperform rwlocks.
Obviously spinlocks and rwlocks have the same cache-bounce properties,
so rwlocks can outperform spinlocks if the read path overhead is higher
than that of a bounce, and reads are dominant. But it's still a poor
form of scalability. In fact, when the read path is really expensive
(larger than say 10-20 usecs) an rwlock can produce the appearance of
linear scalability, when compared to spinlocks.

> As far as performance relative to RCU goes, I suspect cases where
> write-side latency is important will arise for these. Other lockless
> methods are probably more appropriate, and are more likely to dominate
> rwlocks as expected. For instance, a reimplementation of the radix
> trees for lockless insertion and traversal (c.f. lockless pagetable
> patches for examples of how that's carried out) is plausible, where
> RCU memory overhead in struct page is not.

yeah.

Ingo

2005-01-28 19:24:33

by Trond Myklebust

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

fr den 28.01.2005 Klokka 08:38 (+0100) skreiv Ingo Molnar:

> no, it's not a big scalability problem. rwlocks are really a mistake -
> if you want scalability and spinlocks/semaphores are not enough then one
> should either use per-CPU locks or lockless structures. rwlocks/rwsems
> will very unlikely help much.

If you do have a highest interrupt case that causes all activity to
block, then rwsems may indeed fit the bill.

In the NFS client code we may use rwsems in order to protect stateful
operations against the (very infrequently used) server reboot recovery
code. The point is that when the server reboots, the server forces us to
block *all* requests that involve adding new state (e.g. opening an
NFSv4 file, or setting up a lock) while our client and others are
re-establishing their existing state on the server.

IOW: If you are planning on converting rwsems into a semaphore, you will
screw us over most royally, by converting the currently highly
infrequent scenario of a single task being able to access the server
into the common case.

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-01-28 19:59:31

by Ingo Molnar

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

* Trond Myklebust <[email protected]> wrote:

> If you do have a highest interrupt case that causes all activity to
> block, then rwsems may indeed fit the bill.
>
> In the NFS client code we may use rwsems in order to protect stateful
> operations against the (very infrequently used) server reboot recovery
> code. The point is that when the server reboots, the server forces us
> to block *all* requests that involve adding new state (e.g. opening an
> NFSv4 file, or setting up a lock) while our client and others are
> re-establishing their existing state on the server.

it seems the most scalable solution for this would be a global flag plus
per-CPU spinlocks (or per-CPU mutexes) to make this totally scalable and
still support the requirements of this rare event. An rwsem really
bounces around on SMP, and it seems very unnecessary in the case you
described.

possibly this could be formalised as an rwlock/rwlock implementation
that scales better. brlocks were such an attempt.

> IOW: If you are planning on converting rwsems into a semaphore, you
> will screw us over most royally, by converting the currently highly
> infrequent scenario of a single task being able to access the server
> into the common case.

nono, i have no such plans.

Ingo

2005-01-28 21:15:01

by Lee Revell

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

On Fri, 2005-01-28 at 11:18 -0800, Trond Myklebust wrote:
> In the NFS client code we may use rwsems in order to protect stateful
> operations against the (very infrequently used) server reboot recovery
> code. The point is that when the server reboots, the server forces us to
> block *all* requests that involve adding new state (e.g. opening an
> NFSv4 file, or setting up a lock) while our client and others are
> re-establishing their existing state on the server.

Hmm, when I was an ISP sysadmin I used to use this all the time. NFS
mounts from the BSD/OS clients would start to act up under heavy web
server load and the cleanest way to get them to recover was to simulate
a reboot on the NetApp. Of course Linux clients were unaffected, they
were just along for the ride ;-)

Lee

2005-01-28 23:28:38

by Bill Huey

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

On Fri, Jan 28, 2005 at 08:45:46PM +0100, Ingo Molnar wrote:
> * Trond Myklebust <[email protected]> wrote:
> > If you do have a highest interrupt case that causes all activity to
> > block, then rwsems may indeed fit the bill.
> >
> > In the NFS client code we may use rwsems in order to protect stateful
> > operations against the (very infrequently used) server reboot recovery
> > code. The point is that when the server reboots, the server forces us
> > to block *all* requests that involve adding new state (e.g. opening an
> > NFSv4 file, or setting up a lock) while our client and others are
> > re-establishing their existing state on the server.
>
> it seems the most scalable solution for this would be a global flag plus
> per-CPU spinlocks (or per-CPU mutexes) to make this totally scalable and
> still support the requirements of this rare event. An rwsem really
> bounces around on SMP, and it seems very unnecessary in the case you
> described.
>
> possibly this could be formalised as an rwlock/rwlock implementation
> that scales better. brlocks were such an attempt.

>From how I understand it, you'll have to have a global structure to
denote an exclusive operation and then take some additional cpumask_t
representing the spinlocks set and use it to iterate over when doing a
PI chain operation.

Locking of each individual parametric typed spinlock might require
a raw_spinlock manipulate lists structures, which, added up, is rather
heavy weight.

No only that, you'd have to introduce a notion of it being counted
since it could also be aquired/preempted by another higher priority
thread on that same procesor. Not having this semantic would make the
thread in that specific circumstance effectively non-preemptable (PI
scheduler indeterminancy), where the mulipule readers portion of a
real read/write (shared-exclusve) lock would have permitted this.

http://people.lynuxworks.com/~bhuey/rt-share-exclusive-lock/rtsem.tgz.1208

Is our attempt at getting real shared-exclusive lock semantics in a
blocking lock and may still be incomplete and buggy. Igor is still
working on this and this is the latest that I have of his work. Getting
comments on this approach would be a good thing as I/we (me/Igor)
believed from the start that this approach is correct.

Assuming that this is possible with the current approach, optimizing
it to avoid CPU ping-ponging is an important next step

bill

2005-01-30 22:14:59

by Esben Nielsen

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

On Fri, 28 Jan 2005, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > I noticed that you changed rw-locks to behave quite diferently under
> > real-time preemption: They basicly works like normal locks now. I.e.
> > there can only be one reader task within each region. This can can
> > however lock the region recursively. [...]
>
> correct.
>
> > [...] I wanted to start looking at fixing that because it ought to
> > hurt scalability quite a bit - and even on UP create a few unneeded
> > task-switchs. [...]
>
> no, it's not a big scalability problem. rwlocks are really a mistake -
> if you want scalability and spinlocks/semaphores are not enough then one
> should either use per-CPU locks or lockless structures. rwlocks/rwsems
> will very unlikely help much.
>
I agree that RCU ought to do the trick in a lot of cases. Unfortunately,
people haven't used RCU in a lot of code but an rwlock. I also like the
idea of rwlocks: Many readers or just one writer. I don't see the need to
take that away from people. Here is an examble which even on a UP will
give problems without it:
You have a shared datastructure, rarely updated with many readers. A low
priority task is reading it. That is preempted a high priority task which
finds out it can't read it -> priority inheritance, task switch. The low
priority task finishes the job -> priority set back, task switch. If it
was done with a rwlock two task switchs would have been saved.

> > However, the more I think about it the bigger the problem:
>
> yes, that complexity to get it perform in a deterministic manner is why
> i introduced this (major!) simplification of locking. It turns out that
> most of the time the actual use of rwlocks matches this simplified
> 'owner-recursive exclusive lock' semantics, so we are lucky.
>
> look at what kind of worst-case scenarios there may already be with
> multiple spinlocks (blocker.c). With rwlocks that just gets insane.
>
Yes it does. But one could make a compromise: The up_write() should _not_
be deterministic. In that case it would be very simple to implement.
up_read() could still be deterministic as it would only involve boosting
one writer in the rare case such exists. That kind of locking would be
very usefull in many real-time systems. Ofcourse, RCU can do the job as
well, but it puts a lot of contrains on the code.

However, as Linux is a general OS there is no way to know wether a
specific lock needs to be determnistic wrt. writing or not as the actual
application is not known at the time the lock type is specified.

> Ingo
>

Esben

2005-01-30 23:59:21

by Kyle Moffett

[permalink] [raw]

Subject: Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

For anybody who wants a good executive summary of RCU, see these:
http://lse.sourceforge.net/locking/rcupdate.html
http://lse.sourceforge.net/locking/rcu/HOWTO/intro.html#WHATIS

On Jan 30, 2005, at 17:03, Esben Nielsen wrote:
> I agree that RCU ought to do the trick in a lot of cases.
> Unfortunately,
> people haven't used RCU in a lot of code but an rwlock. I also like the
> idea of rwlocks: Many readers or just one writer.

Well, RCU is nice because as long as there are no processes attempting
to
modify the data, the performance is as though there was no locking at
all,
which is better than the cacheline-bouncing for rwlock read-acquires,
which must modify the rwlock data every time you acquire. It's only
when
you need to modify the data that readers or other writers must repeat
their calculations when they find out that the data's changed. In the
case of a reader and a writer, the performance reduction is the same as
a
cmpxchg and the reader redoing their calculations (if necessary).

> You have a shared datastructure, rarely updated with many readers. A
> low
> priority task is reading it. That is preempted a high priority task
> which
> finds out it can't read it -> priority inheritance, task switch. The
> low
> priority task finishes the job -> priority set back, task switch. If it
> was done with a rwlock two task switchs would have been saved.

With RCU the high priority task (unlikely to be preempted) gets to run
all
the way through with its calculation, and any low priority tasks are the
ones that will probably need to redo their calculations.

> Yes it does. But one could make a compromise: The up_write() should
> _not_
> be deterministic. In that case it would be very simple to implement.
> up_read() could still be deterministic as it would only involve
> boosting
> one writer in the rare case such exists. That kind of locking would be
> very usefull in many real-time systems. Ofcourse, RCU can do the job as
> well, but it puts a lot of contrains on the code.

Yeah, unfortunately it's harder to write good reliable RCU code than
good
reliable rwlock code, because the semantics of RCU WRT memory access are
much more difficult, so more people write rwlock code that needs to be
cleaned up. It's not like normal locking is easily comprehensible
either.
:-\

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r
!y?(-)
------END GEEK CODE BLOCK------