2005-05-09 19:16:44

by Joe Seigh

[permalink] [raw]
Subject: RCU + SMR for preemptive kernel/user threads.

You can get rid of the requirement to disable interrupts or use K2 type
techniques if you combine RCU with SMR (hazard pointers). Basically
you define the quiescent states for RCU to be anything that is equivalent
to a
full memory barrier, e.g. interrupts, context switches, syscalls, signals,
etc... For read access you use SMR hazard pointers without the
usual store/load memory barrier (so they're really fast on a pipelined
machine *). For write access, you use the usual RCU technique but
after every processor/thread has gone through a quiescent state at
least once, you then do a hazard pointer scan before freeing the resource.
Doing it in this way ensures that the hazard pointer guarded load is
observed correctly even though there is no store/load memory barrier. **

This method combines the load overhead of RCU with the granularity
or SMR hazard pointers. So you can have threads do long term read
accesses without tying up resources, only the resource they are
referencing.

If, besides the quiescent states mentioned above, you can query the
run state of the thread, you can use any not running states as quiescent
states since that implies a context switch. This takes care of the problem
of low priority threads not making forward progress and not passing through
quiescent states in a timely and periodic matter. This is not only
important
for user threads, but kernel threads. One of the ways you can implement
this is by looking at processor, rather than thread, counts of events that
would be considered quiescent states, e.g. interrupts, timer values, etc...
While events of this sort may be considered to occur with frequent
periodicity,
that's only true if Linux is running on real hardware. If it's running on
virtual
hardware, a processor may behave more like a user level thread in that
respect. If the vm allows that, you may want an api that allows querying
the virtual processor's run state.

This is fairly portable. For user threads, it can even be implemented on
windows using some old NT api's that are still around until Longhorn
arrives anyway. For OS X and Solaris there are api's that let you query
various per thread counters and run states. For Linux, there's /proc/stat
which appears to have per cpu times. *** This is more efficient than
querying
threads, since there will likely be more threads than processors for a
while.
Plus you don't have to worry about thread run state for this method, at
least for non-virtual processors anyway.

This virtualization poses interesting problems. What would you do about
a thread that's running on a non running virtual processor? It's in a
quiescent
state. There needs to be a way to determine this. You need to think about
virtualization more and about exposing this information at an application
level.

Anyway, the problems that user level threads have will be the same problem
the kernel has with virtualization. Even if the vm implementations don't
currently support this, defining a public api would send a message to the
vm vendors and give Linux proactive control over these issues.

====

* Did I say fast? On an iBook with 1.2 Mhz ppc and slow iBook memory
running OS X running a testcase that has reader threads traversing a linked
list that is being concurrently modified by writer threads, and pausing
during the traversal using sched_yield just to live dangerously, I get
about 500,000 reads (traversals) per second per thread with the RCU+SMR,
about 25,000 to 150,000 reads/sec/thread using my older lockfree methods,
and about 700 to 1000 reads/sec/thread using mutex or rwlocks.

On a single processor Linux system, it hard to get as dramatic numbers due
to the amount of scheduler artifacts that exist when trying this kind of
stuff
on a single processor Linux system. I get about 500 reads/sec/thread with
RCU+SMR, about 100 reads/sec/thread with the old lock-free stuff, and the
mutex and rwlock versions hang with writer starvation.

The main differenct between the old lock-free stuff and the RCU+SMR is
the old stuff uses memory barriers and/or interlocked instructions.

A raw timing loop shows the hazard pointer load sequence (a load,
store, load, compare, branch condtional) to be about 2x to 3x a
raw pointer load on both ia32 and ppc.


** If the quiescent state occurred before the SMR hazard pointer load,
then the
thread will correctly observe the changed global pointer. If the
quiescent state
occurred after hazard pointer load, the hazard pointer scan will see the
value stored in the hazard pointer. If it occurred in the middle of a
hazard
pointer load, any partitioning of the hazard pointer program sequential
operations
into two sets, the first set executed before the second set in processor
allowed
order, will work correctly. I'm assuming some familiarity with SMR hazard
pointer
logic and with RCU logic.


*** The per cpu times are in jiffies rounded off from some higher
resolution value.
This means that for n time counters, you may have to wait up to n jiffies
for
at least one round up to occur. It would be nicer to have the higher
resolution
value or even better a per processor number of interrupts to allow more
flexibility
and finer granularity in polling intervals.


--
Joe Seigh


2005-05-10 01:11:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU + SMR for preemptive kernel/user threads.

On Mon, May 09, 2005 at 03:09:33PM -0400, Joe Seigh wrote:

[ . . . ]

> *** The per cpu times are in jiffies rounded off from some higher
> resolution value.
> This means that for n time counters, you may have to wait up to n jiffies
> for
> at least one round up to occur. It would be nicer to have the higher
> resolution
> value or even better a per processor number of interrupts to allow more
> flexibility
> and finer granularity in polling intervals.

You can use smp_call_function() to invoke a function on each of the
other CPUs. The implementations of smp_call_function() use inter-processor
interrupts, which should suffice, but you can always make the called
function just execute a memory barrier.

Thanx, Paul

2005-05-10 13:34:06

by Joe Seigh

[permalink] [raw]
Subject: Re: RCU + SMR for preemptive kernel/user threads.

I think you need a release memory barrier if you store into a hazard
pointer that is non null to prevent prior accesses from occurring after
the store. That's an extra memory barrier for hazard pointers that I
overlooked. One thing that could be done is wait an extra RCU grace
period after the hazard pointer scan to ensure prior accesses have
completed before freeing the resource. That would lengthen the delay
to approximately 2 RCU grace periods. Could be a problem if you have
a high write rate and are trying to keep that delay minimal.

There might be another way. I'd have to investigate it a little more.


--
Joe Seigh

2005-05-10 16:55:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU + SMR for preemptive kernel/user threads.

On Tue, May 10, 2005 at 09:32:07AM -0400, Joe Seigh wrote:
> I think you need a release memory barrier if you store into a hazard
> pointer that is non null to prevent prior accesses from occurring after
> the store. That's an extra memory barrier for hazard pointers that I
> overlooked. One thing that could be done is wait an extra RCU grace
> period after the hazard pointer scan to ensure prior accesses have
> completed before freeing the resource. That would lengthen the delay
> to approximately 2 RCU grace periods. Could be a problem if you have
> a high write rate and are trying to keep that delay minimal.
>
> There might be another way. I'd have to investigate it a little more.

Here is what I thought you were suggesting for the updater:

Remove item from list
Send IPIs to all CPUs, relying on exact interrupts (which might
not be present on all CPUs) to serialize the instruction
streams of the other CPUs
Wait for all IPIs to complete
Wait until there are no hazard pointers referencing the item
before freeing.

For the traverser:

1. allocate hazard pointer (SW engr problem: what to do if
allocation fails? If deeply nested, holding locks, &c?)
2. retry:
3. Pick up pointer, store into hazard pointer.
4. Tell the compiler not to reorder memory references across this point
5. If hazard pointer does not match the pointer, goto retry.
6. begin critical section

If the updater and traverser run concurrently, the interrupt forces
serialization -- look at all the possible interleavings to see this:
1. Before this point, the traverser cannot see the removed element.
2. Ditto.
3. Ditto.
4. Before this point, the traverser might have stored a pointer to
the remove element into the hazard pointer, but will see the
disappearance when it returns from interrupt.
5. Ditto.
6. At this point, the hazard pointer has been set, and the
interrupt will force memory ordering.

Similar reasoning when the traverser NULLs the hazard pointer.

Or am I missing something? Or is there some CPU that Linux supports that
does inexact interrupts?

I must admit that I am not completely comfortable with this approach -- it
needs to be beaten up pretty thoroughly with both testing and theoretical
analysis. And there might well be a flaw in my reasoning above. ;-)

Thanx, Paul

2005-05-10 23:37:33

by Joe Seigh

[permalink] [raw]
Subject: Re: RCU + SMR for preemptive kernel/user threads.

On Tue, 10 May 2005 09:55:12 -0700, Paul E. McKenney <[email protected]>
wrote:

> Here is what I thought you were suggesting for the updater:
>
> Remove item from list
> Send IPIs to all CPUs, relying on exact interrupts (which might
> not be present on all CPUs) to serialize the instruction
> streams of the other CPUs
> Wait for all IPIs to complete
> Wait until there are no hazard pointers referencing the item
> before freeing.

Well, anything that acts as a memory barrier, occurs frequently or
can be made to do so, and can be monitored. For the inital user
thread implementations I used unix signals (pthread_kill). Worked
ok on OS X but pretty much tanked on Linux. So no signals as a
common unix user thread implementation of this. It would have to be
/proc type information that would let you infer the threads in question
had done something that synchronized memory. I don't mess around in
the Linux kernel so I can't make any specific suggestions there.
You're the expert there.

>
> For the traverser:
>
> 1. allocate hazard pointer (SW engr problem: what to do if
> allocation fails? If deeply nested, holding locks, &c?)
> 2. retry:
> 3. Pick up pointer, store into hazard pointer.
> 4. Tell the compiler not to reorder memory references across this point
> 5. If hazard pointer does not match the pointer, goto retry.
> 6. begin critical section
>
> If the updater and traverser run concurrently, the interrupt forces
> serialization -- look at all the possible interleavings to see this:
> 1. Before this point, the traverser cannot see the removed element.
> 2. Ditto.
> 3. Ditto.
> 4. Before this point, the traverser might have stored a pointer to
> the remove element into the hazard pointer, but will see the
> disappearance when it returns from interrupt.
> 5. Ditto.
> 6. At this point, the hazard pointer has been set, and the
> interrupt will force memory ordering.

Right, the memory ordering on 6 means the updater is guaranteed to see
the hazard pointer set if it has a valid value.

>
> Similar reasoning when the traverser NULLs the hazard pointer.

Setting the hazard pointer to NULL requires release semantics to ensure
prior accesses to the element complete before the element is deleted.
That's an MFENCE on Intel, membar #StoreStore+#LoadStore on sparc, and
lwsync on powerpc. Or you could wait an additional RCU grace period
to ensure those prior accesses complete before deleting the element if
you have RCU set up to track memory synchronization.

RCU "read" access is kind of like a lock. And when you release
a lock, you need release memory synchronization semantics to prevent
accesses occurring outside of the critical region, the end of which
in this case would be defined by the store of NULL into the hazard pointer.

Additionally if you replace any non NULL hazard pointer value you will
need to use
release semantics.

There might be something you can do to avoid the extra RCU wait but
I'd have to study it a little more to get a better handle on the
trade offs.

>
> Or am I missing something? Or is there some CPU that Linux supports that
> does inexact interrupts?

It's not the interrupts per se that are of interest here, it's the memory
synchronization. If interrupts are a problem, you could always find
something
that does synchronization in a more orderly manner, or put explicit memory
synchronization in the interrupt handler to force exactitude for this
particular
case.

>
> I must admit that I am not completely comfortable with this approach --
> it
> needs to be beaten up pretty thoroughly with both testing and theoretical
> analysis. And there might well be a flaw in my reasoning above. ;-)
>

I suppose I should do some kind of formal analysis of this. I'm figuring
out
if this technique is interesting enough first before I go through all that
work.


--
Joe Seigh

2005-05-11 15:06:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU + SMR for preemptive kernel/user threads.

On Tue, May 10, 2005 at 06:40:20PM -0400, Joe Seigh wrote:
> On Tue, 10 May 2005 09:55:12 -0700, Paul E. McKenney <[email protected]>
> wrote:
>
> > Here is what I thought you were suggesting for the updater:
> >
> > Remove item from list
> > Send IPIs to all CPUs, relying on exact interrupts (which might
> > not be present on all CPUs) to serialize the instruction
> > streams of the other CPUs
> > Wait for all IPIs to complete
> > Wait until there are no hazard pointers referencing the item
> > before freeing.
>
> Well, anything that acts as a memory barrier, occurs frequently or
> can be made to do so, and can be monitored. For the inital user
> thread implementations I used unix signals (pthread_kill). Worked
> ok on OS X but pretty much tanked on Linux. So no signals as a
> common unix user thread implementation of this. It would have to be
> /proc type information that would let you infer the threads in question
> had done something that synchronized memory. I don't mess around in
> the Linux kernel so I can't make any specific suggestions there.

OK, I was thinking in terms of the kernel.

> You're the expert there.

"We", LKML, rather than "me".

> > For the traverser:
> >
> > 1. allocate hazard pointer (SW engr problem: what to do if
> > allocation fails? If deeply nested, holding locks, &c?)
> > 2. retry:
> > 3. Pick up pointer, store into hazard pointer.
> > 4. Tell the compiler not to reorder memory references across this point
> > 5. If hazard pointer does not match the pointer, goto retry.
> > 6. begin critical section
> >
> > If the updater and traverser run concurrently, the interrupt forces
> > serialization -- look at all the possible interleavings to see this:
> > 1. Before this point, the traverser cannot see the removed element.
> > 2. Ditto.
> > 3. Ditto.
> > 4. Before this point, the traverser might have stored a pointer to
> > the remove element into the hazard pointer, but will see the
> > disappearance when it returns from interrupt.
> > 5. Ditto.
> > 6. At this point, the hazard pointer has been set, and the
> > interrupt will force memory ordering.
>
> Right, the memory ordering on 6 means the updater is guaranteed to see
> the hazard pointer set if it has a valid value.

OK. Note that the CPU would be free to reorder. The IPIs suppress
this reordering, but only when needed.

> > Similar reasoning when the traverser NULLs the hazard pointer.
>
> Setting the hazard pointer to NULL requires release semantics to ensure
> prior accesses to the element complete before the element is deleted.
> That's an MFENCE on Intel, membar #StoreStore+#LoadStore on sparc, and
> lwsync on powerpc. Or you could wait an additional RCU grace period
> to ensure those prior accesses complete before deleting the element if
> you have RCU set up to track memory synchronization.
>
> RCU "read" access is kind of like a lock. And when you release
> a lock, you need release memory synchronization semantics to prevent
> accesses occurring outside of the critical region, the end of which
> in this case would be defined by the store of NULL into the hazard pointer.

In classic RCU, the release is supplied by the context switch. In your
scheme, couldn't you do the following on the update side?

1. Gather up all the hazard pointers.
2. Send IPIs to all other CPUs.
3. Check the hazard pointers gathered in #1 against the
blocks to be freed.

The read side would do the following when letting go of a hazard pointer:

1. Prevent the compiler from reordering memory references
(the CPU would still be free to do so).
2. Set the hazard pointer to NULL.
3. begin non-critical-section code.

Checking where the IPI is received by the read side:

1. Before this point, the updater would have seen the non-NULL hazard
pointer (if the hazard pointer referenced the data item that was
previously removed).
2. Ditto.
3. Before this point, the hazard pointer could be seen as NULL, but
the read-side CPU will also have stopped using the pointer (since
we are assuming precise interrupts).

Again, not sure if all CPUs support precise interrupts. The ones that I
am familiar with do, at least for IPIs.

> Additionally if you replace any non NULL hazard pointer value you will
> need to use
> release semantics.

The trick is that the IPI provides the release semantics, but only
when needed. Right?

> There might be something you can do to avoid the extra RCU wait but
> I'd have to study it a little more to get a better handle on the
> trade offs.

True, there will need to be either two RCU waits or two rounds of IPIs.

> > Or am I missing something? Or is there some CPU that Linux supports that
> > does inexact interrupts?
>
> It's not the interrupts per se that are of interest here, it's the memory
> synchronization. If interrupts are a problem, you could always find
> something
> that does synchronization in a more orderly manner, or put explicit memory
> synchronization in the interrupt handler to force exactitude for this
> particular
> case.

The interrupts provide the memory synchronization (if the CPU does not
cause a memory barrier on interrupt, one simply places a memory barrier
in the IPI handler). The trick is that the memory barriers are not
executed except when they are needed. This provides performance benefits
when updates are rare compared to read-side critical sections.

But again, assumes that the CPU provides precise interrupts.

> > I must admit that I am not completely comfortable with this approach --
> > it
> > needs to be beaten up pretty thoroughly with both testing and theoretical
> > analysis. And there might well be a flaw in my reasoning above. ;-)
> >
>
> I suppose I should do some kind of formal analysis of this. I'm figuring
> out
> if this technique is interesting enough first before I go through all that
> work.

Hard to say without some experimentation.

Thanx, Paul

> --
> Joe Seigh
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2005-05-11 22:45:33

by Joe Seigh

[permalink] [raw]
Subject: Re: RCU + SMR for preemptive kernel/user threads.

On Wed, 11 May 2005 08:04:54 -0700, Paul E. McKenney <[email protected]>
wrote:

> On Tue, May 10, 2005 at 06:40:20PM -0400, Joe Seigh wrote:


> In classic RCU, the release is supplied by the context switch. In your
> scheme, couldn't you do the following on the update side?
>
> 1. Gather up all the hazard pointers.
> 2. Send IPIs to all other CPUs.
> 3. Check the hazard pointers gathered in #1 against the
> blocks to be freed.

You need to do the IPIs before you look at the hazard pointers.

>
> The read side would do the following when letting go of a hazard pointer:
>
> 1. Prevent the compiler from reordering memory references
> (the CPU would still be free to do so).
> 2. Set the hazard pointer to NULL.
> 3. begin non-critical-section code.
>
> Checking where the IPI is received by the read side:
>
> 1. Before this point, the updater would have seen the non-NULL hazard
> pointer (if the hazard pointer referenced the data item that was
> previously removed).
> 2. Ditto.
> 3. Before this point, the hazard pointer could be seen as NULL, but
> the read-side CPU will also have stopped using the pointer (since
> we are assuming precise interrupts).

The problem is you don't know when the hazard pointer was set to NULL.
It could have been set soon after the IPI interrupt was received and
any outstanding accesses made since the IPI interrupt aren't syncronized
with respect to setting the hazard pointer to null.

But if you looked at the hazard pointer in the IPI interrupt handler,
you could use that information to decide whether you had to wait an
additional RCU interval. So updater logic would be

1. Set global pointer to NULL. // make object unreachable
2. Send IPIs to all other CPUs
(IPI interrupt handler will copy CPU's hazard pointers)
3. Check objects to be freed against copied hazard pointers.
4. There is no step 4. Even if the actual hazard pointers
that pointed to the object is NULL by this point (but not
its copy), you'd still have to wait and addtional RCU
interval so you might as well leave it out as redundant.

This is better. I may try that trick I used to make NPTL condvars
faster to see if I can keep Linux user space version of this from
tanking. It uses unix signals instead of IPIs.



>
> Again, not sure if all CPUs support precise interrupts. The ones that I
> am familiar with do, at least for IPIs.
>
>> Additionally if you replace any non NULL hazard pointer value you will
>> need to use
>> release semantics.
>
> The trick is that the IPI provides the release semantics, but only
> when needed. Right?
>
>> There might be something you can do to avoid the extra RCU wait but
>> I'd have to study it a little more to get a better handle on the
>> trade offs.
>
> True, there will need to be either two RCU waits or two rounds of IPIs.

Yes, it might better be called RCU+SMR+RCU in that case.


>>
>> I suppose I should do some kind of formal analysis of this. I'm
>> figuring
>> out
>> if this technique is interesting enough first before I go through all
>> that
>> work.
>
> Hard to say without some experimentation.


I've done plenty of that. I have some atomically thread-safe reference
counting impletations and a proxy GC based on those which I compare to
an RCU for user threads implementation. Using lock-free in user space
gives you much more dramatic performance improvements than in the kernel.
It cuts down on unnecessary context switching which can slow things down
considerably. Also mutexes and rwlocks are prone to starvation. Making
them FIFO for guaranteed service order slows them down even further.

I usually use a semaphore to keep the updaters from running out of
resources.
It slows down updater throughput but then I'm more concerned with reader
throughput. If I want faster recovery of resources I'll used the atomic
refcounted pointer or the proxy based on it. Slightly slower updater
performance
but resources are recovered more quickly.

--
Joe Seigh

2005-05-12 01:36:28

by Joe Seigh

[permalink] [raw]
Subject: Re: RCU + SMR for preemptive kernel/user threads.

On Wed, 11 May 2005 17:47:52 -0400, Joe Seigh <[email protected]> wrote:

>
> But if you looked at the hazard pointer in the IPI interrupt handler,
> you could use that information to decide whether you had to wait an
> additional RCU interval. So updater logic would be
>
> 1. Set global pointer to NULL. // make object unreachable
> 2. Send IPIs to all other CPUs
> (IPI interrupt handler will copy CPU's hazard pointers)
> 3. Check objects to be freed against copied hazard pointers.
> 4. There is no step 4. Even if the actual hazard pointers
> that pointed to the object is NULL by this point (but not
> its copy), you'd still have to wait and addtional RCU
> interval so you might as well leave it out as redundant.
>
> This is better. I may try that trick I used to make NPTL condvars
> faster to see if I can keep Linux user space version of this from
> tanking. It uses unix signals instead of IPIs.
>

I should add that this is pretty close to deferred reference counting,
not the reference counting bit but that it differentiates between local
references and shared global references like RCU does. It lets them
avoid having to stop the world to GC, just stop one thread at a time to
examine the stack.

--
Joe Seigh

2005-05-12 01:57:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU + SMR for preemptive kernel/user threads.

On Wed, May 11, 2005 at 05:47:52PM -0400, Joe Seigh wrote:
> On Wed, 11 May 2005 08:04:54 -0700, Paul E. McKenney <[email protected]>
> wrote:
>
> > On Tue, May 10, 2005 at 06:40:20PM -0400, Joe Seigh wrote:
>
>
> > In classic RCU, the release is supplied by the context switch. In your
> > scheme, couldn't you do the following on the update side?
> >
> > 1. Gather up all the hazard pointers.
> > 2. Send IPIs to all other CPUs.
> > 3. Check the hazard pointers gathered in #1 against the
> > blocks to be freed.
>
> You need to do the IPIs before you look at the hazard pointers.

I do both -- once after removing the data item and before gathering up
all the hazard pointers (see previous email) and once after gathering
up the hazard pointers but before checking them. The combined update
procedure is thus:

1. Remove data item.
2. Send IPIs to all other CPUs.
3. Gather up all the hazard pointers.
4. Send IPIs to all other CPUs.
5. Check the hazard pointers gathered in #3 against
the data items to be freed.

> > The read side would do the following when letting go of a hazard pointer:
> >
> > 1. Prevent the compiler from reordering memory references
> > (the CPU would still be free to do so).
> > 2. Set the hazard pointer to NULL.
> > 3. begin non-critical-section code.
> >
> > Checking where the IPI is received by the read side:
> >
> > 1. Before this point, the updater would have seen the non-NULL hazard
> > pointer (if the hazard pointer referenced the data item that was
> > previously removed).
> > 2. Ditto.
> > 3. Before this point, the hazard pointer could be seen as NULL, but
> > the read-side CPU will also have stopped using the pointer (since
> > we are assuming precise interrupts).
>
> The problem is you don't know when the hazard pointer was set to NULL.
> It could have been set soon after the IPI interrupt was received and
> any outstanding accesses made since the IPI interrupt aren't syncronized
> with respect to setting the hazard pointer to null.

If the hazard pointer was set to NULL after the IPI was received, then
the prior gather would have seen the hazard pointer as non-NULL, right?
This would prevent the data item from being freed, as desired in this
case.

> But if you looked at the hazard pointer in the IPI interrupt handler,
> you could use that information to decide whether you had to wait an
> additional RCU interval. So updater logic would be
>
> 1. Set global pointer to NULL. // make object unreachable
> 2. Send IPIs to all other CPUs
> (IPI interrupt handler will copy CPU's hazard pointers)
> 3. Check objects to be freed against copied hazard pointers.
> 4. There is no step 4. Even if the actual hazard pointers
> that pointed to the object is NULL by this point (but not
> its copy), you'd still have to wait and addtional RCU
> interval so you might as well leave it out as redundant.
>
> This is better. I may try that trick I used to make NPTL condvars
> faster to see if I can keep Linux user space version of this from
> tanking. It uses unix signals instead of IPIs.

Need to think about this one -- it does seem like copying the hazard
pointers in the IPI should reduce the amount of synchronization required.

> > Again, not sure if all CPUs support precise interrupts. The ones that I
> > am familiar with do, at least for IPIs.
> >
> >> Additionally if you replace any non NULL hazard pointer value you will
> >> need to use
> >> release semantics.
> >
> > The trick is that the IPI provides the release semantics, but only
> > when needed. Right?
> >
> >> There might be something you can do to avoid the extra RCU wait but
> >> I'd have to study it a little more to get a better handle on the
> >> trade offs.
> >
> > True, there will need to be either two RCU waits or two rounds of IPIs.
>
> Yes, it might better be called RCU+SMR+RCU in that case.
>
>
> >>
> >> I suppose I should do some kind of formal analysis of this. I'm
> >> figuring
> >> out
> >> if this technique is interesting enough first before I go through all
> >> that
> >> work.
> >
> > Hard to say without some experimentation.
>
>
> I've done plenty of that. I have some atomically thread-safe reference
> counting impletations and a proxy GC based on those which I compare to
> an RCU for user threads implementation. Using lock-free in user space
> gives you much more dramatic performance improvements than in the kernel.
> It cuts down on unnecessary context switching which can slow things down
> considerably. Also mutexes and rwlocks are prone to starvation. Making
> them FIFO for guaranteed service order slows them down even further.

True, user-level synchronization is often quite a bit more expensive
than in-kernel synchronization, so more beneficial to avoid.

> I usually use a semaphore to keep the updaters from running out of
> resources.
> It slows down updater throughput but then I'm more concerned with reader
> throughput. If I want faster recovery of resources I'll used the atomic
> refcounted pointer or the proxy based on it. Slightly slower updater
> performance
> but resources are recovered more quickly.
>
> --
> Joe Seigh

Thanx, Paul