2005-12-21 18:37:45

by Joe Seigh

[permalink] [raw]
Subject: rcuref optimization

You can get rid of the requirement for atomic_inc_not_zero logic
if you use the logic I first proposed here in c.l.c++.m.
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=3E7C83DD.B126DE24%40xemaps.com
for weakptrs where the same kind of logic was required for the strong count.
This will allow you to use fetch_inc (e.g. LOCK INC on x86) instead of compare
and swap logic which might be more efficient on some processors. You might
even be able to get rid of the the "unincrement" if you are pretty sure the
maximum number of increments won't put the refcount to zero.

Summary for those who can't follow the link. Basically, if you decrement the
refcount to zero, you attempt to set the refcount to the minimum signed value
(e.g. 0x80000000 for 32 bits). If successful you can schedule the object
for deallocation using RCU. If unsuccessful, some other thread has incremented
the refcount and object is still in use and even deallocated by some other thread.
Incrementing of the refcount is only considered successful if the result is greater
than zero. If less than zero, object is being scheduled for deallocation.

--
Joe Seigh


2005-12-22 09:00:12

by Nick Piggin

[permalink] [raw]
Subject: Re: rcuref optimization

Joe Seigh wrote:
> You can get rid of the requirement for atomic_inc_not_zero logic
> if you use the logic I first proposed here in c.l.c++.m.
> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=3E7C83DD.B126DE24%40xemaps.com
>
> for weakptrs where the same kind of logic was required for the strong
> count.
> This will allow you to use fetch_inc (e.g. LOCK INC on x86) instead of
> compare
> and swap logic which might be more efficient on some processors. You might
> even be able to get rid of the the "unincrement" if you are pretty sure the
> maximum number of increments won't put the refcount to zero.
>
> Summary for those who can't follow the link. Basically, if you
> decrement the
> refcount to zero, you attempt to set the refcount to the minimum signed
> value
> (e.g. 0x80000000 for 32 bits). If successful you can schedule the object
> for deallocation using RCU. If unsuccessful, some other thread has
> incremented
> the refcount and object is still in use and even deallocated by some
> other thread.
> Incrementing of the refcount is only considered successful if the result
> is greater
> than zero. If less than zero, object is being scheduled for deallocation.
>

Clever idea.

I don't know... atomic_inc_not_zero is implemented very easily on the
many architectures without SMP, and I think it *could* be implemented
very nicely on ll/sc based architectures without using cmpxchg.

Lastly, your InterlockedIncrement and InterlockedDecrement are not
actually atomic_inc (LOCK INC), but atomic_inc_return (XADD). Another
primitive like atomic_inc_return_negative or something could be added
to take advantages of status flags and use LOCK INC, but this will
probably not be worthwhile for any architecture other than i386/x86-64
(ie. it will be plain worse on most ll/sc and UP-only architectures
once they get around to implementing atomic_inc_not_zero properly)

Also, the extra logic and atomic op in the decrement-to-zero case
takes a bit of shine off it even for i386. I'd say we should stick
to what we have unless we see some really compelling numbers.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-22 16:15:42

by Joe Seigh

[permalink] [raw]
Subject: Re: rcuref optimization

Nick Piggin wrote:
> Joe Seigh wrote:
>
>> You can get rid of the requirement for atomic_inc_not_zero logic
>> if you use the logic I first proposed here in c.l.c++.m.
>> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=3E7C83DD.B126DE24%40xemaps.com
>>
>> for weakptrs where the same kind of logic was required for the strong
>> count.
>> This will allow you to use fetch_inc (e.g. LOCK INC on x86) instead of
>> compare
>> and swap logic which might be more efficient on some processors. You
>> might
>> even be able to get rid of the the "unincrement" if you are pretty
>> sure the
>> maximum number of increments won't put the refcount to zero.
>>
>
> Clever idea.
>
> I don't know... atomic_inc_not_zero is implemented very easily on the
> many architectures without SMP, and I think it *could* be implemented
> very nicely on ll/sc based architectures without using cmpxchg.
>
> Lastly, your InterlockedIncrement and InterlockedDecrement are not
> actually atomic_inc (LOCK INC), but atomic_inc_return (XADD). Another
> primitive like atomic_inc_return_negative or something could be added
> to take advantages of status flags and use LOCK INC, but this will
> probably not be worthwhile for any architecture other than i386/x86-64
> (ie. it will be plain worse on most ll/sc and UP-only architectures
> once they get around to implementing atomic_inc_not_zero properly)

In other words, if you *don't* implement it, it will be plain worse
on architectures that do have native atomic_inc. (sorry, couldn't resist :) )
>
> Also, the extra logic and atomic op in the decrement-to-zero case
> takes a bit of shine off it even for i386. I'd say we should stick
> to what we have unless we see some really compelling numbers.

The extra atomic op should be on a somewhat infrequently executed path
if you're in a typical readers/writers situation, which I think is the
case. Probably RCU lookups that return by reference rather than by value.

On the read side, the atomic increment might make a difference with less
conditional branches needed but it probably matters more what the overall
contribution to performance is. If the refcount increment is only 1% of
overall cost then even if an atomic increment is 2 or 3 times faster than
the present scheme, it probably wouldn't matter that much.

That said, the refcount can be considered an opaque data type and if you
don't hard code the implementation into your api naming scheme, you can
use the present implementation and alter some platform implementations
later if it did look like there was an advantage. E.g., go back to
rcuref_inc/dec and have them return success/fail instead of the value
of the refcount so you can keep it as an opaque data type.

I may even go add GC guarded refcounting to my atomic-ptr-plus project.
I hadn't done it before since I didn't have a good RCU for preemptive
user threads implementation and besides I already had atomically threadsafe
refcounted smart pointers that didn't depend on some other form of GC.
But it occurs to me that it will work fine with the proxy GC and with
the RCU+SMR where you might want to keep even longer term references
than would be feasible to use a hazard pointer for.


--
Joe Seigh