In the article "The design of preemptible read-copy-update":
http://lwn.net/Articles/253651
Paul McKenney explains why number of grace periods before executing callbacks is
set to 2:
#define GP_STAGES 2
There are following statements in the reasoning:
"Note that because rcu_read_lock() does not contain any memory barriers, the
contents of the critical section might be executed early by the CPU"
and:
"However, because rcu_read_unlock() contains no memory barriers, the contents of
the corresponding RCU read-side critical section (possibly including a reference
to the item deleted by CPU 0) can be executed late by CPU 1"
But on some architectures (IA-32, Intel 64, SPARC TSO) acquire and release
fences are implied with every load/store (read - costless), so isn't it possible
to reduce the number of required grace periods before executing callbacks on
these architectures?
I.e. something like:
#ifdef ACQUIRE_RELEASE_FENCES_ARE_IMPLIED_ON_ARCH // defined for x86 etc
#define GP_STAGES 1
#else
#define GP_STAGES 2
#endif
Have someone considered such variant? Is it worth doing?
Thank you.
--
Best regards,
Dmitriy V'jukov
On Wed, Mar 11, 2009 at 10:58:41AM +0000, Dmitriy V'jukov wrote:
> In the article "The design of preemptible read-copy-update":
> http://lwn.net/Articles/253651
>
> Paul McKenney explains why number of grace periods before executing callbacks is
> set to 2:
> #define GP_STAGES 2
>
> There are following statements in the reasoning:
> "Note that because rcu_read_lock() does not contain any memory barriers, the
> contents of the critical section might be executed early by the CPU"
> and:
> "However, because rcu_read_unlock() contains no memory barriers, the contents of
> the corresponding RCU read-side critical section (possibly including a reference
> to the item deleted by CPU 0) can be executed late by CPU 1"
>
> But on some architectures (IA-32, Intel 64, SPARC TSO) acquire and release
> fences are implied with every load/store (read - costless), so isn't it possible
> to reduce the number of required grace periods before executing callbacks on
> these architectures?
> I.e. something like:
> #ifdef ACQUIRE_RELEASE_FENCES_ARE_IMPLIED_ON_ARCH // defined for x86 etc
> #define GP_STAGES 1
> #else
> #define GP_STAGES 2
> #endif
> Have someone considered such variant? Is it worth doing?
> Thank you.
Interesting thought -- but please keep in mind that acquire/release fences
still allow subsequent stores to be reordered to precede earlier loads.
This means that the first loads in the RCU critical section could be
reordered to precede the final store of the rcu_read_lock() primitive.
My guess is there would be some resistance to the new #define, but if
there were enough uses, perhaps such resistence could be overcome.
So, have you tried running this through Relacy? If so, what happened?
Thanx, Paul
Paul E. McKenney <paulmck <at> linux.vnet.ibm.com> writes:
>
> Interesting thought -- but please keep in mind that acquire/release fences
> still allow subsequent stores to be reordered to precede earlier loads.
> This means that the first loads in the RCU critical section could be
> reordered to precede the final store of the rcu_read_lock() primitive.
Hmmm... Yes, I've missed this moment. I think you are right. The critical
synchronizing action of the __rcu_read_lock() is the store to the
__get_cpu_var(rcu_flipctr) (not a load!). So some code from the critical section
can hoist above the store to the __get_cpu_var(rcu_flipctr).
However the good question is how many grace-periods is required if code can
hoist above read_lock(), but can't sink below read_unlock()?
Good work for Relacy :) Nope, I've not yet tried to apply it...
--
Best regards,
Dmitriy V'jukov