[PATCH] cpufreq: make the transition_notifier chain use SRCU
(b4dfdbb3c707474a2254c5b4d7e62be31a4b7da9)
breaks cpu frequency notification users, which register the callback on
core_init level. Interestingly enough the registration survives the
uninitialized head, but the registered user is lost by:
static int __init init_cpufreq_transition_notifier_list(void)
{
srcu_init_notifier_head(&cpufreq_transition_notifier_list);
return 0;
}
core_initcall(init_cpufreq_transition_notifier_list);
This affects i386, x86_64 and sparc64 AFAICT, which call
register_notifier early in the arch code.
> The head of the notifier chain needs to be initialized before use;
> this is done by an __init routine at core_initcall time. If this turns
> out not to be a good choice, it can easily be changed.
Hmm, there are no static initializers for srcu and the only way to fix
this up is to move the arch calls to postcore_init.
tglx
* Thomas Gleixner <[email protected]> wrote:
> [PATCH] cpufreq: make the transition_notifier chain use SRCU
> (b4dfdbb3c707474a2254c5b4d7e62be31a4b7da9)
>
> breaks cpu frequency notification users, which register the callback
> on core_init level. Interestingly enough the registration survives the
> uninitialized head, but the registered user is lost by:
i have hit this bug in -rt (it caused a lockup) and have fixed it -
forgot to send it upstream. Find the patch below.
Ingo
---------------->
From: Ingo Molnar <[email protected]>
Subject: [patch] cpufreq: mark cpufreq_tsc() as core_initcall_sync
init_cpufreq_transition_notifier_list() should execute first, which is a
core_initcall, so mark cpufreq_tsc() core_initcall_sync.
Signed-off-by: Ingo Molnar <[email protected]>
--- linux.orig/arch/x86_64/kernel/tsc.c
+++ linux/arch/x86_64/kernel/tsc.c
@@ -138,7 +138,11 @@ static int __init cpufreq_tsc(void)
return 0;
}
-core_initcall(cpufreq_tsc);
+/*
+ * init_cpufreq_transition_notifier_list() should execute first,
+ * which is a core_initcall, so mark this one core_initcall_sync:
+ */
+core_initcall_sync(cpufreq_tsc);
#endif
/*
On Thu, 16 Nov 2006, Thomas Gleixner wrote:
> [PATCH] cpufreq: make the transition_notifier chain use SRCU
> (b4dfdbb3c707474a2254c5b4d7e62be31a4b7da9)
>
> breaks cpu frequency notification users, which register the callback on
> core_init level. Interestingly enough the registration survives the
> uninitialized head, but the registered user is lost by:
>
> static int __init init_cpufreq_transition_notifier_list(void)
> {
> srcu_init_notifier_head(&cpufreq_transition_notifier_list);
> return 0;
> }
> core_initcall(init_cpufreq_transition_notifier_list);
>
> This affects i386, x86_64 and sparc64 AFAICT, which call
> register_notifier early in the arch code.
>
> > The head of the notifier chain needs to be initialized before use;
> > this is done by an __init routine at core_initcall time. If this turns
> > out not to be a good choice, it can easily be changed.
>
> Hmm, there are no static initializers for srcu and the only way to fix
> this up is to move the arch calls to postcore_init.
If you can find a way to invoke init_cpufreq_transition_notifier_list
earlier than core_initcall time, that would be okay. I did it this way
because it was easiest, but earlier should be just as good.
The only requirement is that alloc_percpu() has to be working, so that the
SRCU per-cpu data values can be set up. I don't know how early in the
boot process you can do per-cpu memory allocation.
As an alternative approach, initialization of srcu_notifiers could be
broken up into two pieces, one of which could be done statically. The
part that has to be done dynamically (the SRCU initialization) wouldn't
mess up the notifier chain. Provided the dynamic part is carried out
while the system is still single-threaded, it would be safe.
Alan Stern
On Thu, 2006-11-16 at 21:15 +0100, Ingo Molnar wrote:
> From: Ingo Molnar <[email protected]>
> Subject: [patch] cpufreq: mark cpufreq_tsc() as core_initcall_sync
>
> init_cpufreq_transition_notifier_list() should execute first, which is a
> core_initcall, so mark cpufreq_tsc() core_initcall_sync.
>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> --- linux.orig/arch/x86_64/kernel/tsc.c
> +++ linux/arch/x86_64/kernel/tsc.c
> @@ -138,7 +138,11 @@ static int __init cpufreq_tsc(void)
> return 0;
> }
Here is the i386/sparc fixup
--------------->
From: Thomas Gleixner <[email protected]>
Subject: [patch] cpufreq: register notifiers after the head init
init_cpufreq_transition_notifier_list() must execute first, which is a
core_initcall, so move the registration to core_initcall_sync.
Signed-off-by: Thomas Gleixner <[email protected]>
diff --git a/arch/i386/kernel/tsc.c b/arch/i386/kernel/tsc.c
index fbc9582..4ebd903 100644
--- a/arch/i386/kernel/tsc.c
+++ b/arch/i386/kernel/tsc.c
@@ -315,7 +315,7 @@ static int __init cpufreq_tsc(void)
return ret;
}
-core_initcall(cpufreq_tsc);
+core_initcall_sync(cpufreq_tsc);
#endif
diff --git a/arch/sparc64/kernel/time.c b/arch/sparc64/kernel/time.c
index 061e1b1..c6eadcc 100644
--- a/arch/sparc64/kernel/time.c
+++ b/arch/sparc64/kernel/time.c
@@ -973,6 +973,13 @@ static struct notifier_block sparc64_cpu
.notifier_call = sparc64_cpufreq_notifier
};
+static int __init sparc64_cpu_freq_init(void)
+{
+ return cpufreq_register_notifier(&sparc64_cpufreq_notifier_block,
+ CPUFREQ_TRANSITION_NOTIFIER);
+}
+core_initcall_sync(sparc64_cpu_freq_init);
+
#endif /* CONFIG_CPU_FREQ */
static struct time_interpolator sparc64_cpu_interpolator = {
@@ -999,10 +1006,6 @@ void __init time_init(void)
(((NSEC_PER_SEC << SPARC64_NSEC_PER_CYC_SHIFT) +
(clock / 2)) / clock);
-#ifdef CONFIG_CPU_FREQ
- cpufreq_register_notifier(&sparc64_cpufreq_notifier_block,
- CPUFREQ_TRANSITION_NOTIFIER);
-#endif
}
unsigned long long sched_clock(void)
On Thu, 2006-11-16 at 21:15 +0100, Ingo Molnar wrote:
> From: Ingo Molnar <[email protected]>
> Subject: [patch] cpufreq: mark cpufreq_tsc() as core_initcall_sync
>
> init_cpufreq_transition_notifier_list() should execute first, which is a
> core_initcall, so mark cpufreq_tsc() core_initcall_sync.
>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> --- linux.orig/arch/x86_64/kernel/tsc.c
> +++ linux/arch/x86_64/kernel/tsc.c
it seems to want to be arch/x86_64/kernel/time.c
> @@ -138,7 +138,11 @@ static int __init cpufreq_tsc(void)
> return 0;
> }
>
> -core_initcall(cpufreq_tsc);
> +/*
> + * init_cpufreq_transition_notifier_list() should execute first,
> + * which is a core_initcall, so mark this one core_initcall_sync:
> + */
> +core_initcall_sync(cpufreq_tsc);
>
> #endif
> /*
On Thu, 2006-11-16 at 15:27 -0500, Alan Stern wrote:
> > Hmm, there are no static initializers for srcu and the only way to fix
> > this up is to move the arch calls to postcore_init.
>
> If you can find a way to invoke init_cpufreq_transition_notifier_list
> earlier than core_initcall time, that would be okay. I did it this way
> because it was easiest, but earlier should be just as good.
>
> The only requirement is that alloc_percpu() has to be working, so that the
> SRCU per-cpu data values can be set up. I don't know how early in the
> boot process you can do per-cpu memory allocation.
>
> As an alternative approach, initialization of srcu_notifiers could be
> broken up into two pieces, one of which could be done statically. The
> part that has to be done dynamically (the SRCU initialization) wouldn't
> mess up the notifier chain. Provided the dynamic part is carried out
> while the system is still single-threaded, it would be safe.
There is another issue with this SRCU change:
The notification comes actually after the real change, which is bad. We
try to make the TSC usable by backing it with pm_timer accross such
states, but this behaviour breaks the safety code.
tglx
On Thu, 16 Nov 2006 21:15:32 +0100
Ingo Molnar <[email protected]> wrote:
>
> * Thomas Gleixner <[email protected]> wrote:
>
> > [PATCH] cpufreq: make the transition_notifier chain use SRCU
> > (b4dfdbb3c707474a2254c5b4d7e62be31a4b7da9)
> >
> > breaks cpu frequency notification users, which register the callback
> > on core_init level. Interestingly enough the registration survives the
> > uninitialized head, but the registered user is lost by:
>
> i have hit this bug in -rt (it caused a lockup) and have fixed it -
> forgot to send it upstream. Find the patch below.
>
> Ingo
>
> ---------------->
> From: Ingo Molnar <[email protected]>
> Subject: [patch] cpufreq: mark cpufreq_tsc() as core_initcall_sync
>
> init_cpufreq_transition_notifier_list() should execute first, which is a
> core_initcall, so mark cpufreq_tsc() core_initcall_sync.
That's not a terribly useful changelog. What bug is being fixed. What
does "first" mean?
> Signed-off-by: Ingo Molnar <[email protected]>
>
> --- linux.orig/arch/x86_64/kernel/tsc.c
> +++ linux/arch/x86_64/kernel/tsc.c
> @@ -138,7 +138,11 @@ static int __init cpufreq_tsc(void)
> return 0;
> }
>
> -core_initcall(cpufreq_tsc);
> +/*
> + * init_cpufreq_transition_notifier_list() should execute first,
> + * which is a core_initcall, so mark this one core_initcall_sync:
> + */
> +core_initcall_sync(cpufreq_tsc);
Would prefer that we not use the _sync levels. They're there as a
synchronisation for MULTITHREAD_PROBE and might disappear at any time.
On Thu, 2006-11-16 at 13:20 -0800, Andrew Morton wrote:
> >
> > -core_initcall(cpufreq_tsc);
> > +/*
> > + * init_cpufreq_transition_notifier_list() should execute first,
> > + * which is a core_initcall, so mark this one core_initcall_sync:
> > + */
> > +core_initcall_sync(cpufreq_tsc);
>
> Would prefer that we not use the _sync levels. They're there as a
> synchronisation for MULTITHREAD_PROBE and might disappear at any time.
Works also fine with postcore_init, but I'm more concerned about the
delayed notification (after the actual change) which is introduced by
this srcu change.
tglx
On Thu, 16 Nov 2006, Thomas Gleixner wrote:
> There is another issue with this SRCU change:
>
> The notification comes actually after the real change, which is bad. We
> try to make the TSC usable by backing it with pm_timer accross such
> states, but this behaviour breaks the safety code.
I don't understand. Sending notifications is completely separate from
setting up the notifier chain's head. The patch you mentioned didn't
touch the code that sends the notifications.
Alan Stern
On Thu, 2006-11-16 at 16:26 -0500, Alan Stern wrote:
> On Thu, 16 Nov 2006, Thomas Gleixner wrote:
>
> > There is another issue with this SRCU change:
> >
> > The notification comes actually after the real change, which is bad. We
> > try to make the TSC usable by backing it with pm_timer accross such
> > states, but this behaviour breaks the safety code.
>
> I don't understand. Sending notifications is completely separate from
> setting up the notifier chain's head. The patch you mentioned didn't
> touch the code that sends the notifications.
Yeah, my bad. It just uses rcu based locking, but its still synchronous.
I have to dig deeper, why the change of the frequency happens _before_
the notifier arrives.
tglx
On Thu, 16 Nov 2006, Thomas Gleixner wrote:
>
> Here is the i386/sparc fixup
Gag me with a volvo.
This is disgusting, but I would actually prefer the following version over
the patches I've seen, because
- it doesn't end up having any architecture-specific parts
- it doesn't use the new "xxx_sync()" thing that I'm not even sure we
should be using.
- it makes it clear that this should be fixed, preferably by just having
some way to initialize SRCU structs staticalyl. If we get that, the fix
is to just replace the horrible "initialize by hand" with a static
initializer once and for all.
Hmm?
Totally untested, but it compiles and it _looks_ sane. The overhead of the
function call should be minimal, once things are initialized.
Paul, it would be _really_ nice to have some way to just initialize that
SRCU thing statically. This kind of crud is just crazy.
Comments?
Linus
----
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 86e69b7..02326b2 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -52,14 +52,39 @@ static void handle_update(void *data);
* The mutex locks both lists.
*/
static BLOCKING_NOTIFIER_HEAD(cpufreq_policy_notifier_list);
-static struct srcu_notifier_head cpufreq_transition_notifier_list;
-static int __init init_cpufreq_transition_notifier_list(void)
+/*
+ * This is horribly horribly ugly.
+ *
+ * We really want to initialize the transition notifier list
+ * statically and just once, but there is no static way to
+ * initialize a srcu lock, so we instead make up all this nasty
+ * infrastructure to make sure it's initialized when we use it.
+ *
+ * Bleaargh.
+ */
+static struct srcu_notifier_head *cpufreq_transition_notifier_list(void)
{
- srcu_init_notifier_head(&cpufreq_transition_notifier_list);
- return 0;
+ static struct srcu_notifier_head *initialized;
+ struct srcu_notifier_head *ret;
+
+ ret = initialized;
+ if (!ret) {
+ static DEFINE_MUTEX(init_lock);
+
+ mutex_lock(&init_lock);
+ ret = initialized;
+ if (!ret) {
+ static struct srcu_notifier_head list_head;
+ ret = &list_head;
+ srcu_init_notifier_head(ret);
+ smp_wmb();
+ initialized = ret;
+ }
+ mutex_unlock(&init_lock);
+ }
+ return ret;
}
-core_initcall(init_cpufreq_transition_notifier_list);
static LIST_HEAD(cpufreq_governor_list);
static DEFINE_MUTEX (cpufreq_governor_mutex);
@@ -268,14 +293,14 @@ void cpufreq_notify_transition(struct cp
freqs->old = policy->cur;
}
}
- srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
+ srcu_notifier_call_chain(cpufreq_transition_notifier_list(),
CPUFREQ_PRECHANGE, freqs);
adjust_jiffies(CPUFREQ_PRECHANGE, freqs);
break;
case CPUFREQ_POSTCHANGE:
adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
- srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
+ srcu_notifier_call_chain(cpufreq_transition_notifier_list(),
CPUFREQ_POSTCHANGE, freqs);
if (likely(policy) && likely(policy->cpu == freqs->cpu))
policy->cur = freqs->new;
@@ -1055,7 +1080,7 @@ static int cpufreq_suspend(struct sys_de
freqs.old = cpu_policy->cur;
freqs.new = cur_freq;
- srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
+ srcu_notifier_call_chain(cpufreq_transition_notifier_list(),
CPUFREQ_SUSPENDCHANGE, &freqs);
adjust_jiffies(CPUFREQ_SUSPENDCHANGE, &freqs);
@@ -1137,7 +1162,7 @@ static int cpufreq_resume(struct sys_dev
freqs.new = cur_freq;
srcu_notifier_call_chain(
- &cpufreq_transition_notifier_list,
+ cpufreq_transition_notifier_list(),
CPUFREQ_RESUMECHANGE, &freqs);
adjust_jiffies(CPUFREQ_RESUMECHANGE, &freqs);
@@ -1183,7 +1208,7 @@ int cpufreq_register_notifier(struct not
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
ret = srcu_notifier_chain_register(
- &cpufreq_transition_notifier_list, nb);
+ cpufreq_transition_notifier_list(), nb);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_register(
@@ -1215,7 +1240,7 @@ int cpufreq_unregister_notifier(struct n
switch (list) {
case CPUFREQ_TRANSITION_NOTIFIER:
ret = srcu_notifier_chain_unregister(
- &cpufreq_transition_notifier_list, nb);
+ cpufreq_transition_notifier_list(), nb);
break;
case CPUFREQ_POLICY_NOTIFIER:
ret = blocking_notifier_chain_unregister(
On Thu, 16 Nov 2006, Thomas Gleixner wrote:
> On Thu, 2006-11-16 at 16:26 -0500, Alan Stern wrote:
> > On Thu, 16 Nov 2006, Thomas Gleixner wrote:
> >
> > > There is another issue with this SRCU change:
> > >
> > > The notification comes actually after the real change, which is bad. We
> > > try to make the TSC usable by backing it with pm_timer accross such
> > > states, but this behaviour breaks the safety code.
> >
> > I don't understand. Sending notifications is completely separate from
> > setting up the notifier chain's head. The patch you mentioned didn't
> > touch the code that sends the notifications.
>
> Yeah, my bad. It just uses rcu based locking, but its still synchronous.
>
> I have to dig deeper, why the change of the frequency happens _before_
> the notifier arrives.
There are supposed to be _two_ notifier calls: one before the frequency
change and one after. Check the callers of cpufreq_notify_transition().
Alan Stern
On Thu, 16 Nov 2006, Linus Torvalds wrote:
> - it makes it clear that this should be fixed, preferably by just having
> some way to initialize SRCU structs staticalyl. If we get that, the fix
> is to just replace the horrible "initialize by hand" with a static
> initializer once and for all.
>
> Hmm?
>
> Totally untested, but it compiles and it _looks_ sane. The overhead of the
> function call should be minimal, once things are initialized.
>
> Paul, it would be _really_ nice to have some way to just initialize that
> SRCU thing statically. This kind of crud is just crazy.
I looked into this back when SRCU was first added. It's essentially
impossible to do it, because the per-cpu memory allocation & usage APIs
are completely different for the static and the dynamic cases. They are a
real mess. I couldn't think up a way to construct any sort of uniform
interface to per-cpu memory, not without completely changing the guts of
the per-cpu stuff.
If you or someone else can fix that problem, I will be happy to change the
SRCU-based notifiers to work both ways.
Alan Stern
On Thu, 16 Nov 2006, Alan Stern wrote:
> On Thu, 16 Nov 2006, Linus Torvalds wrote:
> >
> > Paul, it would be _really_ nice to have some way to just initialize
> > that SRCU thing statically. This kind of crud is just crazy.
>
> I looked into this back when SRCU was first added. It's essentially
> impossible to do it, because the per-cpu memory allocation & usage APIs
> are completely different for the static and the dynamic cases.
I don't think that's how you'd want to do it.
There's no way to do an initialization of a percpu allocation statically.
That's pretty obvious.
What I'd suggest instead, is to make the allocation dynamic, and make it
inside the srcu functions (kind of like I did now, but I did it at a
higher level).
Doing it at the high level was trivial right now, but we may well end up
hitting this problem again if people start using SRCU more. Right now I
suspect the cpufreq notifier is the only thing that uses SRCU, and it
already showed this problem with SRCU initializers.
So I was more thinking about moving my "one special case high level hack"
down lower, down to the SRCU level, so that we'll never see _more_ of
those horrible hacks. We'll still have the hacky thing, but at least it
will be limited to a single place - the SRCU code itself.
Linus
On Thu, Nov 16, 2006 at 01:47:48PM -0800, Linus Torvalds wrote:
>
>
> On Thu, 16 Nov 2006, Thomas Gleixner wrote:
> >
> > Here is the i386/sparc fixup
>
> Gag me with a volvo.
No can do -- my wife drives a Ford and my car is a bicycle.
> This is disgusting, but I would actually prefer the following version over
> the patches I've seen, because
>
> - it doesn't end up having any architecture-specific parts
>
> - it doesn't use the new "xxx_sync()" thing that I'm not even sure we
> should be using.
>
> - it makes it clear that this should be fixed, preferably by just having
> some way to initialize SRCU structs staticalyl. If we get that, the fix
> is to just replace the horrible "initialize by hand" with a static
> initializer once and for all.
>
> Hmm?
>
> Totally untested, but it compiles and it _looks_ sane. The overhead of the
> function call should be minimal, once things are initialized.
>
> Paul, it would be _really_ nice to have some way to just initialize that
> SRCU thing statically. This kind of crud is just crazy.
Static initialization is a bit of a tarpit for SRCU. Before this week,
I would have protested bitterly over the overhead of a dynamic runtime
check, but Jens is running into another issue that looks to require a
bit more read-side overhead as well (synchronize_srcu() is too expensive
for his situation). Now if I can get one of the local weak-memory model
torture-chamber boxes to deal with a recent kernel...
Hardware whines aside, shouldn't be too hard. Will put something
together...
Thanx, Paul
> Comments?
>
> Linus
>
> ----
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 86e69b7..02326b2 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -52,14 +52,39 @@ static void handle_update(void *data);
> * The mutex locks both lists.
> */
> static BLOCKING_NOTIFIER_HEAD(cpufreq_policy_notifier_list);
> -static struct srcu_notifier_head cpufreq_transition_notifier_list;
>
> -static int __init init_cpufreq_transition_notifier_list(void)
> +/*
> + * This is horribly horribly ugly.
> + *
> + * We really want to initialize the transition notifier list
> + * statically and just once, but there is no static way to
> + * initialize a srcu lock, so we instead make up all this nasty
> + * infrastructure to make sure it's initialized when we use it.
> + *
> + * Bleaargh.
> + */
> +static struct srcu_notifier_head *cpufreq_transition_notifier_list(void)
> {
> - srcu_init_notifier_head(&cpufreq_transition_notifier_list);
> - return 0;
> + static struct srcu_notifier_head *initialized;
> + struct srcu_notifier_head *ret;
> +
> + ret = initialized;
> + if (!ret) {
> + static DEFINE_MUTEX(init_lock);
> +
> + mutex_lock(&init_lock);
> + ret = initialized;
> + if (!ret) {
> + static struct srcu_notifier_head list_head;
> + ret = &list_head;
> + srcu_init_notifier_head(ret);
> + smp_wmb();
> + initialized = ret;
> + }
> + mutex_unlock(&init_lock);
> + }
> + return ret;
> }
> -core_initcall(init_cpufreq_transition_notifier_list);
>
> static LIST_HEAD(cpufreq_governor_list);
> static DEFINE_MUTEX (cpufreq_governor_mutex);
> @@ -268,14 +293,14 @@ void cpufreq_notify_transition(struct cp
> freqs->old = policy->cur;
> }
> }
> - srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> + srcu_notifier_call_chain(cpufreq_transition_notifier_list(),
> CPUFREQ_PRECHANGE, freqs);
> adjust_jiffies(CPUFREQ_PRECHANGE, freqs);
> break;
>
> case CPUFREQ_POSTCHANGE:
> adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
> - srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> + srcu_notifier_call_chain(cpufreq_transition_notifier_list(),
> CPUFREQ_POSTCHANGE, freqs);
> if (likely(policy) && likely(policy->cpu == freqs->cpu))
> policy->cur = freqs->new;
> @@ -1055,7 +1080,7 @@ static int cpufreq_suspend(struct sys_de
> freqs.old = cpu_policy->cur;
> freqs.new = cur_freq;
>
> - srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> + srcu_notifier_call_chain(cpufreq_transition_notifier_list(),
> CPUFREQ_SUSPENDCHANGE, &freqs);
> adjust_jiffies(CPUFREQ_SUSPENDCHANGE, &freqs);
>
> @@ -1137,7 +1162,7 @@ static int cpufreq_resume(struct sys_dev
> freqs.new = cur_freq;
>
> srcu_notifier_call_chain(
> - &cpufreq_transition_notifier_list,
> + cpufreq_transition_notifier_list(),
> CPUFREQ_RESUMECHANGE, &freqs);
> adjust_jiffies(CPUFREQ_RESUMECHANGE, &freqs);
>
> @@ -1183,7 +1208,7 @@ int cpufreq_register_notifier(struct not
> switch (list) {
> case CPUFREQ_TRANSITION_NOTIFIER:
> ret = srcu_notifier_chain_register(
> - &cpufreq_transition_notifier_list, nb);
> + cpufreq_transition_notifier_list(), nb);
> break;
> case CPUFREQ_POLICY_NOTIFIER:
> ret = blocking_notifier_chain_register(
> @@ -1215,7 +1240,7 @@ int cpufreq_unregister_notifier(struct n
> switch (list) {
> case CPUFREQ_TRANSITION_NOTIFIER:
> ret = srcu_notifier_chain_unregister(
> - &cpufreq_transition_notifier_list, nb);
> + cpufreq_transition_notifier_list(), nb);
> break;
> case CPUFREQ_POLICY_NOTIFIER:
> ret = blocking_notifier_chain_unregister(
On Thu, 16 Nov 2006, Linus Torvalds wrote:
>
>
> On Thu, 16 Nov 2006, Alan Stern wrote:
> > On Thu, 16 Nov 2006, Linus Torvalds wrote:
> > >
> > > Paul, it would be _really_ nice to have some way to just initialize
> > > that SRCU thing statically. This kind of crud is just crazy.
> >
> > I looked into this back when SRCU was first added. It's essentially
> > impossible to do it, because the per-cpu memory allocation & usage APIs
> > are completely different for the static and the dynamic cases.
>
> I don't think that's how you'd want to do it.
>
> There's no way to do an initialization of a percpu allocation statically.
> That's pretty obvious.
Hmmm... What about DEFINE_PER_CPU in include/asm-generic/percpu.h
combined with setup_per_cpu_areas() in init/main.c? So long as you want
all the CPUs to start with the same initial values, it should work.
> What I'd suggest instead, is to make the allocation dynamic, and make it
> inside the srcu functions (kind of like I did now, but I did it at a
> higher level).
>
> Doing it at the high level was trivial right now, but we may well end up
> hitting this problem again if people start using SRCU more. Right now I
> suspect the cpufreq notifier is the only thing that uses SRCU, and it
> already showed this problem with SRCU initializers.
>
> So I was more thinking about moving my "one special case high level hack"
> down lower, down to the SRCU level, so that we'll never see _more_ of
> those horrible hacks. We'll still have the hacky thing, but at least it
> will be limited to a single place - the SRCU code itself.
Another possible approach (but equally disgusting) is to use this static
allocation approach, and have the SRCU structure include both a static and
a dynamic percpu pointer together with a flag indicating which should be
used.
Alan Stern
On Thu, Nov 16, 2006 at 10:06:25PM -0500, Alan Stern wrote:
> On Thu, 16 Nov 2006, Linus Torvalds wrote:
>
> >
> >
> > On Thu, 16 Nov 2006, Alan Stern wrote:
> > > On Thu, 16 Nov 2006, Linus Torvalds wrote:
> > > >
> > > > Paul, it would be _really_ nice to have some way to just initialize
> > > > that SRCU thing statically. This kind of crud is just crazy.
> > >
> > > I looked into this back when SRCU was first added. It's essentially
> > > impossible to do it, because the per-cpu memory allocation & usage APIs
> > > are completely different for the static and the dynamic cases.
> >
> > I don't think that's how you'd want to do it.
> >
> > There's no way to do an initialization of a percpu allocation statically.
> > That's pretty obvious.
>
> Hmmm... What about DEFINE_PER_CPU in include/asm-generic/percpu.h
> combined with setup_per_cpu_areas() in init/main.c? So long as you want
> all the CPUs to start with the same initial values, it should work.
>
> > What I'd suggest instead, is to make the allocation dynamic, and make it
> > inside the srcu functions (kind of like I did now, but I did it at a
> > higher level).
> >
> > Doing it at the high level was trivial right now, but we may well end up
> > hitting this problem again if people start using SRCU more. Right now I
> > suspect the cpufreq notifier is the only thing that uses SRCU, and it
> > already showed this problem with SRCU initializers.
> >
> > So I was more thinking about moving my "one special case high level hack"
> > down lower, down to the SRCU level, so that we'll never see _more_ of
> > those horrible hacks. We'll still have the hacky thing, but at least it
> > will be limited to a single place - the SRCU code itself.
>
> Another possible approach (but equally disgusting) is to use this static
> allocation approach, and have the SRCU structure include both a static and
> a dynamic percpu pointer together with a flag indicating which should be
> used.
I am actually taking some suggestions you made some months ago. At the
time, I rejected them because they injected extra branches into the
fastpath. However, recent experience indicates that you (Alan Stern)
were right and I was wrong -- turns out that the update-side overhead
cannot be so lightly disregarded, which forces memory barriers (but
neither atomics nor cache misses) into the fastpath. If some application
ends up being provably inconvenienced by the read-side overhead, they old
implementation can be re-introduced under a different name or some such.
So, here is my current plan:
o Add NULL checks on srcu_struct_array to srcu_read_lock(),
srcu_read_unlock(), and synchronize_srcu. These will
acquire the mutex and attempt to initialize. If out
of memory, they will use the new hardluckref field.
o Add memory barriers to srcu_read_lock() and srcu_read_unlock().
o Also add a memory barrier or two to synchronize_srcu(), which,
in combination with those in srcu_read_lock() and srcu_read_unlock(),
permit removing two of the three synchronize_sched() calls
in synchronize_srcu(), decreasing its latency by roughly
a factor of three.
This change should have the added benefit of making
synchronize_srcu() much easier to understand.
o I left out the super-fastpath synchronize_srcu() because
after sleeping on it, it scared me silly. Might be OK,
but needs careful thought. The fastpath is of the form:
if (srcu_readers_active(sp) == 0) {
smp_mb();
return;
}
prior to the mutex_lock() in synchronize_srcu().
Attached is a patch that compiles, but probably goes down in flames
otherwise.
Thoughts?
Thanx, Paul
include/linux/srcu.h | 7 ---
kernel/srcu.c | 111 +++++++++++++++++++++++----------------------------
2 files changed, 53 insertions(+), 65 deletions(-)
diff -urpNa -X dontdiff linux-2.6.19-rc5/include/linux/srcu.h linux-2.6.19-rc5-dsrcu/include/linux/srcu.h
--- linux-2.6.19-rc5/include/linux/srcu.h 2006-11-07 18:24:20.000000000 -0800
+++ linux-2.6.19-rc5-dsrcu/include/linux/srcu.h 2006-11-16 21:40:03.000000000 -0800
@@ -35,14 +35,9 @@ struct srcu_struct {
int completed;
struct srcu_struct_array *per_cpu_ref;
struct mutex mutex;
+ int hardluckref;
};
-#ifndef CONFIG_PREEMPT
-#define srcu_barrier() barrier()
-#else /* #ifndef CONFIG_PREEMPT */
-#define srcu_barrier()
-#endif /* #else #ifndef CONFIG_PREEMPT */
-
int init_srcu_struct(struct srcu_struct *sp);
void cleanup_srcu_struct(struct srcu_struct *sp);
int srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
diff -urpNa -X dontdiff linux-2.6.19-rc5/kernel/srcu.c linux-2.6.19-rc5-dsrcu/kernel/srcu.c
--- linux-2.6.19-rc5/kernel/srcu.c 2006-11-07 18:24:20.000000000 -0800
+++ linux-2.6.19-rc5-dsrcu/kernel/srcu.c 2006-11-16 22:40:33.000000000 -0800
@@ -34,6 +34,14 @@
#include <linux/smp.h>
#include <linux/srcu.h>
+/*
+ * Initialize the per-CPU array, returning the pointer.
+ */
+static inline struct srcu_struct_array *alloc_srcu_struct_percpu(void)
+{
+ return alloc_percpu(struct srcu_struct_array);
+}
+
/**
* init_srcu_struct - initialize a sleep-RCU structure
* @sp: structure to initialize.
@@ -46,7 +54,8 @@ int init_srcu_struct(struct srcu_struct
{
sp->completed = 0;
mutex_init(&sp->mutex);
- sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+ sp->hardluckref = 0;
return (sp->per_cpu_ref ? 0 : -ENOMEM);
}
@@ -61,9 +70,10 @@ static int srcu_readers_active_idx(struc
int sum;
sum = 0;
- for_each_possible_cpu(cpu)
- sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
- return sum;
+ if (likely(sp->per_cpu_ref != NULL))
+ for_each_possible_cpu(cpu)
+ sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
+ return sum + sp->hardluckref;
}
/**
@@ -76,7 +86,9 @@ static int srcu_readers_active_idx(struc
*/
int srcu_readers_active(struct srcu_struct *sp)
{
- return srcu_readers_active_idx(sp, 0) + srcu_readers_active_idx(sp, 1);
+ return srcu_readers_active_idx(sp, 0) +
+ srcu_readers_active_idx(sp, 1) -
+ sp->hardluckref; /* No one will ever care, but... */
}
/**
@@ -94,7 +106,8 @@ void cleanup_srcu_struct(struct srcu_str
WARN_ON(sum); /* Leakage unless caller handles error. */
if (sum != 0)
return;
- free_percpu(sp->per_cpu_ref);
+ if (sp->per_cpu_ref != NULL)
+ free_percpu(sp->per_cpu_ref);
sp->per_cpu_ref = NULL;
}
@@ -105,6 +118,7 @@ void cleanup_srcu_struct(struct srcu_str
* Counts the new reader in the appropriate per-CPU element of the
* srcu_struct. Must be called from process context.
* Returns an index that must be passed to the matching srcu_read_unlock().
+ * The index is -1 if the srcu_struct is not and cannot be initialized.
*/
int srcu_read_lock(struct srcu_struct *sp)
{
@@ -112,11 +126,24 @@ int srcu_read_lock(struct srcu_struct *s
preempt_disable();
idx = sp->completed & 0x1;
- barrier(); /* ensure compiler looks -once- at sp->completed. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
+ if (likely(sp->per_cpu_ref != NULL)) {
+ barrier(); /* ensure compiler looks -once- at sp->completed. */
+ per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
+ smp_processor_id())->c[idx]++;
+ smp_mb();
+ preempt_enable();
+ return idx;
+ }
preempt_enable();
- return idx;
+ mutex_lock(&sp->mutex);
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+ if (sp->per_cpu_ref == NULL) {
+ sp->hardluckref++;
+ mutex_unlock(&sp->mutex);
+ return -1;
+ }
+ mutex_unlock(&sp->mutex);
+ return srcu_read_lock(sp);
}
/**
@@ -131,10 +158,16 @@ int srcu_read_lock(struct srcu_struct *s
*/
void srcu_read_unlock(struct srcu_struct *sp, int idx)
{
- preempt_disable();
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
- preempt_enable();
+ if (likely(idx != -1)) {
+ preempt_disable();
+ smp_mb();
+ per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
+ preempt_enable();
+ return;
+ }
+ mutex_lock(&sp->mutex);
+ sp->hardluckref--;
+ mutex_unlock(&sp->mutex);
}
/**
@@ -173,65 +206,25 @@ void synchronize_srcu(struct srcu_struct
return;
}
- synchronize_sched(); /* Force memory barrier on all CPUs. */
-
- /*
- * The preceding synchronize_sched() ensures that any CPU that
- * sees the new value of sp->completed will also see any preceding
- * changes to data structures made by this CPU. This prevents
- * some other CPU from reordering the accesses in its SRCU
- * read-side critical section to precede the corresponding
- * srcu_read_lock() -- ensuring that such references will in
- * fact be protected.
- *
- * So it is now safe to do the flip.
- */
-
+ smp_mb(); /* ensure srcu_read_lock() sees prior change first! */
idx = sp->completed & 0x1;
sp->completed++;
- synchronize_sched(); /* Force memory barrier on all CPUs. */
+ synchronize_sched();
/*
* At this point, because of the preceding synchronize_sched(),
* all srcu_read_lock() calls using the old counters have completed.
* Their corresponding critical sections might well be still
* executing, but the srcu_read_lock() primitives themselves
- * will have finished executing.
+ * will have finished executing. The "old" rank of counters
+ * can therefore only decrease, never increase in value.
*/
while (srcu_readers_active_idx(sp, idx))
schedule_timeout_interruptible(1);
- synchronize_sched(); /* Force memory barrier on all CPUs. */
-
- /*
- * The preceding synchronize_sched() forces all srcu_read_unlock()
- * primitives that were executing concurrently with the preceding
- * for_each_possible_cpu() loop to have completed by this point.
- * More importantly, it also forces the corresponding SRCU read-side
- * critical sections to have also completed, and the corresponding
- * references to SRCU-protected data items to be dropped.
- *
- * Note:
- *
- * Despite what you might think at first glance, the
- * preceding synchronize_sched() -must- be within the
- * critical section ended by the following mutex_unlock().
- * Otherwise, a task taking the early exit can race
- * with a srcu_read_unlock(), which might have executed
- * just before the preceding srcu_readers_active() check,
- * and whose CPU might have reordered the srcu_read_unlock()
- * with the preceding critical section. In this case, there
- * is nothing preventing the synchronize_sched() task that is
- * taking the early exit from freeing a data structure that
- * is still being referenced (out of order) by the task
- * doing the srcu_read_unlock().
- *
- * Alternatively, the comparison with "2" on the early exit
- * could be changed to "3", but this increases synchronize_srcu()
- * latency for bulk loads. So the current code is preferred.
- */
+ smp_mb(); /* must see critical section prior to srcu_read_unlock() */
mutex_unlock(&sp->mutex);
}
On Thu, Nov 16 2006, Paul E. McKenney wrote:
> On Thu, Nov 16, 2006 at 10:06:25PM -0500, Alan Stern wrote:
> > On Thu, 16 Nov 2006, Linus Torvalds wrote:
> >
> > >
> > >
> > > On Thu, 16 Nov 2006, Alan Stern wrote:
> > > > On Thu, 16 Nov 2006, Linus Torvalds wrote:
> > > > >
> > > > > Paul, it would be _really_ nice to have some way to just initialize
> > > > > that SRCU thing statically. This kind of crud is just crazy.
> > > >
> > > > I looked into this back when SRCU was first added. It's essentially
> > > > impossible to do it, because the per-cpu memory allocation & usage APIs
> > > > are completely different for the static and the dynamic cases.
> > >
> > > I don't think that's how you'd want to do it.
> > >
> > > There's no way to do an initialization of a percpu allocation statically.
> > > That's pretty obvious.
> >
> > Hmmm... What about DEFINE_PER_CPU in include/asm-generic/percpu.h
> > combined with setup_per_cpu_areas() in init/main.c? So long as you want
> > all the CPUs to start with the same initial values, it should work.
> >
> > > What I'd suggest instead, is to make the allocation dynamic, and make it
> > > inside the srcu functions (kind of like I did now, but I did it at a
> > > higher level).
> > >
> > > Doing it at the high level was trivial right now, but we may well end up
> > > hitting this problem again if people start using SRCU more. Right now I
> > > suspect the cpufreq notifier is the only thing that uses SRCU, and it
> > > already showed this problem with SRCU initializers.
> > >
> > > So I was more thinking about moving my "one special case high level hack"
> > > down lower, down to the SRCU level, so that we'll never see _more_ of
> > > those horrible hacks. We'll still have the hacky thing, but at least it
> > > will be limited to a single place - the SRCU code itself.
> >
> > Another possible approach (but equally disgusting) is to use this static
> > allocation approach, and have the SRCU structure include both a static and
> > a dynamic percpu pointer together with a flag indicating which should be
> > used.
>
> I am actually taking some suggestions you made some months ago. At the
> time, I rejected them because they injected extra branches into the
> fastpath. However, recent experience indicates that you (Alan Stern)
> were right and I was wrong -- turns out that the update-side overhead
> cannot be so lightly disregarded, which forces memory barriers (but
> neither atomics nor cache misses) into the fastpath. If some application
> ends up being provably inconvenienced by the read-side overhead, they old
> implementation can be re-introduced under a different name or some such.
>
> So, here is my current plan:
>
> o Add NULL checks on srcu_struct_array to srcu_read_lock(),
> srcu_read_unlock(), and synchronize_srcu. These will
> acquire the mutex and attempt to initialize. If out
> of memory, they will use the new hardluckref field.
>
> o Add memory barriers to srcu_read_lock() and srcu_read_unlock().
>
> o Also add a memory barrier or two to synchronize_srcu(), which,
> in combination with those in srcu_read_lock() and srcu_read_unlock(),
> permit removing two of the three synchronize_sched() calls
> in synchronize_srcu(), decreasing its latency by roughly
> a factor of three.
>
> This change should have the added benefit of making
> synchronize_srcu() much easier to understand.
>
> o I left out the super-fastpath synchronize_srcu() because
> after sleeping on it, it scared me silly. Might be OK,
> but needs careful thought. The fastpath is of the form:
>
> if (srcu_readers_active(sp) == 0) {
> smp_mb();
> return;
> }
>
> prior to the mutex_lock() in synchronize_srcu().
It works for me, but the overhead is still large. Before it would take
8-12 jiffies for a synchronize_srcu() to complete without there actually
being any reader locks active, now it takes 2-3 jiffies. So it's
definitely faster, and as suspected the loss of two of three
synchronize_sched() cut down the overhead to a third.
It's still too heavy for me, by far the most calls I do to
synchronize_srcu() doesn't have any reader locks pending. I'm still a
big advocate of the fastpath srcu_readers_active() check. I can
understand the reluctance to make it the default, but for my case it's
"safe enough", so if we could either export srcu_readers_active() or
export a synchronize_srcu_fast() (or something like that), then SRCU
would be a good fit for barrier vs plug rework.
> Attached is a patch that compiles, but probably goes down in flames
> otherwise.
Works here :-)
--
Jens Axboe
Paul E. McKenney wrote:
>
> int srcu_read_lock(struct srcu_struct *sp)
> {
> @@ -112,11 +126,24 @@ int srcu_read_lock(struct srcu_struct *s
>
> preempt_disable();
> idx = sp->completed & 0x1;
> - barrier(); /* ensure compiler looks -once- at sp->completed. */
> - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> + if (likely(sp->per_cpu_ref != NULL)) {
> + barrier(); /* ensure compiler looks -once- at sp->completed. */
> + per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
> + smp_processor_id())->c[idx]++;
> + smp_mb();
> + preempt_enable();
> + return idx;
> + }
> preempt_enable();
> - return idx;
> + mutex_lock(&sp->mutex);
> + sp->per_cpu_ref = alloc_srcu_struct_percpu();
We should re-check sp->per_cpu_ref != NULL after taking sp->mutex,
it was probably allocated by another thread.
> void srcu_read_unlock(struct srcu_struct *sp, int idx)
> {
> - preempt_disable();
> - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> - preempt_enable();
> + if (likely(idx != -1)) {
> + preempt_disable();
> + smp_mb();
> + per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> + preempt_enable();
> + return;
> + }
> + mutex_lock(&sp->mutex);
> + sp->hardluckref--;
> + mutex_unlock(&sp->mutex);
> }
I think this is deadlockable, synchronize_srcu() does
while (srcu_readers_active_idx(sp, idx))
schedule_timeout_interruptible(1);
under sp->mutex, so the loop above may spin forever while the reader
waits for sp->mutex in srcu_read_unlock(sp, -1).
Oleg.
On Fri, Nov 17, 2006 at 10:29:25AM +0100, Jens Axboe wrote:
> On Thu, Nov 16 2006, Paul E. McKenney wrote:
> > On Thu, Nov 16, 2006 at 10:06:25PM -0500, Alan Stern wrote:
> > > On Thu, 16 Nov 2006, Linus Torvalds wrote:
> > >
> > > >
> > > >
> > > > On Thu, 16 Nov 2006, Alan Stern wrote:
> > > > > On Thu, 16 Nov 2006, Linus Torvalds wrote:
> > > > > >
> > > > > > Paul, it would be _really_ nice to have some way to just initialize
> > > > > > that SRCU thing statically. This kind of crud is just crazy.
> > > > >
> > > > > I looked into this back when SRCU was first added. It's essentially
> > > > > impossible to do it, because the per-cpu memory allocation & usage APIs
> > > > > are completely different for the static and the dynamic cases.
> > > >
> > > > I don't think that's how you'd want to do it.
> > > >
> > > > There's no way to do an initialization of a percpu allocation statically.
> > > > That's pretty obvious.
> > >
> > > Hmmm... What about DEFINE_PER_CPU in include/asm-generic/percpu.h
> > > combined with setup_per_cpu_areas() in init/main.c? So long as you want
> > > all the CPUs to start with the same initial values, it should work.
> > >
> > > > What I'd suggest instead, is to make the allocation dynamic, and make it
> > > > inside the srcu functions (kind of like I did now, but I did it at a
> > > > higher level).
> > > >
> > > > Doing it at the high level was trivial right now, but we may well end up
> > > > hitting this problem again if people start using SRCU more. Right now I
> > > > suspect the cpufreq notifier is the only thing that uses SRCU, and it
> > > > already showed this problem with SRCU initializers.
> > > >
> > > > So I was more thinking about moving my "one special case high level hack"
> > > > down lower, down to the SRCU level, so that we'll never see _more_ of
> > > > those horrible hacks. We'll still have the hacky thing, but at least it
> > > > will be limited to a single place - the SRCU code itself.
> > >
> > > Another possible approach (but equally disgusting) is to use this static
> > > allocation approach, and have the SRCU structure include both a static and
> > > a dynamic percpu pointer together with a flag indicating which should be
> > > used.
> >
> > I am actually taking some suggestions you made some months ago. At the
> > time, I rejected them because they injected extra branches into the
> > fastpath. However, recent experience indicates that you (Alan Stern)
> > were right and I was wrong -- turns out that the update-side overhead
> > cannot be so lightly disregarded, which forces memory barriers (but
> > neither atomics nor cache misses) into the fastpath. If some application
> > ends up being provably inconvenienced by the read-side overhead, they old
> > implementation can be re-introduced under a different name or some such.
> >
> > So, here is my current plan:
> >
> > o Add NULL checks on srcu_struct_array to srcu_read_lock(),
> > srcu_read_unlock(), and synchronize_srcu. These will
> > acquire the mutex and attempt to initialize. If out
> > of memory, they will use the new hardluckref field.
> >
> > o Add memory barriers to srcu_read_lock() and srcu_read_unlock().
> >
> > o Also add a memory barrier or two to synchronize_srcu(), which,
> > in combination with those in srcu_read_lock() and srcu_read_unlock(),
> > permit removing two of the three synchronize_sched() calls
> > in synchronize_srcu(), decreasing its latency by roughly
> > a factor of three.
> >
> > This change should have the added benefit of making
> > synchronize_srcu() much easier to understand.
> >
> > o I left out the super-fastpath synchronize_srcu() because
> > after sleeping on it, it scared me silly. Might be OK,
> > but needs careful thought. The fastpath is of the form:
> >
> > if (srcu_readers_active(sp) == 0) {
> > smp_mb();
> > return;
> > }
> >
> > prior to the mutex_lock() in synchronize_srcu().
>
> It works for me, but the overhead is still large. Before it would take
> 8-12 jiffies for a synchronize_srcu() to complete without there actually
> being any reader locks active, now it takes 2-3 jiffies. So it's
> definitely faster, and as suspected the loss of two of three
> synchronize_sched() cut down the overhead to a third.
Good to hear, thank you for trying it out!
> It's still too heavy for me, by far the most calls I do to
> synchronize_srcu() doesn't have any reader locks pending. I'm still a
> big advocate of the fastpath srcu_readers_active() check. I can
> understand the reluctance to make it the default, but for my case it's
> "safe enough", so if we could either export srcu_readers_active() or
> export a synchronize_srcu_fast() (or something like that), then SRCU
> would be a good fit for barrier vs plug rework.
OK, will export the interface. Do your queues have associated locking?
> > Attached is a patch that compiles, but probably goes down in flames
> > otherwise.
>
> Works here :-)
I have at least a couple bugs that would show up under low-memory
situations, will fix and post an update.
Thanx, Paul
On Fri, 17 Nov 2006, Paul E. McKenney wrote:
> > It works for me, but the overhead is still large. Before it would take
> > 8-12 jiffies for a synchronize_srcu() to complete without there actually
> > being any reader locks active, now it takes 2-3 jiffies. So it's
> > definitely faster, and as suspected the loss of two of three
> > synchronize_sched() cut down the overhead to a third.
>
> Good to hear, thank you for trying it out!
>
> > It's still too heavy for me, by far the most calls I do to
> > synchronize_srcu() doesn't have any reader locks pending. I'm still a
> > big advocate of the fastpath srcu_readers_active() check. I can
> > understand the reluctance to make it the default, but for my case it's
> > "safe enough", so if we could either export srcu_readers_active() or
> > export a synchronize_srcu_fast() (or something like that), then SRCU
> > would be a good fit for barrier vs plug rework.
>
> OK, will export the interface. Do your queues have associated locking?
>
> > > Attached is a patch that compiles, but probably goes down in flames
> > > otherwise.
> >
> > Works here :-)
>
> I have at least a couple bugs that would show up under low-memory
> situations, will fix and post an update.
Perhaps a better approach to the initialization problem would be to assume
that either:
1. The srcu_struct will be initialized before it is used, or
2. When it is used before initialization, the system is running
only one thread.
In other words, statically allocated SRCU strucures that get used during
system startup must be initialized before the system starts multitasking.
That seems like a reasonable requirement.
This eliminates worries about readers holding mutexes. It doesn't
solve the issues surrounding your hardluckref, but maybe it makes them
easier to think about.
Alan Stern
On Fri, Nov 17, 2006 at 09:39:45PM +0300, Oleg Nesterov wrote:
> Paul E. McKenney wrote:
> >
> > int srcu_read_lock(struct srcu_struct *sp)
> > {
> > @@ -112,11 +126,24 @@ int srcu_read_lock(struct srcu_struct *s
> >
> > preempt_disable();
> > idx = sp->completed & 0x1;
> > - barrier(); /* ensure compiler looks -once- at sp->completed. */
> > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > + if (likely(sp->per_cpu_ref != NULL)) {
> > + barrier(); /* ensure compiler looks -once- at sp->completed. */
> > + per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
> > + smp_processor_id())->c[idx]++;
> > + smp_mb();
> > + preempt_enable();
> > + return idx;
> > + }
> > preempt_enable();
> > - return idx;
> > + mutex_lock(&sp->mutex);
> > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
>
> We should re-check sp->per_cpu_ref != NULL after taking sp->mutex,
> it was probably allocated by another thread.
Good catch!!!
> > void srcu_read_unlock(struct srcu_struct *sp, int idx)
> > {
> > - preempt_disable();
> > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> > - preempt_enable();
> > + if (likely(idx != -1)) {
> > + preempt_disable();
> > + smp_mb();
> > + per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> > + preempt_enable();
> > + return;
> > + }
> > + mutex_lock(&sp->mutex);
> > + sp->hardluckref--;
> > + mutex_unlock(&sp->mutex);
> > }
>
> I think this is deadlockable, synchronize_srcu() does
>
> while (srcu_readers_active_idx(sp, idx))
> schedule_timeout_interruptible(1);
>
> under sp->mutex, so the loop above may spin forever while the reader
> waits for sp->mutex in srcu_read_unlock(sp, -1).
Indeed it is! This requires a nested reader, so that the outer reader
blocks synchronize_srcu() and synchronize_srcu() blocks the inner
reader -- but that is legal.
So I made hardluckref be an atomic_t, and changed the mutex_lock()
in srcu_read_lock() be a mutex_trylock() -- which cannot block, right?
I also added the srcu_readers_active() declaration to srcu.h for Jens.
Oleg, any thoughts about Jens's optimization? He would code something
like:
if (srcu_readers_active(&my_srcu))
synchronize_srcu();
else
smp_mb();
However, he is doing ordered I/O requests rather than protecting data
structures.
Changes:
o Make hardluckref be an atomic_t.
o Put the now-needed rcu_dereference()s for per_cpu_ref
(used to be constant...).
o Moved to mutex_trylock() in srcu_read_lock() to avoid Oleg's
deadlock scenario.
o Added per_cpu_ref NULL rechecks to avoid the Oleg's memory
leak (and worse).
o Added srcu_readers_active() to srcu.h.
Still untested (aside from Jens's runs).
Signed-off-by: [email protected] (AKA [email protected])
---
include/linux/srcu.h | 8 ---
kernel/srcu.c | 130 +++++++++++++++++++++++++++------------------------
2 files changed, 73 insertions(+), 65 deletions(-)
diff -urpNa -X dontdiff linux-2.6.19-rc5/include/linux/srcu.h linux-2.6.19-rc5-dsrcu/include/linux/srcu.h
--- linux-2.6.19-rc5/include/linux/srcu.h 2006-11-17 13:54:15.000000000 -0800
+++ linux-2.6.19-rc5-dsrcu/include/linux/srcu.h 2006-11-17 15:14:07.000000000 -0800
@@ -35,19 +35,15 @@ struct srcu_struct {
int completed;
struct srcu_struct_array *per_cpu_ref;
struct mutex mutex;
+ atomic_t hardluckref;
};
-#ifndef CONFIG_PREEMPT
-#define srcu_barrier() barrier()
-#else /* #ifndef CONFIG_PREEMPT */
-#define srcu_barrier()
-#endif /* #else #ifndef CONFIG_PREEMPT */
-
int init_srcu_struct(struct srcu_struct *sp);
void cleanup_srcu_struct(struct srcu_struct *sp);
int srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
void synchronize_srcu(struct srcu_struct *sp);
long srcu_batches_completed(struct srcu_struct *sp);
+int srcu_readers_active(struct srcu_struct *sp);
#endif
diff -urpNa -X dontdiff linux-2.6.19-rc5/kernel/srcu.c linux-2.6.19-rc5-dsrcu/kernel/srcu.c
--- linux-2.6.19-rc5/kernel/srcu.c 2006-11-17 13:54:17.000000000 -0800
+++ linux-2.6.19-rc5-dsrcu/kernel/srcu.c 2006-11-17 14:15:06.000000000 -0800
@@ -34,6 +34,18 @@
#include <linux/smp.h>
#include <linux/srcu.h>
+/*
+ * Initialize the per-CPU array, returning the pointer.
+ */
+static inline struct srcu_struct_array *alloc_srcu_struct_percpu(void)
+{
+ struct srcu_struct_array *sap;
+
+ sap = alloc_percpu(struct srcu_struct_array);
+ smp_wmb();
+ return (sap);
+}
+
/**
* init_srcu_struct - initialize a sleep-RCU structure
* @sp: structure to initialize.
@@ -46,7 +58,8 @@ int init_srcu_struct(struct srcu_struct
{
sp->completed = 0;
mutex_init(&sp->mutex);
- sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+ atomic_set(&sp->hardluckref, 0);
return (sp->per_cpu_ref ? 0 : -ENOMEM);
}
@@ -58,12 +71,15 @@ int init_srcu_struct(struct srcu_struct
static int srcu_readers_active_idx(struct srcu_struct *sp, int idx)
{
int cpu;
+ struct srcu_struct_array *sap;
int sum;
sum = 0;
- for_each_possible_cpu(cpu)
- sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
- return sum;
+ sap = rcu_dereference(sp->per_cpu_ref);
+ if (likely(sap != NULL))
+ for_each_possible_cpu(cpu)
+ sum += per_cpu_ptr(sap, cpu)->c[idx];
+ return sum + atomic_read(&sp->hardluckref);
}
/**
@@ -76,7 +92,9 @@ static int srcu_readers_active_idx(struc
*/
int srcu_readers_active(struct srcu_struct *sp)
{
- return srcu_readers_active_idx(sp, 0) + srcu_readers_active_idx(sp, 1);
+ return srcu_readers_active_idx(sp, 0) +
+ srcu_readers_active_idx(sp, 1) -
+ atomic_read(&sp->hardluckref); /* No one will care, but... */
}
/**
@@ -94,7 +112,8 @@ void cleanup_srcu_struct(struct srcu_str
WARN_ON(sum); /* Leakage unless caller handles error. */
if (sum != 0)
return;
- free_percpu(sp->per_cpu_ref);
+ if (sp->per_cpu_ref != NULL)
+ free_percpu(sp->per_cpu_ref);
sp->per_cpu_ref = NULL;
}
@@ -105,18 +124,39 @@ void cleanup_srcu_struct(struct srcu_str
* Counts the new reader in the appropriate per-CPU element of the
* srcu_struct. Must be called from process context.
* Returns an index that must be passed to the matching srcu_read_unlock().
+ * The index is -1 if the srcu_struct is not and cannot be initialized.
*/
int srcu_read_lock(struct srcu_struct *sp)
{
int idx;
+ struct srcu_struct_array *sap;
preempt_disable();
idx = sp->completed & 0x1;
- barrier(); /* ensure compiler looks -once- at sp->completed. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
+ sap = rcu_dereference(sp->per_cpu_ref);
+ if (likely(sap != NULL)) {
+ barrier(); /* ensure compiler looks -once- at sp->completed. */
+ per_cpu_ptr(rcu_dereference(sap),
+ smp_processor_id())->c[idx]++;
+ smp_mb();
+ preempt_enable();
+ return idx;
+ }
+ if (mutex_trylock(&sp->mutex)) {
+ preempt_enable();
+ if (sp->per_cpu_ref == NULL)
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+ if (sp->per_cpu_ref == NULL) {
+ atomic_inc(&sp->hardluckref);
+ mutex_unlock(&sp->mutex);
+ return -1;
+ }
+ mutex_unlock(&sp->mutex);
+ return srcu_read_lock(sp);
+ }
preempt_enable();
- return idx;
+ atomic_inc(&sp->hardluckref);
+ return -1;
}
/**
@@ -131,10 +171,17 @@ int srcu_read_lock(struct srcu_struct *s
*/
void srcu_read_unlock(struct srcu_struct *sp, int idx)
{
- preempt_disable();
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
- preempt_enable();
+ if (likely(idx != -1)) {
+ preempt_disable();
+ smp_mb();
+ per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
+ smp_processor_id())->c[idx]--;
+ preempt_enable();
+ return;
+ }
+ mutex_lock(&sp->mutex);
+ atomic_dec(&sp->hardluckref);
+ mutex_unlock(&sp->mutex);
}
/**
@@ -158,6 +205,11 @@ void synchronize_srcu(struct srcu_struct
idx = sp->completed;
mutex_lock(&sp->mutex);
+ /* Initialize if not already initialized. */
+
+ if (sp->per_cpu_ref == NULL)
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+
/*
* Check to see if someone else did the work for us while we were
* waiting to acquire the lock. We need -two- advances of
@@ -173,65 +225,25 @@ void synchronize_srcu(struct srcu_struct
return;
}
- synchronize_sched(); /* Force memory barrier on all CPUs. */
-
- /*
- * The preceding synchronize_sched() ensures that any CPU that
- * sees the new value of sp->completed will also see any preceding
- * changes to data structures made by this CPU. This prevents
- * some other CPU from reordering the accesses in its SRCU
- * read-side critical section to precede the corresponding
- * srcu_read_lock() -- ensuring that such references will in
- * fact be protected.
- *
- * So it is now safe to do the flip.
- */
-
+ smp_mb(); /* ensure srcu_read_lock() sees prior change first! */
idx = sp->completed & 0x1;
sp->completed++;
- synchronize_sched(); /* Force memory barrier on all CPUs. */
+ synchronize_sched();
/*
* At this point, because of the preceding synchronize_sched(),
* all srcu_read_lock() calls using the old counters have completed.
* Their corresponding critical sections might well be still
* executing, but the srcu_read_lock() primitives themselves
- * will have finished executing.
+ * will have finished executing. The "old" rank of counters
+ * can therefore only decrease, never increase in value.
*/
while (srcu_readers_active_idx(sp, idx))
schedule_timeout_interruptible(1);
- synchronize_sched(); /* Force memory barrier on all CPUs. */
-
- /*
- * The preceding synchronize_sched() forces all srcu_read_unlock()
- * primitives that were executing concurrently with the preceding
- * for_each_possible_cpu() loop to have completed by this point.
- * More importantly, it also forces the corresponding SRCU read-side
- * critical sections to have also completed, and the corresponding
- * references to SRCU-protected data items to be dropped.
- *
- * Note:
- *
- * Despite what you might think at first glance, the
- * preceding synchronize_sched() -must- be within the
- * critical section ended by the following mutex_unlock().
- * Otherwise, a task taking the early exit can race
- * with a srcu_read_unlock(), which might have executed
- * just before the preceding srcu_readers_active() check,
- * and whose CPU might have reordered the srcu_read_unlock()
- * with the preceding critical section. In this case, there
- * is nothing preventing the synchronize_sched() task that is
- * taking the early exit from freeing a data structure that
- * is still being referenced (out of order) by the task
- * doing the srcu_read_unlock().
- *
- * Alternatively, the comparison with "2" on the early exit
- * could be changed to "3", but this increases synchronize_srcu()
- * latency for bulk loads. So the current code is preferred.
- */
+ smp_mb(); /* must see critical section prior to srcu_read_unlock() */
mutex_unlock(&sp->mutex);
}
On Fri, Nov 17, 2006 at 02:27:15PM -0500, Alan Stern wrote:
> On Fri, 17 Nov 2006, Paul E. McKenney wrote:
>
> > > It works for me, but the overhead is still large. Before it would take
> > > 8-12 jiffies for a synchronize_srcu() to complete without there actually
> > > being any reader locks active, now it takes 2-3 jiffies. So it's
> > > definitely faster, and as suspected the loss of two of three
> > > synchronize_sched() cut down the overhead to a third.
> >
> > Good to hear, thank you for trying it out!
> >
> > > It's still too heavy for me, by far the most calls I do to
> > > synchronize_srcu() doesn't have any reader locks pending. I'm still a
> > > big advocate of the fastpath srcu_readers_active() check. I can
> > > understand the reluctance to make it the default, but for my case it's
> > > "safe enough", so if we could either export srcu_readers_active() or
> > > export a synchronize_srcu_fast() (or something like that), then SRCU
> > > would be a good fit for barrier vs plug rework.
> >
> > OK, will export the interface. Do your queues have associated locking?
> >
> > > > Attached is a patch that compiles, but probably goes down in flames
> > > > otherwise.
> > >
> > > Works here :-)
> >
> > I have at least a couple bugs that would show up under low-memory
> > situations, will fix and post an update.
>
> Perhaps a better approach to the initialization problem would be to assume
> that either:
>
> 1. The srcu_struct will be initialized before it is used, or
>
> 2. When it is used before initialization, the system is running
> only one thread.
Are these assumptions valid? If so, they would indeed simplify things
a bit.
> In other words, statically allocated SRCU strucures that get used during
> system startup must be initialized before the system starts multitasking.
> That seems like a reasonable requirement.
>
> This eliminates worries about readers holding mutexes. It doesn't
> solve the issues surrounding your hardluckref, but maybe it makes them
> easier to think about.
For the moment, I cheaped out and used a mutex_trylock. If this can block,
I will need to add a separate spinlock to guard per_cpu_ref allocation.
Hmmm... How to test this? Time for the wrapper around alloc_percpu()
that randomly fails, I guess. ;-)
Thanx, Paul
On Fri, 17 Nov 2006, Paul E. McKenney wrote:
> > Perhaps a better approach to the initialization problem would be to assume
> > that either:
> >
> > 1. The srcu_struct will be initialized before it is used, or
> >
> > 2. When it is used before initialization, the system is running
> > only one thread.
>
> Are these assumptions valid? If so, they would indeed simplify things
> a bit.
I don't know. Maybe Andrew can tell us -- is it true that the kernel runs
only one thread up through the time the core_initcalls are finished?
If not, can we create another initcall level that is guaranteed to run
before any threads are spawned?
> For the moment, I cheaped out and used a mutex_trylock. If this can block,
> I will need to add a separate spinlock to guard per_cpu_ref allocation.
I haven't looked at your revised patch yet... But it's important to keep
things as simple as possible.
> Hmmm... How to test this? Time for the wrapper around alloc_percpu()
> that randomly fails, I guess. ;-)
Do you really want things to continue in a highly degraded mode when
percpu allocation fails? Maybe it would be better just to pass the
failure back to the caller.
Alan Stern
On Fri, 17 Nov 2006 23:33:45 -0500 (EST)
Alan Stern <[email protected]> wrote:
> On Fri, 17 Nov 2006, Paul E. McKenney wrote:
>
> > > Perhaps a better approach to the initialization problem would be to assume
> > > that either:
> > >
> > > 1. The srcu_struct will be initialized before it is used, or
> > >
> > > 2. When it is used before initialization, the system is running
> > > only one thread.
> >
> > Are these assumptions valid? If so, they would indeed simplify things
> > a bit.
>
> I don't know. Maybe Andrew can tell us -- is it true that the kernel runs
> only one thread up through the time the core_initcalls are finished?
I don't see why - a core_initcall could go off and do the
multithreaded-pci-probing thing, or it could call kernel_thread() or
anything. I doubt if any core_initcall functions _do_ do that, but there
are a lot of them.
> If not, can we create another initcall level that is guaranteed to run
> before any threads are spawned?
It's a simple and cheap matter to create a precore_initcall() - one would
need to document it carefully to be able to preserve whatever guarantees it
needs.
However by the time the initcalls get run, various thing are already
happening: SMP is up, the keventd threads are running, the CPU scheduler
migration threads are running, ksoftirqd, softlockup-detector, etc.
keventd is the problematic one.
So I guess you'd need a new linker section and a call from
do_pre_smp_initcalls() or thereabouts.
On Fri, Nov 17, 2006 at 08:51:03PM -0800, Andrew Morton wrote:
> On Fri, 17 Nov 2006 23:33:45 -0500 (EST)
> Alan Stern <[email protected]> wrote:
>
> > On Fri, 17 Nov 2006, Paul E. McKenney wrote:
> >
> > > > Perhaps a better approach to the initialization problem would be to assume
> > > > that either:
> > > >
> > > > 1. The srcu_struct will be initialized before it is used, or
> > > >
> > > > 2. When it is used before initialization, the system is running
> > > > only one thread.
> > >
> > > Are these assumptions valid? If so, they would indeed simplify things
> > > a bit.
> >
> > I don't know. Maybe Andrew can tell us -- is it true that the kernel runs
> > only one thread up through the time the core_initcalls are finished?
>
> I don't see why - a core_initcall could go off and do the
> multithreaded-pci-probing thing, or it could call kernel_thread() or
> anything. I doubt if any core_initcall functions _do_ do that, but there
> are a lot of them.
>
> > If not, can we create another initcall level that is guaranteed to run
> > before any threads are spawned?
>
> It's a simple and cheap matter to create a precore_initcall() - one would
> need to document it carefully to be able to preserve whatever guarantees it
> needs.
>
> However by the time the initcalls get run, various thing are already
> happening: SMP is up, the keventd threads are running, the CPU scheduler
> migration threads are running, ksoftirqd, softlockup-detector, etc.
> keventd is the problematic one.
>
> So I guess you'd need a new linker section and a call from
> do_pre_smp_initcalls() or thereabouts.
Hmmm... OK then, for the moment, I will stick with the current checks
in the primitives. Not that I particularly like the "bulking up" of
srcu_read_lock() and srcu_read_unlock() -- but if the super-fast
version is needed, it can easily be provided either within the
confines of the subsystem that needs it, or as yet another set of
RCU-like primitives. Hopefully this latter option can be avoided!
BTW, the reason for the hardluckref is that I don't want to inflict a
failure return from srcu_read_lock() on you guys. The non-blocking
synchronization community has repeatedly made that sort of mistake,
and I have no intention of letting it propagate any further. ;-)
Thanx, Paul
There are a few things I don't like about this patch.
On Fri, 17 Nov 2006, Paul E. McKenney wrote:
> diff -urpNa -X dontdiff linux-2.6.19-rc5/kernel/srcu.c linux-2.6.19-rc5-dsrcu/kernel/srcu.c
> --- linux-2.6.19-rc5/kernel/srcu.c 2006-11-17 13:54:17.000000000 -0800
> +++ linux-2.6.19-rc5-dsrcu/kernel/srcu.c 2006-11-17 14:15:06.000000000 -0800
> @@ -34,6 +34,18 @@
> #include <linux/smp.h>
> #include <linux/srcu.h>
>
> +/*
> + * Initialize the per-CPU array, returning the pointer.
> + */
> +static inline struct srcu_struct_array *alloc_srcu_struct_percpu(void)
> +{
> + struct srcu_struct_array *sap;
> +
> + sap = alloc_percpu(struct srcu_struct_array);
> + smp_wmb();
> + return (sap);
Style: Don't use () here.
> +}
> +
> /**
> * init_srcu_struct - initialize a sleep-RCU structure
> * @sp: structure to initialize.
> @@ -94,7 +112,8 @@ void cleanup_srcu_struct(struct srcu_str
> WARN_ON(sum); /* Leakage unless caller handles error. */
> if (sum != 0)
> return;
> - free_percpu(sp->per_cpu_ref);
> + if (sp->per_cpu_ref != NULL)
> + free_percpu(sp->per_cpu_ref);
Now that Andrew has accepted the "allow free_percpu(NULL)" change, you can
remove the test here.
> sp->per_cpu_ref = NULL;
> }
>
> @@ -105,18 +124,39 @@ void cleanup_srcu_struct(struct srcu_str
> * Counts the new reader in the appropriate per-CPU element of the
> * srcu_struct. Must be called from process context.
> * Returns an index that must be passed to the matching srcu_read_unlock().
> + * The index is -1 if the srcu_struct is not and cannot be initialized.
> */
> int srcu_read_lock(struct srcu_struct *sp)
> {
> int idx;
> + struct srcu_struct_array *sap;
>
> preempt_disable();
> idx = sp->completed & 0x1;
> - barrier(); /* ensure compiler looks -once- at sp->completed. */
> - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> + sap = rcu_dereference(sp->per_cpu_ref);
> + if (likely(sap != NULL)) {
> + barrier(); /* ensure compiler looks -once- at sp->completed. */
Put this barrier() back where the old one was (outside the "if").
> + per_cpu_ptr(rcu_dereference(sap),
You don't need the rcu_dereference here, you already have it above.
> + smp_processor_id())->c[idx]++;
> + smp_mb();
> + preempt_enable();
> + return idx;
> + }
> + if (mutex_trylock(&sp->mutex)) {
> + preempt_enable();
Move the preempt_enable() before the "if", then get rid of the
preempt_enable() after the "if" block.
> + if (sp->per_cpu_ref == NULL)
> + sp->per_cpu_ref = alloc_srcu_struct_percpu();
It would be cleaner to put the mutex_unlock() and closing '}' right here.
> + if (sp->per_cpu_ref == NULL) {
> + atomic_inc(&sp->hardluckref);
> + mutex_unlock(&sp->mutex);
> + return -1;
> + }
> + mutex_unlock(&sp->mutex);
> + return srcu_read_lock(sp);
> + }
> preempt_enable();
> - return idx;
> + atomic_inc(&sp->hardluckref);
> + return -1;
> }
>
> /**
> @@ -131,10 +171,17 @@ int srcu_read_lock(struct srcu_struct *s
> */
> void srcu_read_unlock(struct srcu_struct *sp, int idx)
> {
> - preempt_disable();
> - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> - preempt_enable();
> + if (likely(idx != -1)) {
> + preempt_disable();
> + smp_mb();
> + per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
> + smp_processor_id())->c[idx]--;
> + preempt_enable();
> + return;
> + }
> + mutex_lock(&sp->mutex);
> + atomic_dec(&sp->hardluckref);
> + mutex_unlock(&sp->mutex);
You don't need the mutex to protect an atomic_dec.
> }
>
> /**
> @@ -158,6 +205,11 @@ void synchronize_srcu(struct srcu_struct
> idx = sp->completed;
> mutex_lock(&sp->mutex);
>
> + /* Initialize if not already initialized. */
> +
> + if (sp->per_cpu_ref == NULL)
> + sp->per_cpu_ref = alloc_srcu_struct_percpu();
What happens if a prior reader failed to allocate the memory but this call
succeeds? You need to check hardluckref before doing this. The same is
true in srcu_read_lock().
> +
> /*
> * Check to see if someone else did the work for us while we were
> * waiting to acquire the lock. We need -two- advances of
> @@ -173,65 +225,25 @@ void synchronize_srcu(struct srcu_struct
> return;
> }
>
> - synchronize_sched(); /* Force memory barrier on all CPUs. */
> -
> - /*
> - * The preceding synchronize_sched() ensures that any CPU that
> - * sees the new value of sp->completed will also see any preceding
> - * changes to data structures made by this CPU. This prevents
> - * some other CPU from reordering the accesses in its SRCU
> - * read-side critical section to precede the corresponding
> - * srcu_read_lock() -- ensuring that such references will in
> - * fact be protected.
> - *
> - * So it is now safe to do the flip.
> - */
> -
> + smp_mb(); /* ensure srcu_read_lock() sees prior change first! */
> idx = sp->completed & 0x1;
> sp->completed++;
>
> - synchronize_sched(); /* Force memory barrier on all CPUs. */
> + synchronize_sched();
>
> /*
> * At this point, because of the preceding synchronize_sched(),
> * all srcu_read_lock() calls using the old counters have completed.
> * Their corresponding critical sections might well be still
> * executing, but the srcu_read_lock() primitives themselves
> - * will have finished executing.
> + * will have finished executing. The "old" rank of counters
> + * can therefore only decrease, never increase in value.
> */
>
> while (srcu_readers_active_idx(sp, idx))
> schedule_timeout_interruptible(1);
>
> - synchronize_sched(); /* Force memory barrier on all CPUs. */
> -
> - /*
> - * The preceding synchronize_sched() forces all srcu_read_unlock()
> - * primitives that were executing concurrently with the preceding
> - * for_each_possible_cpu() loop to have completed by this point.
> - * More importantly, it also forces the corresponding SRCU read-side
> - * critical sections to have also completed, and the corresponding
> - * references to SRCU-protected data items to be dropped.
> - *
> - * Note:
> - *
> - * Despite what you might think at first glance, the
> - * preceding synchronize_sched() -must- be within the
> - * critical section ended by the following mutex_unlock().
> - * Otherwise, a task taking the early exit can race
> - * with a srcu_read_unlock(), which might have executed
> - * just before the preceding srcu_readers_active() check,
> - * and whose CPU might have reordered the srcu_read_unlock()
> - * with the preceding critical section. In this case, there
> - * is nothing preventing the synchronize_sched() task that is
> - * taking the early exit from freeing a data structure that
> - * is still being referenced (out of order) by the task
> - * doing the srcu_read_unlock().
> - *
> - * Alternatively, the comparison with "2" on the early exit
> - * could be changed to "3", but this increases synchronize_srcu()
> - * latency for bulk loads. So the current code is preferred.
> - */
> + smp_mb(); /* must see critical section prior to srcu_read_unlock() */
>
> mutex_unlock(&sp->mutex);
> }
>
Alan Stern
On Sat, Nov 18, 2006 at 11:15:27AM -0500, Alan Stern wrote:
> There are a few things I don't like about this patch.
>
> On Fri, 17 Nov 2006, Paul E. McKenney wrote:
>
> > diff -urpNa -X dontdiff linux-2.6.19-rc5/kernel/srcu.c linux-2.6.19-rc5-dsrcu/kernel/srcu.c
> > --- linux-2.6.19-rc5/kernel/srcu.c 2006-11-17 13:54:17.000000000 -0800
> > +++ linux-2.6.19-rc5-dsrcu/kernel/srcu.c 2006-11-17 14:15:06.000000000 -0800
> > @@ -34,6 +34,18 @@
> > #include <linux/smp.h>
> > #include <linux/srcu.h>
> >
> > +/*
> > + * Initialize the per-CPU array, returning the pointer.
> > + */
> > +static inline struct srcu_struct_array *alloc_srcu_struct_percpu(void)
> > +{
> > + struct srcu_struct_array *sap;
> > +
> > + sap = alloc_percpu(struct srcu_struct_array);
> > + smp_wmb();
> > + return (sap);
>
> Style: Don't use () here.
Touche!!!
> > +}
> > +
> > /**
> > * init_srcu_struct - initialize a sleep-RCU structure
> > * @sp: structure to initialize.
>
> > @@ -94,7 +112,8 @@ void cleanup_srcu_struct(struct srcu_str
> > WARN_ON(sum); /* Leakage unless caller handles error. */
> > if (sum != 0)
> > return;
> > - free_percpu(sp->per_cpu_ref);
> > + if (sp->per_cpu_ref != NULL)
> > + free_percpu(sp->per_cpu_ref);
>
> Now that Andrew has accepted the "allow free_percpu(NULL)" change, you can
> remove the test here.
OK. I thought that there was some sort of error-checking involved,
but if not, will fix.
> > sp->per_cpu_ref = NULL;
> > }
> >
> > @@ -105,18 +124,39 @@ void cleanup_srcu_struct(struct srcu_str
> > * Counts the new reader in the appropriate per-CPU element of the
> > * srcu_struct. Must be called from process context.
> > * Returns an index that must be passed to the matching srcu_read_unlock().
> > + * The index is -1 if the srcu_struct is not and cannot be initialized.
> > */
> > int srcu_read_lock(struct srcu_struct *sp)
> > {
> > int idx;
> > + struct srcu_struct_array *sap;
> >
> > preempt_disable();
> > idx = sp->completed & 0x1;
> > - barrier(); /* ensure compiler looks -once- at sp->completed. */
> > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > + sap = rcu_dereference(sp->per_cpu_ref);
> > + if (likely(sap != NULL)) {
> > + barrier(); /* ensure compiler looks -once- at sp->completed. */
>
> Put this barrier() back where the old one was (outside the "if").
Why? Outside this "if", I don't use "sap".
> > + per_cpu_ptr(rcu_dereference(sap),
>
> You don't need the rcu_dereference here, you already have it above.
Good point!!!
> > + smp_processor_id())->c[idx]++;
> > + smp_mb();
> > + preempt_enable();
> > + return idx;
> > + }
> > + if (mutex_trylock(&sp->mutex)) {
> > + preempt_enable();
>
> Move the preempt_enable() before the "if", then get rid of the
> preempt_enable() after the "if" block.
No can do. The preempt_enable() must follow the increment and
the memory barrier, otherwise the synchronize_sched() inside
synchronize_srcu() can't do its job.
> > + if (sp->per_cpu_ref == NULL)
> > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
>
> It would be cleaner to put the mutex_unlock() and closing '}' right here.
I can move the mutex_unlock() to this point, but I cannot otherwise
merge the two following pieces of code -- at least not without doing
an otherwise-gratuitous preempt_disable(). Which I suppose I could
do, but seems like it would be more confusing than would the
separate code. I will play with this a bit and see if I can eliminate
the duplication.
> > + if (sp->per_cpu_ref == NULL) {
> > + atomic_inc(&sp->hardluckref);
> > + mutex_unlock(&sp->mutex);
> > + return -1;
> > + }
> > + mutex_unlock(&sp->mutex);
> > + return srcu_read_lock(sp);
> > + }
OK, I suppose I could put the preempt_enable() in an "else" clause,
then maybe be able to merge things. Would that help?
> > preempt_enable();
> > - return idx;
> > + atomic_inc(&sp->hardluckref);
> > + return -1;
> > }
> >
> > /**
> > @@ -131,10 +171,17 @@ int srcu_read_lock(struct srcu_struct *s
> > */
> > void srcu_read_unlock(struct srcu_struct *sp, int idx)
> > {
> > - preempt_disable();
> > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> > - preempt_enable();
> > + if (likely(idx != -1)) {
> > + preempt_disable();
> > + smp_mb();
> > + per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
> > + smp_processor_id())->c[idx]--;
> > + preempt_enable();
> > + return;
> > + }
> > + mutex_lock(&sp->mutex);
> > + atomic_dec(&sp->hardluckref);
> > + mutex_unlock(&sp->mutex);
>
> You don't need the mutex to protect an atomic_dec.
Good point!!!
> > }
> >
> > /**
> > @@ -158,6 +205,11 @@ void synchronize_srcu(struct srcu_struct
> > idx = sp->completed;
> > mutex_lock(&sp->mutex);
> >
> > + /* Initialize if not already initialized. */
> > +
> > + if (sp->per_cpu_ref == NULL)
> > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
>
> What happens if a prior reader failed to allocate the memory but this call
> succeeds? You need to check hardluckref before doing this. The same is
> true in srcu_read_lock().
All accounted for by the fact that hardluckref is unconditionally
added in by srcu_readers_active(). Right?
> > +
> > /*
> > * Check to see if someone else did the work for us while we were
> > * waiting to acquire the lock. We need -two- advances of
> > @@ -173,65 +225,25 @@ void synchronize_srcu(struct srcu_struct
> > return;
> > }
> >
> > - synchronize_sched(); /* Force memory barrier on all CPUs. */
> > -
> > - /*
> > - * The preceding synchronize_sched() ensures that any CPU that
> > - * sees the new value of sp->completed will also see any preceding
> > - * changes to data structures made by this CPU. This prevents
> > - * some other CPU from reordering the accesses in its SRCU
> > - * read-side critical section to precede the corresponding
> > - * srcu_read_lock() -- ensuring that such references will in
> > - * fact be protected.
> > - *
> > - * So it is now safe to do the flip.
> > - */
> > -
> > + smp_mb(); /* ensure srcu_read_lock() sees prior change first! */
> > idx = sp->completed & 0x1;
> > sp->completed++;
> >
> > - synchronize_sched(); /* Force memory barrier on all CPUs. */
> > + synchronize_sched();
> >
> > /*
> > * At this point, because of the preceding synchronize_sched(),
> > * all srcu_read_lock() calls using the old counters have completed.
> > * Their corresponding critical sections might well be still
> > * executing, but the srcu_read_lock() primitives themselves
> > - * will have finished executing.
> > + * will have finished executing. The "old" rank of counters
> > + * can therefore only decrease, never increase in value.
> > */
> >
> > while (srcu_readers_active_idx(sp, idx))
> > schedule_timeout_interruptible(1);
> >
> > - synchronize_sched(); /* Force memory barrier on all CPUs. */
> > -
> > - /*
> > - * The preceding synchronize_sched() forces all srcu_read_unlock()
> > - * primitives that were executing concurrently with the preceding
> > - * for_each_possible_cpu() loop to have completed by this point.
> > - * More importantly, it also forces the corresponding SRCU read-side
> > - * critical sections to have also completed, and the corresponding
> > - * references to SRCU-protected data items to be dropped.
> > - *
> > - * Note:
> > - *
> > - * Despite what you might think at first glance, the
> > - * preceding synchronize_sched() -must- be within the
> > - * critical section ended by the following mutex_unlock().
> > - * Otherwise, a task taking the early exit can race
> > - * with a srcu_read_unlock(), which might have executed
> > - * just before the preceding srcu_readers_active() check,
> > - * and whose CPU might have reordered the srcu_read_unlock()
> > - * with the preceding critical section. In this case, there
> > - * is nothing preventing the synchronize_sched() task that is
> > - * taking the early exit from freeing a data structure that
> > - * is still being referenced (out of order) by the task
> > - * doing the srcu_read_unlock().
> > - *
> > - * Alternatively, the comparison with "2" on the early exit
> > - * could be changed to "3", but this increases synchronize_srcu()
> > - * latency for bulk loads. So the current code is preferred.
> > - */
> > + smp_mb(); /* must see critical section prior to srcu_read_unlock() */
> >
> > mutex_unlock(&sp->mutex);
> > }
Will spin a new patch...
Thanx, Paul
On 11/17, Paul E. McKenney wrote:
>
> Oleg, any thoughts about Jens's optimization? He would code something
> like:
>
> if (srcu_readers_active(&my_srcu))
> synchronize_srcu();
> else
> smp_mb();
Well, this is clearly racy, no? I am not sure, but may be we can do
smp_mb();
if (srcu_readers_active(&my_srcu))
synchronize_srcu();
in this case we also need to add 'smp_mb()' into srcu_read_lock() after
'atomic_inc(&sp->hardluckref)'.
> However, he is doing ordered I/O requests rather than protecting data
> structures.
Probably this makes a difference, but I don't understand this.
Oleg.
On 11/17, Paul E. McKenney wrote:
>
> int srcu_read_lock(struct srcu_struct *sp)
> {
> int idx;
> + struct srcu_struct_array *sap;
>
> preempt_disable();
> idx = sp->completed & 0x1;
> - barrier(); /* ensure compiler looks -once- at sp->completed. */
> - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> + sap = rcu_dereference(sp->per_cpu_ref);
> + if (likely(sap != NULL)) {
> + barrier(); /* ensure compiler looks -once- at sp->completed. */
> + per_cpu_ptr(rcu_dereference(sap),
> + smp_processor_id())->c[idx]++;
> + smp_mb();
> + preempt_enable();
> + return idx;
> + }
> + if (mutex_trylock(&sp->mutex)) {
> + preempt_enable();
> + if (sp->per_cpu_ref == NULL)
> + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> + if (sp->per_cpu_ref == NULL) {
> + atomic_inc(&sp->hardluckref);
> + mutex_unlock(&sp->mutex);
> + return -1;
> + }
> + mutex_unlock(&sp->mutex);
> + return srcu_read_lock(sp);
> + }
> preempt_enable();
> - return idx;
> + atomic_inc(&sp->hardluckref);
> + return -1;
> }
This is a real nitpick, but in theory we have a possibility for the livelock.
Suppose that synchronize_srcu() takes sp->mutex and fails to allocate
sp->per_cpu_ref. If we have a flow of srcu_read_lock/srcu_read_unlock,
this loop in synchronize_srcu()
while (srcu_readers_active_idx(sp, idx))
schedule_timeout_interruptible(1);
may spin unpredictably long, because we use the same sp->hardluckref for
accounting.
Oleg.
On 11/18, Paul E. McKenney wrote:
>
> On Sat, Nov 18, 2006 at 11:15:27AM -0500, Alan Stern wrote:
> > > + smp_processor_id())->c[idx]++;
> > > + smp_mb();
> > > + preempt_enable();
> > > + return idx;
> > > + }
> > > + if (mutex_trylock(&sp->mutex)) {
> > > + preempt_enable();
> >
> > Move the preempt_enable() before the "if", then get rid of the
> > preempt_enable() after the "if" block.
>
> No can do. The preempt_enable() must follow the increment and
> the memory barrier, otherwise the synchronize_sched() inside
> synchronize_srcu() can't do its job.
Given that srcu_read_lock() does smp_mb() after ->c[idx]++, what
is the purpose of synchronize_srcu() ? It seems to me it could be
replaced by smp_mb().
synchronize_srcu:
sp->completed++;
mb();
// if the reader did any memory access _after_
// srcu_read_lock()->mb() we must see the changes.
while (srcu_readers_active_idx(sp, idx))
sleep();
No?
Oleg.
On Sat, 18 Nov 2006, Paul E. McKenney wrote:
> > > @@ -94,7 +112,8 @@ void cleanup_srcu_struct(struct srcu_str
> > > WARN_ON(sum); /* Leakage unless caller handles error. */
> > > if (sum != 0)
> > > return;
> > > - free_percpu(sp->per_cpu_ref);
> > > + if (sp->per_cpu_ref != NULL)
> > > + free_percpu(sp->per_cpu_ref);
> >
> > Now that Andrew has accepted the "allow free_percpu(NULL)" change, you can
> > remove the test here.
>
> OK. I thought that there was some sort of error-checking involved,
> but if not, will fix.
Just make sure that _you_ have the free_percpu(NULL) patch installed on
your machine before testing this -- otherwise you'll get a nice hard
crash!
> > > preempt_disable();
> > > idx = sp->completed & 0x1;
> > > - barrier(); /* ensure compiler looks -once- at sp->completed. */
> > > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > > + sap = rcu_dereference(sp->per_cpu_ref);
> > > + if (likely(sap != NULL)) {
> > > + barrier(); /* ensure compiler looks -once- at sp->completed. */
> >
> > Put this barrier() back where the old one was (outside the "if").
>
> Why? Outside this "if", I don't use "sap".
Because it looks funny to see the comment here talking about sp->completed
when sp->completed hasn't been used for several lines. (Maybe it looks
less funny in the patched source than in the patch itself.) The best
place to prevent extra accesses of sp->completed is immediately after
the required access.
> > > + smp_processor_id())->c[idx]++;
> > > + smp_mb();
> > > + preempt_enable();
> > > + return idx;
> > > + }
> > > + if (mutex_trylock(&sp->mutex)) {
> > > + preempt_enable();
> >
> > Move the preempt_enable() before the "if", then get rid of the
> > preempt_enable() after the "if" block.
>
> No can do. The preempt_enable() must follow the increment and
> the memory barrier, otherwise the synchronize_sched() inside
> synchronize_srcu() can't do its job.
You misunderstood -- I was talking about the preempt_enable() in the last
line quoted above (not the one in the third line) and the "if
(mutex_trylock" (not the earlier "if (likely").
> > > + if (sp->per_cpu_ref == NULL)
> > > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> >
> > It would be cleaner to put the mutex_unlock() and closing '}' right here.
>
> I can move the mutex_unlock() to this point, but I cannot otherwise
> merge the two following pieces of code -- at least not without doing
> an otherwise-gratuitous preempt_disable(). Which I suppose I could
> do, but seems like it would be more confusing than would the
> separate code. I will play with this a bit and see if I can eliminate
> the duplication.
If you follow the advice above then you won't need to add a gratuitous
preempt_disable(). Try it and see how it comes out; the idea is that
you can use the same code for testing sp->per_cpu_ref regardless of
whether the mutex_trylock() or the call to alloc_srcu_struct_percpu()
succeeded.
> > What happens if a prior reader failed to allocate the memory but this call
> > succeeds? You need to check hardluckref before doing this. The same is
> > true in srcu_read_lock().
>
> All accounted for by the fact that hardluckref is unconditionally
> added in by srcu_readers_active(). Right?
Yes, you're right.
> Will spin a new patch...
Good -- it's getting pretty messy to look at this one!
By the way, I think the fastpath for synchronize_srcu() should be safe,
now that you have added the memory barriers into srcu_read_lock() and
srcu_read_unlock(). You might as well try putting it in.
Although now that I look at it again, you have forgotten to put smp_mb()
after the atomic_inc() call and before the atomic_dec(). In
srcu_read_unlock() you could just move the existing smp_mb() back before
the test of idx.
Alan Stern
On 11/18, Alan Stern wrote:
>
> By the way, I think the fastpath for synchronize_srcu() should be safe,
> now that you have added the memory barriers into srcu_read_lock() and
> srcu_read_unlock(). You might as well try putting it in.
I still think the fastpath should do mb() unconditionally to be correct.
> Although now that I look at it again, you have forgotten to put smp_mb()
> after the atomic_inc() call and before the atomic_dec().
As I see it, currently we don't need this barrier because synchronize_srcu()
does synchronize_sched() before reading ->hardluckref.
But if we add the fastpath into synchronize_srcu() then yes, we need mb()
after atomic_inc().
Unless I totally confused :)
Oleg.
On Sun, 19 Nov 2006, Oleg Nesterov wrote:
> On 11/18, Alan Stern wrote:
> >
> > By the way, I think the fastpath for synchronize_srcu() should be safe,
> > now that you have added the memory barriers into srcu_read_lock() and
> > srcu_read_unlock(). You might as well try putting it in.
>
> I still think the fastpath should do mb() unconditionally to be correct.
Yes, it definitely should.
> > Although now that I look at it again, you have forgotten to put smp_mb()
> > after the atomic_inc() call and before the atomic_dec().
>
> As I see it, currently we don't need this barrier because synchronize_srcu()
> does synchronize_sched() before reading ->hardluckref.
>
> But if we add the fastpath into synchronize_srcu() then yes, we need mb()
> after atomic_inc().
>
> Unless I totally confused :)
Put it this way: If the missing memory barrier in srcu_read_lock() after
the atomic_inc call isn't needed, then neither is the existing memory
barrier after the per-cpu counter gets incremented. Likewise, if a memory
barrier isn't needed before the atomic_dec in srcu_read_unlock(), then
neither is the memory barrier before the per-cpu counter gets decremented.
What you're ignoring is the synchronize_sched() call at the end of
synchronize_srcu(), which has been replaced with smp_mb(). The smp_mb()
needs to pair against a memory barrier on the read side, and that memory
barrier has to occur after srcu_read_lock() has incremented the counter
and before the read-side critical section begins. Otherwise code in the
critical section might leak out to before the counter is incremented.
Alan Stern
On 11/18, Alan Stern wrote:
>
> On Sun, 19 Nov 2006, Oleg Nesterov wrote:
>
> > On 11/18, Alan Stern wrote:
> > >
> > > By the way, I think the fastpath for synchronize_srcu() should be safe,
> > > now that you have added the memory barriers into srcu_read_lock() and
> > > srcu_read_unlock(). You might as well try putting it in.
> >
> > I still think the fastpath should do mb() unconditionally to be correct.
>
> Yes, it definitely should.
>
> > > Although now that I look at it again, you have forgotten to put smp_mb()
> > > after the atomic_inc() call and before the atomic_dec().
> >
> > As I see it, currently we don't need this barrier because synchronize_srcu()
> > does synchronize_sched() before reading ->hardluckref.
> >
> > But if we add the fastpath into synchronize_srcu() then yes, we need mb()
> > after atomic_inc().
> >
> > Unless I totally confused :)
>
> Put it this way: If the missing memory barrier in srcu_read_lock() after
> the atomic_inc call isn't needed, then neither is the existing memory
> barrier after the per-cpu counter gets incremented.
I disagree. There is another reason for mb() after the per-cpu counter gets
incremented. Without this barrier we can read the updated value of ->completed
(incremented by synchronize_srcu()), but then read a stale data of the rcu
protected memory.
> Likewise, if a memory
> barrier isn't needed before the atomic_dec in srcu_read_unlock(), then
> neither is the memory barrier before the per-cpu counter gets decremented.
Yes, you are right, I forgot about unlock(), it definitely needs mb().
> What you're ignoring is the synchronize_sched() call at the end of
> synchronize_srcu(), which has been replaced with smp_mb(). The smp_mb()
> needs to pair against a memory barrier on the read side, and that memory
> barrier has to occur after srcu_read_lock() has incremented the counter
> and before the read-side critical section begins. Otherwise code in the
> critical section might leak out to before the counter is incremented.
Still I am not sure you are right. It is ok (I think) if the code in the
critical section leaks out to before the atomic_inc(). In fact this doesn't
differ from the case when srcu_read_lock() happens before synchronize_srcu()
starts. In that case synchronize_srcu() will wait until the critical section
is closed via srcu_read_unlock(). Because of synchronize_sched() synchronize_srcu()
can't miss the fact that the critical section is in progress, so it doesn't
matter if it leaks _before_.
Oleg.
On 11/17, Jens Axboe wrote:
>
> It works for me, but the overhead is still large. Before it would take
> 8-12 jiffies for a synchronize_srcu() to complete without there actually
> being any reader locks active, now it takes 2-3 jiffies. So it's
> definitely faster, and as suspected the loss of two of three
> synchronize_sched() cut down the overhead to a third.
>
> It's still too heavy for me, by far the most calls I do to
> synchronize_srcu() doesn't have any reader locks pending. I'm still a
> big advocate of the fastpath srcu_readers_active() check. I can
> understand the reluctance to make it the default, but for my case it's
> "safe enough", so if we could either export srcu_readers_active() or
> export a synchronize_srcu_fast() (or something like that), then SRCU
> would be a good fit for barrier vs plug rework.
Just an idea. How about another variant of srcu which is more optimized
for writers?
struct xxx_struct {
int completed;
atomic_t ctr[2];
struct mutex mutex;
wait_queue_head_t wq;
};
void init_xxx_struct(struct xxx_struct *sp)
{
sp->completed = 0;
atomic_set(sp->ctr + 0, 1);
atomic_set(sp->ctr + 1, 1);
mutex_init(&sp->mutex);
init_waitqueue_head(&sp->wq);
}
int xxx_read_lock(struct xxx_struct *sp)
{
int idx;
idx = sp->completed & 0x1;
atomic_inc(sp->ctr + idx);
smp_mb__after_atomic_inc();
return idx;
}
void xxx_read_unlock(struct xxx_struct *sp, int idx)
{
if (atomic_dec_and_test(sp->ctr + idx))
wake_up(&sp->wq);
}
void synchronize_xxx(struct xxx_struct *sp)
{
wait_queue_t wait;
int idx;
init_wait(&wait);
mutex_lock(&sp->mutex);
idx = sp->completed++ & 0x1;
for (;;) {
prepare_to_wait(&sp->wq, &wait, TASK_UNINTERRUPTIBLE);
if (!atomic_add_unless(sp->ctr + idx, -1, 1))
break;
schedule();
atomic_inc(sp->ctr + idx);
}
finish_wait(&sp->wq, &wait);
mutex_unlock(&sp->mutex);
}
Very simple. Note that synchronize_xxx() is O(1), doesn't poll, and could
be optimized further.
Oleg.
On Sun, 19 Nov 2006, Oleg Nesterov wrote:
> > Put it this way: If the missing memory barrier in srcu_read_lock() after
> > the atomic_inc call isn't needed, then neither is the existing memory
> > barrier after the per-cpu counter gets incremented.
>
> I disagree. There is another reason for mb() after the per-cpu counter gets
> incremented. Without this barrier we can read the updated value of ->completed
> (incremented by synchronize_srcu()), but then read a stale data of the rcu
> protected memory.
You are right.
> > What you're ignoring is the synchronize_sched() call at the end of
> > synchronize_srcu(), which has been replaced with smp_mb(). The smp_mb()
> > needs to pair against a memory barrier on the read side, and that memory
> > barrier has to occur after srcu_read_lock() has incremented the counter
> > and before the read-side critical section begins. Otherwise code in the
> > critical section might leak out to before the counter is incremented.
>
> Still I am not sure you are right. It is ok (I think) if the code in the
> critical section leaks out to before the atomic_inc(). In fact this doesn't
> differ from the case when srcu_read_lock() happens before synchronize_srcu()
> starts. In that case synchronize_srcu() will wait until the critical section
> is closed via srcu_read_unlock(). Because of synchronize_sched() synchronize_srcu()
> can't miss the fact that the critical section is in progress, so it doesn't
> matter if it leaks _before_.
That's right. I was forgetting an important difference between the c[idx]
and the hardluckref paths: With c[idx] there has to be 2-way
communication (writer increments completed, then reader increments
c[idx]). With hardluckref there is only 1-way communication (reader
increments hardluckref). The synchronize_sched call takes care of the
reader->writer message; the memory barrier is needed for the
writer->reader message. Hence it isn't necessary after the atomic_inc.
But of course it _is_ needed for the fastpath to work. In fact, it might
not be good enough, depending on the architecture. Here's what the
fastpath ends up looking like (using c[idx] is essentially the same as
using hardluckref):
WRITER READER
------ ------
dataptr = &(new data) atomic_inc(&hardluckref)
mb mb
while (hardluckref > 0) ; access *dataptr
Notice the pattern: Each CPU does store-mb-load. It is known that on
some architectures each CPU can end up loading the old value (the value
from before the other CPU's store). This would mean the writer would see
hardluckref == 0 right away and the reader would see the old dataptr.
On architectures where the store-mb-load pattern works, the fastpath would
be safe. But on others it would not be.
Heh, Paul, this highlights the usefulness of our long discussion about
memory barrier semantics. :-)
Alan
On Sun, 19 Nov 2006, Oleg Nesterov wrote:
> On 11/17, Jens Axboe wrote:
> >
> > It works for me, but the overhead is still large. Before it would take
> > 8-12 jiffies for a synchronize_srcu() to complete without there actually
> > being any reader locks active, now it takes 2-3 jiffies. So it's
> > definitely faster, and as suspected the loss of two of three
> > synchronize_sched() cut down the overhead to a third.
> >
> > It's still too heavy for me, by far the most calls I do to
> > synchronize_srcu() doesn't have any reader locks pending. I'm still a
> > big advocate of the fastpath srcu_readers_active() check. I can
> > understand the reluctance to make it the default, but for my case it's
> > "safe enough", so if we could either export srcu_readers_active() or
> > export a synchronize_srcu_fast() (or something like that), then SRCU
> > would be a good fit for barrier vs plug rework.
>
> Just an idea. How about another variant of srcu which is more optimized
> for writers?
>
> struct xxx_struct {
> int completed;
> atomic_t ctr[2];
> struct mutex mutex;
> wait_queue_head_t wq;
> };
>
> void init_xxx_struct(struct xxx_struct *sp)
> {
> sp->completed = 0;
> atomic_set(sp->ctr + 0, 1);
> atomic_set(sp->ctr + 1, 1);
> mutex_init(&sp->mutex);
> init_waitqueue_head(&sp->wq);
> }
>
> int xxx_read_lock(struct xxx_struct *sp)
> {
> int idx;
>
> idx = sp->completed & 0x1;
> atomic_inc(sp->ctr + idx);
> smp_mb__after_atomic_inc();
>
> return idx;
> }
>
> void xxx_read_unlock(struct xxx_struct *sp, int idx)
> {
> if (atomic_dec_and_test(sp->ctr + idx))
> wake_up(&sp->wq);
> }
>
> void synchronize_xxx(struct xxx_struct *sp)
> {
> wait_queue_t wait;
> int idx;
>
> init_wait(&wait);
> mutex_lock(&sp->mutex);
>
> idx = sp->completed++ & 0x1;
>
> for (;;) {
> prepare_to_wait(&sp->wq, &wait, TASK_UNINTERRUPTIBLE);
>
> if (!atomic_add_unless(sp->ctr + idx, -1, 1))
> break;
>
> schedule();
> atomic_inc(sp->ctr + idx);
> }
> finish_wait(&sp->wq, &wait);
>
> mutex_unlock(&sp->mutex);
> }
>
> Very simple. Note that synchronize_xxx() is O(1), doesn't poll, and could
> be optimized further.
What happens if synchronize_xxx manages to execute inbetween
xxx_read_lock's
idx = sp->completed & 0x1;
atomic_inc(sp->ctr + idx);
statements? You see, there's no way around using synchronize_sched().
Alan Stern
On 11/19, Alan Stern wrote:
>
> On Sun, 19 Nov 2006, Oleg Nesterov wrote:
>
> > int xxx_read_lock(struct xxx_struct *sp)
> > {
> > int idx;
> >
> > idx = sp->completed & 0x1;
> > atomic_inc(sp->ctr + idx);
> > smp_mb__after_atomic_inc();
> >
> > return idx;
> > }
> >
> > void xxx_read_unlock(struct xxx_struct *sp, int idx)
> > {
> > if (atomic_dec_and_test(sp->ctr + idx))
> > wake_up(&sp->wq);
> > }
> >
> > void synchronize_xxx(struct xxx_struct *sp)
> > {
> > wait_queue_t wait;
> > int idx;
> >
> > init_wait(&wait);
> > mutex_lock(&sp->mutex);
> >
> > idx = sp->completed++ & 0x1;
> >
> > for (;;) {
> > prepare_to_wait(&sp->wq, &wait, TASK_UNINTERRUPTIBLE);
> >
> > if (!atomic_add_unless(sp->ctr + idx, -1, 1))
> > break;
> >
> > schedule();
> > atomic_inc(sp->ctr + idx);
> > }
> > finish_wait(&sp->wq, &wait);
> >
> > mutex_unlock(&sp->mutex);
> > }
> >
> > Very simple. Note that synchronize_xxx() is O(1), doesn't poll, and could
> > be optimized further.
>
> What happens if synchronize_xxx manages to execute inbetween
> xxx_read_lock's
>
> idx = sp->completed & 0x1;
> atomic_inc(sp->ctr + idx);
>
> statements?
Oops. I forgot about explicit mb() before sp->completed++ in synchronize_xxx().
So synchronize_xxx() should do
smp_mb();
idx = sp->completed++ & 0x1;
for (;;) { ... }
> You see, there's no way around using synchronize_sched().
With this change I think we are safe.
If synchronize_xxx() increments ->completed in between, the caller of
xxx_read_lock() will see all memory ops (started before synchronize_xxx())
completed. It is ok that synchronize_xxx() returns immediately.
Thanks!
Oleg.
On Sun, 19 Nov 2006, Oleg Nesterov wrote:
> > What happens if synchronize_xxx manages to execute inbetween
> > xxx_read_lock's
> >
> > idx = sp->completed & 0x1;
> > atomic_inc(sp->ctr + idx);
> >
> > statements?
>
> Oops. I forgot about explicit mb() before sp->completed++ in synchronize_xxx().
>
> So synchronize_xxx() should do
>
> smp_mb();
> idx = sp->completed++ & 0x1;
>
> for (;;) { ... }
>
> > You see, there's no way around using synchronize_sched().
>
> With this change I think we are safe.
>
> If synchronize_xxx() increments ->completed in between, the caller of
> xxx_read_lock() will see all memory ops (started before synchronize_xxx())
> completed. It is ok that synchronize_xxx() returns immediately.
Yes, the reader will see a consistent picture, but it will have
incremented the wrong element of sp->ctr[]. What happens if another
synchronize_xxx() occurs while the reader is still running?
Alan
On Sat, Nov 18, 2006 at 09:46:24PM +0300, Oleg Nesterov wrote:
> On 11/17, Paul E. McKenney wrote:
> >
> > Oleg, any thoughts about Jens's optimization? He would code something
> > like:
> >
> > if (srcu_readers_active(&my_srcu))
> > synchronize_srcu();
> > else
> > smp_mb();
>
> Well, this is clearly racy, no? I am not sure, but may be we can do
>
> smp_mb();
> if (srcu_readers_active(&my_srcu))
> synchronize_srcu();
>
> in this case we also need to add 'smp_mb()' into srcu_read_lock() after
> 'atomic_inc(&sp->hardluckref)'.
>
> > However, he is doing ordered I/O requests rather than protecting data
> > structures.
>
> Probably this makes a difference, but I don't understand this.
OK, one hypothesis here...
The I/Os must be somehow explicitly ordered to qualify
for I/O-barrier separation. If two independent processes
issue I/Os concurrently with a third process doing an
I/O barrier, the I/O barrier is free to separate the
two concurrent I/Os or not, on its whim.
Jens, is the above correct? If so, what would the two processes
need to do in order to ensure that their I/O was considered to be
ordered with respect to the I/O barrier? Here are some possibilities:
1. I/O barriers apply only to preceding and following I/Os from
the process issuing the I/O barrier.
2. As for #1 above, but restricted to task rather than process.
3. I/O system calls that have completed are ordered by the
barrier to precede I/O system calls that have not yet
started, but I/O system calls still in flight could legally
land on either side of the concurrently executing I/O
barrier.
4. Something else entirely?
Given some restriction like one of the above, it is entirely possible
that we don't even need the memory barrier...
Thanx, Paul
On 11/19, Alan Stern wrote:
> On Sun, 19 Nov 2006, Oleg Nesterov wrote:
>
> > > What happens if synchronize_xxx manages to execute inbetween
> > > xxx_read_lock's
> > >
> > > idx = sp->completed & 0x1;
> > > atomic_inc(sp->ctr + idx);
> > >
> > > statements?
> >
> > Oops. I forgot about explicit mb() before sp->completed++ in synchronize_xxx().
> >
> > So synchronize_xxx() should do
> >
> > smp_mb();
> > idx = sp->completed++ & 0x1;
> >
> > for (;;) { ... }
> >
> > > You see, there's no way around using synchronize_sched().
> >
> > With this change I think we are safe.
> >
> > If synchronize_xxx() increments ->completed in between, the caller of
> > xxx_read_lock() will see all memory ops (started before synchronize_xxx())
> > completed. It is ok that synchronize_xxx() returns immediately.
>
> Yes, the reader will see a consistent picture, but it will have
> incremented the wrong element of sp->ctr[]. What happens if another
> synchronize_xxx() occurs while the reader is still running?
It will wait for xxx_read_unlock() on reader's side. And for this reason
this idx in fact is not exactly wrong :)
Oleg.
On Sun, Nov 19, 2006 at 11:55:16PM +0300, Oleg Nesterov wrote:
> On 11/19, Alan Stern wrote:
> >
> > On Sun, 19 Nov 2006, Oleg Nesterov wrote:
> >
> > > int xxx_read_lock(struct xxx_struct *sp)
> > > {
> > > int idx;
> > >
> > > idx = sp->completed & 0x1;
> > > atomic_inc(sp->ctr + idx);
> > > smp_mb__after_atomic_inc();
> > >
> > > return idx;
> > > }
> > >
> > > void xxx_read_unlock(struct xxx_struct *sp, int idx)
> > > {
> > > if (atomic_dec_and_test(sp->ctr + idx))
> > > wake_up(&sp->wq);
> > > }
> > >
> > > void synchronize_xxx(struct xxx_struct *sp)
> > > {
> > > wait_queue_t wait;
> > > int idx;
> > >
> > > init_wait(&wait);
> > > mutex_lock(&sp->mutex);
> > >
> > > idx = sp->completed++ & 0x1;
> > >
> > > for (;;) {
> > > prepare_to_wait(&sp->wq, &wait, TASK_UNINTERRUPTIBLE);
> > >
> > > if (!atomic_add_unless(sp->ctr + idx, -1, 1))
> > > break;
> > >
> > > schedule();
> > > atomic_inc(sp->ctr + idx);
> > > }
> > > finish_wait(&sp->wq, &wait);
> > >
> > > mutex_unlock(&sp->mutex);
> > > }
> > >
> > > Very simple. Note that synchronize_xxx() is O(1), doesn't poll, and could
> > > be optimized further.
> >
> > What happens if synchronize_xxx manages to execute inbetween
> > xxx_read_lock's
> >
> > idx = sp->completed & 0x1;
> > atomic_inc(sp->ctr + idx);
> >
> > statements?
>
> Oops. I forgot about explicit mb() before sp->completed++ in synchronize_xxx().
>
> So synchronize_xxx() should do
>
> smp_mb();
> idx = sp->completed++ & 0x1;
>
> for (;;) { ... }
>
> > You see, there's no way around using synchronize_sched().
>
> With this change I think we are safe.
>
> If synchronize_xxx() increments ->completed in between, the caller of
> xxx_read_lock() will see all memory ops (started before synchronize_xxx())
> completed. It is ok that synchronize_xxx() returns immediately.
Let me take Alan's example one step further:
o CPU 0 starts executing xxx_read_lock(), but is interrupted
(or whatever) just before the atomic_inc().
o CPU 1 executes synchronize_xxx() to completion, which it
can because CPU 0 has not yet incremented the counter.
o CPU 0 returns from interrupt and completes xxx_read_lock(),
but has incremented the wrong counter.
o CPU 0 continues into its critical section, picking up a
pointer to an xxx-protected data structure (or, in Jens's
case starting an xxx-protected I/O).
o CPU 1 executes another synchronize_xxx(). This completes
immediately because CPU 1 has the wrong counter incremented.
o CPU 1 continues, either freeing a data structure while
CPU 0 is still referencing it, or, in Jens's case, completing
an I/O barrier while there is still outstanding I/O.
I agree with Alan -- unless I am missing something, we need a
synchronize_sched() in synchronize_xxx(). One thing missing in
the I/O-barrier case might be the possible restrictions I call
out in my earlier email.
Thanx, Paul
On Sat, Nov 18, 2006 at 10:02:17PM +0300, Oleg Nesterov wrote:
> On 11/17, Paul E. McKenney wrote:
> >
> > int srcu_read_lock(struct srcu_struct *sp)
> > {
> > int idx;
> > + struct srcu_struct_array *sap;
> >
> > preempt_disable();
> > idx = sp->completed & 0x1;
> > - barrier(); /* ensure compiler looks -once- at sp->completed. */
> > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > + sap = rcu_dereference(sp->per_cpu_ref);
> > + if (likely(sap != NULL)) {
> > + barrier(); /* ensure compiler looks -once- at sp->completed. */
> > + per_cpu_ptr(rcu_dereference(sap),
> > + smp_processor_id())->c[idx]++;
> > + smp_mb();
> > + preempt_enable();
> > + return idx;
> > + }
> > + if (mutex_trylock(&sp->mutex)) {
> > + preempt_enable();
> > + if (sp->per_cpu_ref == NULL)
> > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> > + if (sp->per_cpu_ref == NULL) {
> > + atomic_inc(&sp->hardluckref);
> > + mutex_unlock(&sp->mutex);
> > + return -1;
> > + }
> > + mutex_unlock(&sp->mutex);
> > + return srcu_read_lock(sp);
> > + }
> > preempt_enable();
> > - return idx;
> > + atomic_inc(&sp->hardluckref);
> > + return -1;
> > }
>
> This is a real nitpick, but in theory we have a possibility for the livelock.
>
> Suppose that synchronize_srcu() takes sp->mutex and fails to allocate
> sp->per_cpu_ref. If we have a flow of srcu_read_lock/srcu_read_unlock,
> this loop in synchronize_srcu()
>
> while (srcu_readers_active_idx(sp, idx))
> schedule_timeout_interruptible(1);
>
> may spin unpredictably long, because we use the same sp->hardluckref for
> accounting.
Excellent point -- hardluckref also needs to be a two-element array.
Thanx, Paul
On Sat, Nov 18, 2006 at 10:34:26PM +0300, Oleg Nesterov wrote:
> On 11/18, Paul E. McKenney wrote:
> >
> > On Sat, Nov 18, 2006 at 11:15:27AM -0500, Alan Stern wrote:
> > > > + smp_processor_id())->c[idx]++;
> > > > + smp_mb();
> > > > + preempt_enable();
> > > > + return idx;
> > > > + }
> > > > + if (mutex_trylock(&sp->mutex)) {
> > > > + preempt_enable();
> > >
> > > Move the preempt_enable() before the "if", then get rid of the
> > > preempt_enable() after the "if" block.
> >
> > No can do. The preempt_enable() must follow the increment and
> > the memory barrier, otherwise the synchronize_sched() inside
> > synchronize_srcu() can't do its job.
>
> Given that srcu_read_lock() does smp_mb() after ->c[idx]++, what
> is the purpose of synchronize_srcu() ? It seems to me it could be
> replaced by smp_mb().
>
> synchronize_srcu:
>
> sp->completed++;
>
> mb();
>
> // if the reader did any memory access _after_
> // srcu_read_lock()->mb() we must see the changes.
> while (srcu_readers_active_idx(sp, idx))
> sleep();
>
> No?
I believe that this could run afoul of the example I sent out earlier
(based on Alan's example). In my mind, the key difference between
this and Jens's suggestion is that in Jens's case, we check for -all-
the counters being zero, not just the old ones. (But I still don't
trust Jen's optimization -- I just have not yet come up with an example
showing breakage, possibly because there isn't one, but...)
Thanx, Paul
On Sat, Nov 18, 2006 at 04:00:35PM -0500, Alan Stern wrote:
> On Sat, 18 Nov 2006, Paul E. McKenney wrote:
>
> > > > @@ -94,7 +112,8 @@ void cleanup_srcu_struct(struct srcu_str
> > > > WARN_ON(sum); /* Leakage unless caller handles error. */
> > > > if (sum != 0)
> > > > return;
> > > > - free_percpu(sp->per_cpu_ref);
> > > > + if (sp->per_cpu_ref != NULL)
> > > > + free_percpu(sp->per_cpu_ref);
> > >
> > > Now that Andrew has accepted the "allow free_percpu(NULL)" change, you can
> > > remove the test here.
> >
> > OK. I thought that there was some sort of error-checking involved,
> > but if not, will fix.
>
> Just make sure that _you_ have the free_percpu(NULL) patch installed on
> your machine before testing this -- otherwise you'll get a nice hard
> crash!
'Nuff said -- will leave this fixup till later. ;-)
> > > > preempt_disable();
> > > > idx = sp->completed & 0x1;
> > > > - barrier(); /* ensure compiler looks -once- at sp->completed. */
> > > > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > > > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > > > + sap = rcu_dereference(sp->per_cpu_ref);
> > > > + if (likely(sap != NULL)) {
> > > > + barrier(); /* ensure compiler looks -once- at sp->completed. */
> > >
> > > Put this barrier() back where the old one was (outside the "if").
> >
> > Why? Outside this "if", I don't use "sap".
>
> Because it looks funny to see the comment here talking about sp->completed
> when sp->completed hasn't been used for several lines. (Maybe it looks
> less funny in the patched source than in the patch itself.) The best
> place to prevent extra accesses of sp->completed is immediately after
> the required access.
Good point -- and with the addition of the second element of hardluckref,
it has to be hoisted out of the "if" in any case.
> > > > + smp_processor_id())->c[idx]++;
> > > > + smp_mb();
> > > > + preempt_enable();
> > > > + return idx;
> > > > + }
> > > > + if (mutex_trylock(&sp->mutex)) {
> > > > + preempt_enable();
> > >
> > > Move the preempt_enable() before the "if", then get rid of the
> > > preempt_enable() after the "if" block.
> >
> > No can do. The preempt_enable() must follow the increment and
> > the memory barrier, otherwise the synchronize_sched() inside
> > synchronize_srcu() can't do its job.
>
> You misunderstood -- I was talking about the preempt_enable() in the last
> line quoted above (not the one in the third line) and the "if
> (mutex_trylock" (not the earlier "if (likely").
OK, I see your point -- but this has changed thoroughly with the
addition of the second element of hardluckref.
> > > > + if (sp->per_cpu_ref == NULL)
> > > > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> > >
> > > It would be cleaner to put the mutex_unlock() and closing '}' right here.
> >
> > I can move the mutex_unlock() to this point, but I cannot otherwise
> > merge the two following pieces of code -- at least not without doing
> > an otherwise-gratuitous preempt_disable(). Which I suppose I could
> > do, but seems like it would be more confusing than would the
> > separate code. I will play with this a bit and see if I can eliminate
> > the duplication.
>
> If you follow the advice above then you won't need to add a gratuitous
> preempt_disable(). Try it and see how it comes out; the idea is that
> you can use the same code for testing sp->per_cpu_ref regardless of
> whether the mutex_trylock() or the call to alloc_srcu_struct_percpu()
> succeeded.
Understood, finally -- but the two-element hardluckref now requires
greater preempt_disable() coverage.
> > > What happens if a prior reader failed to allocate the memory but this call
> > > succeeds? You need to check hardluckref before doing this. The same is
> > > true in srcu_read_lock().
> >
> > All accounted for by the fact that hardluckref is unconditionally
> > added in by srcu_readers_active(). Right?
>
> Yes, you're right.
>
> > Will spin a new patch...
>
> Good -- it's getting pretty messy to look at this one!
>
> By the way, I think the fastpath for synchronize_srcu() should be safe,
> now that you have added the memory barriers into srcu_read_lock() and
> srcu_read_unlock(). You might as well try putting it in.
>
> Although now that I look at it again, you have forgotten to put smp_mb()
> after the atomic_inc() call and before the atomic_dec(). In
> srcu_read_unlock() you could just move the existing smp_mb() back before
> the test of idx.
Good catch again -- added smp_mb__before_atomic_dec() and
smp_mb__after_atomic_inc(). The reason for avoiding moving the smp_mb()
is that atomic_dec() implies a memory barrier on some architectures,
such as x86. In these cases, smp_mb__before_atomic_dec() is a no-op.
Signed-off-by: Paul E. McKenney <[email protected]> (was @us.ibm.com)
---
include/linux/srcu.h | 8 ---
kernel/srcu.c | 126 +++++++++++++++++++++++++++------------------------
2 files changed, 71 insertions(+), 63 deletions(-)
diff -urpNa -X dontdiff linux-2.6.19-rc5/include/linux/srcu.h linux-2.6.19-rc5-dsrcu/include/linux/srcu.h
--- linux-2.6.19-rc5/include/linux/srcu.h 2006-11-17 15:44:40.000000000 -0800
+++ linux-2.6.19-rc5-dsrcu/include/linux/srcu.h 2006-11-19 13:33:35.000000000 -0800
@@ -35,19 +35,15 @@ struct srcu_struct {
int completed;
struct srcu_struct_array *per_cpu_ref;
struct mutex mutex;
+ atomic_t hardluckref[2];
};
-#ifndef CONFIG_PREEMPT
-#define srcu_barrier() barrier()
-#else /* #ifndef CONFIG_PREEMPT */
-#define srcu_barrier()
-#endif /* #else #ifndef CONFIG_PREEMPT */
-
int init_srcu_struct(struct srcu_struct *sp);
void cleanup_srcu_struct(struct srcu_struct *sp);
int srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
void synchronize_srcu(struct srcu_struct *sp);
long srcu_batches_completed(struct srcu_struct *sp);
+int srcu_readers_active(struct srcu_struct *sp);
#endif
diff -urpNa -X dontdiff linux-2.6.19-rc5/kernel/srcu.c linux-2.6.19-rc5-dsrcu/kernel/srcu.c
--- linux-2.6.19-rc5/kernel/srcu.c 2006-11-17 15:44:40.000000000 -0800
+++ linux-2.6.19-rc5-dsrcu/kernel/srcu.c 2006-11-19 13:40:33.000000000 -0800
@@ -34,6 +34,18 @@
#include <linux/smp.h>
#include <linux/srcu.h>
+/*
+ * Initialize the per-CPU array, returning the pointer.
+ */
+static inline struct srcu_struct_array *alloc_srcu_struct_percpu(void)
+{
+ struct srcu_struct_array *sap;
+
+ sap = alloc_percpu(struct srcu_struct_array);
+ smp_wmb();
+ return (sap);
+}
+
/**
* init_srcu_struct - initialize a sleep-RCU structure
* @sp: structure to initialize.
@@ -46,7 +58,9 @@ int init_srcu_struct(struct srcu_struct
{
sp->completed = 0;
mutex_init(&sp->mutex);
- sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+ atomic_set(&sp->hardluckref[0], 0);
+ atomic_set(&sp->hardluckref[1], 0);
return (sp->per_cpu_ref ? 0 : -ENOMEM);
}
@@ -58,12 +72,15 @@ int init_srcu_struct(struct srcu_struct
static int srcu_readers_active_idx(struct srcu_struct *sp, int idx)
{
int cpu;
+ struct srcu_struct_array *sap;
int sum;
sum = 0;
- for_each_possible_cpu(cpu)
- sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
- return sum;
+ sap = rcu_dereference(sp->per_cpu_ref);
+ if (likely(sap != NULL))
+ for_each_possible_cpu(cpu)
+ sum += per_cpu_ptr(sap, cpu)->c[idx];
+ return sum + atomic_read(&sp->hardluckref[idx]);
}
/**
@@ -94,7 +111,8 @@ void cleanup_srcu_struct(struct srcu_str
WARN_ON(sum); /* Leakage unless caller handles error. */
if (sum != 0)
return;
- free_percpu(sp->per_cpu_ref);
+ if (sp->per_cpu_ref != NULL)
+ free_percpu(sp->per_cpu_ref);
sp->per_cpu_ref = NULL;
}
@@ -105,18 +123,41 @@ void cleanup_srcu_struct(struct srcu_str
* Counts the new reader in the appropriate per-CPU element of the
* srcu_struct. Must be called from process context.
* Returns an index that must be passed to the matching srcu_read_unlock().
+ * The index is mapped to negative numbers if the srcu_struct is not and
+ * cannot be initialized.
*/
int srcu_read_lock(struct srcu_struct *sp)
{
int idx;
+ struct srcu_struct_array *sap;
preempt_disable();
idx = sp->completed & 0x1;
barrier(); /* ensure compiler looks -once- at sp->completed. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
+ sap = rcu_dereference(sp->per_cpu_ref);
+ if (likely(sap != NULL)) {
+ per_cpu_ptr(sap, smp_processor_id())->c[idx]++;
+ smp_mb();
+ preempt_enable();
+ return idx;
+ }
+ if (mutex_trylock(&sp->mutex)) {
+ preempt_enable();
+ if (sp->per_cpu_ref == NULL)
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+ if (sp->per_cpu_ref == NULL) {
+ mutex_unlock(&sp->mutex);
+ preempt_disable();
+ idx = sp->completed & 0x1;
+ } else {
+ mutex_unlock(&sp->mutex);
+ return srcu_read_lock(sp);
+ }
+ }
+ atomic_inc(&sp->hardluckref[idx]);
+ smp_mb__after_atomic_inc();
preempt_enable();
- return idx;
+ return -1 - idx;
}
/**
@@ -131,10 +172,16 @@ int srcu_read_lock(struct srcu_struct *s
*/
void srcu_read_unlock(struct srcu_struct *sp, int idx)
{
- preempt_disable();
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
- preempt_enable();
+ if (likely(idx <= 0)) {
+ preempt_disable();
+ smp_mb();
+ per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
+ smp_processor_id())->c[idx]--;
+ preempt_enable();
+ return;
+ }
+ smp_mb__before_atomic_dec();
+ atomic_dec(&sp->hardluckref[-1 - idx]);
}
/**
@@ -158,6 +205,11 @@ void synchronize_srcu(struct srcu_struct
idx = sp->completed;
mutex_lock(&sp->mutex);
+ /* Initialize if not already initialized. */
+
+ if (sp->per_cpu_ref == NULL)
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+
/*
* Check to see if someone else did the work for us while we were
* waiting to acquire the lock. We need -two- advances of
@@ -173,65 +225,25 @@ void synchronize_srcu(struct srcu_struct
return;
}
- synchronize_sched(); /* Force memory barrier on all CPUs. */
-
- /*
- * The preceding synchronize_sched() ensures that any CPU that
- * sees the new value of sp->completed will also see any preceding
- * changes to data structures made by this CPU. This prevents
- * some other CPU from reordering the accesses in its SRCU
- * read-side critical section to precede the corresponding
- * srcu_read_lock() -- ensuring that such references will in
- * fact be protected.
- *
- * So it is now safe to do the flip.
- */
-
+ smp_mb(); /* ensure srcu_read_lock() sees prior change first! */
idx = sp->completed & 0x1;
sp->completed++;
- synchronize_sched(); /* Force memory barrier on all CPUs. */
+ synchronize_sched();
/*
* At this point, because of the preceding synchronize_sched(),
* all srcu_read_lock() calls using the old counters have completed.
* Their corresponding critical sections might well be still
* executing, but the srcu_read_lock() primitives themselves
- * will have finished executing.
+ * will have finished executing. The "old" rank of counters
+ * can therefore only decrease, never increase in value.
*/
while (srcu_readers_active_idx(sp, idx))
schedule_timeout_interruptible(1);
- synchronize_sched(); /* Force memory barrier on all CPUs. */
-
- /*
- * The preceding synchronize_sched() forces all srcu_read_unlock()
- * primitives that were executing concurrently with the preceding
- * for_each_possible_cpu() loop to have completed by this point.
- * More importantly, it also forces the corresponding SRCU read-side
- * critical sections to have also completed, and the corresponding
- * references to SRCU-protected data items to be dropped.
- *
- * Note:
- *
- * Despite what you might think at first glance, the
- * preceding synchronize_sched() -must- be within the
- * critical section ended by the following mutex_unlock().
- * Otherwise, a task taking the early exit can race
- * with a srcu_read_unlock(), which might have executed
- * just before the preceding srcu_readers_active() check,
- * and whose CPU might have reordered the srcu_read_unlock()
- * with the preceding critical section. In this case, there
- * is nothing preventing the synchronize_sched() task that is
- * taking the early exit from freeing a data structure that
- * is still being referenced (out of order) by the task
- * doing the srcu_read_unlock().
- *
- * Alternatively, the comparison with "2" on the early exit
- * could be changed to "3", but this increases synchronize_srcu()
- * latency for bulk loads. So the current code is preferred.
- */
+ smp_mb(); /* must see critical section prior to srcu_read_unlock() */
mutex_unlock(&sp->mutex);
}
On 11/19, Paul E. McKenney wrote:
>
> On Sun, Nov 19, 2006 at 11:55:16PM +0300, Oleg Nesterov wrote:
> > So synchronize_xxx() should do
> >
> > smp_mb();
> > idx = sp->completed++ & 0x1;
> >
> > for (;;) { ... }
> >
> > > You see, there's no way around using synchronize_sched().
> >
> > With this change I think we are safe.
> >
> > If synchronize_xxx() increments ->completed in between, the caller of
> > xxx_read_lock() will see all memory ops (started before synchronize_xxx())
> > completed. It is ok that synchronize_xxx() returns immediately.
>
> Let me take Alan's example one step further:
>
> o CPU 0 starts executing xxx_read_lock(), but is interrupted
> (or whatever) just before the atomic_inc().
>
> o CPU 1 executes synchronize_xxx() to completion, which it
> can because CPU 0 has not yet incremented the counter.
Let's suppose for simplicity that CPU 1 does "classical"
old = global_ptr;
global_ptr = new_value();
before synchronize_xxx(), and ->completed == 0.
Now, synchronize_xxx() sets ->completed == 1. Because of mb()
'global_ptr = new_value()' is completed.
> o CPU 0 returns from interrupt and completes xxx_read_lock(),
> but has incremented the wrong counter.
->completed == 1, it is not so wrong, see below
> o CPU 0 continues into its critical section, picking up a
> pointer to an xxx-protected data structure (or, in Jens's
> case starting an xxx-protected I/O).
it sees the new value in global_ptr, we are safe.
> o CPU 1 executes another synchronize_xxx(). This completes
> immediately because CPU 1 has the wrong counter incremented.
No, it will notice .ctr[1] != 1 and wait.
> o CPU 1 continues, either freeing a data structure while
> CPU 0 is still referencing it, or, in Jens's case, completing
> an I/O barrier while there is still outstanding I/O.
CPU 1 continues only when CPU 0 does read_unlock(/*completed*/ 1),
we are safe.
Safe?
Oleg.
On Mon, Nov 20, 2006 at 12:17:31AM +0300, Oleg Nesterov wrote:
> On 11/19, Alan Stern wrote:
> > On Sun, 19 Nov 2006, Oleg Nesterov wrote:
> >
> > > > What happens if synchronize_xxx manages to execute inbetween
> > > > xxx_read_lock's
> > > >
> > > > idx = sp->completed & 0x1;
> > > > atomic_inc(sp->ctr + idx);
> > > >
> > > > statements?
> > >
> > > Oops. I forgot about explicit mb() before sp->completed++ in synchronize_xxx().
> > >
> > > So synchronize_xxx() should do
> > >
> > > smp_mb();
> > > idx = sp->completed++ & 0x1;
> > >
> > > for (;;) { ... }
> > >
> > > > You see, there's no way around using synchronize_sched().
> > >
> > > With this change I think we are safe.
> > >
> > > If synchronize_xxx() increments ->completed in between, the caller of
> > > xxx_read_lock() will see all memory ops (started before synchronize_xxx())
> > > completed. It is ok that synchronize_xxx() returns immediately.
> >
> > Yes, the reader will see a consistent picture, but it will have
> > incremented the wrong element of sp->ctr[]. What happens if another
> > synchronize_xxx() occurs while the reader is still running?
>
> It will wait for xxx_read_unlock() on reader's side. And for this reason
> this idx in fact is not exactly wrong :)
I am not seeing this.
Let's assume sp->completed starts out zero.
o CPU 0 starts executing xxx_read_lock(), but is interrupted
(or whatever) just before the atomic_inc(). Upon return,
it will increment sp->ctr[0].
o CPU 1 executes synchronize_xxx() to completion, which it
can because CPU 0 has not yet incremented the counter.
It waited on sp->ctr[0], and incremented sp->completed to 1.
o CPU 0 returns from interrupt and completes xxx_read_lock(),
but has incremented sp->ctr[0].
o CPU 0 continues into its critical section, picking up a
pointer to an xxx-protected data structure (or, in Jens's
case starting an xxx-protected I/O).
o CPU 1 executes another synchronize_xxx(). This completes
immediately because it is waiting for sp->ctr[1] to go
to zero, but CPU 0 incremented sp->ctr[0]. (Right?)
o CPU 1 continues, either freeing a data structure while
CPU 0 is still referencing it, or, in Jens's case, completing
an I/O barrier while there is still outstanding I/O.
Or am I missing something?
Thanx, Paul
On Mon, Nov 20, 2006 at 12:50:53AM +0300, Oleg Nesterov wrote:
> On 11/19, Paul E. McKenney wrote:
> >
> > On Sun, Nov 19, 2006 at 11:55:16PM +0300, Oleg Nesterov wrote:
> > > So synchronize_xxx() should do
> > >
> > > smp_mb();
> > > idx = sp->completed++ & 0x1;
> > >
> > > for (;;) { ... }
> > >
> > > > You see, there's no way around using synchronize_sched().
> > >
> > > With this change I think we are safe.
> > >
> > > If synchronize_xxx() increments ->completed in between, the caller of
> > > xxx_read_lock() will see all memory ops (started before synchronize_xxx())
> > > completed. It is ok that synchronize_xxx() returns immediately.
> >
> > Let me take Alan's example one step further:
> >
> > o CPU 0 starts executing xxx_read_lock(), but is interrupted
> > (or whatever) just before the atomic_inc().
> >
> > o CPU 1 executes synchronize_xxx() to completion, which it
> > can because CPU 0 has not yet incremented the counter.
>
> Let's suppose for simplicity that CPU 1 does "classical"
>
> old = global_ptr;
> global_ptr = new_value();
>
> before synchronize_xxx(), and ->completed == 0.
OK. But there are two of these in this example -- one such update
per execution of synchronize_xxx(), right?
> Now, synchronize_xxx() sets ->completed == 1. Because of mb()
> 'global_ptr = new_value()' is completed.
>
> > o CPU 0 returns from interrupt and completes xxx_read_lock(),
> > but has incremented the wrong counter.
>
> ->completed == 1, it is not so wrong, see below
But CPU 0 kept idx==0 in xxx_read_lock() in the earlier steps, right?
Therefore, CPU 0 increments sp->ctr[0] rather than sp->ctr[1].
> > o CPU 0 continues into its critical section, picking up a
> > pointer to an xxx-protected data structure (or, in Jens's
> > case starting an xxx-protected I/O).
>
> it sees the new value in global_ptr, we are safe.
It -does- see the new value corresponding to the -first- call to
synchronize_xxx(), but gets in trouble due to the change just
before the -second- call to synchronize_xxx().
> > o CPU 1 executes another synchronize_xxx(). This completes
> > immediately because CPU 1 has the wrong counter incremented.
>
> No, it will notice .ctr[1] != 1 and wait.
Unless I am missing something, we have incremented .ctr[0] rather
than .ctr[1], so I do not believe that it will wait.
> > o CPU 1 continues, either freeing a data structure while
> > CPU 0 is still referencing it, or, in Jens's case, completing
> > an I/O barrier while there is still outstanding I/O.
>
> CPU 1 continues only when CPU 0 does read_unlock(/*completed*/ 1),
> we are safe.
>
> Safe?
I have my doubts...
Thanx, Paul
On 11/19, Paul E. McKenney wrote:
>
> On Mon, Nov 20, 2006 at 12:17:31AM +0300, Oleg Nesterov wrote:
> >
> > It will wait for xxx_read_unlock() on reader's side. And for this reason
> > this idx in fact is not exactly wrong :)
>
> I am not seeing this.
>
> Let's assume sp->completed starts out zero.
>
> o CPU 0 starts executing xxx_read_lock(), but is interrupted
> (or whatever) just before the atomic_inc(). Upon return,
> it will increment sp->ctr[0].
Right.
> o CPU 1 executes synchronize_xxx() to completion, which it
> can because CPU 0 has not yet incremented the counter.
> It waited on sp->ctr[0], and incremented sp->completed to 1.
>
> o CPU 0 returns from interrupt and completes xxx_read_lock(),
> but has incremented sp->ctr[0].
>
> o CPU 0 continues into its critical section, picking up a
> pointer to an xxx-protected data structure (or, in Jens's
> case starting an xxx-protected I/O).
>
> o CPU 1 executes another synchronize_xxx(). This completes
> immediately because it is waiting for sp->ctr[1] to go
> to zero, but CPU 0 incremented sp->ctr[0]. (Right?)
Right!
> o CPU 1 continues, either freeing a data structure while
> CPU 0 is still referencing it, or, in Jens's case, completing
> an I/O barrier while there is still outstanding I/O.
>
> Or am I missing something?
No, it is me.
Alan, Paul, thanks a lot for your patience!
Oleg.
On Mon, Nov 20, 2006 at 01:28:47AM +0300, Oleg Nesterov wrote:
> On 11/19, Paul E. McKenney wrote:
> >
> > On Mon, Nov 20, 2006 at 12:17:31AM +0300, Oleg Nesterov wrote:
> > >
> > > It will wait for xxx_read_unlock() on reader's side. And for this reason
> > > this idx in fact is not exactly wrong :)
> >
> > I am not seeing this.
> >
> > Let's assume sp->completed starts out zero.
> >
> > o CPU 0 starts executing xxx_read_lock(), but is interrupted
> > (or whatever) just before the atomic_inc(). Upon return,
> > it will increment sp->ctr[0].
>
> Right.
>
> > o CPU 1 executes synchronize_xxx() to completion, which it
> > can because CPU 0 has not yet incremented the counter.
> > It waited on sp->ctr[0], and incremented sp->completed to 1.
> >
> > o CPU 0 returns from interrupt and completes xxx_read_lock(),
> > but has incremented sp->ctr[0].
> >
> > o CPU 0 continues into its critical section, picking up a
> > pointer to an xxx-protected data structure (or, in Jens's
> > case starting an xxx-protected I/O).
> >
> > o CPU 1 executes another synchronize_xxx(). This completes
> > immediately because it is waiting for sp->ctr[1] to go
> > to zero, but CPU 0 incremented sp->ctr[0]. (Right?)
>
> Right!
>
> > o CPU 1 continues, either freeing a data structure while
> > CPU 0 is still referencing it, or, in Jens's case, completing
> > an I/O barrier while there is still outstanding I/O.
> >
> > Or am I missing something?
>
> No, it is me.
>
> Alan, Paul, thanks a lot for your patience!
No problem -- now we just need to work out if Jens's optimization is
safe, either in his situation or in general. And I need to chase down
any remaining bugs in the patch...
Thanx, Paul
On Sun, Nov 19 2006, Paul E. McKenney wrote:
> On Sat, Nov 18, 2006 at 09:46:24PM +0300, Oleg Nesterov wrote:
> > On 11/17, Paul E. McKenney wrote:
> > >
> > > Oleg, any thoughts about Jens's optimization? He would code something
> > > like:
> > >
> > > if (srcu_readers_active(&my_srcu))
> > > synchronize_srcu();
> > > else
> > > smp_mb();
> >
> > Well, this is clearly racy, no? I am not sure, but may be we can do
> >
> > smp_mb();
> > if (srcu_readers_active(&my_srcu))
> > synchronize_srcu();
> >
> > in this case we also need to add 'smp_mb()' into srcu_read_lock() after
> > 'atomic_inc(&sp->hardluckref)'.
> >
> > > However, he is doing ordered I/O requests rather than protecting data
> > > structures.
> >
> > Probably this makes a difference, but I don't understand this.
>
> OK, one hypothesis here...
>
> The I/Os must be somehow explicitly ordered to qualify
> for I/O-barrier separation. If two independent processes
> issue I/Os concurrently with a third process doing an
> I/O barrier, the I/O barrier is free to separate the
> two concurrent I/Os or not, on its whim.
>
> Jens, is the above correct? If so, what would the two processes
That's completely correct, hence my somewhat relaxed approach with SRCU.
> need to do in order to ensure that their I/O was considered to be
> ordered with respect to the I/O barrier? Here are some possibilities:
If we consider the barrier a barrier in a certain stream of requests,
it is the responsibility of the issuer of that barrier to ensure that
the queueing is ordered. So if two "unrelated" streams of requests with
barriers hit __make_request() at the same time, we don't go to great
lengths to ensure who gets there firt.
> 1. I/O barriers apply only to preceding and following I/Os from
> the process issuing the I/O barrier.
>
> 2. As for #1 above, but restricted to task rather than process.
>
> 3. I/O system calls that have completed are ordered by the
> barrier to precede I/O system calls that have not yet
> started, but I/O system calls still in flight could legally
> land on either side of the concurrently executing I/O
> barrier.
>
> 4. Something else entirely?
>
> Given some restriction like one of the above, it is entirely possible
> that we don't even need the memory barrier...
3 is the closest. The request queue doesn't really know the scope of the
barrier, it has to rely on the issuer getting it right. If you have two
competing processes issuing io and process A relies on process B issuing
a barrier, they have to synchronize that between them. Normally that is
not a problem, since that's how the file systems always did io before
barriers on items that need to be on disk (it was a serialization point
anyway, it's just a stronger one now).
That said, I think the
smp_mb();
if (srcu_readers_active(sp))
synchronize_srcu();
makes the most sense.
--
Jens Axboe
On Mon, Nov 20, 2006 at 08:15:14AM +0100, Jens Axboe wrote:
> On Sun, Nov 19 2006, Paul E. McKenney wrote:
> > On Sat, Nov 18, 2006 at 09:46:24PM +0300, Oleg Nesterov wrote:
> > > On 11/17, Paul E. McKenney wrote:
> > > >
> > > > Oleg, any thoughts about Jens's optimization? He would code something
> > > > like:
> > > >
> > > > if (srcu_readers_active(&my_srcu))
> > > > synchronize_srcu();
> > > > else
> > > > smp_mb();
> > >
> > > Well, this is clearly racy, no? I am not sure, but may be we can do
> > >
> > > smp_mb();
> > > if (srcu_readers_active(&my_srcu))
> > > synchronize_srcu();
> > >
> > > in this case we also need to add 'smp_mb()' into srcu_read_lock() after
> > > 'atomic_inc(&sp->hardluckref)'.
> > >
> > > > However, he is doing ordered I/O requests rather than protecting data
> > > > structures.
> > >
> > > Probably this makes a difference, but I don't understand this.
> >
> > OK, one hypothesis here...
> >
> > The I/Os must be somehow explicitly ordered to qualify
> > for I/O-barrier separation. If two independent processes
> > issue I/Os concurrently with a third process doing an
> > I/O barrier, the I/O barrier is free to separate the
> > two concurrent I/Os or not, on its whim.
> >
> > Jens, is the above correct? If so, what would the two processes
>
> That's completely correct, hence my somewhat relaxed approach with SRCU.
OK, less scary in that case. ;-)
> > need to do in order to ensure that their I/O was considered to be
> > ordered with respect to the I/O barrier? Here are some possibilities:
>
> If we consider the barrier a barrier in a certain stream of requests,
> it is the responsibility of the issuer of that barrier to ensure that
> the queueing is ordered. So if two "unrelated" streams of requests with
> barriers hit __make_request() at the same time, we don't go to great
> lengths to ensure who gets there firt.
So the "preceding" requests have to have completed their I/O system
calls? If this is the case, does this include normal (non-direct/raw)
writes and asynchronous reads? My guess is that it would include
asynchronous I/O, but not buffered writes.
> > 1. I/O barriers apply only to preceding and following I/Os from
> > the process issuing the I/O barrier.
> >
> > 2. As for #1 above, but restricted to task rather than process.
> >
> > 3. I/O system calls that have completed are ordered by the
> > barrier to precede I/O system calls that have not yet
> > started, but I/O system calls still in flight could legally
> > land on either side of the concurrently executing I/O
> > barrier.
> >
> > 4. Something else entirely?
> >
> > Given some restriction like one of the above, it is entirely possible
> > that we don't even need the memory barrier...
>
> 3 is the closest. The request queue doesn't really know the scope of the
> barrier, it has to rely on the issuer getting it right. If you have two
> competing processes issuing io and process A relies on process B issuing
> a barrier, they have to synchronize that between them. Normally that is
> not a problem, since that's how the file systems always did io before
> barriers on items that need to be on disk (it was a serialization point
> anyway, it's just a stronger one now).
So something like a user-level mutex or atomic instructions must be used
by the tasks doing the pre-barrier I/Os to announce that these I/Os have
been started in the kernel.
> That said, I think the
>
> smp_mb();
> if (srcu_readers_active(sp))
> synchronize_srcu();
>
> makes the most sense.
If the user-level tasks/threads/processes must explicitly synchronize,
and if the pre-barrier I/O-initation syscalls have to have completed,
then I am not sure that the smp_mb() is needed. Seems like the queuing
mechanisms in the syscall and the user-level synchronization would have
supplied the needed memory barriers. Or are you using some extremely
lightweight user-level synchronization?
Thanx, Paul
Paul:
Here's my version of your patch from yesterday. It's basically the same,
but I cleaned up the code in a few places and fixed a bug (the sign of idx
in srcu_read_unlock). Also I changed the init routine back to void, since
it's no longer an error if the per-cpu allocation fails.
More importantly, I added a static initializer and included the fast-path
in synchronize_srcu. It's protected by the new symbol
SMP__STORE_MB_LOAD_WORKS, which should be defined in arch-specific headers
for those architectures where the store-mb-load pattern is safe.
Alan
Signed-off-by: Alan Stern <[email protected]>
---
Index: usb-2.6/include/linux/srcu.h
===================================================================
--- usb-2.6.orig/include/linux/srcu.h
+++ usb-2.6/include/linux/srcu.h
@@ -35,19 +35,19 @@ struct srcu_struct {
int completed;
struct srcu_struct_array *per_cpu_ref;
struct mutex mutex;
+ atomic_t hardluckref[2];
};
-#ifndef CONFIG_PREEMPT
-#define srcu_barrier() barrier()
-#else /* #ifndef CONFIG_PREEMPT */
-#define srcu_barrier()
-#endif /* #else #ifndef CONFIG_PREEMPT */
+#define SRCU_INITIALIZER(srcu_name) { \
+ .mutex = __MUTEX_INITIALIZER(srcu_name.mutex), \
+ .hardluckref = {ATOMIC_INIT(0), ATOMIC_INIT(0)} }
-int init_srcu_struct(struct srcu_struct *sp);
-void cleanup_srcu_struct(struct srcu_struct *sp);
-int srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
-void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
-void synchronize_srcu(struct srcu_struct *sp);
-long srcu_batches_completed(struct srcu_struct *sp);
+extern void init_srcu_struct(struct srcu_struct *sp);
+extern void cleanup_srcu_struct(struct srcu_struct *sp);
+extern int srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
+extern void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
+extern void synchronize_srcu(struct srcu_struct *sp);
+extern long srcu_batches_completed(struct srcu_struct *sp);
+extern int srcu_readers_active(struct srcu_struct *sp);
#endif
Index: usb-2.6/kernel/srcu.c
===================================================================
--- usb-2.6.orig/kernel/srcu.c
+++ usb-2.6/kernel/srcu.c
@@ -34,6 +34,18 @@
#include <linux/smp.h>
#include <linux/srcu.h>
+/*
+ * Initialize the per-CPU array, returning the pointer.
+ */
+static inline struct srcu_struct_array *alloc_srcu_struct_percpu(void)
+{
+ struct srcu_struct_array *sap;
+
+ sap = alloc_percpu(struct srcu_struct_array);
+ smp_wmb();
+ return sap;
+}
+
/**
* init_srcu_struct - initialize a sleep-RCU structure
* @sp: structure to initialize.
@@ -42,12 +54,13 @@
* to any other function. Each srcu_struct represents a separate domain
* of SRCU protection.
*/
-int init_srcu_struct(struct srcu_struct *sp)
+void init_srcu_struct(struct srcu_struct *sp)
{
sp->completed = 0;
mutex_init(&sp->mutex);
- sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
- return (sp->per_cpu_ref ? 0 : -ENOMEM);
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+ atomic_set(&sp->hardluckref[0], 0);
+ atomic_set(&sp->hardluckref[1], 0);
}
/*
@@ -58,11 +71,14 @@ int init_srcu_struct(struct srcu_struct
static int srcu_readers_active_idx(struct srcu_struct *sp, int idx)
{
int cpu;
+ struct srcu_struct_array *sap;
int sum;
- sum = 0;
- for_each_possible_cpu(cpu)
- sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
+ sum = atomic_read(&sp->hardluckref[idx]);
+ sap = rcu_dereference(sp->per_cpu_ref);
+ if (likely(sap != NULL))
+ for_each_possible_cpu(cpu)
+ sum += per_cpu_ptr(sap, cpu)->c[idx];
return sum;
}
@@ -94,7 +110,8 @@ void cleanup_srcu_struct(struct srcu_str
WARN_ON(sum); /* Leakage unless caller handles error. */
if (sum != 0)
return;
- free_percpu(sp->per_cpu_ref);
+ if (sp->per_cpu_ref != NULL)
+ free_percpu(sp->per_cpu_ref);
sp->per_cpu_ref = NULL;
}
@@ -102,19 +119,37 @@ void cleanup_srcu_struct(struct srcu_str
* srcu_read_lock - register a new reader for an SRCU-protected structure.
* @sp: srcu_struct in which to register the new reader.
*
- * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct. Must be called from process context.
+ * Counts the new reader in the appropriate per-CPU element of @sp.
+ * Must be called from process context.
+ *
* Returns an index that must be passed to the matching srcu_read_unlock().
+ * The index is mapped to negative numbers if the per-cpu array in @sp
+ * is not and cannot be allocated.
*/
int srcu_read_lock(struct srcu_struct *sp)
{
int idx;
+ struct srcu_struct_array *sap;
+
+ if (unlikely(sp->per_cpu_ref == NULL &&
+ mutex_trylock(&sp->mutex))) {
+ if (sp->per_cpu_ref == NULL)
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+ mutex_unlock(&sp->mutex);
+ }
preempt_disable();
idx = sp->completed & 0x1;
barrier(); /* ensure compiler looks -once- at sp->completed. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
+ sap = rcu_dereference(sp->per_cpu_ref);
+ if (likely(sap != NULL)) {
+ per_cpu_ptr(sap, smp_processor_id())->c[idx]++;
+ smp_mb();
+ } else {
+ atomic_inc(&sp->hardluckref[idx]);
+ smp_mb__after_atomic_inc();
+ idx = -1 - idx;
+ }
preempt_enable();
return idx;
}
@@ -131,10 +166,16 @@ int srcu_read_lock(struct srcu_struct *s
*/
void srcu_read_unlock(struct srcu_struct *sp, int idx)
{
- preempt_disable();
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
- preempt_enable();
+ if (likely(idx >= 0)) {
+ smp_mb();
+ preempt_disable();
+ per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
+ smp_processor_id())->c[idx]--;
+ preempt_enable();
+ } else {
+ smp_mb__before_atomic_dec();
+ atomic_dec(&sp->hardluckref[-1 - idx]);
+ }
}
/**
@@ -158,6 +199,11 @@ void synchronize_srcu(struct srcu_struct
idx = sp->completed;
mutex_lock(&sp->mutex);
+ /* Initialize if not already initialized. */
+
+ if (sp->per_cpu_ref == NULL)
+ sp->per_cpu_ref = alloc_srcu_struct_percpu();
+
/*
* Check to see if someone else did the work for us while we were
* waiting to acquire the lock. We need -two- advances of
@@ -168,71 +214,35 @@ void synchronize_srcu(struct srcu_struct
* either (1) wait for two or (2) supply the second ourselves.
*/
- if ((sp->completed - idx) >= 2) {
- mutex_unlock(&sp->mutex);
- return;
- }
+ if ((sp->completed - idx) >= 2)
+ goto done;
- synchronize_sched(); /* Force memory barrier on all CPUs. */
+ smp_mb(); /* ensure srcu_read_lock() sees prior change first! */
+ idx = sp->completed & 0x1;
- /*
- * The preceding synchronize_sched() ensures that any CPU that
- * sees the new value of sp->completed will also see any preceding
- * changes to data structures made by this CPU. This prevents
- * some other CPU from reordering the accesses in its SRCU
- * read-side critical section to precede the corresponding
- * srcu_read_lock() -- ensuring that such references will in
- * fact be protected.
- *
- * So it is now safe to do the flip.
- */
+#ifdef SMP__STORE_MB_LOAD_WORKS /* The fast path */
+ if (srcu_readers_active_idx(sp, idx) == 0)
+ goto done;
+#endif
- idx = sp->completed & 0x1;
sp->completed++;
-
- synchronize_sched(); /* Force memory barrier on all CPUs. */
+ synchronize_sched();
/*
* At this point, because of the preceding synchronize_sched(),
* all srcu_read_lock() calls using the old counters have completed.
* Their corresponding critical sections might well be still
* executing, but the srcu_read_lock() primitives themselves
- * will have finished executing.
+ * will have finished executing. The "old" rank of counters
+ * can therefore only decrease, never increase in value.
*/
while (srcu_readers_active_idx(sp, idx))
schedule_timeout_interruptible(1);
- synchronize_sched(); /* Force memory barrier on all CPUs. */
-
- /*
- * The preceding synchronize_sched() forces all srcu_read_unlock()
- * primitives that were executing concurrently with the preceding
- * for_each_possible_cpu() loop to have completed by this point.
- * More importantly, it also forces the corresponding SRCU read-side
- * critical sections to have also completed, and the corresponding
- * references to SRCU-protected data items to be dropped.
- *
- * Note:
- *
- * Despite what you might think at first glance, the
- * preceding synchronize_sched() -must- be within the
- * critical section ended by the following mutex_unlock().
- * Otherwise, a task taking the early exit can race
- * with a srcu_read_unlock(), which might have executed
- * just before the preceding srcu_readers_active() check,
- * and whose CPU might have reordered the srcu_read_unlock()
- * with the preceding critical section. In this case, there
- * is nothing preventing the synchronize_sched() task that is
- * taking the early exit from freeing a data structure that
- * is still being referenced (out of order) by the task
- * doing the srcu_read_unlock().
- *
- * Alternatively, the comparison with "2" on the early exit
- * could be changed to "3", but this increases synchronize_srcu()
- * latency for bulk loads. So the current code is preferred.
- */
+ smp_mb(); /* must see critical section prior to srcu_read_unlock() */
+done:
mutex_unlock(&sp->mutex);
}
On Mon, Nov 20 2006, Paul E. McKenney wrote:
> On Mon, Nov 20, 2006 at 08:15:14AM +0100, Jens Axboe wrote:
> > On Sun, Nov 19 2006, Paul E. McKenney wrote:
> > > On Sat, Nov 18, 2006 at 09:46:24PM +0300, Oleg Nesterov wrote:
> > > > On 11/17, Paul E. McKenney wrote:
> > > > >
> > > > > Oleg, any thoughts about Jens's optimization? He would code something
> > > > > like:
> > > > >
> > > > > if (srcu_readers_active(&my_srcu))
> > > > > synchronize_srcu();
> > > > > else
> > > > > smp_mb();
> > > >
> > > > Well, this is clearly racy, no? I am not sure, but may be we can do
> > > >
> > > > smp_mb();
> > > > if (srcu_readers_active(&my_srcu))
> > > > synchronize_srcu();
> > > >
> > > > in this case we also need to add 'smp_mb()' into srcu_read_lock() after
> > > > 'atomic_inc(&sp->hardluckref)'.
> > > >
> > > > > However, he is doing ordered I/O requests rather than protecting data
> > > > > structures.
> > > >
> > > > Probably this makes a difference, but I don't understand this.
> > >
> > > OK, one hypothesis here...
> > >
> > > The I/Os must be somehow explicitly ordered to qualify
> > > for I/O-barrier separation. If two independent processes
> > > issue I/Os concurrently with a third process doing an
> > > I/O barrier, the I/O barrier is free to separate the
> > > two concurrent I/Os or not, on its whim.
> > >
> > > Jens, is the above correct? If so, what would the two processes
> >
> > That's completely correct, hence my somewhat relaxed approach with SRCU.
>
> OK, less scary in that case. ;-)
Yep, it's really not scary in any ordering sense!
> > > need to do in order to ensure that their I/O was considered to be
> > > ordered with respect to the I/O barrier? Here are some possibilities:
> >
> > If we consider the barrier a barrier in a certain stream of requests,
> > it is the responsibility of the issuer of that barrier to ensure that
> > the queueing is ordered. So if two "unrelated" streams of requests with
> > barriers hit __make_request() at the same time, we don't go to great
> > lengths to ensure who gets there firt.
>
> So the "preceding" requests have to have completed their I/O system
> calls? If this is the case, does this include normal (non-direct/raw)
> writes and asynchronous reads? My guess is that it would include
> asynchronous I/O, but not buffered writes.
They need not have completed, but they must have been queued at the
block layer level. IOW, the io scheduler must know about them. Since
it's a block layer device property, we really don't care about system
calls since any of them could amount to 1 or lots more individual io
requests.
But now we have taken a detour from the original problem. As I wrote
above, the io scheduler must know about the requests. When the plug list
ends up in the private process context, the io scheduler doesn't know
about it yet. When a barrier is queued, the block layer does not care
about io that hasn't been issued yet (dirty data in the page cache
perhaps), since if it hasn't been seen, it's by definition not
interesting. But if some of the requests reside in a different process
private request list, then that is a violation of this rule since it
should technically belong to the block layer / io scheduler at that
point. This is where I wanted to use SRCU.
> > > 1. I/O barriers apply only to preceding and following I/Os from
> > > the process issuing the I/O barrier.
> > >
> > > 2. As for #1 above, but restricted to task rather than process.
> > >
> > > 3. I/O system calls that have completed are ordered by the
> > > barrier to precede I/O system calls that have not yet
> > > started, but I/O system calls still in flight could legally
> > > land on either side of the concurrently executing I/O
> > > barrier.
> > >
> > > 4. Something else entirely?
> > >
> > > Given some restriction like one of the above, it is entirely possible
> > > that we don't even need the memory barrier...
> >
> > 3 is the closest. The request queue doesn't really know the scope of the
> > barrier, it has to rely on the issuer getting it right. If you have two
> > competing processes issuing io and process A relies on process B issuing
> > a barrier, they have to synchronize that between them. Normally that is
> > not a problem, since that's how the file systems always did io before
> > barriers on items that need to be on disk (it was a serialization point
> > anyway, it's just a stronger one now).
>
> So something like a user-level mutex or atomic instructions must be used
> by the tasks doing the pre-barrier I/Os to announce that these I/Os have
> been started in the kernel.
We don't do barriers from user space, it's purely a feature available to
file systems to ensure ordering of writes even at the disk platter
level.
> > That said, I think the
> >
> > smp_mb();
> > if (srcu_readers_active(sp))
> > synchronize_srcu();
> >
> > makes the most sense.
>
> If the user-level tasks/threads/processes must explicitly synchronize,
> and if the pre-barrier I/O-initation syscalls have to have completed,
> then I am not sure that the smp_mb() is needed. Seems like the queuing
> mechanisms in the syscall and the user-level synchronization would have
> supplied the needed memory barriers. Or are you using some extremely
> lightweight user-level synchronization?
Once a process holds a queue plug, any write issued to that plug list
will do an srcu_read_lock(). So as far as I can tell, the smp_mb() is
needed to ensure that an immediately following synchronize_srcu() from a
barrier write queued on a different CPU will see that srcu_read_lock().
There are no syscall or user-level synchronization.
--
Jens Axboe
On Mon, Nov 20 2006, Alan Stern wrote:
> Paul:
>
> Here's my version of your patch from yesterday. It's basically the same,
> but I cleaned up the code in a few places and fixed a bug (the sign of idx
> in srcu_read_unlock). Also I changed the init routine back to void, since
> it's no longer an error if the per-cpu allocation fails.
>
> More importantly, I added a static initializer and included the fast-path
> in synchronize_srcu. It's protected by the new symbol
> SMP__STORE_MB_LOAD_WORKS, which should be defined in arch-specific headers
> for those architectures where the store-mb-load pattern is safe.
Must we introduce memory allocations in srcu_read_lock()? It makes it
much harder and nastier for me to use. I'd much prefer a failing
init_srcu(), seems like a much better API.
--
Jens Axboe
On 11/20, Alan Stern wrote:
>
> @@ -158,6 +199,11 @@ void synchronize_srcu(struct srcu_struct
>
> [... snip ...]
>
> +#ifdef SMP__STORE_MB_LOAD_WORKS /* The fast path */
> + if (srcu_readers_active_idx(sp, idx) == 0)
> + goto done;
> +#endif
I guess this is connected to another message from you,
> But of course it _is_ needed for the fastpath to work. In fact, it might
> not be good enough, depending on the architecture. Here's what the
> fastpath ends up looking like (using c[idx] is essentially the same as
> using hardluckref):
>
> WRITER READER
> ------ ------
> dataptr = &(new data) atomic_inc(&hardluckref)
> mb mb
> while (hardluckref > 0) ; access *dataptr
>
> Notice the pattern: Each CPU does store-mb-load. It is known that on
> some architectures each CPU can end up loading the old value (the value
> from before the other CPU's store). This would mean the writer would see
> hardluckref == 0 right away and the reader would see the old dataptr.
So, if we have global A == B == 0,
CPU_0 CPU_1
A = 1; B = 2;
mb(); mb();
b = B; a = A;
It could happen that a == b == 0, yes? Isn't this contradicts with definition
of mb?
By definition, when CPU_0 issues 'b = B', 'A = 1' should be visible to other
CPUs, yes? Now, b == 0 means that CPU_1 did not read 'a = A' yet, otherwise
'B = 2' should be visible to all CPUs (by definition again).
Could you please clarify this?
Btw, this is funny, but I was going to suggest _exactly_ same cleanup for
srcu_read_lock :)
Oleg.
On Mon, Nov 20, 2006 at 12:19:19PM -0500, Alan Stern wrote:
> Paul:
>
> Here's my version of your patch from yesterday. It's basically the same,
> but I cleaned up the code in a few places and fixed a bug (the sign of idx
> in srcu_read_unlock). Also I changed the init routine back to void, since
> it's no longer an error if the per-cpu allocation fails.
I don't see any changes in the sign of idx.
> More importantly, I added a static initializer and included the fast-path
> in synchronize_srcu. It's protected by the new symbol
> SMP__STORE_MB_LOAD_WORKS, which should be defined in arch-specific headers
> for those architectures where the store-mb-load pattern is safe.
I did take most of your changes. Comments interspersed below...
I will send out an updated patch after a bit of testing.
> Alan
>
> Signed-off-by: Alan Stern <[email protected]>
>
> ---
>
> Index: usb-2.6/include/linux/srcu.h
> ===================================================================
> --- usb-2.6.orig/include/linux/srcu.h
> +++ usb-2.6/include/linux/srcu.h
> @@ -35,19 +35,19 @@ struct srcu_struct {
> int completed;
> struct srcu_struct_array *per_cpu_ref;
> struct mutex mutex;
> + atomic_t hardluckref[2];
> };
>
> -#ifndef CONFIG_PREEMPT
> -#define srcu_barrier() barrier()
> -#else /* #ifndef CONFIG_PREEMPT */
> -#define srcu_barrier()
> -#endif /* #else #ifndef CONFIG_PREEMPT */
> +#define SRCU_INITIALIZER(srcu_name) { \
> + .mutex = __MUTEX_INITIALIZER(srcu_name.mutex), \
> + .hardluckref = {ATOMIC_INIT(0), ATOMIC_INIT(0)} }
Took this, very good.
> -int init_srcu_struct(struct srcu_struct *sp);
> -void cleanup_srcu_struct(struct srcu_struct *sp);
> -int srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
> -void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
> -void synchronize_srcu(struct srcu_struct *sp);
> -long srcu_batches_completed(struct srcu_struct *sp);
> +extern void init_srcu_struct(struct srcu_struct *sp);
> +extern void cleanup_srcu_struct(struct srcu_struct *sp);
> +extern int srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
> +extern void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
> +extern void synchronize_srcu(struct srcu_struct *sp);
> +extern long srcu_batches_completed(struct srcu_struct *sp);
> +extern int srcu_readers_active(struct srcu_struct *sp);
Took this as well.
> #endif
> Index: usb-2.6/kernel/srcu.c
> ===================================================================
> --- usb-2.6.orig/kernel/srcu.c
> +++ usb-2.6/kernel/srcu.c
> @@ -34,6 +34,18 @@
> #include <linux/smp.h>
> #include <linux/srcu.h>
>
> +/*
> + * Initialize the per-CPU array, returning the pointer.
> + */
> +static inline struct srcu_struct_array *alloc_srcu_struct_percpu(void)
> +{
> + struct srcu_struct_array *sap;
> +
> + sap = alloc_percpu(struct srcu_struct_array);
> + smp_wmb();
> + return sap;
> +}
> +
> /**
> * init_srcu_struct - initialize a sleep-RCU structure
> * @sp: structure to initialize.
> @@ -42,12 +54,13 @@
> * to any other function. Each srcu_struct represents a separate domain
> * of SRCU protection.
> */
> -int init_srcu_struct(struct srcu_struct *sp)
> +void init_srcu_struct(struct srcu_struct *sp)
> {
> sp->completed = 0;
> mutex_init(&sp->mutex);
> - sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
> - return (sp->per_cpu_ref ? 0 : -ENOMEM);
> + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> + atomic_set(&sp->hardluckref[0], 0);
> + atomic_set(&sp->hardluckref[1], 0);
> }
Nack -- the caller is free to ignore the error return, but should be
able to detect it in case the caller is unable to tolerate the overhead
of running in hardluckref mode, perhaps instead choosing to fail a
user-level request in order to try to reduce memory pressure.
I did update the comment to say that you can use SRCU_INITIALIZER()
instead.
> /*
> @@ -58,11 +71,14 @@ int init_srcu_struct(struct srcu_struct
> static int srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> {
> int cpu;
> + struct srcu_struct_array *sap;
> int sum;
>
> - sum = 0;
> - for_each_possible_cpu(cpu)
> - sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> + sum = atomic_read(&sp->hardluckref[idx]);
> + sap = rcu_dereference(sp->per_cpu_ref);
> + if (likely(sap != NULL))
> + for_each_possible_cpu(cpu)
> + sum += per_cpu_ptr(sap, cpu)->c[idx];
> return sum;
> }
Took the initialization based on hardluckref rather than adding it in
at the end, good!
> @@ -94,7 +110,8 @@ void cleanup_srcu_struct(struct srcu_str
> WARN_ON(sum); /* Leakage unless caller handles error. */
> if (sum != 0)
> return;
> - free_percpu(sp->per_cpu_ref);
> + if (sp->per_cpu_ref != NULL)
> + free_percpu(sp->per_cpu_ref);
> sp->per_cpu_ref = NULL;
> }
>
> @@ -102,19 +119,37 @@ void cleanup_srcu_struct(struct srcu_str
> * srcu_read_lock - register a new reader for an SRCU-protected structure.
> * @sp: srcu_struct in which to register the new reader.
> *
> - * Counts the new reader in the appropriate per-CPU element of the
> - * srcu_struct. Must be called from process context.
> + * Counts the new reader in the appropriate per-CPU element of @sp.
> + * Must be called from process context.
Good -- took the "@sp".
> * Returns an index that must be passed to the matching srcu_read_unlock().
> + * The index is mapped to negative numbers if the per-cpu array in @sp
> + * is not and cannot be allocated.
> */
> int srcu_read_lock(struct srcu_struct *sp)
> {
> int idx;
> + struct srcu_struct_array *sap;
> +
> + if (unlikely(sp->per_cpu_ref == NULL &&
> + mutex_trylock(&sp->mutex))) {
> + if (sp->per_cpu_ref == NULL)
> + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> + mutex_unlock(&sp->mutex);
> + }
>
> preempt_disable();
> idx = sp->completed & 0x1;
> barrier(); /* ensure compiler looks -once- at sp->completed. */
> - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> + sap = rcu_dereference(sp->per_cpu_ref);
> + if (likely(sap != NULL)) {
> + per_cpu_ptr(sap, smp_processor_id())->c[idx]++;
> + smp_mb();
> + } else {
> + atomic_inc(&sp->hardluckref[idx]);
> + smp_mb__after_atomic_inc();
> + idx = -1 - idx;
> + }
> preempt_enable();
> return idx;
> }
Good restructuring -- took this, though I restricted the unlikely() to
cover only the comparison to NULL, since the mutex_trylock() has a
reasonable chance of success assuming that the read path has substantial
overhead from other sources.
> @@ -131,10 +166,16 @@ int srcu_read_lock(struct srcu_struct *s
> */
> void srcu_read_unlock(struct srcu_struct *sp, int idx)
> {
> - preempt_disable();
> - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> - preempt_enable();
> + if (likely(idx >= 0)) {
> + smp_mb();
> + preempt_disable();
> + per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
> + smp_processor_id())->c[idx]--;
> + preempt_enable();
> + } else {
> + smp_mb__before_atomic_dec();
> + atomic_dec(&sp->hardluckref[-1 - idx]);
> + }
> }
I took the moving smp_mb() out from under preempt_disable().
I left the "return" -- same number of lines either way.
I don't see any changes in sign compared to what I had, FWIW.
> /**
> @@ -158,6 +199,11 @@ void synchronize_srcu(struct srcu_struct
> idx = sp->completed;
> mutex_lock(&sp->mutex);
>
> + /* Initialize if not already initialized. */
> +
> + if (sp->per_cpu_ref == NULL)
> + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> +
> /*
> * Check to see if someone else did the work for us while we were
> * waiting to acquire the lock. We need -two- advances of
> @@ -168,71 +214,35 @@ void synchronize_srcu(struct srcu_struct
> * either (1) wait for two or (2) supply the second ourselves.
> */
>
> - if ((sp->completed - idx) >= 2) {
> - mutex_unlock(&sp->mutex);
> - return;
> - }
> + if ((sp->completed - idx) >= 2)
> + goto done;
I don't see the benefit of gathering all the mutex_unlock()s together.
If the unwinding was more complicated, I would take this change, however.
> - synchronize_sched(); /* Force memory barrier on all CPUs. */
> + smp_mb(); /* ensure srcu_read_lock() sees prior change first! */
> + idx = sp->completed & 0x1;
>
> - /*
> - * The preceding synchronize_sched() ensures that any CPU that
> - * sees the new value of sp->completed will also see any preceding
> - * changes to data structures made by this CPU. This prevents
> - * some other CPU from reordering the accesses in its SRCU
> - * read-side critical section to precede the corresponding
> - * srcu_read_lock() -- ensuring that such references will in
> - * fact be protected.
> - *
> - * So it is now safe to do the flip.
> - */
> +#ifdef SMP__STORE_MB_LOAD_WORKS /* The fast path */
> + if (srcu_readers_active_idx(sp, idx) == 0)
> + goto done;
> +#endif
I still don't trust this one. I would trust it a bit more if it were
srcu_readers_active() rather than srcu_readers_active_idx(), but even
then I suspect that external synchronization is required. Or is there
something that I am missing that makes it safe in face of the sequence
of events that you, Oleg, and I were discussing?
For the moment, this optimization should be done by the caller, and
should be prominently commented.
Thanx, Paul
> - idx = sp->completed & 0x1;
> sp->completed++;
> -
> - synchronize_sched(); /* Force memory barrier on all CPUs. */
> + synchronize_sched();
>
> /*
> * At this point, because of the preceding synchronize_sched(),
> * all srcu_read_lock() calls using the old counters have completed.
> * Their corresponding critical sections might well be still
> * executing, but the srcu_read_lock() primitives themselves
> - * will have finished executing.
> + * will have finished executing. The "old" rank of counters
> + * can therefore only decrease, never increase in value.
> */
>
> while (srcu_readers_active_idx(sp, idx))
> schedule_timeout_interruptible(1);
>
> - synchronize_sched(); /* Force memory barrier on all CPUs. */
> -
> - /*
> - * The preceding synchronize_sched() forces all srcu_read_unlock()
> - * primitives that were executing concurrently with the preceding
> - * for_each_possible_cpu() loop to have completed by this point.
> - * More importantly, it also forces the corresponding SRCU read-side
> - * critical sections to have also completed, and the corresponding
> - * references to SRCU-protected data items to be dropped.
> - *
> - * Note:
> - *
> - * Despite what you might think at first glance, the
> - * preceding synchronize_sched() -must- be within the
> - * critical section ended by the following mutex_unlock().
> - * Otherwise, a task taking the early exit can race
> - * with a srcu_read_unlock(), which might have executed
> - * just before the preceding srcu_readers_active() check,
> - * and whose CPU might have reordered the srcu_read_unlock()
> - * with the preceding critical section. In this case, there
> - * is nothing preventing the synchronize_sched() task that is
> - * taking the early exit from freeing a data structure that
> - * is still being referenced (out of order) by the task
> - * doing the srcu_read_unlock().
> - *
> - * Alternatively, the comparison with "2" on the early exit
> - * could be changed to "3", but this increases synchronize_srcu()
> - * latency for bulk loads. So the current code is preferred.
> - */
> + smp_mb(); /* must see critical section prior to srcu_read_unlock() */
>
> +done:
> mutex_unlock(&sp->mutex);
> }
>
>
On Mon, 20 Nov 2006, Jens Axboe wrote:
> On Mon, Nov 20 2006, Alan Stern wrote:
> > Paul:
> >
> > Here's my version of your patch from yesterday. It's basically the same,
> > but I cleaned up the code in a few places and fixed a bug (the sign of idx
> > in srcu_read_unlock). Also I changed the init routine back to void, since
> > it's no longer an error if the per-cpu allocation fails.
> >
> > More importantly, I added a static initializer and included the fast-path
> > in synchronize_srcu. It's protected by the new symbol
> > SMP__STORE_MB_LOAD_WORKS, which should be defined in arch-specific headers
> > for those architectures where the store-mb-load pattern is safe.
>
> Must we introduce memory allocations in srcu_read_lock()? It makes it
> much harder and nastier for me to use. I'd much prefer a failing
> init_srcu(), seems like a much better API.
Paul agrees with you that allocation failures in init_srcu() should be
passed back to the caller, and I certainly don't mind doing so.
However we can't remove the memory allocation in srcu_read_lock(). That
was the point which started this whole thread: the per-cpu allocation
cannot be done statically, and some users of a static SRCU structure can't
easily call init_srcu() early enough.
Once the allocation succeeds, the overhead in srcu_read_lock() is minimal.
Alan
I decided to go with this instead, for now.
Please holler if you see problems with it.
Linus
---
commit b3438f8266cb1f5010085ac47d7ad6a36a212164
Author: Linus Torvalds <[email protected]>
Date: Mon Nov 20 11:47:18 2006 -0800
Add "pure_initcall" for static variable initialization
This is a quick hack to overcome the fact that SRCU currently does not
allow static initializers, and we need to sometimes initialize those
things before any other initializers (even "core" ones) can do so.
Currently we don't allow this at all for modules, and the only user that
needs is right now is cpufreq. As reported by Thomas Gleixner:
"Commit b4dfdbb3c707474a2254c5b4d7e62be31a4b7da9 ("[PATCH] cpufreq:
make the transition_notifier chain use SRCU breaks cpu frequency
notification users, which register the callback > on core_init
level."
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Andrew Morton <[email protected]>,
Signed-off-by: Linus Torvalds <[email protected]>
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 86e69b7..dd0c262 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -59,7 +59,7 @@ static int __init init_cpufreq_transitio
srcu_init_notifier_head(&cpufreq_transition_notifier_list);
return 0;
}
-core_initcall(init_cpufreq_transition_notifier_list);
+pure_initcall(init_cpufreq_transition_notifier_list);
static LIST_HEAD(cpufreq_governor_list);
static DEFINE_MUTEX (cpufreq_governor_mutex);
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 9d87316..e60d6f2 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -215,6 +215,8 @@
.notes : { *(.note.*) } :note
#define INITCALLS \
+ *(.initcall0.init) \
+ *(.initcall0s.init) \
*(.initcall1.init) \
*(.initcall1s.init) \
*(.initcall2.init) \
diff --git a/include/linux/init.h b/include/linux/init.h
index ff40ea1..5eb5d24 100644
--- a/include/linux/init.h
+++ b/include/linux/init.h
@@ -93,6 +93,14 @@ extern void setup_arch(char **);
static initcall_t __initcall_##fn##id __attribute_used__ \
__attribute__((__section__(".initcall" level ".init"))) = fn
+/*
+ * A "pure" initcall has no dependencies on anything else, and purely
+ * initializes variables that couldn't be statically initialized.
+ *
+ * This only exists for built-in code, not for modules.
+ */
+#define pure_initcall(fn) __define_initcall("0",fn,1)
+
#define core_initcall(fn) __define_initcall("1",fn,1)
#define core_initcall_sync(fn) __define_initcall("1s",fn,1s)
#define postcore_initcall(fn) __define_initcall("2",fn,2)
On Mon, 20 Nov 2006, Oleg Nesterov wrote:
> On 11/20, Alan Stern wrote:
> >
> > @@ -158,6 +199,11 @@ void synchronize_srcu(struct srcu_struct
> >
> > [... snip ...]
> >
> > +#ifdef SMP__STORE_MB_LOAD_WORKS /* The fast path */
> > + if (srcu_readers_active_idx(sp, idx) == 0)
> > + goto done;
> > +#endif
>
> I guess this is connected to another message from you,
Yes.
> > But of course it _is_ needed for the fastpath to work. In fact, it might
> > not be good enough, depending on the architecture. Here's what the
> > fastpath ends up looking like (using c[idx] is essentially the same as
> > using hardluckref):
> >
> > WRITER READER
> > ------ ------
> > dataptr = &(new data) atomic_inc(&hardluckref)
> > mb mb
> > while (hardluckref > 0) ; access *dataptr
> >
> > Notice the pattern: Each CPU does store-mb-load. It is known that on
> > some architectures each CPU can end up loading the old value (the value
> > from before the other CPU's store). This would mean the writer would see
> > hardluckref == 0 right away and the reader would see the old dataptr.
>
> So, if we have global A == B == 0,
>
> CPU_0 CPU_1
>
> A = 1; B = 2;
> mb(); mb();
> b = B; a = A;
>
> It could happen that a == b == 0, yes?
Exactly.
> Isn't this contradicts with definition
> of mb?
One might think so, at first. But if you do a careful search, you'll find
that there _is_ no definition of mb! People state in vague terms what
it's supposed to do, but they are almost never specific enough to tell
whether the example above should work.
> By definition, when CPU_0 issues 'b = B', 'A = 1' should be visible to other
> CPUs, yes?
No. Memory barriers don't guarantee that any particular store will become
visible to other CPUs at any particular time. They guarantee only that a
certain sequence of stores will become visible in a particular order
(provided the other CPUs also use the correct memory barriers).
> Now, b == 0 means that CPU_1 did not read 'a = A' yet, otherwise
> 'B = 2' should be visible to all CPUs (by definition again).
>
> Could you please clarify this?
Here's an example showing how the code can fail. (Paul can correct me if
I get this wrong.)
"A = 1" and "B = 2" are issued simultaneously. The two caches
send out Invalidate messages, and each queues the message it
receives for later processing.
Both CPUs execute their "mb" instructions. The mb forces each
cache to wait until it receives an Acknowdgement for the
Invalidate it sent.
Both caches send an Acknowledgement message to the other. The
mb instructions complete.
"b = B" and "a = A" execute. The caches return A==0 and B==0
because they haven't yet invalidated their cache lines.
Cache 0 invalidates its line for B and Cache 1 invalidates its
line for A. But it's already too late.
The reason the code failed is because the mb instructions didn't force
the caches to wait until the Invalidate messages in their queues had been
fully carried out (i.e., the lines had actually been invalidated). This
is because at the time the mb started, those messages had not been
acknowledged -- they were just sitting in the cache's invalidate queue.
> Btw, this is funny, but I was going to suggest _exactly_ same cleanup for
> srcu_read_lock :)
I guess we think alike!
Alan
On Mon, Nov 20, 2006 at 06:55:54PM +0100, Jens Axboe wrote:
> On Mon, Nov 20 2006, Paul E. McKenney wrote:
> > On Mon, Nov 20, 2006 at 08:15:14AM +0100, Jens Axboe wrote:
> > > On Sun, Nov 19 2006, Paul E. McKenney wrote:
> > > > On Sat, Nov 18, 2006 at 09:46:24PM +0300, Oleg Nesterov wrote:
> > > > > On 11/17, Paul E. McKenney wrote:
> > > > > >
> > > > > > Oleg, any thoughts about Jens's optimization? He would code something
> > > > > > like:
> > > > > >
> > > > > > if (srcu_readers_active(&my_srcu))
> > > > > > synchronize_srcu();
> > > > > > else
> > > > > > smp_mb();
> > > > >
> > > > > Well, this is clearly racy, no? I am not sure, but may be we can do
> > > > >
> > > > > smp_mb();
> > > > > if (srcu_readers_active(&my_srcu))
> > > > > synchronize_srcu();
> > > > >
> > > > > in this case we also need to add 'smp_mb()' into srcu_read_lock() after
> > > > > 'atomic_inc(&sp->hardluckref)'.
> > > > >
> > > > > > However, he is doing ordered I/O requests rather than protecting data
> > > > > > structures.
> > > > >
> > > > > Probably this makes a difference, but I don't understand this.
> > > >
> > > > OK, one hypothesis here...
> > > >
> > > > The I/Os must be somehow explicitly ordered to qualify
> > > > for I/O-barrier separation. If two independent processes
> > > > issue I/Os concurrently with a third process doing an
> > > > I/O barrier, the I/O barrier is free to separate the
> > > > two concurrent I/Os or not, on its whim.
> > > >
> > > > Jens, is the above correct? If so, what would the two processes
> > >
> > > That's completely correct, hence my somewhat relaxed approach with SRCU.
> >
> > OK, less scary in that case. ;-)
>
> Yep, it's really not scary in any ordering sense!
>
> > > > need to do in order to ensure that their I/O was considered to be
> > > > ordered with respect to the I/O barrier? Here are some possibilities:
> > >
> > > If we consider the barrier a barrier in a certain stream of requests,
> > > it is the responsibility of the issuer of that barrier to ensure that
> > > the queueing is ordered. So if two "unrelated" streams of requests with
> > > barriers hit __make_request() at the same time, we don't go to great
> > > lengths to ensure who gets there firt.
> >
> > So the "preceding" requests have to have completed their I/O system
> > calls? If this is the case, does this include normal (non-direct/raw)
> > writes and asynchronous reads? My guess is that it would include
> > asynchronous I/O, but not buffered writes.
>
> They need not have completed, but they must have been queued at the
> block layer level. IOW, the io scheduler must know about them. Since
> it's a block layer device property, we really don't care about system
> calls since any of them could amount to 1 or lots more individual io
> requests.
>
> But now we have taken a detour from the original problem. As I wrote
> above, the io scheduler must know about the requests. When the plug list
> ends up in the private process context, the io scheduler doesn't know
> about it yet. When a barrier is queued, the block layer does not care
> about io that hasn't been issued yet (dirty data in the page cache
> perhaps), since if it hasn't been seen, it's by definition not
> interesting. But if some of the requests reside in a different process
> private request list, then that is a violation of this rule since it
> should technically belong to the block layer / io scheduler at that
> point. This is where I wanted to use SRCU.
OK. Beyond a certain point, I would need to see the code using SRCU.
> > > > 1. I/O barriers apply only to preceding and following I/Os from
> > > > the process issuing the I/O barrier.
> > > >
> > > > 2. As for #1 above, but restricted to task rather than process.
> > > >
> > > > 3. I/O system calls that have completed are ordered by the
> > > > barrier to precede I/O system calls that have not yet
> > > > started, but I/O system calls still in flight could legally
> > > > land on either side of the concurrently executing I/O
> > > > barrier.
> > > >
> > > > 4. Something else entirely?
> > > >
> > > > Given some restriction like one of the above, it is entirely possible
> > > > that we don't even need the memory barrier...
> > >
> > > 3 is the closest. The request queue doesn't really know the scope of the
> > > barrier, it has to rely on the issuer getting it right. If you have two
> > > competing processes issuing io and process A relies on process B issuing
> > > a barrier, they have to synchronize that between them. Normally that is
> > > not a problem, since that's how the file systems always did io before
> > > barriers on items that need to be on disk (it was a serialization point
> > > anyway, it's just a stronger one now).
> >
> > So something like a user-level mutex or atomic instructions must be used
> > by the tasks doing the pre-barrier I/Os to announce that these I/Os have
> > been started in the kernel.
>
> We don't do barriers from user space, it's purely a feature available to
> file systems to ensure ordering of writes even at the disk platter
> level.
Got it. The question then becomes whether the barriers involved in
locking/whatever for the queues enter into the picture.
> > > That said, I think the
> > >
> > > smp_mb();
> > > if (srcu_readers_active(sp))
> > > synchronize_srcu();
> > >
> > > makes the most sense.
> >
> > If the user-level tasks/threads/processes must explicitly synchronize,
> > and if the pre-barrier I/O-initation syscalls have to have completed,
> > then I am not sure that the smp_mb() is needed. Seems like the queuing
> > mechanisms in the syscall and the user-level synchronization would have
> > supplied the needed memory barriers. Or are you using some extremely
> > lightweight user-level synchronization?
>
> Once a process holds a queue plug, any write issued to that plug list
> will do an srcu_read_lock(). So as far as I can tell, the smp_mb() is
> needed to ensure that an immediately following synchronize_srcu() from a
> barrier write queued on a different CPU will see that srcu_read_lock().
Is the srcu_read_lock() invoked before actually queueing the I/O?
Is there any interaction with the queues befory calling synchronize_srcu()?
If yes to both, it might be possible to remove the memory barrier.
That said, the overhead of the smp_mb() is so small compared to that of
the I/O path that it might not be worth it.
> There are no syscall or user-level synchronization.
Got it, thank you for the explanation!
Thanx, Paul
On Mon, Nov 20 2006, Alan Stern wrote:
> On Mon, 20 Nov 2006, Jens Axboe wrote:
>
> > On Mon, Nov 20 2006, Alan Stern wrote:
> > > Paul:
> > >
> > > Here's my version of your patch from yesterday. It's basically the same,
> > > but I cleaned up the code in a few places and fixed a bug (the sign of idx
> > > in srcu_read_unlock). Also I changed the init routine back to void, since
> > > it's no longer an error if the per-cpu allocation fails.
> > >
> > > More importantly, I added a static initializer and included the fast-path
> > > in synchronize_srcu. It's protected by the new symbol
> > > SMP__STORE_MB_LOAD_WORKS, which should be defined in arch-specific headers
> > > for those architectures where the store-mb-load pattern is safe.
> >
> > Must we introduce memory allocations in srcu_read_lock()? It makes it
> > much harder and nastier for me to use. I'd much prefer a failing
> > init_srcu(), seems like a much better API.
>
> Paul agrees with you that allocation failures in init_srcu() should be
> passed back to the caller, and I certainly don't mind doing so.
>
> However we can't remove the memory allocation in srcu_read_lock(). That
> was the point which started this whole thread: the per-cpu allocation
> cannot be done statically, and some users of a static SRCU structure can't
> easily call init_srcu() early enough.
>
> Once the allocation succeeds, the overhead in srcu_read_lock() is minimal.
It's not about the overhead, it's about a potentially problematic
allocation.
--
Jens Axboe
On Mon, Nov 20 2006, Paul E. McKenney wrote:
> > > So the "preceding" requests have to have completed their I/O system
> > > calls? If this is the case, does this include normal (non-direct/raw)
> > > writes and asynchronous reads? My guess is that it would include
> > > asynchronous I/O, but not buffered writes.
> >
> > They need not have completed, but they must have been queued at the
> > block layer level. IOW, the io scheduler must know about them. Since
> > it's a block layer device property, we really don't care about system
> > calls since any of them could amount to 1 or lots more individual io
> > requests.
> >
> > But now we have taken a detour from the original problem. As I wrote
> > above, the io scheduler must know about the requests. When the plug list
> > ends up in the private process context, the io scheduler doesn't know
> > about it yet. When a barrier is queued, the block layer does not care
> > about io that hasn't been issued yet (dirty data in the page cache
> > perhaps), since if it hasn't been seen, it's by definition not
> > interesting. But if some of the requests reside in a different process
> > private request list, then that is a violation of this rule since it
> > should technically belong to the block layer / io scheduler at that
> > point. This is where I wanted to use SRCU.
>
> OK. Beyond a certain point, I would need to see the code using SRCU.
Yeah, of course. It's actually in the 'plug' branch of the block repo
(and has been for some weeks), so it's public. I'll post the updated
patch soon again.
> > > If the user-level tasks/threads/processes must explicitly synchronize,
> > > and if the pre-barrier I/O-initation syscalls have to have completed,
> > > then I am not sure that the smp_mb() is needed. Seems like the queuing
> > > mechanisms in the syscall and the user-level synchronization would have
> > > supplied the needed memory barriers. Or are you using some extremely
> > > lightweight user-level synchronization?
> >
> > Once a process holds a queue plug, any write issued to that plug list
> > will do an srcu_read_lock(). So as far as I can tell, the smp_mb() is
> > needed to ensure that an immediately following synchronize_srcu() from a
> > barrier write queued on a different CPU will see that srcu_read_lock().
>
> Is the srcu_read_lock() invoked before actually queueing the I/O?
Yes,
> Is there any interaction with the queues befory calling synchronize_srcu()?
> If yes to both, it might be possible to remove the memory barrier.
There might, there might not be. Currently it replugs the queue to
prevent a livelock if the same process currently holds a read_lock on
the queue. And I guess that will stay, so that's a yes here as well.
> That said, the overhead of the smp_mb() is so small compared to that of
> the I/O path that it might not be worth it.
I could not convince myself that it wasn't always needed, so I'd agree.
--
Jens Axboe
On Mon, 20 Nov 2006, Paul E. McKenney wrote:
> On Mon, Nov 20, 2006 at 12:19:19PM -0500, Alan Stern wrote:
> > Paul:
> >
> > Here's my version of your patch from yesterday. It's basically the same,
> > but I cleaned up the code in a few places and fixed a bug (the sign of idx
> > in srcu_read_unlock). Also I changed the init routine back to void, since
> > it's no longer an error if the per-cpu allocation fails.
>
> I don't see any changes in the sign of idx.
Your earlier patch had srcu_read_unlock doing:
+ if (likely(idx <= 0)) {
+ preempt_disable();
+ smp_mb();
+ per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
+ smp_processor_id())->c[idx]--;
+ preempt_enable();
+ return;
+ }
Obviously you meant the test to be "idx >= 0".
> > -int init_srcu_struct(struct srcu_struct *sp)
> > +void init_srcu_struct(struct srcu_struct *sp)
> > {
> > sp->completed = 0;
> > mutex_init(&sp->mutex);
> > - sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
> > - return (sp->per_cpu_ref ? 0 : -ENOMEM);
> > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> > + atomic_set(&sp->hardluckref[0], 0);
> > + atomic_set(&sp->hardluckref[1], 0);
> > }
>
> Nack -- the caller is free to ignore the error return, but should be
> able to detect it in case the caller is unable to tolerate the overhead
> of running in hardluckref mode, perhaps instead choosing to fail a
> user-level request in order to try to reduce memory pressure.
Okay. Remember to change back the declaration in srcu.h as well.
> I did update the comment to say that you can use SRCU_INITIALIZER()
> instead.
Good.
> > int srcu_read_lock(struct srcu_struct *sp)
> > {
> > int idx;
> > + struct srcu_struct_array *sap;
> > +
> > + if (unlikely(sp->per_cpu_ref == NULL &&
> > + mutex_trylock(&sp->mutex))) {
> > + if (sp->per_cpu_ref == NULL)
> > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> > + mutex_unlock(&sp->mutex);
> > + }
> >
> > preempt_disable();
> > idx = sp->completed & 0x1;
> > barrier(); /* ensure compiler looks -once- at sp->completed. */
> > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > + sap = rcu_dereference(sp->per_cpu_ref);
> > + if (likely(sap != NULL)) {
> > + per_cpu_ptr(sap, smp_processor_id())->c[idx]++;
> > + smp_mb();
> > + } else {
> > + atomic_inc(&sp->hardluckref[idx]);
> > + smp_mb__after_atomic_inc();
> > + idx = -1 - idx;
> > + }
> > preempt_enable();
> > return idx;
> > }
>
> Good restructuring -- took this, though I restricted the unlikely() to
> cover only the comparison to NULL, since the mutex_trylock() has a
> reasonable chance of success assuming that the read path has substantial
> overhead from other sources.
I vacillated over that. It's not clear what difference it will make to
the compiler, but I guess you would make it a little clearer to a reader
that the unlikely part is the test for NULL.
> > @@ -131,10 +166,16 @@ int srcu_read_lock(struct srcu_struct *s
> > */
> > void srcu_read_unlock(struct srcu_struct *sp, int idx)
> > {
> > - preempt_disable();
> > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> > - preempt_enable();
> > + if (likely(idx >= 0)) {
> > + smp_mb();
> > + preempt_disable();
> > + per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
> > + smp_processor_id())->c[idx]--;
> > + preempt_enable();
> > + } else {
> > + smp_mb__before_atomic_dec();
> > + atomic_dec(&sp->hardluckref[-1 - idx]);
> > + }
> > }
>
> I took the moving smp_mb() out from under preempt_disable().
> I left the "return" -- same number of lines either way.
> I don't see any changes in sign compared to what I had, FWIW.
Compare your "if" statement to mine.
> > @@ -168,71 +214,35 @@ void synchronize_srcu(struct srcu_struct
> > * either (1) wait for two or (2) supply the second ourselves.
> > */
> >
> > - if ((sp->completed - idx) >= 2) {
> > - mutex_unlock(&sp->mutex);
> > - return;
> > - }
> > + if ((sp->completed - idx) >= 2)
> > + goto done;
>
> I don't see the benefit of gathering all the mutex_unlock()s together.
> If the unwinding was more complicated, I would take this change, however.
It's not a big deal. A couple of extra function calls won't add much
code.
> > +#ifdef SMP__STORE_MB_LOAD_WORKS /* The fast path */
> > + if (srcu_readers_active_idx(sp, idx) == 0)
> > + goto done;
> > +#endif
>
> I still don't trust this one. I would trust it a bit more if it were
> srcu_readers_active() rather than srcu_readers_active_idx(), but even
> then I suspect that external synchronization is required.
There should be some sort of assertion at the start of synchronize_srcu
(just after the mutex is acquired) that srcu_readers_active_idx() yields 0
for the inactive idx, and a similar assertion at the end (just before the
mutex is released). Hence there's no need to check both indices.
> Or is there
> something that I am missing that makes it safe in face of the sequence
> of events that you, Oleg, and I were discussing?
Consider the sequence of events we were discussing.
srcu_read_lock and synchronize_srcu are invoked at the same
time. The writer sees that the count is 0 and takes the fast
path, leaving sp->completed unchanged. The reader sees the
new value of the data pointer and starts using it.
While the reader is still running, synchronize_srcu is called
again. This time it sees that the count is > 0, so it doesn't
take the fast path. Everything follows according to your
original code.
The underlying problem with Oleg's code was that the reader could end up
incrementing the wrong counter. That can't happen with the fast path,
because sp->completed doesn't change.
> For the moment, this optimization should be done by the caller, and
> should be prominently commented.
Prominent commenting is certainly a good idea. However I don't see how
the caller can make this optimization without getting into the guts of the
SRCU implementation. After all, a major part of the optimization is
to avoid calling synchronize_sched().
Alan
On Mon, Nov 20, 2006 at 09:57:12PM +0300, Oleg Nesterov wrote:
> On 11/20, Alan Stern wrote:
> >
> > @@ -158,6 +199,11 @@ void synchronize_srcu(struct srcu_struct
> >
> > [... snip ...]
> >
> > +#ifdef SMP__STORE_MB_LOAD_WORKS /* The fast path */
> > + if (srcu_readers_active_idx(sp, idx) == 0)
> > + goto done;
> > +#endif
>
> I guess this is connected to another message from you,
>
> > But of course it _is_ needed for the fastpath to work. In fact, it might
> > not be good enough, depending on the architecture. Here's what the
> > fastpath ends up looking like (using c[idx] is essentially the same as
> > using hardluckref):
> >
> > WRITER READER
> > ------ ------
> > dataptr = &(new data) atomic_inc(&hardluckref)
> > mb mb
> > while (hardluckref > 0) ; access *dataptr
> >
> > Notice the pattern: Each CPU does store-mb-load. It is known that on
> > some architectures each CPU can end up loading the old value (the value
> > from before the other CPU's store). This would mean the writer would see
> > hardluckref == 0 right away and the reader would see the old dataptr.
>
> So, if we have global A == B == 0,
>
> CPU_0 CPU_1
>
> A = 1; B = 2;
> mb(); mb();
> b = B; a = A;
>
> It could happen that a == b == 0, yes? Isn't this contradicts with definition
> of mb?
It can and does happen. -Which- definition of mb()? ;-)
To see how this can happen, thing of the SMP system as a message-passing
system, and consider the following sequence of events:
o The cache line for A is initially in CPU 1's cache, and the
cache line for B is initially in CPU 0's cache (backwards of
what you would want knowing about the upcoming writes).
o CPU 0 stores to A, but because A is not in cache, places it in
CPU 0's store queue. It also puts out a request for ownership
of the cache line containing A.
o CPU 1 stores to B, with the same situation as for CPU 0's store
to A.
o Both CPUs execute an mb(), which ensures that any subsequent writes
follow the writes to A and B, respectively. Since neither CPU
has yet received the other CPU's request for ownership, there is
no ordering effects on subsequent reads.
o CPU 0 executes "b = B", and since B is in CPU 0's cache, it loads
the current value, which is zero.
o Ditto for CPU 1 and A.
o CPUs 0 and 1 now receive each other's requests for ownership, so
exchange the cache lines containing A and B.
o Once CPUs 0 and 1 receive ownership of the respective cache lines,
they complete their writes to A and B (moving the values from the
store buffers to the cache lines).
> By definition, when CPU_0 issues 'b = B', 'A = 1' should be visible to other
> CPUs, yes? Now, b == 0 means that CPU_1 did not read 'a = A' yet, otherwise
> 'B = 2' should be visible to all CPUs (by definition again).
>
> Could you please clarify this?
See above...
> Btw, this is funny, but I was going to suggest _exactly_ same cleanup for
> srcu_read_lock :)
;-)
Thanx, Paul
On Mon, Nov 20, 2006 at 03:01:59PM -0500, Alan Stern wrote:
> On Mon, 20 Nov 2006, Oleg Nesterov wrote:
>
> > On 11/20, Alan Stern wrote:
> > >
> > > @@ -158,6 +199,11 @@ void synchronize_srcu(struct srcu_struct
> > >
> > > [... snip ...]
> > >
> > > +#ifdef SMP__STORE_MB_LOAD_WORKS /* The fast path */
> > > + if (srcu_readers_active_idx(sp, idx) == 0)
> > > + goto done;
> > > +#endif
> >
> > I guess this is connected to another message from you,
>
> Yes.
>
> > > But of course it _is_ needed for the fastpath to work. In fact, it might
> > > not be good enough, depending on the architecture. Here's what the
> > > fastpath ends up looking like (using c[idx] is essentially the same as
> > > using hardluckref):
> > >
> > > WRITER READER
> > > ------ ------
> > > dataptr = &(new data) atomic_inc(&hardluckref)
> > > mb mb
> > > while (hardluckref > 0) ; access *dataptr
> > >
> > > Notice the pattern: Each CPU does store-mb-load. It is known that on
> > > some architectures each CPU can end up loading the old value (the value
> > > from before the other CPU's store). This would mean the writer would see
> > > hardluckref == 0 right away and the reader would see the old dataptr.
> >
> > So, if we have global A == B == 0,
> >
> > CPU_0 CPU_1
> >
> > A = 1; B = 2;
> > mb(); mb();
> > b = B; a = A;
> >
> > It could happen that a == b == 0, yes?
>
> Exactly.
>
> > Isn't this contradicts with definition
> > of mb?
>
> One might think so, at first. But if you do a careful search, you'll find
> that there _is_ no definition of mb! People state in vague terms what
> it's supposed to do, but they are almost never specific enough to tell
> whether the example above should work.
Yep -- mb() is currently defined only for specific CPUs. :-/
Some Linux kernel code has been written by considering each SMP-capable
CPU in turn, but that does not scale with increasing numbers of SMP-capable
CPUs.
> > By definition, when CPU_0 issues 'b = B', 'A = 1' should be visible to other
> > CPUs, yes?
>
> No. Memory barriers don't guarantee that any particular store will become
> visible to other CPUs at any particular time. They guarantee only that a
> certain sequence of stores will become visible in a particular order
> (provided the other CPUs also use the correct memory barriers).
>
> > Now, b == 0 means that CPU_1 did not read 'a = A' yet, otherwise
> > 'B = 2' should be visible to all CPUs (by definition again).
> >
> > Could you please clarify this?
>
> Here's an example showing how the code can fail. (Paul can correct me if
> I get this wrong.)
Looks good to me!
Thanx, Paul
On Mon, 20 Nov 2006, Jens Axboe wrote:
> > > Must we introduce memory allocations in srcu_read_lock()? It makes it
> > > much harder and nastier for me to use. I'd much prefer a failing
> > > init_srcu(), seems like a much better API.
> >
> > Paul agrees with you that allocation failures in init_srcu() should be
> > passed back to the caller, and I certainly don't mind doing so.
> >
> > However we can't remove the memory allocation in srcu_read_lock(). That
> > was the point which started this whole thread: the per-cpu allocation
> > cannot be done statically, and some users of a static SRCU structure can't
> > easily call init_srcu() early enough.
> >
> > Once the allocation succeeds, the overhead in srcu_read_lock() is minimal.
>
> It's not about the overhead, it's about a potentially problematic
> allocation.
I'm not sure what you mean by "problematic allocation". If you
successfully call init_srcu_struct then the allocation will be taken care
of. Later calls to srcu_read_lock won't experience any slowdowns or
problems.
If your call to init_srcu_struct isn't successful then you have to decide
how to handle it. You can ignore the failure and live with degraded
performance (caused by cache-line contention and repeated attempts to do
the per-cpu allocation), or you can give up entirely.
Does this answer your objection? If not, can you explain in more detail
what other features you would like?
Alan Stern
On Mon, Nov 20 2006, Alan Stern wrote:
> On Mon, 20 Nov 2006, Jens Axboe wrote:
>
> > > > Must we introduce memory allocations in srcu_read_lock()? It makes it
> > > > much harder and nastier for me to use. I'd much prefer a failing
> > > > init_srcu(), seems like a much better API.
> > >
> > > Paul agrees with you that allocation failures in init_srcu() should be
> > > passed back to the caller, and I certainly don't mind doing so.
> > >
> > > However we can't remove the memory allocation in srcu_read_lock(). That
> > > was the point which started this whole thread: the per-cpu allocation
> > > cannot be done statically, and some users of a static SRCU structure can't
> > > easily call init_srcu() early enough.
> > >
> > > Once the allocation succeeds, the overhead in srcu_read_lock() is minimal.
> >
> > It's not about the overhead, it's about a potentially problematic
> > allocation.
>
> I'm not sure what you mean by "problematic allocation". If you
> successfully call init_srcu_struct then the allocation will be taken care
> of. Later calls to srcu_read_lock won't experience any slowdowns or
> problems.
That requires init_srcu_struct() to return the error. If it does that,
I'm fine with it.
> Does this answer your objection? If not, can you explain in more detail
> what other features you would like?
It does, if the allocation failure in init_srcu_struct() is signalled.
--
Jens Axboe
On 11/20, Paul E. McKenney wrote:
>
> On Mon, Nov 20, 2006 at 09:57:12PM +0300, Oleg Nesterov wrote:
> > >
> > So, if we have global A == B == 0,
> >
> > CPU_0 CPU_1
> >
> > A = 1; B = 2;
> > mb(); mb();
> > b = B; a = A;
> >
> > It could happen that a == b == 0, yes? Isn't this contradicts with definition
> > of mb?
>
> It can and does happen. -Which- definition of mb()? ;-)
I had a somewhat similar understanding before this discussion
[PATCH] Fix RCU race in access of nohz_cpu_mask
http://marc.theaimsgroup.com/?t=113378060600003
Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
http://marc.theaimsgroup.com/?t=113432312600001
Could you please explain me again why that fix was correct? What we have now is:
CPU_0 CPU_1
rcu_start_batch: stop_hz_timer:
rcp->cur++; STORE nohz_cpu_mask |= cpu
smp_mb(); mb(); // missed actually
->cpumask = ~nohz_cpu_mask; LOAD if (rcu_pending()) // reads rcp->cur
nohz_cpu_mask &= ~cpu
So, it is possible that CPU_0 reads an empty nohz_cpu_mask and starts a grace
period with CPU_1 included in rcp->cpumask. CPU_1 in turn reads an old value
of rcp->cur (so rcu_pending() returns 0) and becomes CPU_IDLE.
Take another patch,
Re: Oops on 2.6.18
http://marc.theaimsgroup.com/?l=linux-kernel&m=116266392016286
switch_uid: __sigqueue_alloc:
STORE 'new_user' to ->user STORE "locked" to ->siglock
mb(); "mb()"; // sort of, wrt loads/stores above
LOAD ->siglock LOAD ->siglock
Agian, it is possible that switch_uid() doesn't notice that ->siglock is locked
and frees ->user. __sigqueue_alloc() in turn reads an old (freed) value of ->user
and does get_uid() on it.
> To see how this can happen, thing of the SMP system as a message-passing
> system, and consider the following sequence of events:
>
> o The cache line for A is initially in CPU 1's cache, and the
> cache line for B is initially in CPU 0's cache (backwards of
> what you would want knowing about the upcoming writes).
>
> o CPU 0 stores to A, but because A is not in cache, places it in
> CPU 0's store queue. It also puts out a request for ownership
> of the cache line containing A.
>
> o CPU 1 stores to B, with the same situation as for CPU 0's store
> to A.
>
> o Both CPUs execute an mb(), which ensures that any subsequent writes
> follow the writes to A and B, respectively. Since neither CPU
> has yet received the other CPU's request for ownership, there is
> no ordering effects on subsequent reads.
>
> o CPU 0 executes "b = B", and since B is in CPU 0's cache, it loads
> the current value, which is zero.
>
> o Ditto for CPU 1 and A.
>
> o CPUs 0 and 1 now receive each other's requests for ownership, so
> exchange the cache lines containing A and B.
>
> o Once CPUs 0 and 1 receive ownership of the respective cache lines,
> they complete their writes to A and B (moving the values from the
> store buffers to the cache lines).
Paul, Alan, in case it was not clear: I am not arguing, just trying to
understand, and I appreciate very much your time and your explanations.
Oleg.
On Mon, Nov 20, 2006 at 03:22:58PM -0500, Alan Stern wrote:
> On Mon, 20 Nov 2006, Paul E. McKenney wrote:
>
> > On Mon, Nov 20, 2006 at 12:19:19PM -0500, Alan Stern wrote:
> > > Paul:
> > >
> > > Here's my version of your patch from yesterday. It's basically the same,
> > > but I cleaned up the code in a few places and fixed a bug (the sign of idx
> > > in srcu_read_unlock). Also I changed the init routine back to void, since
> > > it's no longer an error if the per-cpu allocation fails.
> >
> > I don't see any changes in the sign of idx.
>
> Your earlier patch had srcu_read_unlock doing:
>
> + if (likely(idx <= 0)) {
> + preempt_disable();
> + smp_mb();
> + per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
> + smp_processor_id())->c[idx]--;
> + preempt_enable();
> + return;
> + }
>
> Obviously you meant the test to be "idx >= 0".
That would explain the oops upon kicking off rcutorture. ;-) Good catch --
must be time for me to visit the optometrist or something...
> > > -int init_srcu_struct(struct srcu_struct *sp)
> > > +void init_srcu_struct(struct srcu_struct *sp)
> > > {
> > > sp->completed = 0;
> > > mutex_init(&sp->mutex);
> > > - sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
> > > - return (sp->per_cpu_ref ? 0 : -ENOMEM);
> > > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> > > + atomic_set(&sp->hardluckref[0], 0);
> > > + atomic_set(&sp->hardluckref[1], 0);
> > > }
> >
> > Nack -- the caller is free to ignore the error return, but should be
> > able to detect it in case the caller is unable to tolerate the overhead
> > of running in hardluckref mode, perhaps instead choosing to fail a
> > user-level request in order to try to reduce memory pressure.
>
> Okay. Remember to change back the declaration in srcu.h as well.
I did manage to get this one right. ;-)
> > I did update the comment to say that you can use SRCU_INITIALIZER()
> > instead.
>
> Good.
>
> > > int srcu_read_lock(struct srcu_struct *sp)
> > > {
> > > int idx;
> > > + struct srcu_struct_array *sap;
> > > +
> > > + if (unlikely(sp->per_cpu_ref == NULL &&
> > > + mutex_trylock(&sp->mutex))) {
> > > + if (sp->per_cpu_ref == NULL)
> > > + sp->per_cpu_ref = alloc_srcu_struct_percpu();
> > > + mutex_unlock(&sp->mutex);
> > > + }
> > >
> > > preempt_disable();
> > > idx = sp->completed & 0x1;
> > > barrier(); /* ensure compiler looks -once- at sp->completed. */
> > > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > > + sap = rcu_dereference(sp->per_cpu_ref);
> > > + if (likely(sap != NULL)) {
> > > + per_cpu_ptr(sap, smp_processor_id())->c[idx]++;
> > > + smp_mb();
> > > + } else {
> > > + atomic_inc(&sp->hardluckref[idx]);
> > > + smp_mb__after_atomic_inc();
> > > + idx = -1 - idx;
> > > + }
> > > preempt_enable();
> > > return idx;
> > > }
> >
> > Good restructuring -- took this, though I restricted the unlikely() to
> > cover only the comparison to NULL, since the mutex_trylock() has a
> > reasonable chance of success assuming that the read path has substantial
> > overhead from other sources.
>
> I vacillated over that. It's not clear what difference it will make to
> the compiler, but I guess you would make it a little clearer to a reader
> that the unlikely part is the test for NULL.
Agreed, probably mostly for the benefit of the human reader.
> > > @@ -131,10 +166,16 @@ int srcu_read_lock(struct srcu_struct *s
> > > */
> > > void srcu_read_unlock(struct srcu_struct *sp, int idx)
> > > {
> > > - preempt_disable();
> > > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
> > > - preempt_enable();
> > > + if (likely(idx >= 0)) {
> > > + smp_mb();
> > > + preempt_disable();
> > > + per_cpu_ptr(rcu_dereference(sp->per_cpu_ref),
> > > + smp_processor_id())->c[idx]--;
> > > + preempt_enable();
> > > + } else {
> > > + smp_mb__before_atomic_dec();
> > > + atomic_dec(&sp->hardluckref[-1 - idx]);
> > > + }
> > > }
> >
> > I took the moving smp_mb() out from under preempt_disable().
> > I left the "return" -- same number of lines either way.
> > I don't see any changes in sign compared to what I had, FWIW.
>
> Compare your "if" statement to mine.
<red face>
> > > @@ -168,71 +214,35 @@ void synchronize_srcu(struct srcu_struct
> > > * either (1) wait for two or (2) supply the second ourselves.
> > > */
> > >
> > > - if ((sp->completed - idx) >= 2) {
> > > - mutex_unlock(&sp->mutex);
> > > - return;
> > > - }
> > > + if ((sp->completed - idx) >= 2)
> > > + goto done;
> >
> > I don't see the benefit of gathering all the mutex_unlock()s together.
> > If the unwinding was more complicated, I would take this change, however.
>
> It's not a big deal. A couple of extra function calls won't add much
> code.
And some compilers do this kind of gathering automatically anyway.
> > > +#ifdef SMP__STORE_MB_LOAD_WORKS /* The fast path */
> > > + if (srcu_readers_active_idx(sp, idx) == 0)
> > > + goto done;
> > > +#endif
> >
> > I still don't trust this one. I would trust it a bit more if it were
> > srcu_readers_active() rather than srcu_readers_active_idx(), but even
> > then I suspect that external synchronization is required.
>
> There should be some sort of assertion at the start of synchronize_srcu
> (just after the mutex is acquired) that srcu_readers_active_idx() yields 0
> for the inactive idx, and a similar assertion at the end (just before the
> mutex is released). Hence there's no need to check both indices.
OK, good point -- the non-active set is indeed supposed to sum to zero
unless someone is in synchronize_srcu().
> > Or is there
> > something that I am missing that makes it safe in face of the sequence
> > of events that you, Oleg, and I were discussing?
>
> Consider the sequence of events we were discussing.
>
> srcu_read_lock and synchronize_srcu are invoked at the same
> time. The writer sees that the count is 0 and takes the fast
> path, leaving sp->completed unchanged. The reader sees the
> new value of the data pointer and starts using it.
>
> While the reader is still running, synchronize_srcu is called
> again. This time it sees that the count is > 0, so it doesn't
> take the fast path. Everything follows according to your
> original code.
>
> The underlying problem with Oleg's code was that the reader could end up
> incrementing the wrong counter. That can't happen with the fast path,
> because sp->completed doesn't change.
OK, then the remaining area of my paranoia centers around unfortunate
sequences of preemptions, increments, and decrements of the counters
while synchronize_srcu() is reading them out (and also being preempted).
Intuitively, it seems like this should always be safe, but I am worried
about it nonetheless.
> > For the moment, this optimization should be done by the caller, and
> > should be prominently commented.
>
> Prominent commenting is certainly a good idea. However I don't see how
> the caller can make this optimization without getting into the guts of the
> SRCU implementation. After all, a major part of the optimization is
> to avoid calling synchronize_sched().
In general, I agree. There may be special cases where the caller knows
it is safe due to things unknown to synchronize_srcu() -- Jens's example
might be a case in point.
Thanx, Paul
Here's another potential problem with the fast path approach. It's not
very serious, but you might want to keep it in mind.
The idea is that a reader can start up on one CPU and finish on another,
and a writer might see the finish event but not the start event. For
example:
Reader A enters the critical section on CPU 0 and starts
accessing the old data area.
Writer B updates the data pointer and starts executing
srcu_readers_active_idx() to check if the fast path can be
used. It sees per_cpu_ptr(0)->c[idx] == 1 because of
Reader A.
Reader C runs srcu_read_lock() on CPU 0, setting
per_cpu_ptr[0]->c[idx] to 2.
Reader C migrates to CPU 1 and leaves the critical section;
srcu_read_unlock() sets per_cpu_ptr(1)->c[idx] to -1.
Writer B finishes the cpu loop in srcu_readers_active_idx(),
seeing per_cpu_ptr(1)->c[idx] == -1. It computes sum =
1 + -1 == 0, takes the fast path, and exits immediately
from synchronize_srcu().
Writer B deallocates the old data area while Reader A is still
using it.
This requires two context switches to take place while the cpu loop in
srcu_readers_active_idx() runs, so perhaps it isn't realistic. Is it
worth worrying about?
Alan
On Tue, Nov 21, 2006 at 12:56:21PM -0500, Alan Stern wrote:
> Here's another potential problem with the fast path approach. It's not
> very serious, but you might want to keep it in mind.
>
> The idea is that a reader can start up on one CPU and finish on another,
> and a writer might see the finish event but not the start event. For
> example:
>
> Reader A enters the critical section on CPU 0 and starts
> accessing the old data area.
>
> Writer B updates the data pointer and starts executing
> srcu_readers_active_idx() to check if the fast path can be
> used. It sees per_cpu_ptr(0)->c[idx] == 1 because of
> Reader A.
>
> Reader C runs srcu_read_lock() on CPU 0, setting
> per_cpu_ptr[0]->c[idx] to 2.
>
> Reader C migrates to CPU 1 and leaves the critical section;
> srcu_read_unlock() sets per_cpu_ptr(1)->c[idx] to -1.
>
> Writer B finishes the cpu loop in srcu_readers_active_idx(),
> seeing per_cpu_ptr(1)->c[idx] == -1. It computes sum =
> 1 + -1 == 0, takes the fast path, and exits immediately
> from synchronize_srcu().
>
> Writer B deallocates the old data area while Reader A is still
> using it.
>
> This requires two context switches to take place while the cpu loop in
> srcu_readers_active_idx() runs, so perhaps it isn't realistic. Is it
> worth worrying about?
Thank you -very- -much- for finding the basis behind my paranoia!
I guess my intuition is still in good working order. ;-)
It might be unlikely, but that makes it even worse -- a strange memory
corruption problem that happens only under heavy load, and even then only
sometimes. No thank you!!!
I suspect that this affects Jens as well, though I don't claim to
completely understand his usage.
One approach to get around this would be for the the "idx" returned from
srcu_read_lock() to keep track of the CPU as well as the index within
the CPU. This would require atomic_inc()/atomic_dec() on the fast path,
but would not add much to the overhead on x86 because the smp_mb() imposes
an atomic operation anyway. There would be little cache thrashing in the
case where there is no preemption -- but if the readers almost always sleep,
and where it is common for the srcu_read_unlock() to run on a different CPU
than the srcu_read_lock(), then the additional cache thrashing could add
significant overhead.
Thoughts?
Thanx, Paul
On Tue, Nov 21, 2006 at 07:44:20PM +0300, Oleg Nesterov wrote:
> On 11/20, Paul E. McKenney wrote:
> >
> > On Mon, Nov 20, 2006 at 09:57:12PM +0300, Oleg Nesterov wrote:
> > > >
> > > So, if we have global A == B == 0,
> > >
> > > CPU_0 CPU_1
> > >
> > > A = 1; B = 2;
> > > mb(); mb();
> > > b = B; a = A;
> > >
> > > It could happen that a == b == 0, yes? Isn't this contradicts with definition
> > > of mb?
> >
> > It can and does happen. -Which- definition of mb()? ;-)
>
> I had a somewhat similar understanding before this discussion
>
> [PATCH] Fix RCU race in access of nohz_cpu_mask
> http://marc.theaimsgroup.com/?t=113378060600003
>
> Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
> http://marc.theaimsgroup.com/?t=113432312600001
>
> Could you please explain me again why that fix was correct? What we have now is:
>
> CPU_0 CPU_1
> rcu_start_batch: stop_hz_timer:
>
> rcp->cur++; STORE nohz_cpu_mask |= cpu
>
> smp_mb(); mb(); // missed actually
>
> ->cpumask = ~nohz_cpu_mask; LOAD if (rcu_pending()) // reads rcp->cur
> nohz_cpu_mask &= ~cpu
>
> So, it is possible that CPU_0 reads an empty nohz_cpu_mask and starts a grace
> period with CPU_1 included in rcp->cpumask. CPU_1 in turn reads an old value
> of rcp->cur (so rcu_pending() returns 0) and becomes CPU_IDLE.
At this point, I am not certain that it is in fact correct. :-/
> Take another patch,
>
> Re: Oops on 2.6.18
> http://marc.theaimsgroup.com/?l=linux-kernel&m=116266392016286
>
> switch_uid: __sigqueue_alloc:
>
> STORE 'new_user' to ->user STORE "locked" to ->siglock
>
> mb(); "mb()"; // sort of, wrt loads/stores above
>
> LOAD ->siglock LOAD ->siglock
>
> Agian, it is possible that switch_uid() doesn't notice that ->siglock is locked
> and frees ->user. __sigqueue_alloc() in turn reads an old (freed) value of ->user
> and does get_uid() on it.
Ditto.
> > To see how this can happen, thing of the SMP system as a message-passing
> > system, and consider the following sequence of events:
> >
> > o The cache line for A is initially in CPU 1's cache, and the
> > cache line for B is initially in CPU 0's cache (backwards of
> > what you would want knowing about the upcoming writes).
> >
> > o CPU 0 stores to A, but because A is not in cache, places it in
> > CPU 0's store queue. It also puts out a request for ownership
> > of the cache line containing A.
> >
> > o CPU 1 stores to B, with the same situation as for CPU 0's store
> > to A.
> >
> > o Both CPUs execute an mb(), which ensures that any subsequent writes
> > follow the writes to A and B, respectively. Since neither CPU
> > has yet received the other CPU's request for ownership, there is
> > no ordering effects on subsequent reads.
> >
> > o CPU 0 executes "b = B", and since B is in CPU 0's cache, it loads
> > the current value, which is zero.
> >
> > o Ditto for CPU 1 and A.
> >
> > o CPUs 0 and 1 now receive each other's requests for ownership, so
> > exchange the cache lines containing A and B.
> >
> > o Once CPUs 0 and 1 receive ownership of the respective cache lines,
> > they complete their writes to A and B (moving the values from the
> > store buffers to the cache lines).
>
> Paul, Alan, in case it was not clear: I am not arguing, just trying to
> understand, and I appreciate very much your time and your explanations.
Either way, we clearly need better definitions of what the memory barriers
actually do! And I expect that we will need your help.
Thanx, Paul
On 11/20, Alan Stern wrote:
>
> On Mon, 20 Nov 2006, Oleg Nesterov wrote:
>
> > So, if we have global A == B == 0,
> >
> > CPU_0 CPU_1
> >
> > A = 1; B = 2;
> > mb(); mb();
> > b = B; a = A;
> >
> > It could happen that a == b == 0, yes?
>
> Both CPUs execute their "mb" instructions. The mb forces each
> cache to wait until it receives an Acknowdgement for the
> Invalidate it sent.
>
> Both caches send an Acknowledgement message to the other. The
> mb instructions complete.
>
> "b = B" and "a = A" execute. The caches return A==0 and B==0
> because they haven't yet invalidated their cache lines.
>
> The reason the code failed is because the mb instructions didn't force
> the caches to wait until the Invalidate messages in their queues had been
> fully carried out (i.e., the lines had actually been invalidated).
However, from
http://marc.theaimsgroup.com/?l=linux-kernel&m=113435711112941
Paul E. McKenney wrote:
>
> 2. rmb() guarantees that any changes seen by the interconnect
> preceding the rmb() will be seen by any reads following the
> rmb().
>
> 3. mb() combines the guarantees made by rmb() and wmb().
Confused :(
Oleg.
On Tue, 21 Nov 2006, Paul E. McKenney wrote:
> On Tue, Nov 21, 2006 at 07:44:20PM +0300, Oleg Nesterov wrote:
> > On 11/20, Paul E. McKenney wrote:
> > >
> > > On Mon, Nov 20, 2006 at 09:57:12PM +0300, Oleg Nesterov wrote:
> > > > >
> > > > So, if we have global A == B == 0,
> > > >
> > > > CPU_0 CPU_1
> > > >
> > > > A = 1; B = 2;
> > > > mb(); mb();
> > > > b = B; a = A;
> > > >
> > > > It could happen that a == b == 0, yes? Isn't this contradicts with definition
> > > > of mb?
> > >
> > > It can and does happen. -Which- definition of mb()? ;-)
> >
> > I had a somewhat similar understanding before this discussion
> >
> > [PATCH] Fix RCU race in access of nohz_cpu_mask
> > http://marc.theaimsgroup.com/?t=113378060600003
> >
> > Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
> > http://marc.theaimsgroup.com/?t=113432312600001
> >
> > Could you please explain me again why that fix was correct? What we have now is:
> >
> > CPU_0 CPU_1
> > rcu_start_batch: stop_hz_timer:
> >
> > rcp->cur++; STORE nohz_cpu_mask |= cpu
> >
> > smp_mb(); mb(); // missed actually
> >
> > ->cpumask = ~nohz_cpu_mask; LOAD if (rcu_pending()) // reads rcp->cur
> > nohz_cpu_mask &= ~cpu
> >
> > So, it is possible that CPU_0 reads an empty nohz_cpu_mask and starts a grace
> > period with CPU_1 included in rcp->cpumask. CPU_1 in turn reads an old value
> > of rcp->cur (so rcu_pending() returns 0) and becomes CPU_IDLE.
>
> At this point, I am not certain that it is in fact correct. :-/
>
> > Take another patch,
> >
> > Re: Oops on 2.6.18
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=116266392016286
> >
> > switch_uid: __sigqueue_alloc:
> >
> > STORE 'new_user' to ->user STORE "locked" to ->siglock
> >
> > mb(); "mb()"; // sort of, wrt loads/stores above
> >
> > LOAD ->siglock LOAD ->siglock
> >
> > Agian, it is possible that switch_uid() doesn't notice that ->siglock is locked
> > and frees ->user. __sigqueue_alloc() in turn reads an old (freed) value of ->user
> > and does get_uid() on it.
>
> Ditto.
> > Paul, Alan, in case it was not clear: I am not arguing, just trying to
> > understand, and I appreciate very much your time and your explanations.
>
> Either way, we clearly need better definitions of what the memory barriers
> actually do! And I expect that we will need your help.
Things may not be quite as bad as they appear. On many architectures the
store-mb-load pattern will work as expected. (In fact, I don't know which
architectures it might fail on.)
Furthermore this is a very difficult race to trigger. You couldn't force
it to happen, for example, by adding a delay somewhere.
Alan
On Tue, 21 Nov 2006, Paul E. McKenney wrote:
> On Tue, Nov 21, 2006 at 12:56:21PM -0500, Alan Stern wrote:
> > Here's another potential problem with the fast path approach. It's not
> > very serious, but you might want to keep it in mind.
> >
> > The idea is that a reader can start up on one CPU and finish on another,
> > and a writer might see the finish event but not the start event. For
> > example:
...
> > This requires two context switches to take place while the cpu loop in
> > srcu_readers_active_idx() runs, so perhaps it isn't realistic. Is it
> > worth worrying about?
>
> Thank you -very- -much- for finding the basis behind my paranoia!
> I guess my intuition is still in good working order. ;-)
Are you sure _this_ was the basis behind your paranoia? Maybe it had
something else in mind... :-)
> It might be unlikely, but that makes it even worse -- a strange memory
> corruption problem that happens only under heavy load, and even then only
> sometimes. No thank you!!!
>
> I suspect that this affects Jens as well, though I don't claim to
> completely understand his usage.
>
> One approach to get around this would be for the the "idx" returned from
> srcu_read_lock() to keep track of the CPU as well as the index within
> the CPU. This would require atomic_inc()/atomic_dec() on the fast path,
> but would not add much to the overhead on x86 because the smp_mb() imposes
> an atomic operation anyway. There would be little cache thrashing in the
> case where there is no preemption -- but if the readers almost always sleep,
> and where it is common for the srcu_read_unlock() to run on a different CPU
> than the srcu_read_lock(), then the additional cache thrashing could add
> significant overhead.
>
> Thoughts?
I don't like the thought of extra overhead from cache thrashing. Also it
seems silly to allocate per-cpu data and then write to another CPU's
element.
How about making srcu_readers_active_idx() so fast that there isn't time
for 2 context switches? Disabling interrupts ought to be good enough
(except in virtualized environments perhaps).
Alan
On Tue, 21 Nov 2006, Oleg Nesterov wrote:
> On 11/20, Alan Stern wrote:
> >
> > Both CPUs execute their "mb" instructions. The mb forces each
> > cache to wait until it receives an Acknowdgement for the
> > Invalidate it sent.
> >
> > Both caches send an Acknowledgement message to the other. The
> > mb instructions complete.
> >
> > "b = B" and "a = A" execute. The caches return A==0 and B==0
> > because they haven't yet invalidated their cache lines.
> >
> > The reason the code failed is because the mb instructions didn't force
> > the caches to wait until the Invalidate messages in their queues had been
> > fully carried out (i.e., the lines had actually been invalidated).
>
> However, from
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113435711112941
>
> Paul E. McKenney wrote:
> >
> > 2. rmb() guarantees that any changes seen by the interconnect
> > preceding the rmb() will be seen by any reads following the
> > rmb().
> >
> > 3. mb() combines the guarantees made by rmb() and wmb().
>
> Confused :(
I'm not certain the odd behavior can occur on systems that use an
interconnect like Paul described. In the context I was describing, rmb()
guarantees only that any changes seen _and acknowledged_ by the cache
preceding the rmb() will be seen by any reads following the rmb(). It's a
weaker guarantee, but it still suffices to show that
A = 1 b = B
wmb rmb
B = 2 a = A
will work as expected.
Alan
On 11/21, Paul E. McKenney wrote:
>
> On Tue, Nov 21, 2006 at 12:56:21PM -0500, Alan Stern wrote:
> > Here's another potential problem with the fast path approach. It's not
> > very serious, but you might want to keep it in mind.
> >
> > The idea is that a reader can start up on one CPU and finish on another,
> > and a writer might see the finish event but not the start event. For
> > example:
>
> One approach to get around this would be for the the "idx" returned from
> srcu_read_lock() to keep track of the CPU as well as the index within
> the CPU. This would require atomic_inc()/atomic_dec() on the fast path,
> but would not add much to the overhead on x86 because the smp_mb() imposes
> an atomic operation anyway. There would be little cache thrashing in the
> case where there is no preemption -- but if the readers almost always sleep,
> and where it is common for the srcu_read_unlock() to run on a different CPU
> than the srcu_read_lock(), then the additional cache thrashing could add
> significant overhead.
If you are going to do this, it seems better to just forget about ->per_cpu_ref,
and use only ->hardluckref[]. This also allows to avoid the polling in
synchronize_srcu().
Oleg.
On Tue, Nov 21, 2006 at 11:04:41PM +0300, Oleg Nesterov wrote:
> On 11/20, Alan Stern wrote:
> >
> > On Mon, 20 Nov 2006, Oleg Nesterov wrote:
> >
> > > So, if we have global A == B == 0,
> > >
> > > CPU_0 CPU_1
> > >
> > > A = 1; B = 2;
> > > mb(); mb();
> > > b = B; a = A;
> > >
> > > It could happen that a == b == 0, yes?
> >
> > Both CPUs execute their "mb" instructions. The mb forces each
> > cache to wait until it receives an Acknowdgement for the
> > Invalidate it sent.
> >
> > Both caches send an Acknowledgement message to the other. The
> > mb instructions complete.
> >
> > "b = B" and "a = A" execute. The caches return A==0 and B==0
> > because they haven't yet invalidated their cache lines.
> >
> > The reason the code failed is because the mb instructions didn't force
> > the caches to wait until the Invalidate messages in their queues had been
> > fully carried out (i.e., the lines had actually been invalidated).
>
> However, from
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113435711112941
>
> Paul E. McKenney wrote:
> >
> > 2. rmb() guarantees that any changes seen by the interconnect
> > preceding the rmb() will be seen by any reads following the
> > rmb().
> >
> > 3. mb() combines the guarantees made by rmb() and wmb().
>
> Confused :(
There are the weasel words "seen by the interconnect". Alan is
pointing out that the stores to A and B might not have been "seen by the
interconnect" at the time that the pair of mb() instructions execute,
since the other function of the mb() instructions is to ensure that
any stores prior to each mb() is "seen by the interconnect" before any
subsequenct stores are "seen by the interconnect".
Why wouldn't the store to A be seen by the interconnect at the time of
CPU 1's mb()? Because the cacheline containing A is still residing at
CPU 1. CPU 0's store to A cannot possibly be seen by the interconnect
until after CPU 0 receives the corresponding cacheline.
Yes, it is confusing. Memory barriers work a bit more straightforwardly
on MMIO accesses, thankfully. But it would probably be good to strive
for minimal numbers of memory barriers, especially in common code. :-/
Thanx, Paul
On Tue, Nov 21, 2006 at 03:26:44PM -0500, Alan Stern wrote:
> On Tue, 21 Nov 2006, Paul E. McKenney wrote:
>
> > On Tue, Nov 21, 2006 at 07:44:20PM +0300, Oleg Nesterov wrote:
> > > On 11/20, Paul E. McKenney wrote:
> > > >
> > > > On Mon, Nov 20, 2006 at 09:57:12PM +0300, Oleg Nesterov wrote:
> > > > > >
> > > > > So, if we have global A == B == 0,
> > > > >
> > > > > CPU_0 CPU_1
> > > > >
> > > > > A = 1; B = 2;
> > > > > mb(); mb();
> > > > > b = B; a = A;
> > > > >
> > > > > It could happen that a == b == 0, yes? Isn't this contradicts with definition
> > > > > of mb?
> > > >
> > > > It can and does happen. -Which- definition of mb()? ;-)
> > >
> > > I had a somewhat similar understanding before this discussion
> > >
> > > [PATCH] Fix RCU race in access of nohz_cpu_mask
> > > http://marc.theaimsgroup.com/?t=113378060600003
> > >
> > > Semantics of smp_mb() [was : Re: [PATCH] Fix RCU race in access of nohz_cpu_mask ]
> > > http://marc.theaimsgroup.com/?t=113432312600001
> > >
> > > Could you please explain me again why that fix was correct? What we have now is:
> > >
> > > CPU_0 CPU_1
> > > rcu_start_batch: stop_hz_timer:
> > >
> > > rcp->cur++; STORE nohz_cpu_mask |= cpu
> > >
> > > smp_mb(); mb(); // missed actually
> > >
> > > ->cpumask = ~nohz_cpu_mask; LOAD if (rcu_pending()) // reads rcp->cur
> > > nohz_cpu_mask &= ~cpu
> > >
> > > So, it is possible that CPU_0 reads an empty nohz_cpu_mask and starts a grace
> > > period with CPU_1 included in rcp->cpumask. CPU_1 in turn reads an old value
> > > of rcp->cur (so rcu_pending() returns 0) and becomes CPU_IDLE.
> >
> > At this point, I am not certain that it is in fact correct. :-/
> >
> > > Take another patch,
> > >
> > > Re: Oops on 2.6.18
> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=116266392016286
> > >
> > > switch_uid: __sigqueue_alloc:
> > >
> > > STORE 'new_user' to ->user STORE "locked" to ->siglock
> > >
> > > mb(); "mb()"; // sort of, wrt loads/stores above
> > >
> > > LOAD ->siglock LOAD ->siglock
> > >
> > > Agian, it is possible that switch_uid() doesn't notice that ->siglock is locked
> > > and frees ->user. __sigqueue_alloc() in turn reads an old (freed) value of ->user
> > > and does get_uid() on it.
> >
> > Ditto.
>
> > > Paul, Alan, in case it was not clear: I am not arguing, just trying to
> > > understand, and I appreciate very much your time and your explanations.
> >
> > Either way, we clearly need better definitions of what the memory barriers
> > actually do! And I expect that we will need your help.
>
> Things may not be quite as bad as they appear. On many architectures the
> store-mb-load pattern will work as expected. (In fact, I don't know which
> architectures it might fail on.)
Several weak-memory-ordering CPUs. :-/
> Furthermore this is a very difficult race to trigger. You couldn't force
> it to happen, for example, by adding a delay somewhere.
I have only seen it when explicitly forcing it, and even then it is not
easy to make happen. But how would you know whether or not it happened
in a kernel or large multithreaded application?
I am gaining increasing sympathy with anyone who might wish to reduce
the number of non-MMIO-related memory barriers in the Linux kernel!
Thanx, Paul
On Wed, Nov 22, 2006 at 12:01:05AM +0300, Oleg Nesterov wrote:
> On 11/21, Paul E. McKenney wrote:
> >
> > On Tue, Nov 21, 2006 at 12:56:21PM -0500, Alan Stern wrote:
> > > Here's another potential problem with the fast path approach. It's not
> > > very serious, but you might want to keep it in mind.
> > >
> > > The idea is that a reader can start up on one CPU and finish on another,
> > > and a writer might see the finish event but not the start event. For
> > > example:
> >
> > One approach to get around this would be for the the "idx" returned from
> > srcu_read_lock() to keep track of the CPU as well as the index within
> > the CPU. This would require atomic_inc()/atomic_dec() on the fast path,
> > but would not add much to the overhead on x86 because the smp_mb() imposes
> > an atomic operation anyway. There would be little cache thrashing in the
> > case where there is no preemption -- but if the readers almost always sleep,
> > and where it is common for the srcu_read_unlock() to run on a different CPU
> > than the srcu_read_lock(), then the additional cache thrashing could add
> > significant overhead.
>
> If you are going to do this, it seems better to just forget about ->per_cpu_ref,
> and use only ->hardluckref[]. This also allows to avoid the polling in
> synchronize_srcu().
If the readers are reasonably rare, that could work. If readers are
common, you get memory contention (as well as cache thrashing) on the
->hardluckref[] elements. But putting this degree of cache thrashing
into SRCU certainly does not feel right.
Thanx, Paul
On Tue, 21 Nov 2006, Paul E. McKenney wrote:
> > Things may not be quite as bad as they appear. On many architectures the
> > store-mb-load pattern will work as expected. (In fact, I don't know which
> > architectures it might fail on.)
>
> Several weak-memory-ordering CPUs. :-/
Of the CPUs supported by Linux, do you know which ones will work with
store-mb-load and which ones won't?
Alan
On Tue, Nov 21, 2006 at 09:17:59PM -0500, Alan Stern wrote:
> On Tue, 21 Nov 2006, Paul E. McKenney wrote:
>
> > > Things may not be quite as bad as they appear. On many architectures the
> > > store-mb-load pattern will work as expected. (In fact, I don't know which
> > > architectures it might fail on.)
> >
> > Several weak-memory-ordering CPUs. :-/
>
> Of the CPUs supported by Linux, do you know which ones will work with
> store-mb-load and which ones won't?
I have partial lists at this point. I confess to not having made
much progress porting my memory-barrier torture tests to the relevant
architectures over the past few weeks (handling the lack of synchronized
lightweight fine-grained timers being the current obstacle), but will
let people know once I have gotten the tests working on the machines
that I have access to.
I don't have access to SMP Alpha or ARM machines (or UP either, for that
matter), so won't be able to test those.
Thanx, Paul
On Tue, Nov 21, 2006 at 03:40:50PM -0500, Alan Stern wrote:
> On Tue, 21 Nov 2006, Paul E. McKenney wrote:
>
> > On Tue, Nov 21, 2006 at 12:56:21PM -0500, Alan Stern wrote:
> > > Here's another potential problem with the fast path approach. It's not
> > > very serious, but you might want to keep it in mind.
> > >
> > > The idea is that a reader can start up on one CPU and finish on another,
> > > and a writer might see the finish event but not the start event. For
> > > example:
> ...
> > > This requires two context switches to take place while the cpu loop in
> > > srcu_readers_active_idx() runs, so perhaps it isn't realistic. Is it
> > > worth worrying about?
> >
> > Thank you -very- -much- for finding the basis behind my paranoia!
> > I guess my intuition is still in good working order. ;-)
>
> Are you sure _this_ was the basis behind your paranoia? Maybe it had
> something else in mind... :-)
OK, I stand corrected, you found -one- basis for my paranoia. There might
indeed be others. However, only -one- counter-example is required to
invalidate a proposed algorithm. ;-)
> > It might be unlikely, but that makes it even worse -- a strange memory
> > corruption problem that happens only under heavy load, and even then only
> > sometimes. No thank you!!!
> >
> > I suspect that this affects Jens as well, though I don't claim to
> > completely understand his usage.
> >
> > One approach to get around this would be for the the "idx" returned from
> > srcu_read_lock() to keep track of the CPU as well as the index within
> > the CPU. This would require atomic_inc()/atomic_dec() on the fast path,
> > but would not add much to the overhead on x86 because the smp_mb() imposes
> > an atomic operation anyway. There would be little cache thrashing in the
> > case where there is no preemption -- but if the readers almost always sleep,
> > and where it is common for the srcu_read_unlock() to run on a different CPU
> > than the srcu_read_lock(), then the additional cache thrashing could add
> > significant overhead.
> >
> > Thoughts?
>
> I don't like the thought of extra overhead from cache thrashing. Also it
> seems silly to allocate per-cpu data and then write to another CPU's
> element.
I am concerned about this as well, and am beginning to suspect that I
need to make a special-purpose primitive specifically for Jens that he
can include with his code.
That said, some potential advantages of per-CPU elements that might see
cache thrashing are:
1. the cross-CPU references might be rare.
2. memory contention is reduced compared to a single variable that
all CPUs are modifying.
Unfortunately, #1 seems unlikely in Jens's case -- why would the completion
be so lucky as to show up on the same CPU as did the request most of the
time? #2 could be important in I/O heavy workloads with fast devices.
> How about making srcu_readers_active_idx() so fast that there isn't time
> for 2 context switches? Disabling interrupts ought to be good enough
> (except in virtualized environments perhaps).
NMIs? ECC errors? Cache misses? And, as you say, virtualized
environments.
Thanx, Paul
(Sorry, responding to the wrong message)
Paul E. McKenney wrote:
>
> I am concerned about this as well, and am beginning to suspect that I
> need to make a special-purpose primitive specifically for Jens that he
> can include with his code.
How about this?
struct xxx_struct {
int completed;
atomic_t ctr[2];
struct mutex mutex;
wait_queue_head_t wq;
};
void init_xxx_struct(struct xxx_struct *sp)
{
sp->completed = 0;
atomic_set(sp->ctr + 0, 1); // active
atomic_set(sp->ctr + 1, 0); // inactive
mutex_init(&sp->mutex);
init_waitqueue_head(&sp->wq);
}
int xxx_read_lock(struct xxx_struct *sp)
{
for (;;) {
int idx = sp->completed & 0x1;
if (likely(atomic_inc_not_zero(sp->ctr + idx)))
return idx;
}
}
void xxx_read_unlock(struct xxx_struct *sp, int idx)
{
if (unlikely(atomic_dec_and_test(sp->ctr + idx)))
wake_up(&sp->wq);
}
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
mutex_lock(&sp->mutex);
idx = ++sp->completed & 0x1;
smp_mb__before_atomic_inc();
atomic_inc(&sp->ctr + idx);
idx = !idx;
if (!atomic_dec_and_test(&sp->ctr + idx))
__wait_event(&sp->wq, !atomic_read(&sp->ctr + idx));
mutex_unlock(&sp->mutex);
}
Yes, cache thrashing... But I think this is hard to avoid if we want writer
to be fast.
I do not claim this is the best solution, but for some reason I'd like to
suggest something that doesn't need synchronize_sched(). What do you think
about correctness at least?
Oleg.
On Thu, Nov 23, 2006 at 05:59:10PM +0300, Oleg Nesterov wrote:
> (Sorry, responding to the wrong message)
>
> Paul E. McKenney wrote:
> >
> > I am concerned about this as well, and am beginning to suspect that I
> > need to make a special-purpose primitive specifically for Jens that he
> > can include with his code.
>
> How about this?
For Jens, it might be OK. For general use, I believe that this has
difficulties with the sequence of events I sent out on November 20th, see:
http://marc.theaimsgroup.com/?l=linux-kernel&m=116397154808901&w=2
Might also be missing a few memory barriers, see below.
> struct xxx_struct {
> int completed;
> atomic_t ctr[2];
> struct mutex mutex;
> wait_queue_head_t wq;
> };
>
> void init_xxx_struct(struct xxx_struct *sp)
> {
> sp->completed = 0;
> atomic_set(sp->ctr + 0, 1); // active
> atomic_set(sp->ctr + 1, 0); // inactive
> mutex_init(&sp->mutex);
> init_waitqueue_head(&sp->wq);
> }
>
> int xxx_read_lock(struct xxx_struct *sp)
> {
> for (;;) {
> int idx = sp->completed & 0x1;
> if (likely(atomic_inc_not_zero(sp->ctr + idx)))
Need an after-atomic-inc memory barrier here?
> return idx;
> }
> }
>
> void xxx_read_unlock(struct xxx_struct *sp, int idx)
> {
Need a before-atomic-dec memory barrier here?
> if (unlikely(atomic_dec_and_test(sp->ctr + idx)))
> wake_up(&sp->wq);
> }
>
> void synchronize_xxx(struct xxx_struct *sp)
> {
> int idx;
>
> mutex_lock(&sp->mutex);
>
> idx = ++sp->completed & 0x1;
> smp_mb__before_atomic_inc();
> atomic_inc(&sp->ctr + idx);
>
> idx = !idx;
> if (!atomic_dec_and_test(&sp->ctr + idx))
> __wait_event(&sp->wq, !atomic_read(&sp->ctr + idx));
I don't understand why an unlucky sequence of events mightn't be able
to hang this __wait_event(). Suppose we did the atomic_dec_and_test(),
then some other CPU executed xxx_read_unlock(), finding no one to awaken,
then we execute the __wait_event()? What am I missing here?
>
> mutex_unlock(&sp->mutex);
> }
>
> Yes, cache thrashing... But I think this is hard to avoid if we want writer
> to be fast.
>
> I do not claim this is the best solution, but for some reason I'd like to
> suggest something that doesn't need synchronize_sched(). What do you think
> about correctness at least?
The general approach seems reasonable, but I do have the concerns above.
Thanx, Paul
On 11/23, Paul E. McKenney wrote:
>
> On Thu, Nov 23, 2006 at 05:59:10PM +0300, Oleg Nesterov wrote:
> > (Sorry, responding to the wrong message)
> >
> > Paul E. McKenney wrote:
> > >
> > > I am concerned about this as well, and am beginning to suspect that I
> > > need to make a special-purpose primitive specifically for Jens that he
> > > can include with his code.
> >
> > How about this?
>
> For Jens, it might be OK. For general use, I believe that this has
> difficulties with the sequence of events I sent out on November 20th, see:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=116397154808901&w=2
Oh. I guess I'd better sleep before answer, but I hope that this version
doesn't have those problems. Note the 'atomic_inc_not_zero()' in read_lock(),
it seems not possible for synchronize_xxx() to return while xxx_read_lock()
increments a "wrong" element.
Just in case, in no way this interface should replace the current srcu code,
this is another variant optimized for writers, with a hope it is ok for Jens.
> Might also be missing a few memory barriers, see below.
>
> > struct xxx_struct {
> > int completed;
> > atomic_t ctr[2];
> > struct mutex mutex;
> > wait_queue_head_t wq;
> > };
> >
> > void init_xxx_struct(struct xxx_struct *sp)
> > {
> > sp->completed = 0;
> > atomic_set(sp->ctr + 0, 1); // active
> > atomic_set(sp->ctr + 1, 0); // inactive
> > mutex_init(&sp->mutex);
> > init_waitqueue_head(&sp->wq);
> > }
> >
> > int xxx_read_lock(struct xxx_struct *sp)
> > {
> > for (;;) {
> > int idx = sp->completed & 0x1;
> > if (likely(atomic_inc_not_zero(sp->ctr + idx)))
>
> Need an after-atomic-inc memory barrier here?
>From Documentation/atomic_ops.txt:
"atomic_add_unless requires explicit memory barriers around the operation."
>
> > return idx;
> > }
> > }
> >
> > void xxx_read_unlock(struct xxx_struct *sp, int idx)
> > {
>
> Need a before-atomic-dec memory barrier here?
The same, Documentation/atomic_ops.txt states
"It requires explicit memory barrier semantics"
> > if (unlikely(atomic_dec_and_test(sp->ctr + idx)))
> > wake_up(&sp->wq);
> > }
> >
> > void synchronize_xxx(struct xxx_struct *sp)
> > {
> > int idx;
> >
> > mutex_lock(&sp->mutex);
> >
> > idx = ++sp->completed & 0x1;
> > smp_mb__before_atomic_inc();
> > atomic_inc(&sp->ctr + idx);
> >
> > idx = !idx;
> > if (!atomic_dec_and_test(&sp->ctr + idx))
> > __wait_event(&sp->wq, !atomic_read(&sp->ctr + idx));
>
> I don't understand why an unlucky sequence of events mightn't be able
> to hang this __wait_event(). Suppose we did the atomic_dec_and_test(),
... so atomic_read() >= 0 ...
> then some other CPU executed xxx_read_unlock(), finding no one to awaken,
... it does atomic_dec(), but sp->wq is empty, yes?
> then we execute the __wait_event()?
__wait_event() will notice !atomic_read() and return.
Note that this is just an optimization. We can do
atomic_dec(sp->ctr + idx);
__wait_event(&sp->wq, !atomic_read(sp->ctr + idx));
instead. Also, I think synchronize_xxx() could be optimized further.
> What am I missing here?
Probably it is me again who missed something... Please say no!
> >
> > mutex_unlock(&sp->mutex);
> > }
> >
> > Yes, cache thrashing... But I think this is hard to avoid if we want writer
> > to be fast.
> >
> > I do not claim this is the best solution, but for some reason I'd like to
> > suggest something that doesn't need synchronize_sched(). What do you think
> > about correctness at least?
>
> The general approach seems reasonable, but I do have the concerns above.
Thanks!
Oleg.
On 11/23, Paul E. McKenney wrote:
>
> For general use, I believe that this has
> difficulties with the sequence of events I sent out on November 20th, see:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=116397154808901&w=2
>
> ...
>
> I don't understand why an unlucky sequence of events mightn't be able
> to hang this __wait_event(). Suppose we did the atomic_dec_and_test(),
> then some other CPU executed xxx_read_unlock(), finding no one to awaken,
> then we execute the __wait_event()?
Please note how ->ctr[] is initialized,
atomic_set(sp->ctr + 0, 1); <---- 1, not 0
atomic_set(sp->ctr + 1, 0);
atomic_read(sp->ctr + idx) == 0 means that this counter is inactive,
nobody use it.
Oleg.
Ok, synchronize_xxx() passed 1 hour rcutorture test on dual P-III.
It behaves the same as srcu but optimized for writers. The fast path
for synchronize_xxx() is mutex_lock() + atomic_read() + mutex_unlock().
The slow path is __wait_event(), no polling. However, the reader does
atomic inc/dec on lock/unlock, and the counters are not per-cpu.
Jens, is it ok for you? Alan, Paul, what is your opinion?
Oleg.
--- 19-rc6/kernel/__rcutorture.c 2006-11-17 19:42:31.000000000 +0300
+++ 19-rc6/kernel/rcutorture.c 2006-11-24 21:00:00.000000000 +0300
@@ -464,6 +464,120 @@ static struct rcu_torture_ops srcu_ops =
.name = "srcu"
};
+//-----------------------------------------------------------------------------
+struct xxx_struct {
+ int completed;
+ atomic_t ctr[2];
+ struct mutex mutex;
+ wait_queue_head_t wq;
+};
+
+void init_xxx_struct(struct xxx_struct *sp)
+{
+ sp->completed = 0;
+ atomic_set(sp->ctr + 0, 1);
+ atomic_set(sp->ctr + 1, 0);
+ mutex_init(&sp->mutex);
+ init_waitqueue_head(&sp->wq);
+}
+
+int xxx_read_lock(struct xxx_struct *sp)
+{
+ for (;;) {
+ int idx = sp->completed & 0x1;
+ if (likely(atomic_inc_not_zero(sp->ctr + idx)))
+ return idx;
+ }
+}
+
+void xxx_read_unlock(struct xxx_struct *sp, int idx)
+{
+ if (unlikely(atomic_dec_and_test(sp->ctr + idx)))
+ wake_up(&sp->wq);
+}
+
+void synchronize_xxx(struct xxx_struct *sp)
+{
+ int idx;
+
+ mutex_lock(&sp->mutex);
+
+ idx = sp->completed & 0x1;
+ if (!atomic_add_unless(sp->ctr + idx, -1, 1))
+ goto out;
+
+ atomic_inc(sp->ctr + (idx ^ 0x1));
+ sp->completed++;
+
+ __wait_event(sp->wq, !atomic_read(sp->ctr + idx));
+out:
+ mutex_unlock(&sp->mutex);
+}
+
+//-----------------------------------------------------------------------------
+static struct xxx_struct xxx_ctl;
+
+static void xxx_torture_init(void)
+{
+ init_xxx_struct(&xxx_ctl);
+ rcu_sync_torture_init();
+}
+
+static void xxx_torture_cleanup(void)
+{
+ synchronize_xxx(&xxx_ctl);
+}
+
+static int xxx_torture_read_lock(void)
+{
+ return xxx_read_lock(&xxx_ctl);
+}
+
+static void xxx_torture_read_unlock(int idx)
+{
+ xxx_read_unlock(&xxx_ctl, idx);
+}
+
+static int xxx_torture_completed(void)
+{
+ return xxx_ctl.completed;
+}
+
+static void xxx_torture_synchronize(void)
+{
+ synchronize_xxx(&xxx_ctl);
+}
+
+static int xxx_torture_stats(char *page)
+{
+ int cnt = 0;
+ int idx = xxx_ctl.completed & 0x1;
+
+ cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):",
+ torture_type, TORTURE_FLAG, idx);
+
+ cnt += sprintf(&page[cnt], " (%d,%d)",
+ atomic_read(xxx_ctl.ctr + 0),
+ atomic_read(xxx_ctl.ctr + 1));
+
+ cnt += sprintf(&page[cnt], "\n");
+ return cnt;
+}
+
+static struct rcu_torture_ops xxx_ops = {
+ .init = xxx_torture_init,
+ .cleanup = xxx_torture_cleanup,
+ .readlock = xxx_torture_read_lock,
+ .readdelay = srcu_read_delay,
+ .readunlock = xxx_torture_read_unlock,
+ .completed = xxx_torture_completed,
+ .deferredfree = rcu_sync_torture_deferred_free,
+ .sync = xxx_torture_synchronize,
+ .stats = xxx_torture_stats,
+ .name = "xxx"
+};
+//-----------------------------------------------------------------------------
+
/*
* Definitions for sched torture testing.
*/
@@ -503,8 +617,8 @@ static struct rcu_torture_ops sched_ops
};
static struct rcu_torture_ops *torture_ops[] =
- { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops, &srcu_ops,
- &sched_ops, NULL };
+ { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops,
+ &srcu_ops, &xxx_ops, &sched_ops, NULL };
/*
* RCU torture writer kthread. Repeatedly substitutes a new structure
On Fri, Nov 24 2006, Oleg Nesterov wrote:
> Ok, synchronize_xxx() passed 1 hour rcutorture test on dual P-III.
>
> It behaves the same as srcu but optimized for writers. The fast path
> for synchronize_xxx() is mutex_lock() + atomic_read() + mutex_unlock().
> The slow path is __wait_event(), no polling. However, the reader does
> atomic inc/dec on lock/unlock, and the counters are not per-cpu.
>
> Jens, is it ok for you? Alan, Paul, what is your opinion?
This looks good from my end, much more appropriate than the current SRCU
code. Even if I could avoid synchronize_srcu() for most cases, when I
did have to issue it, the 3x synchronize_sched() was a performance
killer.
Thanks Oleg! And Alan and Paul for your excellent ideas.
--
Jens Axboe
On Fri, 24 Nov 2006, Oleg Nesterov wrote:
> Ok, synchronize_xxx() passed 1 hour rcutorture test on dual P-III.
>
> It behaves the same as srcu but optimized for writers. The fast path
> for synchronize_xxx() is mutex_lock() + atomic_read() + mutex_unlock().
> The slow path is __wait_event(), no polling. However, the reader does
> atomic inc/dec on lock/unlock, and the counters are not per-cpu.
>
> Jens, is it ok for you? Alan, Paul, what is your opinion?
Given that you aren't using per-cpu data, why not just rely on a spinlock?
Then everything will be simple and easy to verify, with no need to worry
about atomic instructions or memory barriers.
Alan
On 11/24, Alan Stern wrote:
>
> On Fri, 24 Nov 2006, Oleg Nesterov wrote:
>
> > Ok, synchronize_xxx() passed 1 hour rcutorture test on dual P-III.
> >
> > It behaves the same as srcu but optimized for writers. The fast path
> > for synchronize_xxx() is mutex_lock() + atomic_read() + mutex_unlock().
> > The slow path is __wait_event(), no polling. However, the reader does
> > atomic inc/dec on lock/unlock, and the counters are not per-cpu.
> >
> > Jens, is it ok for you? Alan, Paul, what is your opinion?
>
> Given that you aren't using per-cpu data, why not just rely on a spinlock?
I thought about this too, and we can re-use sp->wq.lock,
> Then everything will be simple and easy to verify,
xxx_read_lock() will be simpler, but not too much. synchronize_xxx() needs
some complication.
> with no need to worry
> about atomic instructions or memory barriers.
spin_lock() + spin_unlock() doesn't imply mb(), it allows subsequent loads
to move into the the critical region.
I personally prefer this way, but may be you are right.
Oleg.
On Sat, 25 Nov 2006, Oleg Nesterov wrote:
> > Given that you aren't using per-cpu data, why not just rely on a spinlock?
>
> I thought about this too, and we can re-use sp->wq.lock,
Yes, although it would be a layering violation.
> > Then everything will be simple and easy to verify,
>
> xxx_read_lock() will be simpler, but not too much. synchronize_xxx() needs
> some complication.
Look at the (untested) example below. The code may be a little bit
longer, but it's a lot easier to understand and verify.
> spin_lock() + spin_unlock() doesn't imply mb(), it allows subsequent loads
> to move into the the critical region.
No, that's wrong. Subsequent loads are allowed to move into the region
protected by the spinlock, but not past it (into the xxx critical
section).
> I personally prefer this way, but may be you are right.
See what you think...
Alan
//-----------------------------------------------------------------------------
struct xxx_struct {
int completed;
int ctr[2];
struct mutex mutex;
spinlock_t lock;
wait_queue_head_t wq;
};
void init_xxx_struct(struct xxx_struct *sp)
{
sp->completed = 0;
sp->ctr[0] = 1;
sp->ctr[1] = 0;
spin_lock_init(&sp->lock);
mutex_init(&sp->mutex);
init_waitqueue_head(&sp->wq);
}
int xxx_read_lock(struct xxx_struct *sp)
{
int idx;
spin_lock(&sp->lock);
idx = sp->completed & 0x1;
++sp->ctr[idx];
spin_unlock(&sp->lock);
return idx;
}
void xxx_read_unlock(struct xxx_struct *sp, int idx)
{
spin_lock(&sp->lock);
if (--sp->ctr[idx] == 0)
wake_up(&sp->wq);
spin_unlock(&sp->lock);
}
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
mutex_lock(&sp->mutex);
spin_lock(&sp->lock);
idx = sp->completed & 0x1;
++sp->completed;
--sp->ctr[idx];
sp->ctr[idx ^ 1] = 1;
spin_unlock(&sp->lock);
wait_event(sp->wq, sp->ctr[idx] == 0);
mutex_unlock(&sp->mutex);
}
On 11/24, Alan Stern wrote:
>
> On Sat, 25 Nov 2006, Oleg Nesterov wrote:
>
> > spin_lock() + spin_unlock() doesn't imply mb(), it allows subsequent loads
> > to move into the the critical region.
>
> No, that's wrong. Subsequent loads are allowed to move into the region
> protected by the spinlock, but not past it (into the xxx critical
> section).
Yes, you are right, but see below what I meant.
> > I personally prefer this way, but may be you are right.
>
> See what you think...
>
> Alan
>
> //-----------------------------------------------------------------------------
> struct xxx_struct {
> int completed;
> int ctr[2];
> struct mutex mutex;
> spinlock_t lock;
> wait_queue_head_t wq;
> };
>
> void init_xxx_struct(struct xxx_struct *sp)
> {
> sp->completed = 0;
> sp->ctr[0] = 1;
> sp->ctr[1] = 0;
> spin_lock_init(&sp->lock);
> mutex_init(&sp->mutex);
> init_waitqueue_head(&sp->wq);
> }
>
> int xxx_read_lock(struct xxx_struct *sp)
> {
> int idx;
>
> spin_lock(&sp->lock);
> idx = sp->completed & 0x1;
> ++sp->ctr[idx];
> spin_unlock(&sp->lock);
> return idx;
> }
>
> void xxx_read_unlock(struct xxx_struct *sp, int idx)
> {
> spin_lock(&sp->lock);
It is possible that the memory ops that occur before spin_lock() is not yet
completed,
> if (--sp->ctr[idx] == 0)
suppose that synchronize_xxx() just unlocked sp->lock. It sees sp->ctr[idx] == 0
and returns.
> wake_up(&sp->wq);
> spin_unlock(&sp->lock);
This is a one-way barrier, yes. But it is too late.
Actually, synchronize_xxx() may sleep on sp->wq and we still have a race.
synchronize_xxx() can return before ->wake_up() unlocks sp->wq.lock (finish_wait()
doesn't take sp->wq.lock due to autoremove_wake_function()).
> }
>
> void synchronize_xxx(struct xxx_struct *sp)
> {
> int idx;
>
> mutex_lock(&sp->mutex);
>
> spin_lock(&sp->lock);
> idx = sp->completed & 0x1;
> ++sp->completed;
> --sp->ctr[idx];
> sp->ctr[idx ^ 1] = 1;
> spin_unlock(&sp->lock);
>
> wait_event(sp->wq, sp->ctr[idx] == 0);
> mutex_unlock(&sp->mutex);
> }
This is more or less equivalent to
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
mutex_lock(&sp->mutex);
idx = sp->completed & 0x1;
atomic_dec(sp->ctr + idx);
smp_mb__before_atomic_inc();
atomic_inc(sp->ctr + (idx ^ 0x1));
sp->completed++;
wait_event(sp->wq, !atomic_read(sp->ctr + idx));
mutex_unlock(&sp->mutex);
}
and lacks an optimization.
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
mutex_lock(&sp->mutex);
spin_lock(&sp->lock);
idx = sp->completed & 0x1;
if (sp->ctr[idx] == 1) {
spin_unlock(&sp->lock);
goto out;
}
++sp->completed;
--sp->ctr[idx];
sp->ctr[idx ^ 1] = 1;
spin_unlock(&sp->lock);
wait_event(sp->wq, sp->ctr[idx] == 0);
out:
mutex_unlock(&sp->mutex);
}
Honestly, I don't see why it is better, but may be this is just me.
In any case, spinlock based implementation shouldn't be faster, yes?
Jens, Paul, what do you think?
Note also that 'atomic_add_unless' in synchronize_xxx() is not strictly
necessary, it is just for "symmetry", we can do
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
mutex_lock(&sp->mutex);
idx = sp->completed & 0x1;
if (!atomic_read(sp->ctr + idx)
goto out;
atomic_dec(sp->ctr + idx);
atomic_inc(sp->ctr + (idx ^ 0x1));
sp->completed++;
wait_event(sp->wq, !atomic_read(sp->ctr + idx));
out:
mutex_unlock(&sp->mutex);
}
instead. So the only complication I can see is the 'for' loop in
xxx_read_lock(). Does it worth adding sp->lock ?
Anyway, s/xxx/WHAT ???/ ?
Oleg.
On Sat, 25 Nov 2006, Oleg Nesterov wrote:
> > void xxx_read_unlock(struct xxx_struct *sp, int idx)
> > {
> > spin_lock(&sp->lock);
>
> It is possible that the memory ops that occur before spin_lock() is not yet
> completed,
>
> > if (--sp->ctr[idx] == 0)
>
> suppose that synchronize_xxx() just unlocked sp->lock. It sees sp->ctr[idx] == 0
> and returns.
>
> > wake_up(&sp->wq);
> > spin_unlock(&sp->lock);
>
> This is a one-way barrier, yes. But it is too late.
Yes, you are right. The corrected routine (including your little
optimization) looks like this:
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
mutex_lock(&sp->mutex);
spin_lock(&sp->lock);
idx = sp->completed & 0x1;
if (sp->ctr[idx] == 1)
goto done;
++sp->completed;
--sp->ctr[idx];
sp->ctr[idx ^ 1] = 1;
spin_unlock(&sp->lock);
__wait_event(sp->wq, sp->ctr[idx] == 0);
spin_lock(&sp->lock);
done:
spin_unlock(&sp->lock);
mutex_unlock(&sp->mutex);
}
> Actually, synchronize_xxx() may sleep on sp->wq and we still have a race.
> synchronize_xxx() can return before ->wake_up() unlocks sp->wq.lock (finish_wait()
> doesn't take sp->wq.lock due to autoremove_wake_function()).
The version above doesn't suffer from that race.
> This is more or less equivalent to
>
> void synchronize_xxx(struct xxx_struct *sp)
> {
> int idx;
>
> mutex_lock(&sp->mutex);
>
> idx = sp->completed & 0x1;
> atomic_dec(sp->ctr + idx);
> smp_mb__before_atomic_inc();
> atomic_inc(sp->ctr + (idx ^ 0x1));
> sp->completed++;
>
> wait_event(sp->wq, !atomic_read(sp->ctr + idx));
> mutex_unlock(&sp->mutex);
> }
It may indeed be equivalent. But _proving_ it is equivalent is certainly
not easy. The advantage of spinlocks is that they remove the necessity
for outrageous mental contortions to verify that all possible execution
paths will work correctly.
> Honestly, I don't see why it is better, but may be this is just me.
> In any case, spinlock based implementation shouldn't be faster, yes?
It will generally be somewhat slower. In addition to cache-line
contention it suffers from lock contention. The difference shouldn't be
enough to matter unless a lot of threads are trying to acquire a read lock
simultaneously.
Alan Stern
On 11/25, Alan Stern wrote:
>
> Yes, you are right. The corrected routine (including your little
> optimization) looks like this:
>
> void synchronize_xxx(struct xxx_struct *sp)
> {
> int idx;
>
> mutex_lock(&sp->mutex);
> spin_lock(&sp->lock);
> idx = sp->completed & 0x1;
> if (sp->ctr[idx] == 1)
> goto done;
Actually, this optimization doesn't make sense with spinlocks. If we use
atomic_t the fast path is just atomic_read(), no stores at all. But if we
are doing lock/unlock anyway, it is silly to optimize out '++sp->completed'.
> ++sp->completed;
> --sp->ctr[idx];
> sp->ctr[idx ^ 1] = 1;
>
> spin_unlock(&sp->lock);
> __wait_event(sp->wq, sp->ctr[idx] == 0);
> spin_lock(&sp->lock);
>
> done:
> spin_unlock(&sp->lock);
Oh, please no. The empty critical section is silly. Also, the spinlock based
implementation doesn't need to have additional "reference" in ->ctr[completed].
struct xxx_struct {
int completed;
int ctr[2];
struct mutex mutex;
wait_queue_head_t wq;
};
void init_xxx_struct(struct xxx_struct *sp)
{
sp->completed = 0;
sp->ctr[0] = sp->ctr[1] = 0;
mutex_init(&sp->mutex);
init_waitqueue_head(&sp->wq);
}
int xxx_read_lock(struct xxx_struct *sp)
{
int idx;
spin_lock(&sp->lock);
idx = sp->completed & 0x1;
sp->ctr[idx]++;
spin_unlock(&sp->lock);
return idx;
}
void xxx_read_unlock(struct xxx_struct *sp, int idx)
{
spin_lock(&sp->lock);
if (!--sp->ctr[idx])
wake_up(&sp->wq);
spin_unlock(&sp->lock);
}
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
mutex_lock(&sp->mutex);
spin_lock(&sp->lock);
idx = sp->completed++ & 0x1;
spin_unlock(&sp->lock);
wait_event(sp->wq, !sp->ctr[idx]);
spin_unlock_wait(&sp->lock);
mutex_unlock(&sp->mutex);
}
> It may indeed be equivalent. But _proving_ it is equivalent is certainly
> not easy. The advantage of spinlocks is that they remove the necessity
> for outrageous mental contortions to verify that all possible execution
> paths will work correctly.
> ...
> It will generally be somewhat slower.
I still like atomic_t more :) Let's wait for Paul's opinion.
What about naming? synchronize_qrcu?
Oleg
On 11/20, Oleg Nesterov wrote:
>
> So, if we have global A == B == 0,
>
> CPU_0 CPU_1
>
> A = 1; B = 2;
> mb(); mb();
> b = B; a = A;
>
> It could happen that a == b == 0, yes? Isn't this contradicts with definition
> of mb?
I still can't relax, another attempt to "prove" this should not be
possible on CPUs supported by Linux :)
Let's suppose it is possible, then it should also be possible if CPU_1
does spin_lock() instead of mb() (spin_lock can't be "stronger"), yes?
Now,
int COND;
wait_queue_head_t wq;
my_wait()
{
add_wait_queue(&wq);
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (COND)
break;
schedule();
}
remove_wait_queue(&wq);
}
my_wake()
{
COND = 1;
wake_up(&wq);
}
this should be correct, but it is not!
my_wait:
task->state = TASK_UNINTERRUPTIBLE; // STORE
mb();
if (COND) break; // LOAD
my_wake:
COND = 1; // STORE
spin_lock(WQ.lock);
spin_lock(runqueue.lock);
// try_to_wake_up()
if (!(task->state & TASK_UNINTERRUPTIBLE)) // LOAD
goto out;
So, my_wait() gets COND == 0, and goes to schedule in 'D' state.
try_to_wake_up() reads ->state == TASK_RUNNING, and does nothing.
Oleg.
On Fri, Nov 24, 2006 at 12:49:08AM +0300, Oleg Nesterov wrote:
> On 11/23, Paul E. McKenney wrote:
> >
> > For general use, I believe that this has
> > difficulties with the sequence of events I sent out on November 20th, see:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=116397154808901&w=2
> >
> > ...
> >
> > I don't understand why an unlucky sequence of events mightn't be able
> > to hang this __wait_event(). Suppose we did the atomic_dec_and_test(),
> > then some other CPU executed xxx_read_unlock(), finding no one to awaken,
> > then we execute the __wait_event()?
>
> Please note how ->ctr[] is initialized,
>
> atomic_set(sp->ctr + 0, 1); <---- 1, not 0
> atomic_set(sp->ctr + 1, 0);
>
> atomic_read(sp->ctr + idx) == 0 means that this counter is inactive,
> nobody use it.
I definitely should have slept before commenting on the earlier version,
it would appear. ;-) Please accept my apologies for my confusion!
I will take a look at your later patch.
Thanx, Paul
On Fri, Nov 24, 2006 at 09:21:53PM +0300, Oleg Nesterov wrote:
> Ok, synchronize_xxx() passed 1 hour rcutorture test on dual P-III.
>
> It behaves the same as srcu but optimized for writers. The fast path
> for synchronize_xxx() is mutex_lock() + atomic_read() + mutex_unlock().
> The slow path is __wait_event(), no polling. However, the reader does
> atomic inc/dec on lock/unlock, and the counters are not per-cpu.
>
> Jens, is it ok for you? Alan, Paul, what is your opinion?
Looks pretty good, actually. A few quibbles below. I need to review
again after sleeping on it.
Thanx, Paul
> Oleg.
>
> --- 19-rc6/kernel/__rcutorture.c 2006-11-17 19:42:31.000000000 +0300
> +++ 19-rc6/kernel/rcutorture.c 2006-11-24 21:00:00.000000000 +0300
> @@ -464,6 +464,120 @@ static struct rcu_torture_ops srcu_ops =
> .name = "srcu"
> };
>
> +//-----------------------------------------------------------------------------
> +struct xxx_struct {
> + int completed;
> + atomic_t ctr[2];
> + struct mutex mutex;
> + wait_queue_head_t wq;
> +};
> +
> +void init_xxx_struct(struct xxx_struct *sp)
> +{
> + sp->completed = 0;
> + atomic_set(sp->ctr + 0, 1);
> + atomic_set(sp->ctr + 1, 0);
> + mutex_init(&sp->mutex);
> + init_waitqueue_head(&sp->wq);
> +}
> +
> +int xxx_read_lock(struct xxx_struct *sp)
> +{
> + for (;;) {
> + int idx = sp->completed & 0x1;
Might need a comment saying why no rcu_dereference() needed on the
preceding line. The reason (as I understand it) is that we are
only doing atomic operations on the element being indexed. Is there
an Alpha architect in the house? ;-)
> + if (likely(atomic_inc_not_zero(sp->ctr + idx)))
> + return idx;
> + }
> +}
The loop seems absolutely necessary if one wishes to avoid a
synchronize_sched() in synchronize_xxx() below (and was one of the things
I was missing earlier). However, isn't there a possibility that a pile
of synchronize_xxx() calls might indefinitely delay an unlucky reader?
> +
> +void xxx_read_unlock(struct xxx_struct *sp, int idx)
> +{
> + if (unlikely(atomic_dec_and_test(sp->ctr + idx)))
> + wake_up(&sp->wq);
> +}
> +
> +void synchronize_xxx(struct xxx_struct *sp)
> +{
> + int idx;
> +
> + mutex_lock(&sp->mutex);
> +
> + idx = sp->completed & 0x1;
> + if (!atomic_add_unless(sp->ctr + idx, -1, 1))
> + goto out;
> +
> + atomic_inc(sp->ctr + (idx ^ 0x1));
> + sp->completed++;
> +
> + __wait_event(sp->wq, !atomic_read(sp->ctr + idx));
> +out:
> + mutex_unlock(&sp->mutex);
> +}
Test code!!! Very good!!! (This is added to rcutorture, right?)
> +//-----------------------------------------------------------------------------
> +static struct xxx_struct xxx_ctl;
> +
> +static void xxx_torture_init(void)
> +{
> + init_xxx_struct(&xxx_ctl);
> + rcu_sync_torture_init();
> +}
> +
> +static void xxx_torture_cleanup(void)
> +{
> + synchronize_xxx(&xxx_ctl);
> +}
> +
> +static int xxx_torture_read_lock(void)
> +{
> + return xxx_read_lock(&xxx_ctl);
> +}
> +
> +static void xxx_torture_read_unlock(int idx)
> +{
> + xxx_read_unlock(&xxx_ctl, idx);
> +}
> +
> +static int xxx_torture_completed(void)
> +{
> + return xxx_ctl.completed;
> +}
> +
> +static void xxx_torture_synchronize(void)
> +{
> + synchronize_xxx(&xxx_ctl);
> +}
> +
> +static int xxx_torture_stats(char *page)
> +{
> + int cnt = 0;
> + int idx = xxx_ctl.completed & 0x1;
> +
> + cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):",
> + torture_type, TORTURE_FLAG, idx);
> +
> + cnt += sprintf(&page[cnt], " (%d,%d)",
> + atomic_read(xxx_ctl.ctr + 0),
> + atomic_read(xxx_ctl.ctr + 1));
> +
> + cnt += sprintf(&page[cnt], "\n");
> + return cnt;
> +}
> +
> +static struct rcu_torture_ops xxx_ops = {
> + .init = xxx_torture_init,
> + .cleanup = xxx_torture_cleanup,
> + .readlock = xxx_torture_read_lock,
> + .readdelay = srcu_read_delay,
> + .readunlock = xxx_torture_read_unlock,
> + .completed = xxx_torture_completed,
> + .deferredfree = rcu_sync_torture_deferred_free,
> + .sync = xxx_torture_synchronize,
> + .stats = xxx_torture_stats,
> + .name = "xxx"
> +};
> +//-----------------------------------------------------------------------------
> +
> /*
> * Definitions for sched torture testing.
> */
> @@ -503,8 +617,8 @@ static struct rcu_torture_ops sched_ops
> };
>
> static struct rcu_torture_ops *torture_ops[] =
> - { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops, &srcu_ops,
> - &sched_ops, NULL };
> + { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops,
> + &srcu_ops, &xxx_ops, &sched_ops, NULL };
>
> /*
> * RCU torture writer kthread. Repeatedly substitutes a new structure
>
On 11/26, Paul E. McKenney wrote:
>
> Looks pretty good, actually. A few quibbles below. I need to review
> again after sleeping on it.
Thanks! Please also look at spinlock-based implementation,
http://marc.theaimsgroup.com/?l=linux-kernel&m=116457701231964
I must admit, Alan was right: it really looks simpler and the perfomance
penalty should be very low. I personally hate this additional spinlock
per se as a "unneeded complication", but probably you will like it more.
> > +int xxx_read_lock(struct xxx_struct *sp)
> > +{
> > + for (;;) {
> > + int idx = sp->completed & 0x1;
>
> Might need a comment saying why no rcu_dereference() needed on the
> preceding line. The reason (as I understand it) is that we are
> only doing atomic operations on the element being indexed.
My understanding is the same. Actually, smp_read_barrier_depends() can't
help because 'atomic_inc' and '->completed++' in synchronize_xxx() could
be re-ordered anyway, so we should rely on correctness of atomic_t.
>
> > + if (likely(atomic_inc_not_zero(sp->ctr + idx)))
> > + return idx;
> > + }
> > +}
>
> The loop seems absolutely necessary if one wishes to avoid a
> synchronize_sched() in synchronize_xxx() below (and was one of the things
> I was missing earlier). However, isn't there a possibility that a pile
> of synchronize_xxx() calls might indefinitely delay an unlucky reader?
Note that synchronize_xxx() does nothing when there are no readers under
xxx_read_lock(), so
for (;;)
synchronize_xxx();
can't suspend xxx_read_lock(). atomic_inc_not_zero() fails when something like
the events below happen between 'idx = sp->completed' and 'atomic_inc_not_zero'
- another reader does xxx_read_lock(), increments ->ctr.
- synchronize_xxx() notices it, goes to __wait_event()
- both the reader and writer manage to do atomic_dec()
This is possible in theory, but indefinite delay... Look, we have the same
"problem" with spinlocks: in theory synchronize_xxx() calls might indefinitely
delay an unlucky reader because synchronize_xxx() always wins spin_lock(&sp->lock);
> > +
> > +void xxx_read_unlock(struct xxx_struct *sp, int idx)
> > +{
> > + if (unlikely(atomic_dec_and_test(sp->ctr + idx)))
> > + wake_up(&sp->wq);
> > +}
> > +
> > +void synchronize_xxx(struct xxx_struct *sp)
> > +{
> > + int idx;
> > +
> > + mutex_lock(&sp->mutex);
> > +
> > + idx = sp->completed & 0x1;
> > + if (!atomic_add_unless(sp->ctr + idx, -1, 1))
> > + goto out;
> > +
> > + atomic_inc(sp->ctr + (idx ^ 0x1));
> > + sp->completed++;
> > +
> > + __wait_event(sp->wq, !atomic_read(sp->ctr + idx));
> > +out:
> > + mutex_unlock(&sp->mutex);
> > +}
>
> Test code!!! Very good!!! (This is added to rcutorture, right?)
Yes, the whole patch goes to kernel/rcutorture.c, it is only for
testing/review.
Note: I suspect that Documentation/ lies about atomic_add_unless(), see
http://marc.theaimsgroup.com/?l=linux-kernel&m=116448966030359
so synchronize_xxx() should be
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
smp_mb();
mutex_lock(&sp->mutex);
idx = sp->completed & 0x1;
if (atomic_read(sp->ctr + idx) == 1)
goto out;
atomic_inc(sp->ctr + (idx ^ 0x1));
sp->completed++;
atomic_dec(sp->ctr + idx);
wait_event(sp->wq, !atomic_read(sp->ctr + idx));
out:
mutex_unlock(&sp->mutex);
}
Yes, Alan was right, spinlock_t makes the code simpler.
Oleg.
On 11/27, Oleg Nesterov wrote:
>
> so synchronize_xxx() should be
>
> void synchronize_xxx(struct xxx_struct *sp)
> {
> int idx;
>
> smp_mb();
> mutex_lock(&sp->mutex);
>
> idx = sp->completed & 0x1;
> if (atomic_read(sp->ctr + idx) == 1)
> goto out;
>
> atomic_inc(sp->ctr + (idx ^ 0x1));
> sp->completed++;
>
> atomic_dec(sp->ctr + idx);
> wait_event(sp->wq, !atomic_read(sp->ctr + idx));
> out:
> mutex_unlock(&sp->mutex);
> }
>
> Yes, Alan was right, spinlock_t makes the code simpler.
Damn, it needs another mb() at the end,
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
smp_mb();
mutex_lock(&sp->mutex);
idx = sp->completed & 0x1;
if (atomic_read(sp->ctr + idx) == 1)
goto out;
atomic_inc(sp->ctr + (idx ^ 0x1));
sp->completed++;
atomic_dec(sp->ctr + idx);
wait_event(sp->wq, !atomic_read(sp->ctr + idx));
out:
mutex_unlock(&sp->mutex);
smp_mb();
}
Oleg.
On Mon, 27 Nov 2006, Oleg Nesterov wrote:
> I still can't relax, another attempt to "prove" this should not be
> possible on CPUs supported by Linux :)
>
> Let's suppose it is possible, then it should also be possible if CPU_1
> does spin_lock() instead of mb() (spin_lock can't be "stronger"), yes?
>
> Now,
>
> int COND;
> wait_queue_head_t wq;
>
> my_wait()
> {
> add_wait_queue(&wq);
> for (;;) {
> set_current_state(TASK_UNINTERRUPTIBLE);
>
> if (COND)
> break;
>
> schedule();
> }
> remove_wait_queue(&wq);
> }
>
> my_wake()
> {
> COND = 1;
> wake_up(&wq);
> }
>
> this should be correct, but it is not!
>
> my_wait:
>
> task->state = TASK_UNINTERRUPTIBLE; // STORE
>
> mb();
>
> if (COND) break; // LOAD
>
>
> my_wake:
>
> COND = 1; // STORE
>
> spin_lock(WQ.lock);
> spin_lock(runqueue.lock);
>
> // try_to_wake_up()
> if (!(task->state & TASK_UNINTERRUPTIBLE)) // LOAD
> goto out;
>
>
> So, my_wait() gets COND == 0, and goes to schedule in 'D' state.
> try_to_wake_up() reads ->state == TASK_RUNNING, and does nothing.
This is a very good point. I don't know what the resolution is; Paul will
have to explain the situation.
Alan
On Mon, Nov 27, 2006 at 04:10:27PM -0500, Alan Stern wrote:
> On Mon, 27 Nov 2006, Oleg Nesterov wrote:
>
> > I still can't relax, another attempt to "prove" this should not be
> > possible on CPUs supported by Linux :)
> >
> > Let's suppose it is possible, then it should also be possible if CPU_1
> > does spin_lock() instead of mb() (spin_lock can't be "stronger"), yes?
> >
> > Now,
> >
> > int COND;
> > wait_queue_head_t wq;
> >
> > my_wait()
> > {
> > add_wait_queue(&wq);
> > for (;;) {
> > set_current_state(TASK_UNINTERRUPTIBLE);
> >
> > if (COND)
> > break;
> >
> > schedule();
> > }
> > remove_wait_queue(&wq);
> > }
> >
> > my_wake()
> > {
> > COND = 1;
> > wake_up(&wq);
> > }
> >
> > this should be correct, but it is not!
> >
> > my_wait:
> >
> > task->state = TASK_UNINTERRUPTIBLE; // STORE
> >
> > mb();
> >
> > if (COND) break; // LOAD
> >
> >
> > my_wake:
> >
> > COND = 1; // STORE
> >
> > spin_lock(WQ.lock);
> > spin_lock(runqueue.lock);
> >
> > // try_to_wake_up()
> > if (!(task->state & TASK_UNINTERRUPTIBLE)) // LOAD
> > goto out;
> >
> >
> > So, my_wait() gets COND == 0, and goes to schedule in 'D' state.
> > try_to_wake_up() reads ->state == TASK_RUNNING, and does nothing.
>
> This is a very good point. I don't know what the resolution is; Paul will
> have to explain the situation.
I am revisiting this, and will let you know what I learn.
Thanx, Paul
On Mon, Nov 27, 2006 at 07:11:06PM +0300, Oleg Nesterov wrote:
> On 11/26, Paul E. McKenney wrote:
> >
> > Looks pretty good, actually. A few quibbles below. I need to review
> > again after sleeping on it.
>
> Thanks! Please also look at spinlock-based implementation,
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=116457701231964
>
> I must admit, Alan was right: it really looks simpler and the perfomance
> penalty should be very low. I personally hate this additional spinlock
> per se as a "unneeded complication", but probably you will like it more.
The two have different advantages and disadvantages, but nothing really
overwhelming either way. Here is my take:
1. The spinlock version will be easier for most people to understand.
2. The atomic version has better read-side overhead -- probably
roughly twice as fast on most machines.
3. The atomic version will have better worst-case latency under
heavy read-side load -- at least assuming that the underlying
hardware is fair.
4. The spinlock version would have better fairness in face of
synchronize_xxx() overload.
5. Neither version can be used from irq (but the same is true of
SRCU as well).
If I was to choose, I would probably go with the easy-to-understand
case, which would push me towards the spinlocks. If there is a
read-side performance problem, then the atomic version can be easily
resurrected from the LKML archives. Maybe have a URL in a comment
pointing to the atomic implementation? ;-)
All this assuming that the spinlock version passes rcutorture, of course!!!
> > > +int xxx_read_lock(struct xxx_struct *sp)
> > > +{
> > > + for (;;) {
> > > + int idx = sp->completed & 0x1;
> >
> > Might need a comment saying why no rcu_dereference() needed on the
> > preceding line. The reason (as I understand it) is that we are
> > only doing atomic operations on the element being indexed.
>
> My understanding is the same. Actually, smp_read_barrier_depends() can't
> help because 'atomic_inc' and '->completed++' in synchronize_xxx() could
> be re-ordered anyway, so we should rely on correctness of atomic_t.
Fair enough!
> > > + if (likely(atomic_inc_not_zero(sp->ctr + idx)))
> > > + return idx;
> > > + }
> > > +}
> >
> > The loop seems absolutely necessary if one wishes to avoid a
> > synchronize_sched() in synchronize_xxx() below (and was one of the things
> > I was missing earlier). However, isn't there a possibility that a pile
> > of synchronize_xxx() calls might indefinitely delay an unlucky reader?
>
> Note that synchronize_xxx() does nothing when there are no readers under
> xxx_read_lock(), so
>
> for (;;)
> synchronize_xxx();
>
> can't suspend xxx_read_lock(). atomic_inc_not_zero() fails when something like
> the events below happen between 'idx = sp->completed' and 'atomic_inc_not_zero'
>
> - another reader does xxx_read_lock(), increments ->ctr.
>
> - synchronize_xxx() notices it, goes to __wait_event()
>
> - both the reader and writer manage to do atomic_dec()
>
> This is possible in theory, but indefinite delay... Look, we have the same
> "problem" with spinlocks: in theory synchronize_xxx() calls might indefinitely
> delay an unlucky reader because synchronize_xxx() always wins spin_lock(&sp->lock);
True enough! Again, the only way I can see to avoid the possibility of
indefinite delay is to make the updater do synchronize_sched(), which
is what we were trying to avoid in the first place. ;-)
> > > +
> > > +void xxx_read_unlock(struct xxx_struct *sp, int idx)
> > > +{
> > > + if (unlikely(atomic_dec_and_test(sp->ctr + idx)))
> > > + wake_up(&sp->wq);
> > > +}
> > > +
> > > +void synchronize_xxx(struct xxx_struct *sp)
> > > +{
> > > + int idx;
> > > +
> > > + mutex_lock(&sp->mutex);
> > > +
> > > + idx = sp->completed & 0x1;
> > > + if (!atomic_add_unless(sp->ctr + idx, -1, 1))
> > > + goto out;
> > > +
> > > + atomic_inc(sp->ctr + (idx ^ 0x1));
> > > + sp->completed++;
> > > +
> > > + __wait_event(sp->wq, !atomic_read(sp->ctr + idx));
> > > +out:
> > > + mutex_unlock(&sp->mutex);
> > > +}
> >
> > Test code!!! Very good!!! (This is added to rcutorture, right?)
>
> Yes, the whole patch goes to kernel/rcutorture.c, it is only for
> testing/review.
>
> Note: I suspect that Documentation/ lies about atomic_add_unless(), see
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=116448966030359
Hmmm... Some do and some don't:
Alpha: Does a memory barrier after (but not before) via cmpxchg().
arm: No clue. http://www.arm.com/pdfs/DUI0204G_rvct_assembler_guide.pdf
does not seem to say much about memory-ordering issues.
There are no obvious memory-barrier instructions, but I
don't see what (if any) ordering effects that ldrex or
strexeq might or might not have.
arm26: No SMP support, so no problem.
cris: Uses hashed spinlocks, so probably OK. (Are cris spinlocks
"leaky" in the ia64 sense? If so, then -not- OK.)
frv: No SMP support, so no problem.
h8300: No SMP support, despite having code under CONFIG_SMP.
In any case, local_irq_save() doesn't do much for SMP.
(Right? Or does h8300 do something special here?)
i386: The x86 semantics, as I understand them, are in fact equivalent
to having a memory barrier before and after the operation.
However, the documentation I have is not as clear as it might be.
ia64: "Acquire" semantics, so that earlier operations may be reordered
after the atomic_add_unless(), but later operations may -not-
be reordered before atomic_add_unless().
m32r: Don't know much about m32r, but it does have an CONFIG_SMP
implementation.
m68k: I don't know what memory-barrier semantics the "cas" instructions
provide. A quick Google search was not helpful.
mips: Seems to have a memory barrier only after, not before.
http://www.mips.com/content/PressRoom/TechLibrary/WhitePapers/multi_cpu
seem to indicate that the semantics of the "sync" instruction
depend on hardware external to the CPU.
parisc: Implements atomic_add_unless() with a spinlock, so probably
does memory barrier before and after (I believe parisc does
not have "leaky" spinlock primitives like ia64 does).
powerpc: lwsync before and isync after.
s390: The "cs" (compare-and-swap) instruction does serialization
equivalent to memory barrier before and after.
sh: No SMP support, despite having code under CONFIG_SMP.
In any case, local_irq_save() doesn't do much for SMP.
(Right? Or does "sh" do something special here?)
sh64: No SMP support, despite having code under CONFIG_SMP.
In any case, local_irq_save() doesn't do much for SMP.
(Right? Or does "sh64" do something special here?)
sparc: Uses spinlocks, so similar to parisc.
sparc64: Does have explicit memory barriers before and after. ;-)
v850: No SMP support, so no problem.
x86_64: Same as for i386.
xtensa: No SMP support, so no problem.
---
So either the docs or several of the architectures need fixing.
And it would be -really- nice if more architectures posted complete
instruction reference manuals on the web!!! (Or maybe I need to
be better at searching for them?)
> so synchronize_xxx() should be
>
> void synchronize_xxx(struct xxx_struct *sp)
> {
> int idx;
>
> smp_mb();
> mutex_lock(&sp->mutex);
>
> idx = sp->completed & 0x1;
> if (atomic_read(sp->ctr + idx) == 1)
> goto out;
>
> atomic_inc(sp->ctr + (idx ^ 0x1));
> sp->completed++;
>
> atomic_dec(sp->ctr + idx);
> wait_event(sp->wq, !atomic_read(sp->ctr + idx));
> out:
> mutex_unlock(&sp->mutex);
> }
>
> Yes, Alan was right, spinlock_t makes the code simpler.
;-)
Thanx, Paul
On 11/29, Paul E. McKenney wrote:
>
> 1. The spinlock version will be easier for most people to understand.
>
> 2. The atomic version has better read-side overhead -- probably
> roughly twice as fast on most machines.
synchronize_xxx() should be a little bit faster too
> 3. The atomic version will have better worst-case latency under
> heavy read-side load -- at least assuming that the underlying
> hardware is fair.
>
> 4. The spinlock version would have better fairness in face of
> synchronize_xxx() overload.
Not sure I understand (both 3 and 4) ...
> 5. Neither version can be used from irq (but the same is true of
> SRCU as well).
Hmm... SRCU can't be used from irq, yes. But I think that both versions
(spinlock needs _irqsave) can ?
> If I was to choose, I would probably go with the easy-to-understand
> case, which would push me towards the spinlocks. If there is a
> read-side performance problem, then the atomic version can be easily
> resurrected from the LKML archives. Maybe have a URL in a comment
> pointing to the atomic implementation? ;-)
But it is so ugly to use spinlock to impement the memory barrier semantics!
Look,
void synchronize_xxx(struct xxx_struct *sp)
{
int idx;
mutex_lock(&sp->mutex);
spin_lock();
idx = sp->completed++ & 0x1;
spin_unlock();
wait_event(sp->wq, !sp->ctr[idx]);
spin_lock();
spin_unlock();
mutex_unlock(&sp->mutex);
}
Yes, it looks simpler. But why do we need an empty critical section? it is
a memory barrier, we can (should?) instead do
/* for wait_event() above */
smp_rmb();
spin_unlock_wait();
smp_mb();
Then,
spin_lock();
idx = sp->completed++ & 0x1;
spin_unlock();
means
idx = sp->completed & 0x1;
spin_lock();
sp->completed++
spin_unlock();
Again, this is a barrier, not a lock! ->completed protected by ->mutex,
sp->completed++;
smp_mb();
spin_unlock_wait(&sp->lock);
/* for wait_event() below */
smp_rmb();
So in fact spinlock_t is used to make inc/dec of ->ctr atomic. Doesn't
we have atomic_t for that ?
That said, if you both think it is better - please send a patch. This is
a matter of taste, and I am far from sure my taste is the best :)
> > Note: I suspect that Documentation/ lies about atomic_add_unless(), see
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=116448966030359
>
> Hmmm... Some do and some don't:
>
> i386: The x86 semantics, as I understand them, are in fact equivalent
> to having a memory barrier before and after the operation.
> However, the documentation I have is not as clear as it might be.
Even i386 has non-empty mb(), but atomic_read() is a plain LOAD.
> So either the docs or several of the architectures need fixing.
I think its better to fix the docs.
Oleg.
On Wed, Nov 29, 2006 at 11:16:46PM +0300, Oleg Nesterov wrote:
> On 11/29, Paul E. McKenney wrote:
> >
> > 1. The spinlock version will be easier for most people to understand.
> >
> > 2. The atomic version has better read-side overhead -- probably
> > roughly twice as fast on most machines.
>
> synchronize_xxx() should be a little bit faster too
Good point.
> > 3. The atomic version will have better worst-case latency under
> > heavy read-side load -- at least assuming that the underlying
> > hardware is fair.
> >
> > 4. The spinlock version would have better fairness in face of
> > synchronize_xxx() overload.
>
> Not sure I understand (both 3 and 4) ...
The differences will be slight, and so hardware-dependent that they
don't mean much.
> > 5. Neither version can be used from irq (but the same is true of
> > SRCU as well).
>
> Hmm... SRCU can't be used from irq, yes. But I think that both versions
> (spinlock needs _irqsave) can ?
I didn't think you could call wait_event() from irq.
For the locked version, you would also need spin_lock_irqsave() or some
such to avoid self-deadlock.
For the atomic version, the fact that synchronize_qrcu() increments
the new counter before decrmenting the old one should mean that calls
to qrcu_read_lock() and qrcu_read_unlock() can be called from irq.
But synchronize_qrcu() must be called from process context, since it
can block.
This might well be important.
> > If I was to choose, I would probably go with the easy-to-understand
> > case, which would push me towards the spinlocks. If there is a
> > read-side performance problem, then the atomic version can be easily
> > resurrected from the LKML archives. Maybe have a URL in a comment
> > pointing to the atomic implementation? ;-)
>
> But it is so ugly to use spinlock to impement the memory barrier semantics!
>
> Look,
>
> void synchronize_xxx(struct xxx_struct *sp)
> {
> int idx;
>
> mutex_lock(&sp->mutex);
>
> spin_lock();
> idx = sp->completed++ & 0x1;
> spin_unlock();
>
> wait_event(sp->wq, !sp->ctr[idx]);
>
> spin_lock();
> spin_unlock();
>
> mutex_unlock(&sp->mutex);
> }
>
> Yes, it looks simpler. But why do we need an empty critical section? it is
> a memory barrier, we can (should?) instead do
>
> /* for wait_event() above */
> smp_rmb();
> spin_unlock_wait();
> smp_mb();
>
> Then,
>
> spin_lock();
> idx = sp->completed++ & 0x1;
> spin_unlock();
>
> means
> idx = sp->completed & 0x1;
> spin_lock();
> sp->completed++
> spin_unlock();
>
> Again, this is a barrier, not a lock! ->completed protected by ->mutex,
>
> sp->completed++;
> smp_mb();
> spin_unlock_wait(&sp->lock);
> /* for wait_event() below */
> smp_rmb();
>
> So in fact spinlock_t is used to make inc/dec of ->ctr atomic. Doesn't
> we have atomic_t for that ?
>
> That said, if you both think it is better - please send a patch. This is
> a matter of taste, and I am far from sure my taste is the best :)
Heck no! I was under the mistaken impression that you had coded and
tested both patches. If you have the atomic one working but the
spinlock one is not to that stage, I am all for going with the atomic
version. ;-)
> > > Note: I suspect that Documentation/ lies about atomic_add_unless(), see
> > >
> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=116448966030359
> >
> > Hmmm... Some do and some don't:
> >
> > i386: The x86 semantics, as I understand them, are in fact equivalent
> > to having a memory barrier before and after the operation.
> > However, the documentation I have is not as clear as it might be.
>
> Even i386 has non-empty mb(), but atomic_read() is a plain LOAD.
Ah -- I was forgetting the failure path. You are quite correct.
> > So either the docs or several of the architectures need fixing.
>
> I think its better to fix the docs.
I must defer to the two Davids...
Thanx, Paul
On 11/29, Paul E. McKenney wrote:
>
> On Wed, Nov 29, 2006 at 11:16:46PM +0300, Oleg Nesterov wrote:
> >
> > Hmm... SRCU can't be used from irq, yes. But I think that both versions
> > (spinlock needs _irqsave) can ?
>
> I didn't think you could call wait_event() from irq.
Ah, sorry for confusion, I talked only about read lock/unlock of course.
Just in case, it is not safe to do srcu_read_{,un}lock() from irq,
per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++
^^^^^^^^
we need local_t for that.
> For the locked version, you would also need spin_lock_irqsave() or some
> such to avoid self-deadlock.
>
> For the atomic version, the fact that synchronize_qrcu() increments
> the new counter before decrmenting the old one should mean that calls
> to qrcu_read_lock() and qrcu_read_unlock() can be called from irq.
Yes, exactly! There is another reason, suppose we did
qp->completed++;
atomic_inc(qp->ctr + (idx ^ 0x1));
In that case the reader could be stalled if synchronize_qrcu() takes a
preemption in between.
> But synchronize_qrcu() must be called from process context, since it
> can block.
Surely.
Oleg.