2012-02-13 02:10:00

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

The current implementation of synchronize_srcu_expedited() can cause
severe OS jitter due to its use of synchronize_sched(), which in turn
invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
This can result in severe performance degradation for real-time workloads
and especially for short-interation-length HPC workloads. Furthermore,
because only one instance of try_stop_cpus() can be making forward progress
at a given time, only one instance of synchronize_srcu_expedited() can
make forward progress at a time, even if they are all operating on
distinct srcu_struct structures.

This commit, inspired by an earlier implementation by Peter Zijlstra
(https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
takes a strictly algorithmic bits-in-memory approach. This has the
disadvantage of requiring one explicit memory-barrier instruction in
each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
completely dispenses with OS jitter and furthermore allows SRCU to be
used freely by CPUs that RCU believes to be idle or offline.

The update-side implementation handles the single read-side memory
barrier by rechecking the per-CPU counters after summing them and
by running through the update-side state machine twice.

This implementation has passed moderate rcutorture testing on both 32-bit
x86 and 64-bit Power. A call_srcu() function will be present in a later
version of this patch.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index d3d5fa5..a478c8e 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -31,13 +31,19 @@
#include <linux/rcupdate.h>

struct srcu_struct_array {
- int c[2];
+ unsigned long c[2];
};

+/* Bit definitions for field ->c above and ->snap below. */
+#define SRCU_USAGE_BITS 2
+#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
+#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
+
struct srcu_struct {
- int completed;
+ unsigned completed;
struct srcu_struct_array __percpu *per_cpu_ref;
struct mutex mutex;
+ unsigned long snap[NR_CPUS];
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index e0fe148..3d99162 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -620,7 +620,7 @@ static int srcu_torture_stats(char *page)
cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):",
torture_type, TORTURE_FLAG, idx);
for_each_possible_cpu(cpu) {
- cnt += sprintf(&page[cnt], " %d(%d,%d)", cpu,
+ cnt += sprintf(&page[cnt], " %d(%lu,%lu)", cpu,
per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[!idx],
per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[idx]);
}
diff --git a/kernel/srcu.c b/kernel/srcu.c
index ba35f3a..540671e 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -73,19 +73,102 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */

/*
- * srcu_readers_active_idx -- returns approximate number of readers
- * active on the specified rank of per-CPU counters.
+ * Returns approximate number of readers active on the specified rank
+ * of per-CPU counters. Also snapshots each counter's value in the
+ * corresponding element of sp->snap[] for later use validating
+ * the sum.
*/
+static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
+{
+ int cpu;
+ unsigned long sum = 0;
+ unsigned long t;

-static int srcu_readers_active_idx(struct srcu_struct *sp, int idx)
+ for_each_possible_cpu(cpu) {
+ t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
+ sum += t;
+ sp->snap[cpu] = t;
+ }
+ return sum & SRCU_REF_MASK;
+}
+
+/*
+ * To be called from the update side after an index flip. Returns true
+ * if the modulo sum of the counters is stably zero, false if there is
+ * some possibility of non-zero.
+ */
+static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
{
int cpu;
- int sum;

- sum = 0;
+ /*
+ * Note that srcu_readers_active_idx() can incorrectly return
+ * zero even though there is a pre-existing reader throughout.
+ * To see this, suppose that task A is in a very long SRCU
+ * read-side critical section that started on CPU 0, and that
+ * no other reader exists, so that the modulo sum of the counters
+ * is equal to one. Then suppose that task B starts executing
+ * srcu_readers_active_idx(), summing up to CPU 1, and then that
+ * task C starts reading on CPU 0, so that its increment is not
+ * summed, but finishes reading on CPU 2, so that its decrement
+ * -is- summed. Then when task B completes its sum, it will
+ * incorrectly get zero, despite the fact that task A has been
+ * in its SRCU read-side critical section the whole time.
+ *
+ * We therefore do a validation step should srcu_readers_active_idx()
+ * return zero.
+ */
+ if (srcu_readers_active_idx(sp, idx) != 0)
+ return false;
+
+ /*
+ * Since the caller recently flipped ->completed, we can see at
+ * most one increment of each CPU's counter from this point
+ * forward. The reason for this is that the reader CPU must have
+ * fetched the index before srcu_readers_active_idx checked
+ * that CPU's counter, but not yet incremented its counter.
+ * Its eventual counter increment will follow the read in
+ * srcu_readers_active_idx(), and that increment is immediately
+ * followed by smp_mb() B. Because smp_mb() D is between
+ * the ->completed flip and srcu_readers_active_idx()'s read,
+ * that CPU's subsequent load of ->completed must see the new
+ * value, and therefore increment the counter in the other rank.
+ */
+ smp_mb(); /* A */
+
+ /*
+ * Now, we check the ->snap array that srcu_readers_active_idx()
+ * filled in from the per-CPU counter values. Since both
+ * __srcu_read_lock() and __srcu_read_unlock() increment the
+ * upper bits of the per-CPU counter, an increment/decrement
+ * pair will change the value of the counter. Since there is
+ * only one possible increment, the only way to wrap the counter
+ * is to have a huge number of counter decrements, which requires
+ * a huge number of tasks and huge SRCU read-side critical-section
+ * nesting levels, even on 32-bit systems.
+ *
+ * All of the ways of confusing the readings require that the scan
+ * in srcu_readers_active_idx() see the read-side task's decrement,
+ * but not its increment. However, between that decrement and
+ * increment are smb_mb() B and C. Either or both of these pair
+ * with smp_mb() A above to ensure that the scan below will see
+ * the read-side tasks's increment, thus noting a difference in
+ * the counter values between the two passes.
+ *
+ * Therefore, if srcu_readers_active_idx() returned zero, and
+ * none of the counters changed, we know that the zero was the
+ * correct sum.
+ *
+ * Of course, it is possible that a task might be delayed
+ * for a very long time in __srcu_read_lock() after fetching
+ * the index but before incrementing its counter. This
+ * possibility will be dealt with in __synchronize_srcu().
+ */
for_each_possible_cpu(cpu)
- sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
- return sum;
+ if (sp->snap[cpu] !=
+ ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
+ return false; /* False zero reading! */
+ return true;
}

/**
@@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
int idx;

preempt_disable();
- idx = sp->completed & 0x1;
- barrier(); /* ensure compiler looks -once- at sp->completed. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
+ idx = rcu_dereference_index_check(sp->completed,
+ rcu_read_lock_sched_held()) & 0x1;
+ ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
+ SRCU_USAGE_COUNT + 1;
+ smp_mb(); /* B */ /* Avoid leaking the critical section. */
preempt_enable();
return idx;
}
@@ -149,8 +233,9 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
void __srcu_read_unlock(struct srcu_struct *sp, int idx)
{
preempt_disable();
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
+ smp_mb(); /* C */ /* Avoid leaking the critical section. */
+ ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
+ SRCU_USAGE_COUNT - 1;
preempt_enable();
}
EXPORT_SYMBOL_GPL(__srcu_read_unlock);
@@ -163,12 +248,65 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
* we repeatedly block for 1-millisecond time periods. This approach
* has done well in testing, so there is no need for a config parameter.
*/
-#define SYNCHRONIZE_SRCU_READER_DELAY 10
+#define SYNCHRONIZE_SRCU_READER_DELAY 5
+
+/*
+ * Flip the readers' index by incrementing ->completed, then wait
+ * until there are no more readers using the counters referenced by
+ * the old index value. (Recall that the index is the bottom bit
+ * of ->completed.)
+ *
+ * Of course, it is possible that a reader might be delayed for the
+ * full duration of flip_idx_and_wait() between fetching the
+ * index and incrementing its counter. This possibility is handled
+ * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
+ */
+static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
+{
+ int idx;
+ int trycount = 0;
+
+ idx = sp->completed++ & 0x1;
+
+ /*
+ * If a reader fetches the index before the above increment,
+ * but increments its counter after srcu_readers_active_idx_check()
+ * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
+ * smp_mb() B to ensure that the SRCU read-side critical section
+ * will see any updates that the current task performed before its
+ * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
+ * as the case may be.
+ */
+ smp_mb(); /* D */
+
+ /*
+ * SRCU read-side critical sections are normally short, so wait
+ * a small amount of time before possibly blocking.
+ */
+ if (!srcu_readers_active_idx_check(sp, idx)) {
+ udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+ while (!srcu_readers_active_idx_check(sp, idx)) {
+ if (expedited && ++ trycount < 10)
+ udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+ else
+ schedule_timeout_interruptible(1);
+ }
+ }
+
+ /*
+ * The following smp_mb() E pairs with srcu_read_unlock()'s
+ * smp_mb C to ensure that if srcu_readers_active_idx_check()
+ * sees srcu_read_unlock()'s counter decrement, then any
+ * of the current task's subsequent code will happen after
+ * that SRCU read-side critical section.
+ */
+ smp_mb(); /* E */
+}

/*
* Helper function for synchronize_srcu() and synchronize_srcu_expedited().
*/
-static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
+static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
{
int idx;

@@ -178,90 +316,51 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
!lock_is_held(&rcu_sched_lock_map),
"Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");

- idx = sp->completed;
+ idx = ACCESS_ONCE(sp->completed);
mutex_lock(&sp->mutex);

/*
* Check to see if someone else did the work for us while we were
- * waiting to acquire the lock. We need -two- advances of
+ * waiting to acquire the lock. We need -three- advances of
* the counter, not just one. If there was but one, we might have
* shown up -after- our helper's first synchronize_sched(), thus
* having failed to prevent CPU-reordering races with concurrent
- * srcu_read_unlock()s on other CPUs (see comment below). So we
- * either (1) wait for two or (2) supply the second ourselves.
+ * srcu_read_unlock()s on other CPUs (see comment below). If there
+ * was only two, we are guaranteed to have waited through only one
+ * full index-flip phase. So we either (1) wait for three or
+ * (2) supply the additional ones we need.
*/

- if ((sp->completed - idx) >= 2) {
+ if (sp->completed == idx + 2)
+ idx = 1;
+ else if (sp->completed == idx + 3) {
mutex_unlock(&sp->mutex);
return;
- }
-
- sync_func(); /* Force memory barrier on all CPUs. */
+ } else
+ idx = 0;

/*
- * The preceding synchronize_sched() ensures that any CPU that
- * sees the new value of sp->completed will also see any preceding
- * changes to data structures made by this CPU. This prevents
- * some other CPU from reordering the accesses in its SRCU
- * read-side critical section to precede the corresponding
- * srcu_read_lock() -- ensuring that such references will in
- * fact be protected.
+ * If there were no helpers, then we need to do two flips of
+ * the index. The first flip is required if there are any
+ * outstanding SRCU readers even if there are no new readers
+ * running concurrently with the first counter flip.
*
- * So it is now safe to do the flip.
- */
-
- idx = sp->completed & 0x1;
- sp->completed++;
-
- sync_func(); /* Force memory barrier on all CPUs. */
-
- /*
- * At this point, because of the preceding synchronize_sched(),
- * all srcu_read_lock() calls using the old counters have completed.
- * Their corresponding critical sections might well be still
- * executing, but the srcu_read_lock() primitives themselves
- * will have finished executing. We initially give readers
- * an arbitrarily chosen 10 microseconds to get out of their
- * SRCU read-side critical sections, then loop waiting 1/HZ
- * seconds per iteration. The 10-microsecond value has done
- * very well in testing.
- */
-
- if (srcu_readers_active_idx(sp, idx))
- udelay(SYNCHRONIZE_SRCU_READER_DELAY);
- while (srcu_readers_active_idx(sp, idx))
- schedule_timeout_interruptible(1);
-
- sync_func(); /* Force memory barrier on all CPUs. */
-
- /*
- * The preceding synchronize_sched() forces all srcu_read_unlock()
- * primitives that were executing concurrently with the preceding
- * for_each_possible_cpu() loop to have completed by this point.
- * More importantly, it also forces the corresponding SRCU read-side
- * critical sections to have also completed, and the corresponding
- * references to SRCU-protected data items to be dropped.
+ * The second flip is required when a new reader picks up
+ * the old value of the index, but does not increment its
+ * counter until after its counters is summed/rechecked by
+ * srcu_readers_active_idx_check(). In this case, the current SRCU
+ * grace period would be OK because the SRCU read-side critical
+ * section started after this SRCU grace period started, so the
+ * grace period is not required to wait for the reader.
*
- * Note:
- *
- * Despite what you might think at first glance, the
- * preceding synchronize_sched() -must- be within the
- * critical section ended by the following mutex_unlock().
- * Otherwise, a task taking the early exit can race
- * with a srcu_read_unlock(), which might have executed
- * just before the preceding srcu_readers_active() check,
- * and whose CPU might have reordered the srcu_read_unlock()
- * with the preceding critical section. In this case, there
- * is nothing preventing the synchronize_sched() task that is
- * taking the early exit from freeing a data structure that
- * is still being referenced (out of order) by the task
- * doing the srcu_read_unlock().
- *
- * Alternatively, the comparison with "2" on the early exit
- * could be changed to "3", but this increases synchronize_srcu()
- * latency for bulk loads. So the current code is preferred.
+ * However, the next SRCU grace period would be waiting for the
+ * other set of counters to go to zero, and therefore would not
+ * wait for the reader, which would be very bad. To avoid this
+ * bad scenario, we flip and wait twice, clearing out both sets
+ * of counters.
*/
-
+ for (; idx < 2; idx++)
+ flip_idx_and_wait(sp, expedited);
mutex_unlock(&sp->mutex);
}

@@ -281,7 +380,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
*/
void synchronize_srcu(struct srcu_struct *sp)
{
- __synchronize_srcu(sp, synchronize_sched);
+ __synchronize_srcu(sp, 0);
}
EXPORT_SYMBOL_GPL(synchronize_srcu);

@@ -289,18 +388,11 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
* synchronize_srcu_expedited - Brute-force SRCU grace period
* @sp: srcu_struct with which to synchronize.
*
- * Wait for an SRCU grace period to elapse, but use a "big hammer"
- * approach to force the grace period to end quickly. This consumes
- * significant time on all CPUs and is unfriendly to real-time workloads,
- * so is thus not recommended for any sort of common-case code. In fact,
- * if you are using synchronize_srcu_expedited() in a loop, please
- * restructure your code to batch your updates, and then use a single
- * synchronize_srcu() instead.
+ * Wait for an SRCU grace period to elapse, but be more aggressive about
+ * spinning rather than blocking when waiting.
*
* Note that it is illegal to call this function while holding any lock
- * that is acquired by a CPU-hotplug notifier. And yes, it is also illegal
- * to call this function from a CPU-hotplug notifier. Failing to observe
- * these restriction will result in deadlock. It is also illegal to call
+ * that is acquired by a CPU-hotplug notifier. It is also illegal to call
* synchronize_srcu_expedited() from the corresponding SRCU read-side
* critical section; doing so will result in deadlock. However, it is
* perfectly legal to call synchronize_srcu_expedited() on one srcu_struct
@@ -309,7 +401,7 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
*/
void synchronize_srcu_expedited(struct srcu_struct *sp)
{
- __synchronize_srcu(sp, synchronize_sched_expedited);
+ __synchronize_srcu(sp, 1);
}
EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);


2012-02-15 13:00:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> The current implementation of synchronize_srcu_expedited() can cause
> severe OS jitter due to its use of synchronize_sched(), which in turn
> invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> This can result in severe performance degradation for real-time workloads
> and especially for short-interation-length HPC workloads. Furthermore,
> because only one instance of try_stop_cpus() can be making forward progress
> at a given time, only one instance of synchronize_srcu_expedited() can
> make forward progress at a time, even if they are all operating on
> distinct srcu_struct structures.
>
> This commit, inspired by an earlier implementation by Peter Zijlstra
> (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> takes a strictly algorithmic bits-in-memory approach. This has the
> disadvantage of requiring one explicit memory-barrier instruction in
> each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> completely dispenses with OS jitter and furthermore allows SRCU to be
> used freely by CPUs that RCU believes to be idle or offline.
>
> The update-side implementation handles the single read-side memory
> barrier by rechecking the per-CPU counters after summing them and
> by running through the update-side state machine twice.

Yeah, getting rid of that second memory barrier in srcu_read_lock() is
pure magic :-)

> This implementation has passed moderate rcutorture testing on both 32-bit
> x86 and 64-bit Power. A call_srcu() function will be present in a later
> version of this patch.

Goodness ;-)

> @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
> int idx;
>
> preempt_disable();
> - idx = sp->completed & 0x1;
> - barrier(); /* ensure compiler looks -once- at sp->completed. */
> - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> + idx = rcu_dereference_index_check(sp->completed,
> + rcu_read_lock_sched_held()) & 0x1;
> + ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> + SRCU_USAGE_COUNT + 1;
> + smp_mb(); /* B */ /* Avoid leaking the critical section. */
> preempt_enable();
> return idx;
> }

You could use __this_cpu_* muck to shorten some of that.

Acked-by: Peter Zijlstra <[email protected]>

2012-02-15 14:41:12

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

* Paul E. McKenney ([email protected]) wrote:
[...]
> +/*
> + * To be called from the update side after an index flip. Returns true
> + * if the modulo sum of the counters is stably zero, false if there is
> + * some possibility of non-zero.
> + */
> +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> {
> int cpu;
> - int sum;
>
> - sum = 0;
> + /*
> + * Note that srcu_readers_active_idx() can incorrectly return
> + * zero even though there is a pre-existing reader throughout.
> + * To see this, suppose that task A is in a very long SRCU
> + * read-side critical section that started on CPU 0, and that
> + * no other reader exists, so that the modulo sum of the counters
> + * is equal to one. Then suppose that task B starts executing
> + * srcu_readers_active_idx(), summing up to CPU 1, and then that
> + * task C starts reading on CPU 0, so that its increment is not
> + * summed, but finishes reading on CPU 2, so that its decrement
> + * -is- summed. Then when task B completes its sum, it will
> + * incorrectly get zero, despite the fact that task A has been
> + * in its SRCU read-side critical section the whole time.
> + *
> + * We therefore do a validation step should srcu_readers_active_idx()
> + * return zero.
> + */
> + if (srcu_readers_active_idx(sp, idx) != 0)
> + return false;
> +
> + /*
> + * Since the caller recently flipped ->completed, we can see at
> + * most one increment of each CPU's counter from this point
> + * forward. The reason for this is that the reader CPU must have
> + * fetched the index before srcu_readers_active_idx checked
> + * that CPU's counter, but not yet incremented its counter.
> + * Its eventual counter increment will follow the read in
> + * srcu_readers_active_idx(), and that increment is immediately
> + * followed by smp_mb() B. Because smp_mb() D is between
> + * the ->completed flip and srcu_readers_active_idx()'s read,
> + * that CPU's subsequent load of ->completed must see the new
> + * value, and therefore increment the counter in the other rank.
> + */
> + smp_mb(); /* A */
> +
> + /*
> + * Now, we check the ->snap array that srcu_readers_active_idx()
> + * filled in from the per-CPU counter values. Since both
> + * __srcu_read_lock() and __srcu_read_unlock() increment the
> + * upper bits of the per-CPU counter, an increment/decrement
> + * pair will change the value of the counter. Since there is
> + * only one possible increment, the only way to wrap the counter
> + * is to have a huge number of counter decrements, which requires
> + * a huge number of tasks and huge SRCU read-side critical-section
> + * nesting levels, even on 32-bit systems.
> + *
> + * All of the ways of confusing the readings require that the scan
> + * in srcu_readers_active_idx() see the read-side task's decrement,
> + * but not its increment. However, between that decrement and
> + * increment are smb_mb() B and C. Either or both of these pair
> + * with smp_mb() A above to ensure that the scan below will see
> + * the read-side tasks's increment, thus noting a difference in
> + * the counter values between the two passes.

Hi Paul,

I think the implementation is correct, but the explanation above might
be improved. Let's consider the following a scenario, where a reader is
migrated between increment of the counter and issuing the memory barrier
in the read lock:

A,B,C are readers
D is synchronize_rcu (one flip'n'wait)

CPU A CPU B CPU C CPU D
c[1]++
smp_mb(1)
read c[0] -> 0
c[0]++
(implicit smp_mb (2))
-> migrated ->
(implicit smp_mb (3))
smp_mb (4)
smp_mb (5)
c[1]--
read c[1] -> -1
read c[2] -> 1
(false 0 sum)
smp_mb (6)
re-check each.
c[1]--

re-check: because we observed c[1] == -1, thanks to the implicit memory
barriers within thread migration (2 and 3), we can assume that we _will_
observe the updated value of c[0] after smp_mb (6).

The current explanation states that memory barriers 4 and 5, along with
6, are responsible for ensuring that the increment will be observed by
the re-check. However, I doubt they have anything to do with it: it's
rather the implicit memory barriers in thread migration, along with
program order guarantees on writes to the same address, that seems to be
the reason why we can do this ordering assumption.

Does it make sense, or shall I get another coffee to wake myself up ?
;)

Thanks,

Mathieu

> + *
> + * Therefore, if srcu_readers_active_idx() returned zero, and
> + * none of the counters changed, we know that the zero was the
> + * correct sum.
> + *
> + * Of course, it is possible that a task might be delayed
> + * for a very long time in __srcu_read_lock() after fetching
> + * the index but before incrementing its counter. This
> + * possibility will be dealt with in __synchronize_srcu().
> + */
> for_each_possible_cpu(cpu)
> - sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> - return sum;
> + if (sp->snap[cpu] !=
> + ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> + return false; /* False zero reading! */
> + return true;
> }


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2012-02-15 14:51:52

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

* Mathieu Desnoyers ([email protected]) wrote:
> * Paul E. McKenney ([email protected]) wrote:
> [...]
> > +/*
> > + * To be called from the update side after an index flip. Returns true
> > + * if the modulo sum of the counters is stably zero, false if there is
> > + * some possibility of non-zero.
> > + */
> > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > {
> > int cpu;
> > - int sum;
> >
> > - sum = 0;
> > + /*
> > + * Note that srcu_readers_active_idx() can incorrectly return
> > + * zero even though there is a pre-existing reader throughout.
> > + * To see this, suppose that task A is in a very long SRCU
> > + * read-side critical section that started on CPU 0, and that
> > + * no other reader exists, so that the modulo sum of the counters
> > + * is equal to one. Then suppose that task B starts executing
> > + * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > + * task C starts reading on CPU 0, so that its increment is not
> > + * summed, but finishes reading on CPU 2, so that its decrement
> > + * -is- summed. Then when task B completes its sum, it will
> > + * incorrectly get zero, despite the fact that task A has been
> > + * in its SRCU read-side critical section the whole time.
> > + *
> > + * We therefore do a validation step should srcu_readers_active_idx()
> > + * return zero.
> > + */
> > + if (srcu_readers_active_idx(sp, idx) != 0)
> > + return false;
> > +
> > + /*
> > + * Since the caller recently flipped ->completed, we can see at
> > + * most one increment of each CPU's counter from this point
> > + * forward. The reason for this is that the reader CPU must have
> > + * fetched the index before srcu_readers_active_idx checked
> > + * that CPU's counter, but not yet incremented its counter.
> > + * Its eventual counter increment will follow the read in
> > + * srcu_readers_active_idx(), and that increment is immediately
> > + * followed by smp_mb() B. Because smp_mb() D is between
> > + * the ->completed flip and srcu_readers_active_idx()'s read,
> > + * that CPU's subsequent load of ->completed must see the new
> > + * value, and therefore increment the counter in the other rank.
> > + */
> > + smp_mb(); /* A */
> > +
> > + /*
> > + * Now, we check the ->snap array that srcu_readers_active_idx()
> > + * filled in from the per-CPU counter values. Since both
> > + * __srcu_read_lock() and __srcu_read_unlock() increment the
> > + * upper bits of the per-CPU counter, an increment/decrement
> > + * pair will change the value of the counter. Since there is
> > + * only one possible increment, the only way to wrap the counter
> > + * is to have a huge number of counter decrements, which requires
> > + * a huge number of tasks and huge SRCU read-side critical-section
> > + * nesting levels, even on 32-bit systems.
> > + *
> > + * All of the ways of confusing the readings require that the scan
> > + * in srcu_readers_active_idx() see the read-side task's decrement,
> > + * but not its increment. However, between that decrement and
> > + * increment are smb_mb() B and C. Either or both of these pair
> > + * with smp_mb() A above to ensure that the scan below will see
> > + * the read-side tasks's increment, thus noting a difference in
> > + * the counter values between the two passes.
>
> Hi Paul,
>
> I think the implementation is correct, but the explanation above might
> be improved. Let's consider the following a scenario, where a reader is
> migrated between increment of the counter and issuing the memory barrier
> in the read lock:
>
> A,B,C are readers
> D is synchronize_rcu (one flip'n'wait)
>
> CPU A CPU B CPU C CPU D
> c[1]++
> smp_mb(1)
> read c[0] -> 0
> c[0]++
> (implicit smp_mb (2))
> -> migrated ->
> (implicit smp_mb (3))
> smp_mb (4)
> smp_mb (5)
> c[1]--
> read c[1] -> -1
> read c[2] -> 1
> (false 0 sum)
> smp_mb (6)
> re-check each.
> c[1]--
>
> re-check: because we observed c[1] == -1, thanks to the implicit memory
> barriers within thread migration (2 and 3), we can assume that we _will_
> observe the updated value of c[0] after smp_mb (6).
>
> The current explanation states that memory barriers 4 and 5, along with
> 6, are responsible for ensuring that the increment will be observed by
> the re-check. However, I doubt they have anything to do with it: it's
> rather the implicit memory barriers in thread migration, along with
> program order guarantees on writes to the same address, that seems to be
> the reason why we can do this ordering assumption.

Please disregard the part about program order: CPU A writes to c[0], and
CPU B writes to c[1], which are two different memory locations. The rest
of my discussion stands though.

Simply reasoning about write to c[0], memory barriers 2-3, write to
c[1], along with c[1] read, memory barrier 6, and then c[0] read is
enough to explain the ordering guarantees you need, without invoking
program order.

Thanks,

Mathieu

>
> Does it make sense, or shall I get another coffee to wake myself up ?
> ;)
>
> Thanks,
>
> Mathieu
>
> > + *
> > + * Therefore, if srcu_readers_active_idx() returned zero, and
> > + * none of the counters changed, we know that the zero was the
> > + * correct sum.
> > + *
> > + * Of course, it is possible that a task might be delayed
> > + * for a very long time in __srcu_read_lock() after fetching
> > + * the index but before incrementing its counter. This
> > + * possibility will be dealt with in __synchronize_srcu().
> > + */
> > for_each_possible_cpu(cpu)
> > - sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > - return sum;
> > + if (sp->snap[cpu] !=
> > + ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > + return false; /* False zero reading! */
> > + return true;
> > }
>
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2012-02-16 06:35:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Wed, Feb 15, 2012 at 01:59:23PM +0100, Peter Zijlstra wrote:
> On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> > The current implementation of synchronize_srcu_expedited() can cause
> > severe OS jitter due to its use of synchronize_sched(), which in turn
> > invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> > This can result in severe performance degradation for real-time workloads
> > and especially for short-interation-length HPC workloads. Furthermore,
> > because only one instance of try_stop_cpus() can be making forward progress
> > at a given time, only one instance of synchronize_srcu_expedited() can
> > make forward progress at a time, even if they are all operating on
> > distinct srcu_struct structures.
> >
> > This commit, inspired by an earlier implementation by Peter Zijlstra
> > (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> > takes a strictly algorithmic bits-in-memory approach. This has the
> > disadvantage of requiring one explicit memory-barrier instruction in
> > each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> > completely dispenses with OS jitter and furthermore allows SRCU to be
> > used freely by CPUs that RCU believes to be idle or offline.
> >
> > The update-side implementation handles the single read-side memory
> > barrier by rechecking the per-CPU counters after summing them and
> > by running through the update-side state machine twice.
>
> Yeah, getting rid of that second memory barrier in srcu_read_lock() is
> pure magic :-)
>
> > This implementation has passed moderate rcutorture testing on both 32-bit
> > x86 and 64-bit Power. A call_srcu() function will be present in a later
> > version of this patch.
>
> Goodness ;-)

Glad you like the magic and the prospect of call_srcu(). ;-)

> > @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
> > int idx;
> >
> > preempt_disable();
> > - idx = sp->completed & 0x1;
> > - barrier(); /* ensure compiler looks -once- at sp->completed. */
> > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > + idx = rcu_dereference_index_check(sp->completed,
> > + rcu_read_lock_sched_held()) & 0x1;
> > + ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> > + SRCU_USAGE_COUNT + 1;
> > + smp_mb(); /* B */ /* Avoid leaking the critical section. */
> > preempt_enable();
> > return idx;
> > }
>
> You could use __this_cpu_* muck to shorten some of that.

Ah, so something like this?

ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
SRCU_USAGE_COUNT + 1;

Now that you mention it, this does look nicer, applied here and to
srcu_read_unlock().

> Acked-by: Peter Zijlstra <[email protected]>


Thanx, Paul

2012-02-16 06:38:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Wed, Feb 15, 2012 at 09:51:44AM -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers ([email protected]) wrote:
> > * Paul E. McKenney ([email protected]) wrote:
> > [...]
> > > +/*
> > > + * To be called from the update side after an index flip. Returns true
> > > + * if the modulo sum of the counters is stably zero, false if there is
> > > + * some possibility of non-zero.
> > > + */
> > > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > > {
> > > int cpu;
> > > - int sum;
> > >
> > > - sum = 0;
> > > + /*
> > > + * Note that srcu_readers_active_idx() can incorrectly return
> > > + * zero even though there is a pre-existing reader throughout.
> > > + * To see this, suppose that task A is in a very long SRCU
> > > + * read-side critical section that started on CPU 0, and that
> > > + * no other reader exists, so that the modulo sum of the counters
> > > + * is equal to one. Then suppose that task B starts executing
> > > + * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > > + * task C starts reading on CPU 0, so that its increment is not
> > > + * summed, but finishes reading on CPU 2, so that its decrement
> > > + * -is- summed. Then when task B completes its sum, it will
> > > + * incorrectly get zero, despite the fact that task A has been
> > > + * in its SRCU read-side critical section the whole time.
> > > + *
> > > + * We therefore do a validation step should srcu_readers_active_idx()
> > > + * return zero.
> > > + */
> > > + if (srcu_readers_active_idx(sp, idx) != 0)
> > > + return false;
> > > +
> > > + /*
> > > + * Since the caller recently flipped ->completed, we can see at
> > > + * most one increment of each CPU's counter from this point
> > > + * forward. The reason for this is that the reader CPU must have
> > > + * fetched the index before srcu_readers_active_idx checked
> > > + * that CPU's counter, but not yet incremented its counter.
> > > + * Its eventual counter increment will follow the read in
> > > + * srcu_readers_active_idx(), and that increment is immediately
> > > + * followed by smp_mb() B. Because smp_mb() D is between
> > > + * the ->completed flip and srcu_readers_active_idx()'s read,
> > > + * that CPU's subsequent load of ->completed must see the new
> > > + * value, and therefore increment the counter in the other rank.
> > > + */
> > > + smp_mb(); /* A */
> > > +
> > > + /*
> > > + * Now, we check the ->snap array that srcu_readers_active_idx()
> > > + * filled in from the per-CPU counter values. Since both
> > > + * __srcu_read_lock() and __srcu_read_unlock() increment the
> > > + * upper bits of the per-CPU counter, an increment/decrement
> > > + * pair will change the value of the counter. Since there is
> > > + * only one possible increment, the only way to wrap the counter
> > > + * is to have a huge number of counter decrements, which requires
> > > + * a huge number of tasks and huge SRCU read-side critical-section
> > > + * nesting levels, even on 32-bit systems.
> > > + *
> > > + * All of the ways of confusing the readings require that the scan
> > > + * in srcu_readers_active_idx() see the read-side task's decrement,
> > > + * but not its increment. However, between that decrement and
> > > + * increment are smb_mb() B and C. Either or both of these pair
> > > + * with smp_mb() A above to ensure that the scan below will see
> > > + * the read-side tasks's increment, thus noting a difference in
> > > + * the counter values between the two passes.
> >
> > Hi Paul,
> >
> > I think the implementation is correct, but the explanation above might
> > be improved. Let's consider the following a scenario, where a reader is
> > migrated between increment of the counter and issuing the memory barrier
> > in the read lock:
> >
> > A,B,C are readers
> > D is synchronize_rcu (one flip'n'wait)
> >
> > CPU A CPU B CPU C CPU D
> > c[1]++
> > smp_mb(1)
> > read c[0] -> 0
> > c[0]++
> > (implicit smp_mb (2))
> > -> migrated ->
> > (implicit smp_mb (3))
> > smp_mb (4)
> > smp_mb (5)
> > c[1]--
> > read c[1] -> -1
> > read c[2] -> 1
> > (false 0 sum)
> > smp_mb (6)
> > re-check each.
> > c[1]--
> >
> > re-check: because we observed c[1] == -1, thanks to the implicit memory
> > barriers within thread migration (2 and 3), we can assume that we _will_
> > observe the updated value of c[0] after smp_mb (6).
> >
> > The current explanation states that memory barriers 4 and 5, along with
> > 6, are responsible for ensuring that the increment will be observed by
> > the re-check. However, I doubt they have anything to do with it: it's
> > rather the implicit memory barriers in thread migration, along with
> > program order guarantees on writes to the same address, that seems to be
> > the reason why we can do this ordering assumption.
>
> Please disregard the part about program order: CPU A writes to c[0], and
> CPU B writes to c[1], which are two different memory locations. The rest
> of my discussion stands though.
>
> Simply reasoning about write to c[0], memory barriers 2-3, write to
> c[1], along with c[1] read, memory barrier 6, and then c[0] read is
> enough to explain the ordering guarantees you need, without invoking
> program order.

I am assuming that if the scheduler migrates a process, it applies enough
memory ordering to allow the proof to operate as if it had stayed on a
single CPU throughout. The reasoning for this would consider the
scheduler access and memory barriers -- but there would be an arbitrarily
large number of migration patterns, so I am not convinced that it would
help...

Thanx, Paul

> Thanks,
>
> Mathieu
>
> >
> > Does it make sense, or shall I get another coffee to wake myself up ?
> > ;)
> >
> > Thanks,
> >
> > Mathieu
> >
> > > + *
> > > + * Therefore, if srcu_readers_active_idx() returned zero, and
> > > + * none of the counters changed, we know that the zero was the
> > > + * correct sum.
> > > + *
> > > + * Of course, it is possible that a task might be delayed
> > > + * for a very long time in __srcu_read_lock() after fetching
> > > + * the index but before incrementing its counter. This
> > > + * possibility will be dealt with in __synchronize_srcu().
> > > + */
> > > for_each_possible_cpu(cpu)
> > > - sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > > - return sum;
> > > + if (sp->snap[cpu] !=
> > > + ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > > + return false; /* False zero reading! */
> > > + return true;
> > > }
> >
> >
> > --
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>

2012-02-16 10:50:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

* Paul E. McKenney ([email protected]) wrote:
> On Wed, Feb 15, 2012 at 01:59:23PM +0100, Peter Zijlstra wrote:
> > On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> > > The current implementation of synchronize_srcu_expedited() can cause
> > > severe OS jitter due to its use of synchronize_sched(), which in turn
> > > invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> > > This can result in severe performance degradation for real-time workloads
> > > and especially for short-interation-length HPC workloads. Furthermore,
> > > because only one instance of try_stop_cpus() can be making forward progress
> > > at a given time, only one instance of synchronize_srcu_expedited() can
> > > make forward progress at a time, even if they are all operating on
> > > distinct srcu_struct structures.
> > >
> > > This commit, inspired by an earlier implementation by Peter Zijlstra
> > > (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> > > takes a strictly algorithmic bits-in-memory approach. This has the
> > > disadvantage of requiring one explicit memory-barrier instruction in
> > > each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> > > completely dispenses with OS jitter and furthermore allows SRCU to be
> > > used freely by CPUs that RCU believes to be idle or offline.
> > >
> > > The update-side implementation handles the single read-side memory
> > > barrier by rechecking the per-CPU counters after summing them and
> > > by running through the update-side state machine twice.
> >
> > Yeah, getting rid of that second memory barrier in srcu_read_lock() is
> > pure magic :-)
> >
> > > This implementation has passed moderate rcutorture testing on both 32-bit
> > > x86 and 64-bit Power. A call_srcu() function will be present in a later
> > > version of this patch.
> >
> > Goodness ;-)
>
> Glad you like the magic and the prospect of call_srcu(). ;-)
>
> > > @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
> > > int idx;
> > >
> > > preempt_disable();
> > > - idx = sp->completed & 0x1;
> > > - barrier(); /* ensure compiler looks -once- at sp->completed. */
> > > - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > > - srcu_barrier(); /* ensure compiler won't misorder critical section. */
> > > + idx = rcu_dereference_index_check(sp->completed,
> > > + rcu_read_lock_sched_held()) & 0x1;
> > > + ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> > > + SRCU_USAGE_COUNT + 1;
> > > + smp_mb(); /* B */ /* Avoid leaking the critical section. */
> > > preempt_enable();
> > > return idx;
> > > }
> >
> > You could use __this_cpu_* muck to shorten some of that.
>
> Ah, so something like this?
>
> ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> SRCU_USAGE_COUNT + 1;
>
> Now that you mention it, this does look nicer, applied here and to
> srcu_read_unlock().

I think Peter refers to __this_cpu_add().

Thanks,

Mathieu

>
> > Acked-by: Peter Zijlstra <[email protected]>
>
>
> Thanx, Paul
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2012-02-16 10:52:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Thu, 2012-02-16 at 05:50 -0500, Mathieu Desnoyers wrote:
> > Ah, so something like this?
> >
> > ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> > SRCU_USAGE_COUNT + 1;
> >
> > Now that you mention it, this does look nicer, applied here and to
> > srcu_read_unlock().
>
> I think Peter refers to __this_cpu_add().

I'm not sure that implies the ACCESS_ONCE() thing

2012-02-16 11:00:41

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

* Paul E. McKenney ([email protected]) wrote:
> On Wed, Feb 15, 2012 at 09:51:44AM -0500, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers ([email protected]) wrote:
> > > * Paul E. McKenney ([email protected]) wrote:
> > > [...]
> > > > +/*
> > > > + * To be called from the update side after an index flip. Returns true
> > > > + * if the modulo sum of the counters is stably zero, false if there is
> > > > + * some possibility of non-zero.
> > > > + */
> > > > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > > > {
> > > > int cpu;
> > > > - int sum;
> > > >
> > > > - sum = 0;
> > > > + /*
> > > > + * Note that srcu_readers_active_idx() can incorrectly return
> > > > + * zero even though there is a pre-existing reader throughout.
> > > > + * To see this, suppose that task A is in a very long SRCU
> > > > + * read-side critical section that started on CPU 0, and that
> > > > + * no other reader exists, so that the modulo sum of the counters
> > > > + * is equal to one. Then suppose that task B starts executing
> > > > + * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > > > + * task C starts reading on CPU 0, so that its increment is not
> > > > + * summed, but finishes reading on CPU 2, so that its decrement
> > > > + * -is- summed. Then when task B completes its sum, it will
> > > > + * incorrectly get zero, despite the fact that task A has been
> > > > + * in its SRCU read-side critical section the whole time.
> > > > + *
> > > > + * We therefore do a validation step should srcu_readers_active_idx()
> > > > + * return zero.
> > > > + */
> > > > + if (srcu_readers_active_idx(sp, idx) != 0)
> > > > + return false;
> > > > +
> > > > + /*
> > > > + * Since the caller recently flipped ->completed, we can see at
> > > > + * most one increment of each CPU's counter from this point
> > > > + * forward. The reason for this is that the reader CPU must have
> > > > + * fetched the index before srcu_readers_active_idx checked
> > > > + * that CPU's counter, but not yet incremented its counter.
> > > > + * Its eventual counter increment will follow the read in
> > > > + * srcu_readers_active_idx(), and that increment is immediately
> > > > + * followed by smp_mb() B. Because smp_mb() D is between
> > > > + * the ->completed flip and srcu_readers_active_idx()'s read,
> > > > + * that CPU's subsequent load of ->completed must see the new
> > > > + * value, and therefore increment the counter in the other rank.
> > > > + */
> > > > + smp_mb(); /* A */
> > > > +
> > > > + /*
> > > > + * Now, we check the ->snap array that srcu_readers_active_idx()
> > > > + * filled in from the per-CPU counter values. Since both
> > > > + * __srcu_read_lock() and __srcu_read_unlock() increment the
> > > > + * upper bits of the per-CPU counter, an increment/decrement
> > > > + * pair will change the value of the counter. Since there is
> > > > + * only one possible increment, the only way to wrap the counter
> > > > + * is to have a huge number of counter decrements, which requires
> > > > + * a huge number of tasks and huge SRCU read-side critical-section
> > > > + * nesting levels, even on 32-bit systems.
> > > > + *
> > > > + * All of the ways of confusing the readings require that the scan
> > > > + * in srcu_readers_active_idx() see the read-side task's decrement,
> > > > + * but not its increment. However, between that decrement and
> > > > + * increment are smb_mb() B and C. Either or both of these pair
> > > > + * with smp_mb() A above to ensure that the scan below will see
> > > > + * the read-side tasks's increment, thus noting a difference in
> > > > + * the counter values between the two passes.
> > >
> > > Hi Paul,
> > >
> > > I think the implementation is correct, but the explanation above might
> > > be improved. Let's consider the following a scenario, where a reader is
> > > migrated between increment of the counter and issuing the memory barrier
> > > in the read lock:
> > >
> > > A,B,C are readers
> > > D is synchronize_rcu (one flip'n'wait)
> > >
> > > CPU A CPU B CPU C CPU D
> > > c[1]++
> > > smp_mb(1)
> > > read c[0] -> 0
> > > c[0]++
> > > (implicit smp_mb (2))
> > > -> migrated ->
> > > (implicit smp_mb (3))
> > > smp_mb (4)
> > > smp_mb (5)
> > > c[1]--
> > > read c[1] -> -1
> > > read c[2] -> 1
> > > (false 0 sum)
> > > smp_mb (6)
> > > re-check each.
> > > c[1]--
> > >
> > > re-check: because we observed c[1] == -1, thanks to the implicit memory
> > > barriers within thread migration (2 and 3), we can assume that we _will_
> > > observe the updated value of c[0] after smp_mb (6).
> > >
> > > The current explanation states that memory barriers 4 and 5, along with
> > > 6, are responsible for ensuring that the increment will be observed by
> > > the re-check. However, I doubt they have anything to do with it: it's
> > > rather the implicit memory barriers in thread migration, along with
> > > program order guarantees on writes to the same address, that seems to be
> > > the reason why we can do this ordering assumption.
> >
> > Please disregard the part about program order: CPU A writes to c[0], and
> > CPU B writes to c[1], which are two different memory locations. The rest
> > of my discussion stands though.
> >
> > Simply reasoning about write to c[0], memory barriers 2-3, write to
> > c[1], along with c[1] read, memory barrier 6, and then c[0] read is
> > enough to explain the ordering guarantees you need, without invoking
> > program order.
>
> I am assuming that if the scheduler migrates a process, it applies enough
> memory ordering to allow the proof to operate as if it had stayed on a
> single CPU throughout.

When applied to per-cpu variables, the reasoning cannot invoke program
order though, since we're touching two different memory locations.

> The reasoning for this would consider the
> scheduler access and memory barriers

indeed,

>-- but there would be an arbitrarily
> large number of migration patterns, so I am not convinced that it would
> help...

This brings the following question then: which memory barriers, in the
scheduler activity, act as full memory barriers to migrated threads ? I
see that the rq_lock is taken, but this lock is permeable in one
direction (operations can spill into the critical section). I'm probably
missing something else, but this something else probably needs to be
documented somewhere, since we are doing tons of assumptions based on
it.

Thanks,

Mathieu

>
> Thanx, Paul
>
> > Thanks,
> >
> > Mathieu
> >
> > >
> > > Does it make sense, or shall I get another coffee to wake myself up ?
> > > ;)
> > >
> > > Thanks,
> > >
> > > Mathieu
> > >
> > > > + *
> > > > + * Therefore, if srcu_readers_active_idx() returned zero, and
> > > > + * none of the counters changed, we know that the zero was the
> > > > + * correct sum.
> > > > + *
> > > > + * Of course, it is possible that a task might be delayed
> > > > + * for a very long time in __srcu_read_lock() after fetching
> > > > + * the index but before incrementing its counter. This
> > > > + * possibility will be dealt with in __synchronize_srcu().
> > > > + */
> > > > for_each_possible_cpu(cpu)
> > > > - sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > > > - return sum;
> > > > + if (sp->snap[cpu] !=
> > > > + ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > > > + return false; /* False zero reading! */
> > > > + return true;
> > > > }
> > >
> > >
> > > --
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> >
> > --
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> >
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2012-02-16 11:14:23

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2012-02-16 at 05:50 -0500, Mathieu Desnoyers wrote:
> > > Ah, so something like this?
> > >
> > > ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> > > SRCU_USAGE_COUNT + 1;
> > >
> > > Now that you mention it, this does look nicer, applied here and to
> > > srcu_read_unlock().
> >
> > I think Peter refers to __this_cpu_add().
>
> I'm not sure that implies the ACCESS_ONCE() thing
>

Fair point. The "generic" fallback for this_cpu_add does not imply the
ACCESS_ONCE() semantic AFAIK. I don't know if there would be a clean way
to get both without duplicating these operations needlessly.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2012-02-16 11:52:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Thu, 2012-02-16 at 06:00 -0500, Mathieu Desnoyers wrote:
> This brings the following question then: which memory barriers, in the
> scheduler activity, act as full memory barriers to migrated threads ? I
> see that the rq_lock is taken, but this lock is permeable in one
> direction (operations can spill into the critical section). I'm probably
> missing something else, but this something else probably needs to be
> documented somewhere, since we are doing tons of assumptions based on
> it.

A migration consists of two context switches, one switching out the task
on the old cpu, and one switching in the task on the new cpu.

Now on x86 all the rq->lock grabbery is plenty implied memory barriers
to make anybody happy.

But I think, since there's guaranteed order (can't switch to before
switching from) you can match the UNLOCK from the switch-from to the
LOCK from the switch-to to make your complete MB.

Does that work or do we need more?

2012-02-16 12:18:16

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2012-02-16 at 06:00 -0500, Mathieu Desnoyers wrote:
> > This brings the following question then: which memory barriers, in the
> > scheduler activity, act as full memory barriers to migrated threads ? I
> > see that the rq_lock is taken, but this lock is permeable in one
> > direction (operations can spill into the critical section). I'm probably
> > missing something else, but this something else probably needs to be
> > documented somewhere, since we are doing tons of assumptions based on
> > it.
>
> A migration consists of two context switches, one switching out the task
> on the old cpu, and one switching in the task on the new cpu.

If we have memory barriers on both context switches, then we should be
good. If just fail to see them.

> Now on x86 all the rq->lock grabbery is plenty implied memory barriers
> to make anybody happy.

Indeed. Outside of x86 is far less certain though.

> But I think, since there's guaranteed order (can't switch to before
> switching from) you can match the UNLOCK from the switch-from to the
> LOCK from the switch-to to make your complete MB.
>
> Does that work or do we need more?

Hrm, I think we'd need a little more than just lock/unlock ordering
guarantees. Let's consider the following, where the stores would be
expected to be seen as "store A before store B" by CPU 2

CPU 0 CPU 1 CPU 2

load B, smp_rmb, load A in loop,
expecting that when updated A is
observed, B is always observed as
updated too.
store A
(lock is permeable:
outside can leak
inside)
lock(rq->lock)

-> migration ->

unlock(rq->lock)
(lock is permeable:
outside can leak inside)
store B

As we notice, the "store A" could theoretically be still pending in
CPU 0's write buffers when store B occurs, because the memory barrier
associated with "lock" only has acquire semantic (so memory operations
prior to the lock can leak into the critical section).

Given that the unlock(rq->lock) on CPU 0 is not guaranteed to happen in
a bound time-frame, no memory barrier with release semantic can be
assumed to have happened. This could happen if we have a long critical
section holding the rq->lock on CPU 0, and a much shorter critical
section on CPU 1.

Does that make sense, or should I get my first morning coffee ? :)

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2012-02-16 12:45:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
>
> Hrm, I think we'd need a little more than just lock/unlock ordering
> guarantees. Let's consider the following, where the stores would be
> expected to be seen as "store A before store B" by CPU 2
>
> CPU 0 CPU 1 CPU 2
>
> load B, smp_rmb, load A in loop,
> expecting that when updated A is
> observed, B is always observed as
> updated too.
> store A
> (lock is permeable:
> outside can leak
> inside)
> lock(rq->lock)
>
> -> migration ->
>
> unlock(rq->lock)
> (lock is permeable:
> outside can leak inside)
> store B

You got the pairing the wrong way around, I suggested:

store A

switch-out
UNLOCK

-> migration ->

switch-in
LOCK

store B

While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
and B won't pass the LOCK.

Yes, A can pass switch-out LOCK, but that doesn't matter much since the
switch-in cannot happen until we've passed UNLOCK.

And yes B can pass switch-in UNLOCK, but again, I can't see that being a
problem since the LOCK will avoid it being visible before A.

> Does that make sense, or should I get my first morning coffee ? :)

Probably.. but that's not saying I'm not wrong ;-)

2012-02-16 14:52:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
> >
> > Hrm, I think we'd need a little more than just lock/unlock ordering
> > guarantees. Let's consider the following, where the stores would be
> > expected to be seen as "store A before store B" by CPU 2
> >
> > CPU 0 CPU 1 CPU 2
> >
> > load B, smp_rmb, load A in loop,
> > expecting that when updated A is
> > observed, B is always observed as
> > updated too.
> > store A
> > (lock is permeable:
> > outside can leak
> > inside)
> > lock(rq->lock)
> >
> > -> migration ->
> >
> > unlock(rq->lock)
> > (lock is permeable:
> > outside can leak inside)
> > store B
>
> You got the pairing the wrong way around, I suggested:
>
> store A
>
> switch-out
> UNLOCK
>
> -> migration ->
>
> switch-in
> LOCK
>
> store B
>
> While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
> and B won't pass the LOCK.
>
> Yes, A can pass switch-out LOCK, but that doesn't matter much since the
> switch-in cannot happen until we've passed UNLOCK.
>
> And yes B can pass switch-in UNLOCK, but again, I can't see that being a
> problem since the LOCK will avoid it being visible before A.

Ah, so this is what I missed: the context switch has its lock/unlock
pair, the following migration is performed under its own lock/unlock
pair, and the following context switch also has its lock/unlock pair. So
yes, this should be sufficient to act as a full memory barrier.

>
> > Does that make sense, or should I get my first morning coffee ? :)
>
> Probably.. but that's not saying I'm not wrong ;-)

It does pass my 1st morning coffee test still, so it looks good, at
least to me. :-)

Back to the initial subject: I think it would be important for general
code understanding that when RCU operates tricks on per-cpu variables
based on scheduler migration memory ordering assumption, that it tells
so explicitely, rather than claiming that the memory barriers match
those at RCU read lock/unlock sites, which is not quite right.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2012-02-16 14:59:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Thu, 2012-02-16 at 09:52 -0500, Mathieu Desnoyers wrote:
> Ah, so this is what I missed: the context switch has its lock/unlock
> pair, the following migration is performed under its own lock/unlock
> pair, and the following context switch also has its lock/unlock pair. So
> yes, this should be sufficient to act as a full memory barrier.

In fact it typically looks like:

LOCK(rq0->lock)
switch-from-A
UNLOCK(rq0->lock);

LOCK(rq0->lock);
LOCK(rq1->lock);
migrate-A
UNLOCK(rq1->lock);
UNLOCK(rq0->lock);

LOCK(rq1->lock);
switch-to-A
UNLOCK(rq1->lock);

the migrate taking both locks involved guarantees that the switch-to
always comes _after_ the switch_from. The migrate might be performed on
a completely unrelated cpu although typically it would be either the old
(push) or the new (pull).

2012-02-16 15:14:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Thu, Feb 16, 2012 at 01:44:41PM +0100, Peter Zijlstra wrote:
> On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
> >
> > Hrm, I think we'd need a little more than just lock/unlock ordering
> > guarantees. Let's consider the following, where the stores would be
> > expected to be seen as "store A before store B" by CPU 2
> >
> > CPU 0 CPU 1 CPU 2
> >
> > load B, smp_rmb, load A in loop,
> > expecting that when updated A is
> > observed, B is always observed as
> > updated too.
> > store A
> > (lock is permeable:
> > outside can leak
> > inside)
> > lock(rq->lock)
> >
> > -> migration ->
> >
> > unlock(rq->lock)
> > (lock is permeable:
> > outside can leak inside)
> > store B
>
> You got the pairing the wrong way around, I suggested:
>
> store A
>
> switch-out
> UNLOCK
>
> -> migration ->
>
> switch-in
> LOCK
>
> store B
>
> While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
> and B won't pass the LOCK.
>
> Yes, A can pass switch-out LOCK, but that doesn't matter much since the
> switch-in cannot happen until we've passed UNLOCK.
>
> And yes B can pass switch-in UNLOCK, but again, I can't see that being a
> problem since the LOCK will avoid it being visible before A.
>
> > Does that make sense, or should I get my first morning coffee ? :)
>
> Probably.. but that's not saying I'm not wrong ;-)

It does look good to me, but given that I don't drink coffee, you should
take that with a large grain of salt.

Thanx, Paul

2012-02-20 07:11:15

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On 02/13/2012 10:09 AM, Paul E. McKenney wrote:

> /*
> * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
> */
> -static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> +static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> {
> int idx;
>
> @@ -178,90 +316,51 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> !lock_is_held(&rcu_sched_lock_map),
> "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
>
> - idx = sp->completed;
> + idx = ACCESS_ONCE(sp->completed);
> mutex_lock(&sp->mutex);
>
> /*
> * Check to see if someone else did the work for us while we were
> - * waiting to acquire the lock. We need -two- advances of
> + * waiting to acquire the lock. We need -three- advances of
> * the counter, not just one. If there was but one, we might have
> * shown up -after- our helper's first synchronize_sched(), thus
> * having failed to prevent CPU-reordering races with concurrent
> - * srcu_read_unlock()s on other CPUs (see comment below). So we
> - * either (1) wait for two or (2) supply the second ourselves.
> + * srcu_read_unlock()s on other CPUs (see comment below). If there
> + * was only two, we are guaranteed to have waited through only one
> + * full index-flip phase. So we either (1) wait for three or
> + * (2) supply the additional ones we need.
> */
>
> - if ((sp->completed - idx) >= 2) {
> + if (sp->completed == idx + 2)
> + idx = 1;
> + else if (sp->completed == idx + 3) {
> mutex_unlock(&sp->mutex);
> return;
> - }
> -
> - sync_func(); /* Force memory barrier on all CPUs. */
> + } else
> + idx = 0;


Hi, Paul

I don't think this check-and-return path is needed since we will introduce call_srcu().
We just need a correct code to show how it works and to be used for a while,
and new call_srcu() will be implemented based on this correct code which will be removed.

And I think this unneeded check-and-return path is incorrect. See the following:

Reader Updater Helper thread
old_ptr = rcu_ptr;
/* rcu_ptr = NULL; but be reordered to (1) */
start synchronize_srcu()
idx = ACCESS_ONCE(sp->completed);(2)
synchronize_srcu()
synchronize_srcu()
srcu_read_lock();
old_ptr = rcu_ptr;
rcu_ptr = NULL;(1)
mutex_lock() and read sp->completed
and return from synchronize_srcu()
free(old_ptr);
use freed old_ptr
srcu_read_unlock();


So, we need a smb_mb() between (1) and (2) to force the order.

__synchronize_srcu() {
smp_mb(); /* F */
idx = ACCESS_ONCE(sp->completed); /* (2) */
....
}

And this smp_mb() F is paired with helper's smp_mb() D. So if Updater sees X advances of
->completed, Updater must sees X times of *full* flip_and_wait(). So We need see -two- advances of
->completed from Helper only, not -three-.

if (sp->completed == idx + 1)
idx = 1;
else if (sp->completed == idx + 2) {
mutex_unlock(&sp->mutex);
return;
} else
idx = 0;


Or simpler:

__synchronize_srcu() {
unsigned int idx; /* <-------- unsigned */

/* comments for smp_mb() F */
smp_mb(); /* F */
idx = ACCESS_ONCE(sp->completed);

mutex_lock(&sp->mutex);
idx = sp->completed - idx;

/* original comments */
for (; idx < 2; idx++)
flip_idx_and_wait(sp, expedited);
mutex_unlock(&sp->mutex);
}

At last, I can't understand the comments of this check-and-return path.
So maybe the above reply and I are totally wrong.
But the comments of this check-and-return path does not describe the code
well(especially the order), and it contains the old "synchronize_sched()"
which make me confuse.

My conclusion, we can just remove the check-and-return path to reduce
the complexity since we will introduce call_srcu().

This new srcu is very great, especially the SRCU_USAGE_COUNT for every
lock/unlock witch forces any increment/decrement pair changes the counter
for me.


Thanks,
Lai

2012-02-20 17:44:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Mon, Feb 20, 2012 at 03:15:33PM +0800, Lai Jiangshan wrote:
> On 02/13/2012 10:09 AM, Paul E. McKenney wrote:
>
> > /*
> > * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
> > */
> > -static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> > +static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> > {
> > int idx;
> >
> > @@ -178,90 +316,51 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> > !lock_is_held(&rcu_sched_lock_map),
> > "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
> >
> > - idx = sp->completed;
> > + idx = ACCESS_ONCE(sp->completed);
> > mutex_lock(&sp->mutex);
> >
> > /*
> > * Check to see if someone else did the work for us while we were
> > - * waiting to acquire the lock. We need -two- advances of
> > + * waiting to acquire the lock. We need -three- advances of
> > * the counter, not just one. If there was but one, we might have
> > * shown up -after- our helper's first synchronize_sched(), thus
> > * having failed to prevent CPU-reordering races with concurrent
> > - * srcu_read_unlock()s on other CPUs (see comment below). So we
> > - * either (1) wait for two or (2) supply the second ourselves.
> > + * srcu_read_unlock()s on other CPUs (see comment below). If there
> > + * was only two, we are guaranteed to have waited through only one
> > + * full index-flip phase. So we either (1) wait for three or
> > + * (2) supply the additional ones we need.
> > */
> >
> > - if ((sp->completed - idx) >= 2) {
> > + if (sp->completed == idx + 2)
> > + idx = 1;
> > + else if (sp->completed == idx + 3) {
> > mutex_unlock(&sp->mutex);
> > return;
> > - }
> > -
> > - sync_func(); /* Force memory barrier on all CPUs. */
> > + } else
> > + idx = 0;
>
>
> Hi, Paul
>
> I don't think this check-and-return path is needed since we will introduce call_srcu().
>
> We just need a correct code to show how it works and to be used for a while,
> and new call_srcu() will be implemented based on this correct code which will be removed.

Hello, Lai!

Yep, this code will be replaced with a state machine driven by callbacks.

> And I think this unneeded check-and-return path is incorrect. See the following:
>
> Reader Updater Helper thread
> old_ptr = rcu_ptr;
> /* rcu_ptr = NULL; but be reordered to (1) */
> start synchronize_srcu()
> idx = ACCESS_ONCE(sp->completed);(2)
> synchronize_srcu()
> synchronize_srcu()
> srcu_read_lock();
> old_ptr = rcu_ptr;
> rcu_ptr = NULL;(1)
> mutex_lock() and read sp->completed
> and return from synchronize_srcu()
> free(old_ptr);
> use freed old_ptr
> srcu_read_unlock();
>
>
> So, we need a smb_mb() between (1) and (2) to force the order.
>
> __synchronize_srcu() {
> smp_mb(); /* F */
> idx = ACCESS_ONCE(sp->completed); /* (2) */

And one here as well because mutex_lock() allows code to bleed in from
outside the critical section.

> ....
> }

Good catch! And shows the limitations of testing -- I hit this pretty
hard and didn't get a failure. I was focused so much on the complex
part of the patch that I failed to get the simple stuff right!!!

Shows the value of the Linux community's review processes, I guess. ;-)

> And this smp_mb() F is paired with helper's smp_mb() D. So if Updater sees X advances of
> ->completed, Updater must sees X times of *full* flip_and_wait(). So We need see -two- advances of
> ->completed from Helper only, not -three-.

Hmmm... Let's see... The case I was worried about is where the updater
samples ->completed just before it is incremented, then samples it again
just after it is incremented a second time. So you are arguing that this
cannot happen because the second sample occurs after acquiring the lock,
so that the second flip-and-wait cycle has to have already completed,
correct?

So waiting for three is appropriate for mutex_try_lock(), but is overly
conservative for mutex_lock().

> if (sp->completed == idx + 1)
> idx = 1;
> else if (sp->completed == idx + 2) {
> mutex_unlock(&sp->mutex);
> return;
> } else
> idx = 0;
>
>
> Or simpler:
>
> __synchronize_srcu() {
> unsigned int idx; /* <-------- unsigned */
>
> /* comments for smp_mb() F */
> smp_mb(); /* F */
> idx = ACCESS_ONCE(sp->completed);
>
> mutex_lock(&sp->mutex);
> idx = sp->completed - idx;
>
> /* original comments */
> for (; idx < 2; idx++)
> flip_idx_and_wait(sp, expedited);
> mutex_unlock(&sp->mutex);
> }
>
> At last, I can't understand the comments of this check-and-return path.
> So maybe the above reply and I are totally wrong.

I -think- you might be correct, but my approach is going to be to implement
call_srcu() which will eliminate this anyway.

> But the comments of this check-and-return path does not describe the code
> well(especially the order), and it contains the old "synchronize_sched()"
> which make me confuse.

The diffs are confusing -- I have to look at the actual code in this case.

> My conclusion, we can just remove the check-and-return path to reduce
> the complexity since we will introduce call_srcu().

If I actually submit the above upstream, that would be quite reasonable.
My thought is that patch remains RFC and the upstream version has
call_srcu().

> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> lock/unlock witch forces any increment/decrement pair changes the counter
> for me.

Glad you like it! ;-)

And thank you for your review and feedback!

Thanx, Paul

> Thanks,
> Lai
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-02-21 01:07:32

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On 02/21/2012 01:44 AM, Paul E. McKenney wrote:

>
>> My conclusion, we can just remove the check-and-return path to reduce
>> the complexity since we will introduce call_srcu().
>
> If I actually submit the above upstream, that would be quite reasonable.
> My thought is that patch remains RFC and the upstream version has
> call_srcu().

Does the work of call_srcu() is started or drafted?

>
>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
>> lock/unlock witch forces any increment/decrement pair changes the counter
>> for me.
>
> Glad you like it! ;-)
>
> And thank you for your review and feedback!

Could you add my Reviewed-by when this patch is last submitted?


Reviewed-by: Lai Jiangshan <[email protected]>

Thanks
Lai

2012-02-21 01:50:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
>
> >
> >> My conclusion, we can just remove the check-and-return path to reduce
> >> the complexity since we will introduce call_srcu().
> >
> > If I actually submit the above upstream, that would be quite reasonable.
> > My thought is that patch remains RFC and the upstream version has
> > call_srcu().
>
> Does the work of call_srcu() is started or drafted?

I do have a draft design, and am currently beating it into shape.
No actual code yet, though. The general idea at the moment is as follows:

o The state machine must be preemptible. I recently received
a bug report about 200-microsecond latency spikes on a system
with more than a thousand CPUs, so the summation of the per-CPU
counters and subsequent recheck cannot be in a preempt-disable
region. I am therefore currently thinking in terms of a kthread.

o At the moment, having a per-srcu_struct kthread seems excessive.
I am planning on a single kthread to do the counter summation
and checking. Further parallelism might be useful in the future,
but I would want to see someone run into problems before adding
more complexity.

o There needs to be a linked list of srcu_struct structures so
that they can be traversed by the state-machine kthread.

o If there are expedited SRCU callbacks anywhere, the kthread
would scan through the list of srcu_struct structures quickly
(perhaps pausing a few microseconds between). If there are no
expedited SRCU callbacks, the kthread would wait a jiffy or so
between scans.

o If a given srcu_struct structure has been scanned too many times
(say, more than ten times) while waiting for the counters to go
to zero, it loses expeditedness. It makes no sense for the kthread
to go CPU-bound just because some SRCU reader somewhere is blocked
in its SRCU read-side critical section.

o Expedited SRCU callbacks cannot be delayed by normal SRCU
callbacks, but neither can expedited callbacks be allowed to
starve normal callbacks. I am thinking in terms of invoking these
from softirq context, with a pair of multi-tailed callback queues
per CPU, stored in the same structure as the per-CPU counters.

o There are enough srcu_struct structures in the Linux that
it does not make sense to force softirq to dig through them all
any time any one of them has callbacks ready to invoke. One way
to deal with this is to have a per-CPU set of linked lists of
of srcu_struct_array structures, so that the kthread enqueues
a given structure when it transitions to having callbacks ready
to invoke, and softirq dequeues it. This can be done locklessly
given that there is only one producer and one consumer.

o We can no longer use the trick of pushing callbacks to another
CPU from the CPU_DYING notifier because it is likely that CPU
hotplug will stop using stop_cpus(). I am therefore thinking
in terms of a set of orphanages (two for normal, two more for
expedited -- one set of each for callbacks ready to invoke,
the other for still-waiting callbacks).

o There will need to be an srcu_barrier() that can be called
before cleanup_srcu_struct(). Otherwise, someone will end up
freeing up an srcu_struct that still has callbacks outstanding.

But what did you have in mind?

> >> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> >> lock/unlock witch forces any increment/decrement pair changes the counter
> >> for me.
> >
> > Glad you like it! ;-)
> >
> > And thank you for your review and feedback!
>
> Could you add my Reviewed-by when this patch is last submitted?
>
>
> Reviewed-by: Lai Jiangshan <[email protected]>

Will do, thank you!

Thanx, Paul

2012-02-21 08:40:11

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On 02/21/2012 09:50 AM, Paul E. McKenney wrote:
> On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
>> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
>>
>>>
>>>> My conclusion, we can just remove the check-and-return path to reduce
>>>> the complexity since we will introduce call_srcu().
>>>
>>> If I actually submit the above upstream, that would be quite reasonable.
>>> My thought is that patch remains RFC and the upstream version has
>>> call_srcu().
>>
>> Does the work of call_srcu() is started or drafted?
>
> I do have a draft design, and am currently beating it into shape.
> No actual code yet, though. The general idea at the moment is as follows:

If you don't mind, I will implement it.(it requires your new version of SRCU implementation)

>
> o The state machine must be preemptible. I recently received
> a bug report about 200-microsecond latency spikes on a system
> with more than a thousand CPUs, so the summation of the per-CPU
> counters and subsequent recheck cannot be in a preempt-disable
> region. I am therefore currently thinking in terms of a kthread.

sure.

Addition:
Srcu callbacks must run on process context and sleepable,
(and therefore they finish orderless). We can't change
the current rcu-callback, we must design a new for srcu.
These do not introduce any complexity, we reuse the workqueue.
All ready srcu-callback will be delivered to workqueue.
so state-machine-thread only to do counter summation
and checking and delivering.

workqueue do all the complexity for us, it will auto
alloc threads when needed(when the callback is sleep, or
there are too much ready callbacks).

struct srcu_head is a little bigger than struct rcu_head,
it contain a union for struct work_struct.

(synchronize_srcu()'s callbacks are special handled)

>
> o At the moment, having a per-srcu_struct kthread seems excessive.
> I am planning on a single kthread to do the counter summation
> and checking. Further parallelism might be useful in the future,
> but I would want to see someone run into problems before adding
> more complexity.

Simple is important, I vote for a single kthread to do the counter summation
and checking, and left the convenient that we can introduce parallelism
thread easy.

>
> o There needs to be a linked list of srcu_struct structures so
> that they can be traversed by the state-machine kthread.
>
> o If there are expedited SRCU callbacks anywhere, the kthread
> would scan through the list of srcu_struct structures quickly
> (perhaps pausing a few microseconds between). If there are no
> expedited SRCU callbacks, the kthread would wait a jiffy or so
> between scans.

Sure
But I think generic expedited SRCU callbacks are make no sense,
we could just allow expedited SRCU callbacks for synchronize_srcu_expedited()
only.

>
> o If a given srcu_struct structure has been scanned too many times
> (say, more than ten times) while waiting for the counters to go
> to zero, it loses expeditedness. It makes no sense for the kthread
> to go CPU-bound just because some SRCU reader somewhere is blocked
> in its SRCU read-side critical section.
>
> o Expedited SRCU callbacks cannot be delayed by normal SRCU
> callbacks, but neither can expedited callbacks be allowed to
> starve normal callbacks. I am thinking in terms of invoking these
> from softirq context, with a pair of multi-tailed callback queues
> per CPU, stored in the same structure as the per-CPU counters.

My Addition design, callbacks are not running in softirq context.
And state-machine-thread is not delay by any normal callbacks.

But all synchronize_srcu()'s callback are done in state-machine-thread
(it is just a wake up), not in workqueue.

Since we allow expedited SRCU callbacks for synchronize_srcu_expedited() only,
Expedited SRCU callbacks will not be delayed by normal srcu callbacks.

I don't think we need to use per CPU callback queues, since SRCU callbacks
are rare(compared to rcu callbacks)

>
> o There are enough srcu_struct structures in the Linux that
> it does not make sense to force softirq to dig through them all
> any time any one of them has callbacks ready to invoke. One way
> to deal with this is to have a per-CPU set of linked lists of
> of srcu_struct_array structures, so that the kthread enqueues
> a given structure when it transitions to having callbacks ready
> to invoke, and softirq dequeues it. This can be done locklessly
> given that there is only one producer and one consumer.
>
> o We can no longer use the trick of pushing callbacks to another
> CPU from the CPU_DYING notifier because it is likely that CPU
> hotplug will stop using stop_cpus(). I am therefore thinking
> in terms of a set of orphanages (two for normal, two more for
> expedited -- one set of each for callbacks ready to invoke,
> the other for still-waiting callbacks).

no such issues in srcu.c if we use workqueue.

>
> o There will need to be an srcu_barrier() that can be called
> before cleanup_srcu_struct(). Otherwise, someone will end up
> freeing up an srcu_struct that still has callbacks outstanding.

Sure.

when use workqueue, the delivering is ordered.

srcu_barrier()
{
synchronzie_srcu();
flush_workqueue();
}


>
> But what did you have in mind?

I agree most of your design, but my relaxing the ability for callbacks and
using workqueue ideas make things simpler.

Thanks,
Lai

>
>>>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
>>>> lock/unlock witch forces any increment/decrement pair changes the counter
>>>> for me.
>>>
>>> Glad you like it! ;-)
>>>
>>> And thank you for your review and feedback!
>>
>> Could you add my Reviewed-by when this patch is last submitted?
>>
>>
>> Reviewed-by: Lai Jiangshan <[email protected]>
>
> Will do, thank you!
>
> Thanx, Paul
>
>

2012-02-21 17:25:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation

On Tue, Feb 21, 2012 at 04:44:22PM +0800, Lai Jiangshan wrote:
> On 02/21/2012 09:50 AM, Paul E. McKenney wrote:
> > On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
> >> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
> >>
> >>>
> >>>> My conclusion, we can just remove the check-and-return path to reduce
> >>>> the complexity since we will introduce call_srcu().
> >>>
> >>> If I actually submit the above upstream, that would be quite reasonable.
> >>> My thought is that patch remains RFC and the upstream version has
> >>> call_srcu().
> >>
> >> Does the work of call_srcu() is started or drafted?
> >
> > I do have a draft design, and am currently beating it into shape.
> > No actual code yet, though. The general idea at the moment is as follows:
>
> If you don't mind, I will implement it.(it requires your new version of SRCU implementation)

I would very much welcome a patch from you for call_srcu()!

I have an rcu/srcu branch for this purpose at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

> > o The state machine must be preemptible. I recently received
> > a bug report about 200-microsecond latency spikes on a system
> > with more than a thousand CPUs, so the summation of the per-CPU
> > counters and subsequent recheck cannot be in a preempt-disable
> > region. I am therefore currently thinking in terms of a kthread.
>
> sure.
>
> Addition:
> Srcu callbacks must run on process context and sleepable,
> (and therefore they finish orderless). We can't change
> the current rcu-callback, we must design a new for srcu.
> These do not introduce any complexity, we reuse the workqueue.
> All ready srcu-callback will be delivered to workqueue.
> so state-machine-thread only to do counter summation
> and checking and delivering.
>
> workqueue do all the complexity for us, it will auto
> alloc threads when needed(when the callback is sleep, or
> there are too much ready callbacks).
>
> struct srcu_head is a little bigger than struct rcu_head,
> it contain a union for struct work_struct.
>
> (synchronize_srcu()'s callbacks are special handled)

Full separation of counter summation etc. from callback invocation sounds
very good. There does need to be something like an srcu_barrier(), so the
SRCU callbacks posted by a given CPU do need to be invoked in the order
that they were posted (unless you have some other trick in mind, in which
case please let us all know what it is). Given the need to keep things
in order, I do not believe that we can allow the SRCU callbacks to sleep.
Note that there is no need for srcu_barrier() to wait on the expedited
SRCU callbacks -- there should not be an call_srcu_expedited(), so the
only way for the callbacks to appear is synchronize_srcu_expedited(),
which does not return until after its callback has been invoked.
This means that expedited SRCU callbacks need not be ordered -- in fact,
the expedited path need not use callbacks at all, if that works better.
(I was thinking in terms of expedited callbacks for SRCU because that
makes the state machine simpler.)

That said, the idea of invoking them from a workqueue seems reasonable,
for example, you can do local_disable_bh() around each invocation. Doing
this also gives us the flexibility to move SRCU callback invocation into
softirq if that proves necessary for whatever reason. (Yes, I do still
well remember the 3.0 fun and excitement with RCU and kthreads!)

But I have to ask... Why does SRCU need more space than RCU? Can't
the selection of workqueue be implicit based on which CPU the callback
is queued on?

> > o At the moment, having a per-srcu_struct kthread seems excessive.
> > I am planning on a single kthread to do the counter summation
> > and checking. Further parallelism might be useful in the future,
> > but I would want to see someone run into problems before adding
> > more complexity.
>
> Simple is important, I vote for a single kthread to do the counter summation
> and checking, and left the convenient that we can introduce parallelism
> thread easy.

Very good!

> > o There needs to be a linked list of srcu_struct structures so
> > that they can be traversed by the state-machine kthread.
> >
> > o If there are expedited SRCU callbacks anywhere, the kthread
> > would scan through the list of srcu_struct structures quickly
> > (perhaps pausing a few microseconds between). If there are no
> > expedited SRCU callbacks, the kthread would wait a jiffy or so
> > between scans.
>
> Sure
> But I think generic expedited SRCU callbacks are make no sense,
> we could just allow expedited SRCU callbacks for synchronize_srcu_expedited()
> only.

Agreed -- We don't have call_rcu_expedited(), so we should definitely
not provide call_srcu_expedited().

> > o If a given srcu_struct structure has been scanned too many times
> > (say, more than ten times) while waiting for the counters to go
> > to zero, it loses expeditedness. It makes no sense for the kthread
> > to go CPU-bound just because some SRCU reader somewhere is blocked
> > in its SRCU read-side critical section.
> >
> > o Expedited SRCU callbacks cannot be delayed by normal SRCU
> > callbacks, but neither can expedited callbacks be allowed to
> > starve normal callbacks. I am thinking in terms of invoking these
> > from softirq context, with a pair of multi-tailed callback queues
> > per CPU, stored in the same structure as the per-CPU counters.
>
> My Addition design, callbacks are not running in softirq context.
> And state-machine-thread is not delay by any normal callbacks.
>
> But all synchronize_srcu()'s callback are done in state-machine-thread
> (it is just a wake up), not in workqueue.

So the idea is that the function pointer is a task_struct pointer in this
case, and that you check for a text address to decide which? In any case,
a per-callback wake up should be OK.

> Since we allow expedited SRCU callbacks for synchronize_srcu_expedited() only,
> Expedited SRCU callbacks will not be delayed by normal srcu callbacks.

OK, good.

> I don't think we need to use per CPU callback queues, since SRCU callbacks
> are rare(compared to rcu callbacks)

Indeed, everything currently in the kernel would turn into a wakeup. ;-)

> > o There are enough srcu_struct structures in the Linux that
> > it does not make sense to force softirq to dig through them all
> > any time any one of them has callbacks ready to invoke. One way
> > to deal with this is to have a per-CPU set of linked lists of
> > of srcu_struct_array structures, so that the kthread enqueues
> > a given structure when it transitions to having callbacks ready
> > to invoke, and softirq dequeues it. This can be done locklessly
> > given that there is only one producer and one consumer.
> >
> > o We can no longer use the trick of pushing callbacks to another
> > CPU from the CPU_DYING notifier because it is likely that CPU
> > hotplug will stop using stop_cpus(). I am therefore thinking
> > in terms of a set of orphanages (two for normal, two more for
> > expedited -- one set of each for callbacks ready to invoke,
> > the other for still-waiting callbacks).
>
> no such issues in srcu.c if we use workqueue.

OK, sounds good.

> > o There will need to be an srcu_barrier() that can be called
> > before cleanup_srcu_struct(). Otherwise, someone will end up
> > freeing up an srcu_struct that still has callbacks outstanding.
>
> Sure.
>
> when use workqueue, the delivering is ordered.
>
> srcu_barrier()
> {
> synchronzie_srcu();
> flush_workqueue();
> }

Would this handle the case where the SRCU kthread migrated from one CPU
to another? My guess is that you would need to flush all the workqueues
that might have ever had SRCU-related work pending on them.

> > But what did you have in mind?
>
> I agree most of your design, but my relaxing the ability for callbacks and
> using workqueue ideas make things simpler.

Your approach looks good in general. My main concerns are:

1. We should prohibit sleeping in SRCU callback functions, at least
for the near future -- just in case something forces us to move
to softirq.

2. If SRCU callbacks can use rcu_head, that would be good -- I don't
yet see why any additional space is needed.

Thanx, Paul

> Thanks,
> Lai
>
> >
> >>>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> >>>> lock/unlock witch forces any increment/decrement pair changes the counter
> >>>> for me.
> >>>
> >>> Glad you like it! ;-)
> >>>
> >>> And thank you for your review and feedback!
> >>
> >> Could you add my Reviewed-by when this patch is last submitted?
> >>
> >>
> >> Reviewed-by: Lai Jiangshan <[email protected]>
> >
> > Will do, thank you!
> >
> > Thanx, Paul
> >
> >
>

2012-02-22 09:25:24

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path

Hi, All

These three patches reduce the states of srcu.
the call_srcu() will be implemented based on this new algorithm if it has no problem.

It is an aggressive algorithm, it needs more reviews, please examine it critically
and leave your comments.

Thanks,
Lai.

>From abe3fd64d08f74f13e8111e333a9790e9e6d782c Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <[email protected]>
Date: Wed, 22 Feb 2012 10:15:48 +0800
Subject: [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path

This fast check path is used for optimizing the situation
that there are many concurrently update site.

But we have no suck situation in currect kernel.
And it introduces complexity, so we just remove it.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/srcu.c | 25 +------------------------
1 files changed, 1 insertions(+), 24 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index 84c9b97..17e95bc 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -308,7 +308,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
*/
static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
{
- int idx;
+ int idx = 0;

rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
!lock_is_held(&rcu_bh_lock_map) &&
@@ -316,32 +316,9 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
!lock_is_held(&rcu_sched_lock_map),
"Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");

- smp_mb(); /* Ensure prior action happens before grace period. */
- idx = ACCESS_ONCE(sp->completed);
- smp_mb(); /* Access to ->completed before lock acquisition. */
mutex_lock(&sp->mutex);

/*
- * Check to see if someone else did the work for us while we were
- * waiting to acquire the lock. We need -three- advances of
- * the counter, not just one. If there was but one, we might have
- * shown up -after- our helper's first synchronize_sched(), thus
- * having failed to prevent CPU-reordering races with concurrent
- * srcu_read_unlock()s on other CPUs (see comment below). If there
- * was only two, we are guaranteed to have waited through only one
- * full index-flip phase. So we either (1) wait for three or
- * (2) supply the additional ones we need.
- */
-
- if (sp->completed == idx + 2)
- idx = 1;
- else if (sp->completed == idx + 3) {
- mutex_unlock(&sp->mutex);
- return;
- } else
- idx = 0;
-
- /*
* If there were no helpers, then we need to do two flips of
* the index. The first flip is required if there are any
* outstanding SRCU readers even if there are no new readers
--
1.7.4.4

2012-02-22 09:26:17

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()

>From de49bb517e6367776e2226b931346ab6c798b122 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <[email protected]>
Date: Wed, 22 Feb 2012 10:41:59 +0800
Subject: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()

The algorithm/smp_mb()s ensure 'there is only one srcu_read_lock()
between flip and recheck for a cpu'.
Increment of the upper bit for srcu_read_lock() only can
ensure a single pair of lock/unlock change the cpu counter.

Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/srcu.h | 2 +-
kernel/srcu.c | 11 +++++------
2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index a478c8e..5b49d41 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -35,7 +35,7 @@ struct srcu_struct_array {
};

/* Bit definitions for field ->c above and ->snap below. */
-#define SRCU_USAGE_BITS 2
+#define SRCU_USAGE_BITS 1
#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index 17e95bc..a51ac48 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -138,10 +138,10 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)

/*
* Now, we check the ->snap array that srcu_readers_active_idx()
- * filled in from the per-CPU counter values. Since both
- * __srcu_read_lock() and __srcu_read_unlock() increment the
- * upper bits of the per-CPU counter, an increment/decrement
- * pair will change the value of the counter. Since there is
+ * filled in from the per-CPU counter values. Since
+ * __srcu_read_lock() increments the upper bits of
+ * the per-CPU counter, an increment/decrement pair will
+ * change the value of the counter. Since there is
* only one possible increment, the only way to wrap the counter
* is to have a huge number of counter decrements, which requires
* a huge number of tasks and huge SRCU read-side critical-section
@@ -234,8 +234,7 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx)
{
preempt_disable();
smp_mb(); /* C */ /* Avoid leaking the critical section. */
- ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
- SRCU_USAGE_COUNT - 1;
+ ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;
preempt_enable();
}
EXPORT_SYMBOL_GPL(__srcu_read_unlock);
--
1.7.4.4

2012-02-22 09:26:15

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period

>From 4ddf62aaf2c4ebe6b9d4a1c596e8b43a678f1f0d Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <[email protected]>
Date: Wed, 22 Feb 2012 14:12:02 +0800
Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period

flip_idx_and_wait() is not changed, and is split as two functions
and only a short comments is added for smp_mb() E.

__synchronize_srcu() use a different algorithm for "leak" readers.
detail is in the comments of the patch.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/srcu.c | 105 ++++++++++++++++++++++++++++++++++----------------------
1 files changed, 64 insertions(+), 41 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index a51ac48..346f9d7 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -249,6 +249,37 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
*/
#define SYNCHRONIZE_SRCU_READER_DELAY 5

+static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
+{
+ int trycount = 0;
+
+ /*
+ * SRCU read-side critical sections are normally short, so wait
+ * a small amount of time before possibly blocking.
+ */
+ if (!srcu_readers_active_idx_check(sp, idx)) {
+ udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+ while (!srcu_readers_active_idx_check(sp, idx)) {
+ if (expedited && ++ trycount < 10)
+ udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+ else
+ schedule_timeout_interruptible(1);
+ }
+ }
+
+ /*
+ * The following smp_mb() E pairs with srcu_read_unlock()'s
+ * smp_mb C to ensure that if srcu_readers_active_idx_check()
+ * sees srcu_read_unlock()'s counter decrement, then any
+ * of the current task's subsequent code will happen after
+ * that SRCU read-side critical section.
+ *
+ * It also ensures the order between the above waiting and
+ * the next flipping.
+ */
+ smp_mb(); /* E */
+}
+
/*
* Flip the readers' index by incrementing ->completed, then wait
* until there are no more readers using the counters referenced by
@@ -258,12 +289,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
* Of course, it is possible that a reader might be delayed for the
* full duration of flip_idx_and_wait() between fetching the
* index and incrementing its counter. This possibility is handled
- * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
+ * by the next __synchronize_srcu() invoking wait_idx() for such readers
+ * before start a new grace perioad.
*/
static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
{
int idx;
- int trycount = 0;

idx = sp->completed++ & 0x1;

@@ -278,28 +309,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
*/
smp_mb(); /* D */

- /*
- * SRCU read-side critical sections are normally short, so wait
- * a small amount of time before possibly blocking.
- */
- if (!srcu_readers_active_idx_check(sp, idx)) {
- udelay(SYNCHRONIZE_SRCU_READER_DELAY);
- while (!srcu_readers_active_idx_check(sp, idx)) {
- if (expedited && ++ trycount < 10)
- udelay(SYNCHRONIZE_SRCU_READER_DELAY);
- else
- schedule_timeout_interruptible(1);
- }
- }
-
- /*
- * The following smp_mb() E pairs with srcu_read_unlock()'s
- * smp_mb C to ensure that if srcu_readers_active_idx_check()
- * sees srcu_read_unlock()'s counter decrement, then any
- * of the current task's subsequent code will happen after
- * that SRCU read-side critical section.
- */
- smp_mb(); /* E */
+ wait_idx(sp, idx, expedited);
}

/*
@@ -307,8 +317,6 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
*/
static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
{
- int idx = 0;
-
rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
!lock_is_held(&rcu_bh_lock_map) &&
!lock_is_held(&rcu_lock_map) &&
@@ -318,27 +326,42 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
mutex_lock(&sp->mutex);

/*
- * If there were no helpers, then we need to do two flips of
- * the index. The first flip is required if there are any
- * outstanding SRCU readers even if there are no new readers
- * running concurrently with the first counter flip.
- *
- * The second flip is required when a new reader picks up
+ * When in the previous grace perioad, if a reader picks up
* the old value of the index, but does not increment its
* counter until after its counters is summed/rechecked by
- * srcu_readers_active_idx_check(). In this case, the current SRCU
+ * srcu_readers_active_idx_check(). In this case, the previous SRCU
* grace period would be OK because the SRCU read-side critical
- * section started after this SRCU grace period started, so the
+ * section started after the SRCU grace period started, so the
* grace period is not required to wait for the reader.
*
- * However, the next SRCU grace period would be waiting for the
- * other set of counters to go to zero, and therefore would not
- * wait for the reader, which would be very bad. To avoid this
- * bad scenario, we flip and wait twice, clearing out both sets
- * of counters.
+ * However, such leftover readers affect this new SRCU grace period.
+ * So we have to wait for such readers. This wait_idx() should be
+ * considerred as the wait_idx() in the flip_idx_and_wait() of
+ * the previous grace perioad except that it is for leftover readers
+ * started before this synchronize_srcu(). So when it returns,
+ * there is no leftover readers that starts before this grace period.
+ *
+ * If there are some leftover readers that do not increment its
+ * counter until after its counters is summed/rechecked by
+ * srcu_readers_active_idx_check(), In this case, this SRCU
+ * grace period would be OK as above comments says. We defines
+ * such readers as leftover-leftover readers, we consider these
+ * readers fteched index of (sp->completed + 1), it means they
+ * are considerred as exactly the same as the readers after this
+ * grace period.
+ *
+ * wait_idx() is expected very fast, because leftover readers
+ * are unlikely produced.
*/
- for (; idx < 2; idx++)
- flip_idx_and_wait(sp, expedited);
+ wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
+
+ /*
+ * Starts a new grace period, this flip is required if there are
+ * any outstanding SRCU readers even if there are no new readers
+ * running concurrently with the counter flip.
+ */
+ flip_idx_and_wait(sp, expedited);
+
mutex_unlock(&sp->mutex);
}

--
1.7.4.4

2012-02-22 09:51:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()

On Wed, 2012-02-22 at 17:29 +0800, Lai Jiangshan wrote:
> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;

That just looks funny..

2012-02-22 21:21:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()

On Wed, Feb 22, 2012 at 05:29:32PM +0800, Lai Jiangshan wrote:
> >From de49bb517e6367776e2226b931346ab6c798b122 Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <[email protected]>
> Date: Wed, 22 Feb 2012 10:41:59 +0800
> Subject: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
>
> The algorithm/smp_mb()s ensure 'there is only one srcu_read_lock()
> between flip and recheck for a cpu'.
> Increment of the upper bit for srcu_read_lock() only can
> ensure a single pair of lock/unlock change the cpu counter.

Very nice! Also makes is more clear in that no combination of operations
including exactly one increment can possibly wrap back to the same value,
because the upper bit would be different.

In deference to Peter Zijlstra's sensibilities, I changed the:

ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;

to:

ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= 1;

I did manage to resist the temptation to instead say:

ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= +1;

;-)

Queued, thank you!

Thanx, Paul

> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> include/linux/srcu.h | 2 +-
> kernel/srcu.c | 11 +++++------
> 2 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index a478c8e..5b49d41 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -35,7 +35,7 @@ struct srcu_struct_array {
> };
>
> /* Bit definitions for field ->c above and ->snap below. */
> -#define SRCU_USAGE_BITS 2
> +#define SRCU_USAGE_BITS 1
> #define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
> #define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
>
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index 17e95bc..a51ac48 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -138,10 +138,10 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>
> /*
> * Now, we check the ->snap array that srcu_readers_active_idx()
> - * filled in from the per-CPU counter values. Since both
> - * __srcu_read_lock() and __srcu_read_unlock() increment the
> - * upper bits of the per-CPU counter, an increment/decrement
> - * pair will change the value of the counter. Since there is
> + * filled in from the per-CPU counter values. Since
> + * __srcu_read_lock() increments the upper bits of
> + * the per-CPU counter, an increment/decrement pair will
> + * change the value of the counter. Since there is
> * only one possible increment, the only way to wrap the counter
> * is to have a huge number of counter decrements, which requires
> * a huge number of tasks and huge SRCU read-side critical-section
> @@ -234,8 +234,7 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx)
> {
> preempt_disable();
> smp_mb(); /* C */ /* Avoid leaking the critical section. */
> - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> - SRCU_USAGE_COUNT - 1;
> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;
> preempt_enable();
> }
> EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> --
> 1.7.4.4
>

2012-02-22 21:27:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()

On Wed, Feb 22, 2012 at 01:20:56PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 22, 2012 at 05:29:32PM +0800, Lai Jiangshan wrote:
> > >From de49bb517e6367776e2226b931346ab6c798b122 Mon Sep 17 00:00:00 2001
> > From: Lai Jiangshan <[email protected]>
> > Date: Wed, 22 Feb 2012 10:41:59 +0800
> > Subject: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
> >
> > The algorithm/smp_mb()s ensure 'there is only one srcu_read_lock()
> > between flip and recheck for a cpu'.
> > Increment of the upper bit for srcu_read_lock() only can
> > ensure a single pair of lock/unlock change the cpu counter.
>
> Very nice! Also makes is more clear in that no combination of operations
> including exactly one increment can possibly wrap back to the same value,
> because the upper bit would be different.

Make that without underflow -- one increment and 2^31+1 decrements would
in fact return the counter to its original value, but that would require
cramming more than two billion tasks into a 32-bit address space, which
I believe to be sufficiently unlikely. (Famous last words...)

Thanx, Paul

> In deference to Peter Zijlstra's sensibilities, I changed the:
>
> ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;
>
> to:
>
> ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= 1;
>
> I did manage to resist the temptation to instead say:
>
> ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= +1;
>
> ;-)
>
> Queued, thank you!
>
> Thanx, Paul
>
> > Signed-off-by: Lai Jiangshan <[email protected]>
> > ---
> > include/linux/srcu.h | 2 +-
> > kernel/srcu.c | 11 +++++------
> > 2 files changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index a478c8e..5b49d41 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -35,7 +35,7 @@ struct srcu_struct_array {
> > };
> >
> > /* Bit definitions for field ->c above and ->snap below. */
> > -#define SRCU_USAGE_BITS 2
> > +#define SRCU_USAGE_BITS 1
> > #define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
> > #define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
> >
> > diff --git a/kernel/srcu.c b/kernel/srcu.c
> > index 17e95bc..a51ac48 100644
> > --- a/kernel/srcu.c
> > +++ b/kernel/srcu.c
> > @@ -138,10 +138,10 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >
> > /*
> > * Now, we check the ->snap array that srcu_readers_active_idx()
> > - * filled in from the per-CPU counter values. Since both
> > - * __srcu_read_lock() and __srcu_read_unlock() increment the
> > - * upper bits of the per-CPU counter, an increment/decrement
> > - * pair will change the value of the counter. Since there is
> > + * filled in from the per-CPU counter values. Since
> > + * __srcu_read_lock() increments the upper bits of
> > + * the per-CPU counter, an increment/decrement pair will
> > + * change the value of the counter. Since there is
> > * only one possible increment, the only way to wrap the counter
> > * is to have a huge number of counter decrements, which requires
> > * a huge number of tasks and huge SRCU read-side critical-section
> > @@ -234,8 +234,7 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx)
> > {
> > preempt_disable();
> > smp_mb(); /* C */ /* Avoid leaking the critical section. */
> > - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> > - SRCU_USAGE_COUNT - 1;
> > + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;
> > preempt_enable();
> > }
> > EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> > --
> > 1.7.4.4
> >

2012-02-22 21:39:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()

On Wed, 2012-02-22 at 13:26 -0800, Paul E. McKenney wrote:

> Make that without underflow -- one increment and 2^31+1 decrements would
> in fact return the counter to its original value, but that would require
> cramming more than two billion tasks into a 32-bit address space, which
> I believe to be sufficiently unlikely. (Famous last words...)

I'll just expect to see you as President of the United States, counting
your money you won in the lottery, and being awarded a Nobel Prize for
curing cancer.

-- Steve


2012-02-23 01:01:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period

On Wed, Feb 22, 2012 at 05:29:36PM +0800, Lai Jiangshan wrote:
> >From 4ddf62aaf2c4ebe6b9d4a1c596e8b43a678f1f0d Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <[email protected]>
> Date: Wed, 22 Feb 2012 14:12:02 +0800
> Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
>
> flip_idx_and_wait() is not changed, and is split as two functions
> and only a short comments is added for smp_mb() E.
>
> __synchronize_srcu() use a different algorithm for "leak" readers.
> detail is in the comments of the patch.
>
> Signed-off-by: Lai Jiangshan <[email protected]>

And I queued this one as well, with some adjustment to the comments.

These are now available at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/srcu

Assuming testing goes well, these might go into 3.4.

Thanx, Paul

> ---
> kernel/srcu.c | 105 ++++++++++++++++++++++++++++++++++----------------------
> 1 files changed, 64 insertions(+), 41 deletions(-)
>
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index a51ac48..346f9d7 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -249,6 +249,37 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> */
> #define SYNCHRONIZE_SRCU_READER_DELAY 5
>
> +static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> +{
> + int trycount = 0;
> +
> + /*
> + * SRCU read-side critical sections are normally short, so wait
> + * a small amount of time before possibly blocking.
> + */
> + if (!srcu_readers_active_idx_check(sp, idx)) {
> + udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> + while (!srcu_readers_active_idx_check(sp, idx)) {
> + if (expedited && ++ trycount < 10)
> + udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> + else
> + schedule_timeout_interruptible(1);
> + }
> + }
> +
> + /*
> + * The following smp_mb() E pairs with srcu_read_unlock()'s
> + * smp_mb C to ensure that if srcu_readers_active_idx_check()
> + * sees srcu_read_unlock()'s counter decrement, then any
> + * of the current task's subsequent code will happen after
> + * that SRCU read-side critical section.
> + *
> + * It also ensures the order between the above waiting and
> + * the next flipping.
> + */
> + smp_mb(); /* E */
> +}
> +
> /*
> * Flip the readers' index by incrementing ->completed, then wait
> * until there are no more readers using the counters referenced by
> @@ -258,12 +289,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> * Of course, it is possible that a reader might be delayed for the
> * full duration of flip_idx_and_wait() between fetching the
> * index and incrementing its counter. This possibility is handled
> - * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
> + * by the next __synchronize_srcu() invoking wait_idx() for such readers
> + * before start a new grace perioad.
> */
> static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> {
> int idx;
> - int trycount = 0;
>
> idx = sp->completed++ & 0x1;
>
> @@ -278,28 +309,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> */
> smp_mb(); /* D */
>
> - /*
> - * SRCU read-side critical sections are normally short, so wait
> - * a small amount of time before possibly blocking.
> - */
> - if (!srcu_readers_active_idx_check(sp, idx)) {
> - udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> - while (!srcu_readers_active_idx_check(sp, idx)) {
> - if (expedited && ++ trycount < 10)
> - udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> - else
> - schedule_timeout_interruptible(1);
> - }
> - }
> -
> - /*
> - * The following smp_mb() E pairs with srcu_read_unlock()'s
> - * smp_mb C to ensure that if srcu_readers_active_idx_check()
> - * sees srcu_read_unlock()'s counter decrement, then any
> - * of the current task's subsequent code will happen after
> - * that SRCU read-side critical section.
> - */
> - smp_mb(); /* E */
> + wait_idx(sp, idx, expedited);
> }
>
> /*
> @@ -307,8 +317,6 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> */
> static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> {
> - int idx = 0;
> -
> rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
> !lock_is_held(&rcu_bh_lock_map) &&
> !lock_is_held(&rcu_lock_map) &&
> @@ -318,27 +326,42 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> mutex_lock(&sp->mutex);
>
> /*
> - * If there were no helpers, then we need to do two flips of
> - * the index. The first flip is required if there are any
> - * outstanding SRCU readers even if there are no new readers
> - * running concurrently with the first counter flip.
> - *
> - * The second flip is required when a new reader picks up
> + * When in the previous grace perioad, if a reader picks up
> * the old value of the index, but does not increment its
> * counter until after its counters is summed/rechecked by
> - * srcu_readers_active_idx_check(). In this case, the current SRCU
> + * srcu_readers_active_idx_check(). In this case, the previous SRCU
> * grace period would be OK because the SRCU read-side critical
> - * section started after this SRCU grace period started, so the
> + * section started after the SRCU grace period started, so the
> * grace period is not required to wait for the reader.
> *
> - * However, the next SRCU grace period would be waiting for the
> - * other set of counters to go to zero, and therefore would not
> - * wait for the reader, which would be very bad. To avoid this
> - * bad scenario, we flip and wait twice, clearing out both sets
> - * of counters.
> + * However, such leftover readers affect this new SRCU grace period.
> + * So we have to wait for such readers. This wait_idx() should be
> + * considerred as the wait_idx() in the flip_idx_and_wait() of
> + * the previous grace perioad except that it is for leftover readers
> + * started before this synchronize_srcu(). So when it returns,
> + * there is no leftover readers that starts before this grace period.
> + *
> + * If there are some leftover readers that do not increment its
> + * counter until after its counters is summed/rechecked by
> + * srcu_readers_active_idx_check(), In this case, this SRCU
> + * grace period would be OK as above comments says. We defines
> + * such readers as leftover-leftover readers, we consider these
> + * readers fteched index of (sp->completed + 1), it means they
> + * are considerred as exactly the same as the readers after this
> + * grace period.
> + *
> + * wait_idx() is expected very fast, because leftover readers
> + * are unlikely produced.
> */
> - for (; idx < 2; idx++)
> - flip_idx_and_wait(sp, expedited);
> + wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
> +
> + /*
> + * Starts a new grace period, this flip is required if there are
> + * any outstanding SRCU readers even if there are no new readers
> + * running concurrently with the counter flip.
> + */
> + flip_idx_and_wait(sp, expedited);
> +
> mutex_unlock(&sp->mutex);
> }
>
> --
> 1.7.4.4
>

2012-02-23 01:02:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()

On Wed, Feb 22, 2012 at 04:39:15PM -0500, Steven Rostedt wrote:
> On Wed, 2012-02-22 at 13:26 -0800, Paul E. McKenney wrote:
>
> > Make that without underflow -- one increment and 2^31+1 decrements would
> > in fact return the counter to its original value, but that would require
> > cramming more than two billion tasks into a 32-bit address space, which
> > I believe to be sufficiently unlikely. (Famous last words...)
>
> I'll just expect to see you as President of the United States, counting
> your money you won in the lottery, and being awarded a Nobel Prize for
> curing cancer.

Those possibilities also seem to me to be sufficiently unlikely. ;-)

Thanx, Paul

2012-02-24 08:01:38

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period

On 02/22/2012 05:29 PM, Lai Jiangshan wrote:
>>From 4ddf62aaf2c4ebe6b9d4a1c596e8b43a678f1f0d Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <[email protected]>
> Date: Wed, 22 Feb 2012 14:12:02 +0800
> Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
>
> flip_idx_and_wait() is not changed, and is split as two functions
> and only a short comments is added for smp_mb() E.
>
> __synchronize_srcu() use a different algorithm for "leak" readers.
> detail is in the comments of the patch.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> kernel/srcu.c | 105 ++++++++++++++++++++++++++++++++++----------------------
> 1 files changed, 64 insertions(+), 41 deletions(-)
>
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index a51ac48..346f9d7 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -249,6 +249,37 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> */
> #define SYNCHRONIZE_SRCU_READER_DELAY 5
>
> +static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> +{
> + int trycount = 0;

Hi, Paul

smp_mb() D also needs to be moved here, could you fix it before push it.
I thought it(smp_mb()) always here in my mind, wrong assumption.

Sorry.

Thanks,
Lai

> +
> + /*
> + * SRCU read-side critical sections are normally short, so wait
> + * a small amount of time before possibly blocking.
> + */
> + if (!srcu_readers_active_idx_check(sp, idx)) {
> + udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> + while (!srcu_readers_active_idx_check(sp, idx)) {
> + if (expedited && ++ trycount < 10)
> + udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> + else
> + schedule_timeout_interruptible(1);
> + }
> + }
> +
> + /*
> + * The following smp_mb() E pairs with srcu_read_unlock()'s
> + * smp_mb C to ensure that if srcu_readers_active_idx_check()
> + * sees srcu_read_unlock()'s counter decrement, then any
> + * of the current task's subsequent code will happen after
> + * that SRCU read-side critical section.
> + *
> + * It also ensures the order between the above waiting and
> + * the next flipping.
> + */
> + smp_mb(); /* E */
> +}
> +
> /*
> * Flip the readers' index by incrementing ->completed, then wait
> * until there are no more readers using the counters referenced by
> @@ -258,12 +289,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> * Of course, it is possible that a reader might be delayed for the
> * full duration of flip_idx_and_wait() between fetching the
> * index and incrementing its counter. This possibility is handled
> - * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
> + * by the next __synchronize_srcu() invoking wait_idx() for such readers
> + * before start a new grace perioad.
> */
> static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> {
> int idx;
> - int trycount = 0;
>
> idx = sp->completed++ & 0x1;
>
> @@ -278,28 +309,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> */
> smp_mb(); /* D */
>
> - /*
> - * SRCU read-side critical sections are normally short, so wait
> - * a small amount of time before possibly blocking.
> - */
> - if (!srcu_readers_active_idx_check(sp, idx)) {
> - udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> - while (!srcu_readers_active_idx_check(sp, idx)) {
> - if (expedited && ++ trycount < 10)
> - udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> - else
> - schedule_timeout_interruptible(1);
> - }
> - }
> -
> - /*
> - * The following smp_mb() E pairs with srcu_read_unlock()'s
> - * smp_mb C to ensure that if srcu_readers_active_idx_check()
> - * sees srcu_read_unlock()'s counter decrement, then any
> - * of the current task's subsequent code will happen after
> - * that SRCU read-side critical section.
> - */
> - smp_mb(); /* E */
> + wait_idx(sp, idx, expedited);
> }
>
> /*
> @@ -307,8 +317,6 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> */
> static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> {
> - int idx = 0;
> -
> rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
> !lock_is_held(&rcu_bh_lock_map) &&
> !lock_is_held(&rcu_lock_map) &&
> @@ -318,27 +326,42 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> mutex_lock(&sp->mutex);
>
> /*
> - * If there were no helpers, then we need to do two flips of
> - * the index. The first flip is required if there are any
> - * outstanding SRCU readers even if there are no new readers
> - * running concurrently with the first counter flip.
> - *
> - * The second flip is required when a new reader picks up
> + * When in the previous grace perioad, if a reader picks up
> * the old value of the index, but does not increment its
> * counter until after its counters is summed/rechecked by
> - * srcu_readers_active_idx_check(). In this case, the current SRCU
> + * srcu_readers_active_idx_check(). In this case, the previous SRCU
> * grace period would be OK because the SRCU read-side critical
> - * section started after this SRCU grace period started, so the
> + * section started after the SRCU grace period started, so the
> * grace period is not required to wait for the reader.
> *
> - * However, the next SRCU grace period would be waiting for the
> - * other set of counters to go to zero, and therefore would not
> - * wait for the reader, which would be very bad. To avoid this
> - * bad scenario, we flip and wait twice, clearing out both sets
> - * of counters.
> + * However, such leftover readers affect this new SRCU grace period.
> + * So we have to wait for such readers. This wait_idx() should be
> + * considerred as the wait_idx() in the flip_idx_and_wait() of
> + * the previous grace perioad except that it is for leftover readers
> + * started before this synchronize_srcu(). So when it returns,
> + * there is no leftover readers that starts before this grace period.
> + *
> + * If there are some leftover readers that do not increment its
> + * counter until after its counters is summed/rechecked by
> + * srcu_readers_active_idx_check(), In this case, this SRCU
> + * grace period would be OK as above comments says. We defines
> + * such readers as leftover-leftover readers, we consider these
> + * readers fteched index of (sp->completed + 1), it means they
> + * are considerred as exactly the same as the readers after this
> + * grace period.
> + *
> + * wait_idx() is expected very fast, because leftover readers
> + * are unlikely produced.
> */
> - for (; idx < 2; idx++)
> - flip_idx_and_wait(sp, expedited);
> + wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
> +
> + /*
> + * Starts a new grace period, this flip is required if there are
> + * any outstanding SRCU readers even if there are no new readers
> + * running concurrently with the counter flip.
> + */
> + flip_idx_and_wait(sp, expedited);
> +
> mutex_unlock(&sp->mutex);
> }
>

2012-02-24 20:03:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period

On Fri, Feb 24, 2012 at 04:06:01PM +0800, Lai Jiangshan wrote:
> On 02/22/2012 05:29 PM, Lai Jiangshan wrote:
> >>From 4ddf62aaf2c4ebe6b9d4a1c596e8b43a678f1f0d Mon Sep 17 00:00:00 2001
> > From: Lai Jiangshan <[email protected]>
> > Date: Wed, 22 Feb 2012 14:12:02 +0800
> > Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
> >
> > flip_idx_and_wait() is not changed, and is split as two functions
> > and only a short comments is added for smp_mb() E.
> >
> > __synchronize_srcu() use a different algorithm for "leak" readers.
> > detail is in the comments of the patch.
> >
> > Signed-off-by: Lai Jiangshan <[email protected]>
> > ---
> > kernel/srcu.c | 105 ++++++++++++++++++++++++++++++++++----------------------
> > 1 files changed, 64 insertions(+), 41 deletions(-)
> >
> > diff --git a/kernel/srcu.c b/kernel/srcu.c
> > index a51ac48..346f9d7 100644
> > --- a/kernel/srcu.c
> > +++ b/kernel/srcu.c
> > @@ -249,6 +249,37 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> > */
> > #define SYNCHRONIZE_SRCU_READER_DELAY 5
> >
> > +static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> > +{
> > + int trycount = 0;
>
> Hi, Paul
>
> smp_mb() D also needs to be moved here, could you fix it before push it.
> I thought it(smp_mb()) always here in my mind, wrong assumption.

Good catch -- I should have seen this myself. I committed this and pushed
it to:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/srcu

> Sorry.

Not a problem, though it does point out the need for review and testing.
So I am feeling a bit nervous about pushing this into 3.4, and am
beginning to think that it needs mechanical proof as well as more testing.

Thoughts?

Thanx, Paul

> Thanks,
> Lai
>
> > +
> > + /*
> > + * SRCU read-side critical sections are normally short, so wait
> > + * a small amount of time before possibly blocking.
> > + */
> > + if (!srcu_readers_active_idx_check(sp, idx)) {
> > + udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> > + while (!srcu_readers_active_idx_check(sp, idx)) {
> > + if (expedited && ++ trycount < 10)
> > + udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> > + else
> > + schedule_timeout_interruptible(1);
> > + }
> > + }
> > +
> > + /*
> > + * The following smp_mb() E pairs with srcu_read_unlock()'s
> > + * smp_mb C to ensure that if srcu_readers_active_idx_check()
> > + * sees srcu_read_unlock()'s counter decrement, then any
> > + * of the current task's subsequent code will happen after
> > + * that SRCU read-side critical section.
> > + *
> > + * It also ensures the order between the above waiting and
> > + * the next flipping.
> > + */
> > + smp_mb(); /* E */
> > +}
> > +
> > /*
> > * Flip the readers' index by incrementing ->completed, then wait
> > * until there are no more readers using the counters referenced by
> > @@ -258,12 +289,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> > * Of course, it is possible that a reader might be delayed for the
> > * full duration of flip_idx_and_wait() between fetching the
> > * index and incrementing its counter. This possibility is handled
> > - * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
> > + * by the next __synchronize_srcu() invoking wait_idx() for such readers
> > + * before start a new grace perioad.
> > */
> > static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> > {
> > int idx;
> > - int trycount = 0;
> >
> > idx = sp->completed++ & 0x1;
> >
> > @@ -278,28 +309,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> > */
> > smp_mb(); /* D */
> >
> > - /*
> > - * SRCU read-side critical sections are normally short, so wait
> > - * a small amount of time before possibly blocking.
> > - */
> > - if (!srcu_readers_active_idx_check(sp, idx)) {
> > - udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> > - while (!srcu_readers_active_idx_check(sp, idx)) {
> > - if (expedited && ++ trycount < 10)
> > - udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> > - else
> > - schedule_timeout_interruptible(1);
> > - }
> > - }
> > -
> > - /*
> > - * The following smp_mb() E pairs with srcu_read_unlock()'s
> > - * smp_mb C to ensure that if srcu_readers_active_idx_check()
> > - * sees srcu_read_unlock()'s counter decrement, then any
> > - * of the current task's subsequent code will happen after
> > - * that SRCU read-side critical section.
> > - */
> > - smp_mb(); /* E */
> > + wait_idx(sp, idx, expedited);
> > }
> >
> > /*
> > @@ -307,8 +317,6 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> > */
> > static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> > {
> > - int idx = 0;
> > -
> > rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
> > !lock_is_held(&rcu_bh_lock_map) &&
> > !lock_is_held(&rcu_lock_map) &&
> > @@ -318,27 +326,42 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> > mutex_lock(&sp->mutex);
> >
> > /*
> > - * If there were no helpers, then we need to do two flips of
> > - * the index. The first flip is required if there are any
> > - * outstanding SRCU readers even if there are no new readers
> > - * running concurrently with the first counter flip.
> > - *
> > - * The second flip is required when a new reader picks up
> > + * When in the previous grace perioad, if a reader picks up
> > * the old value of the index, but does not increment its
> > * counter until after its counters is summed/rechecked by
> > - * srcu_readers_active_idx_check(). In this case, the current SRCU
> > + * srcu_readers_active_idx_check(). In this case, the previous SRCU
> > * grace period would be OK because the SRCU read-side critical
> > - * section started after this SRCU grace period started, so the
> > + * section started after the SRCU grace period started, so the
> > * grace period is not required to wait for the reader.
> > *
> > - * However, the next SRCU grace period would be waiting for the
> > - * other set of counters to go to zero, and therefore would not
> > - * wait for the reader, which would be very bad. To avoid this
> > - * bad scenario, we flip and wait twice, clearing out both sets
> > - * of counters.
> > + * However, such leftover readers affect this new SRCU grace period.
> > + * So we have to wait for such readers. This wait_idx() should be
> > + * considerred as the wait_idx() in the flip_idx_and_wait() of
> > + * the previous grace perioad except that it is for leftover readers
> > + * started before this synchronize_srcu(). So when it returns,
> > + * there is no leftover readers that starts before this grace period.
> > + *
> > + * If there are some leftover readers that do not increment its
> > + * counter until after its counters is summed/rechecked by
> > + * srcu_readers_active_idx_check(), In this case, this SRCU
> > + * grace period would be OK as above comments says. We defines
> > + * such readers as leftover-leftover readers, we consider these
> > + * readers fteched index of (sp->completed + 1), it means they
> > + * are considerred as exactly the same as the readers after this
> > + * grace period.
> > + *
> > + * wait_idx() is expected very fast, because leftover readers
> > + * are unlikely produced.
> > */
> > - for (; idx < 2; idx++)
> > - flip_idx_and_wait(sp, expedited);
> > + wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
> > +
> > + /*
> > + * Starts a new grace period, this flip is required if there are
> > + * any outstanding SRCU readers even if there are no new readers
> > + * running concurrently with the counter flip.
> > + */
> > + flip_idx_and_wait(sp, expedited);
> > +
> > mutex_unlock(&sp->mutex);
> > }
> >
>

2012-02-27 07:56:35

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 1/2 RFC] srcu: change the comments of the wait algorithm

Hi, ALL

The call_srcu() will be sent soon(may be in 2 days). I found something is not
good in current sruc when I implement it, so I do more prepare for it.

The second patch is inspired Peter. I had decided to use per-cpu machine,
the the snap array makes me unhappy. If a machine is sleeping/preempted
while checking, the other machine can't not check the same srcu_struct.
It is nothing big, but it also blocks the sychronize_srcu_expedited().
I hope sychronize_srcu_expedited() can't be blocked when it try to do its
fast-checking. So I try to find non-block checking algorithm, and I find
Peter's.

The most things in these two patches are comments, so I bring a lot
troubles to Paul because my poor English.

Thanks,
Lai

>From 77af819872ddab065d3a46758471b80f31b30e5e Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <[email protected]>
Date: Mon, 27 Feb 2012 10:52:00 +0800
Subject: [PATCH 1/2] srcu: change the comments of the wait algorithm

The original comments does not describe the essential of the wait algorithm
well.

The safe of srcu-protected data and srcu critical section is provided by
wait_idx(), not the flipping.

The two index of the active counter array and the flipping are just used to keep
the wait_idx() from starvation.
(the flip also provides "only one srcu_read_lock() at most after flip
for every cpu", this coupling will be remove in future(next patch))

The code will be split as pieces between every machine-states for call_srcu(),
It is very hard to provide the exactly semantics as original comments,
So I have to consider the exactly what the algorithm, and I change this
comments.

The code is not changed, but it is refactored a little.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/srcu.c | 75 +++++++++++++++++++++++++++++----------------------------
1 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index b6b9ea2..47ee35d 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -249,6 +249,10 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
*/
#define SYNCHRONIZE_SRCU_READER_DELAY 5

+/*
+ * Wait until all the readers(which starts before this wait_idx()
+ * with the specified idx) complete.
+ */
static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
{
int trycount = 0;
@@ -291,24 +295,9 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
smp_mb(); /* E */
}

-/*
- * Flip the readers' index by incrementing ->completed, then wait
- * until there are no more readers using the counters referenced by
- * the old index value. (Recall that the index is the bottom bit
- * of ->completed.)
- *
- * Of course, it is possible that a reader might be delayed for the
- * full duration of flip_idx_and_wait() between fetching the
- * index and incrementing its counter. This possibility is handled
- * by the next __synchronize_srcu() invoking wait_idx() for such readers
- * before starting a new grace period.
- */
-static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
+static void srcu_flip(struct srcu_struct *sp)
{
- int idx;
-
- idx = sp->completed++ & 0x1;
- wait_idx(sp, idx, expedited);
+ sp->completed++;
}

/*
@@ -316,6 +305,8 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
*/
static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
{
+ int busy_idx;
+
rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
!lock_is_held(&rcu_bh_lock_map) &&
!lock_is_held(&rcu_lock_map) &&
@@ -323,8 +314,31 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
"Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");

mutex_lock(&sp->mutex);
+ busy_idx = sp->completed & 0X1UL;

/*
+ * There are some readers start with idx=0, and some others start
+ * with idx=1. So two wait_idx()s are enough for synchronize:
+ * __synchronize_srcu() {
+ * wait_idx(sp, 0, expedited);
+ * wait_idx(sp, 1, expedited);
+ * }
+ * When it returns, all started readers have complete.
+ *
+ * But synchronizer may be starved by the readers, example,
+ * if sp->complete & 0x1L == 1, wait_idx(sp, 1, expedited)
+ * may not returns if there are continuous readers start
+ * with idx=1.
+ *
+ * So we need to flip the busy index to keep synchronizer
+ * from starvation.
+ */
+
+ /*
+ * The above comments assume we have readers with all the
+ * 2 idx. It does have this probability, some readers may
+ * hold the reader lock with idx=1-busy_idx:
+ *
* Suppose that during the previous grace period, a reader
* picked up the old value of the index, but did not increment
* its counter until after the previous instance of
@@ -333,31 +347,18 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
* not start until after the grace period started, so the grace
* period was not obligated to wait for that reader.
*
- * However, the current SRCU grace period does have to wait for
- * that reader. This is handled by invoking wait_idx() on the
- * non-active set of counters (hence sp->completed - 1). Once
- * wait_idx() returns, we know that all readers that picked up
- * the old value of ->completed and that already incremented their
- * counter will have completed.
- *
- * But what about readers that picked up the old value of
- * ->completed, but -still- have not managed to increment their
- * counter? We do not need to wait for those readers, because
- * they will have started their SRCU read-side critical section
- * after the current grace period starts.
- *
- * Because it is unlikely that readers will be preempted between
- * fetching ->completed and incrementing their counter, wait_idx()
+ * Because this probability is not high, wait_idx()
* will normally not need to wait.
*/
- wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
+ wait_idx(sp, 1 - busy_idx, expedited);
+
+ /* flip the index to ensure the return of the next wait_idx() */
+ srcu_flip(sp);

/*
- * Now that wait_idx() has waited for the really old readers,
- * invoke flip_idx_and_wait() to flip the counter and wait
- * for current SRCU readers.
+ * Now that wait_idx() has waited for the really old readers.
*/
- flip_idx_and_wait(sp, expedited);
+ wait_idx(sp, busy_idx, expedited);

mutex_unlock(&sp->mutex);
}
--
1.7.4.4

2012-02-27 07:56:37

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

>From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <[email protected]>
Date: Mon, 27 Feb 2012 14:22:47 +0800
Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm

This patch implement the algorithm as Peter's:
https://lkml.org/lkml/2012/2/1/119

o Make the checking lock-free and we can perform parallel checking,
Although almost parallel checking makes no sense, but we need it
when 1) the original checking task is preempted for long, 2)
sychronize_srcu_expedited(), 3) avoid lock(see next)

o Since it is lock-free, we save a mutex in state machine for
call_srcu().

o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
(so we can remove the preempt_disable() in future, but use
__this_cpu_inc() instead.)

o reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
more intuitive.

Inspired-by: Peter Zijlstra <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/srcu.h | 7 +--
kernel/srcu.c | 137 ++++++++++++++++++++-----------------------------
2 files changed, 57 insertions(+), 87 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 5b49d41..15354db 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -32,18 +32,13 @@

struct srcu_struct_array {
unsigned long c[2];
+ unsigned long seq[2];
};

-/* Bit definitions for field ->c above and ->snap below. */
-#define SRCU_USAGE_BITS 1
-#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
-#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
-
struct srcu_struct {
unsigned completed;
struct srcu_struct_array __percpu *per_cpu_ref;
struct mutex mutex;
- unsigned long snap[NR_CPUS];
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
diff --git a/kernel/srcu.c b/kernel/srcu.c
index 47ee35d..376b583 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */

/*
+ * Returns approximate total sequence of readers on the specified rank
+ * of per-CPU counters.
+ */
+static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
+{
+ int cpu;
+ unsigned long sum = 0;
+ unsigned long t;
+
+ for_each_possible_cpu(cpu) {
+ t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
+ sum += t;
+ }
+ return sum;
+}
+
+/*
* Returns approximate number of readers active on the specified rank
- * of per-CPU counters. Also snapshots each counter's value in the
- * corresponding element of sp->snap[] for later use validating
- * the sum.
+ * of per-CPU counters.
*/
static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
{
@@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
for_each_possible_cpu(cpu) {
t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
sum += t;
- sp->snap[cpu] = t;
}
- return sum & SRCU_REF_MASK;
+ return sum;
}

-/*
- * To be called from the update side after an index flip. Returns true
- * if the modulo sum of the counters is stably zero, false if there is
- * some possibility of non-zero.
- */
static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
{
int cpu;
+ unsigned long seq;
+
+ seq = srcu_readers_seq_idx(sp, idx);
+
+ /*
+ * smp_mb() A pairs with smp_mb() B for critical section.
+ * It ensures that the SRCU read-side critical section whose
+ * read-lock is not seen by the following srcu_readers_active_idx()
+ * will see any updates that before the current task performed before.
+ * (So we don't need to care these readers this time)
+ *
+ * Also, if we see the increment of the seq, we must see the
+ * increment of the active counter in the following
+ * srcu_readers_active_idx().
+ */
+ smp_mb(); /* A */

/*
* Note that srcu_readers_active_idx() can incorrectly return
* zero even though there is a pre-existing reader throughout.
* To see this, suppose that task A is in a very long SRCU
* read-side critical section that started on CPU 0, and that
- * no other reader exists, so that the modulo sum of the counters
+ * no other reader exists, so that the sum of the counters
* is equal to one. Then suppose that task B starts executing
* srcu_readers_active_idx(), summing up to CPU 1, and then that
* task C starts reading on CPU 0, so that its increment is not
@@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
return false;

/*
- * Since the caller recently flipped ->completed, we can see at
- * most one increment of each CPU's counter from this point
- * forward. The reason for this is that the reader CPU must have
- * fetched the index before srcu_readers_active_idx checked
- * that CPU's counter, but not yet incremented its counter.
- * Its eventual counter increment will follow the read in
- * srcu_readers_active_idx(), and that increment is immediately
- * followed by smp_mb() B. Because smp_mb() D is between
- * the ->completed flip and srcu_readers_active_idx()'s read,
- * that CPU's subsequent load of ->completed must see the new
- * value, and therefore increment the counter in the other rank.
- */
- smp_mb(); /* A */
-
- /*
- * Now, we check the ->snap array that srcu_readers_active_idx()
- * filled in from the per-CPU counter values. Since
- * __srcu_read_lock() increments the upper bits of the per-CPU
- * counter, an increment/decrement pair will change the value
- * of the counter. Since there is only one possible increment,
- * the only way to wrap the counter is to have a huge number of
- * counter decrements, which requires a huge number of tasks and
- * huge SRCU read-side critical-section nesting levels, even on
- * 32-bit systems.
- *
- * All of the ways of confusing the readings require that the scan
- * in srcu_readers_active_idx() see the read-side task's decrement,
- * but not its increment. However, between that decrement and
- * increment are smb_mb() B and C. Either or both of these pair
- * with smp_mb() A above to ensure that the scan below will see
- * the read-side tasks's increment, thus noting a difference in
- * the counter values between the two passes.
+ * Validation step, smp_mb() D pairs with smp_mb() C. If the above
+ * srcu_readers_active_idx() see a decrement of the active counter
+ * in srcu_read_unlock(), it should see one of these for corresponding
+ * srcu_read_lock():
+ * See the increment of the active counter,
+ * Failed to see the increment of the active counter.
+ * The second one can cause srcu_readers_active_idx() incorrectly
+ * return zero, but it means the above srcu_readers_seq_idx() does not
+ * see the increment of the seq(ref: comments of smp_mb() A),
+ * and the following srcu_readers_seq_idx() sees the increment of
+ * the seq. The seq is changed.
*
- * Therefore, if srcu_readers_active_idx() returned zero, and
- * none of the counters changed, we know that the zero was the
- * correct sum.
- *
- * Of course, it is possible that a task might be delayed
- * for a very long time in __srcu_read_lock() after fetching
- * the index but before incrementing its counter. This
- * possibility will be dealt with in __synchronize_srcu().
+ * This smp_mb() D pairs with smp_mb() C for critical section.
+ * then any of the current task's subsequent code will happen after
+ * that SRCU read-side critical section whose read-unlock is seen in
+ * srcu_readers_active_idx().
*/
- for_each_possible_cpu(cpu)
- if (sp->snap[cpu] !=
- ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
- return false; /* False zero reading! */
- return true;
+ smp_mb(); /* D */
+
+ return srcu_readers_seq_idx(sp, idx) == seq;
}

/**
@@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
preempt_disable();
idx = rcu_dereference_index_check(sp->completed,
rcu_read_lock_sched_held()) & 0x1;
- ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
- SRCU_USAGE_COUNT + 1;
+ ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
smp_mb(); /* B */ /* Avoid leaking the critical section. */
+ ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
preempt_enable();
return idx;
}
@@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
int trycount = 0;

/*
- * If a reader fetches the index before the ->completed increment,
- * but increments its counter after srcu_readers_active_idx_check()
- * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
- * smp_mb() B to ensure that the SRCU read-side critical section
- * will see any updates that the current task performed before its
- * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
- * as the case may be.
- */
- smp_mb(); /* D */
-
- /*
* SRCU read-side critical sections are normally short, so wait
* a small amount of time before possibly blocking.
*/
@@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
schedule_timeout_interruptible(1);
}
}
-
- /*
- * The following smp_mb() E pairs with srcu_read_unlock()'s
- * smp_mb C to ensure that if srcu_readers_active_idx_check()
- * sees srcu_read_unlock()'s counter decrement, then any
- * of the current task's subsequent code will happen after
- * that SRCU read-side critical section.
- *
- * It also ensures the order between the above waiting and
- * the next flipping.
- */
- smp_mb(); /* E */
}

static void srcu_flip(struct srcu_struct *sp)
--
1.7.4.4

2012-02-27 18:31:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <[email protected]>
> Date: Mon, 27 Feb 2012 14:22:47 +0800
> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>
> This patch implement the algorithm as Peter's:
> https://lkml.org/lkml/2012/2/1/119
>
> o Make the checking lock-free and we can perform parallel checking,
> Although almost parallel checking makes no sense, but we need it
> when 1) the original checking task is preempted for long, 2)
> sychronize_srcu_expedited(), 3) avoid lock(see next)
>
> o Since it is lock-free, we save a mutex in state machine for
> call_srcu().
>
> o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
> (so we can remove the preempt_disable() in future, but use
> __this_cpu_inc() instead.)
>
> o reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
> more intuitive.

Hello, Lai,

Interesting approach!

What happens given the following sequence of events?

o CPU 0 in srcu_readers_active_idx_check() invokes
srcu_readers_seq_idx(), getting some number back.

o CPU 0 invokes srcu_readers_active_idx(), summing the
->c[] array up through CPU 3.

o CPU 1 invokes __srcu_read_lock(), and increments its counter
but not yet its ->seq[] element.

o CPU 0 completes its summing of the ->c[] array, incorrectly
obtaining zero.

o CPU 0 invokes srcu_readers_seq_idx(), getting the same
number back that it got last time.

o In parallel with the previous step, CPU 1 executes out of order
(as permitted by the lack of a second memory barrier in
__srcu_read_lock()), starting up the critical section before
incrementing its ->seq[] element.

o Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
completes the SRCU grace period before CPU 1 completes its
SRCU read-side critical section.

This actually might be safe, but I need to think more about it. In the
meantime, I figured I should ask your thoughts.

Thanx, Paul

> Inspired-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> include/linux/srcu.h | 7 +--
> kernel/srcu.c | 137 ++++++++++++++++++++-----------------------------
> 2 files changed, 57 insertions(+), 87 deletions(-)
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 5b49d41..15354db 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -32,18 +32,13 @@
>
> struct srcu_struct_array {
> unsigned long c[2];
> + unsigned long seq[2];
> };
>
> -/* Bit definitions for field ->c above and ->snap below. */
> -#define SRCU_USAGE_BITS 1
> -#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
> -#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
> -
> struct srcu_struct {
> unsigned completed;
> struct srcu_struct_array __percpu *per_cpu_ref;
> struct mutex mutex;
> - unsigned long snap[NR_CPUS];
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> struct lockdep_map dep_map;
> #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index 47ee35d..376b583 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>
> /*
> + * Returns approximate total sequence of readers on the specified rank
> + * of per-CPU counters.
> + */
> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> +{
> + int cpu;
> + unsigned long sum = 0;
> + unsigned long t;
> +
> + for_each_possible_cpu(cpu) {
> + t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> + sum += t;
> + }
> + return sum;
> +}
> +
> +/*
> * Returns approximate number of readers active on the specified rank
> - * of per-CPU counters. Also snapshots each counter's value in the
> - * corresponding element of sp->snap[] for later use validating
> - * the sum.
> + * of per-CPU counters.
> */
> static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> {
> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> for_each_possible_cpu(cpu) {
> t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> sum += t;
> - sp->snap[cpu] = t;
> }
> - return sum & SRCU_REF_MASK;
> + return sum;
> }
>
> -/*
> - * To be called from the update side after an index flip. Returns true
> - * if the modulo sum of the counters is stably zero, false if there is
> - * some possibility of non-zero.
> - */
> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> {
> int cpu;
> + unsigned long seq;
> +
> + seq = srcu_readers_seq_idx(sp, idx);
> +
> + /*
> + * smp_mb() A pairs with smp_mb() B for critical section.
> + * It ensures that the SRCU read-side critical section whose
> + * read-lock is not seen by the following srcu_readers_active_idx()
> + * will see any updates that before the current task performed before.
> + * (So we don't need to care these readers this time)
> + *
> + * Also, if we see the increment of the seq, we must see the
> + * increment of the active counter in the following
> + * srcu_readers_active_idx().
> + */
> + smp_mb(); /* A */
>
> /*
> * Note that srcu_readers_active_idx() can incorrectly return
> * zero even though there is a pre-existing reader throughout.
> * To see this, suppose that task A is in a very long SRCU
> * read-side critical section that started on CPU 0, and that
> - * no other reader exists, so that the modulo sum of the counters
> + * no other reader exists, so that the sum of the counters
> * is equal to one. Then suppose that task B starts executing
> * srcu_readers_active_idx(), summing up to CPU 1, and then that
> * task C starts reading on CPU 0, so that its increment is not
> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> return false;
>
> /*
> - * Since the caller recently flipped ->completed, we can see at
> - * most one increment of each CPU's counter from this point
> - * forward. The reason for this is that the reader CPU must have
> - * fetched the index before srcu_readers_active_idx checked
> - * that CPU's counter, but not yet incremented its counter.
> - * Its eventual counter increment will follow the read in
> - * srcu_readers_active_idx(), and that increment is immediately
> - * followed by smp_mb() B. Because smp_mb() D is between
> - * the ->completed flip and srcu_readers_active_idx()'s read,
> - * that CPU's subsequent load of ->completed must see the new
> - * value, and therefore increment the counter in the other rank.
> - */
> - smp_mb(); /* A */
> -
> - /*
> - * Now, we check the ->snap array that srcu_readers_active_idx()
> - * filled in from the per-CPU counter values. Since
> - * __srcu_read_lock() increments the upper bits of the per-CPU
> - * counter, an increment/decrement pair will change the value
> - * of the counter. Since there is only one possible increment,
> - * the only way to wrap the counter is to have a huge number of
> - * counter decrements, which requires a huge number of tasks and
> - * huge SRCU read-side critical-section nesting levels, even on
> - * 32-bit systems.
> - *
> - * All of the ways of confusing the readings require that the scan
> - * in srcu_readers_active_idx() see the read-side task's decrement,
> - * but not its increment. However, between that decrement and
> - * increment are smb_mb() B and C. Either or both of these pair
> - * with smp_mb() A above to ensure that the scan below will see
> - * the read-side tasks's increment, thus noting a difference in
> - * the counter values between the two passes.
> + * Validation step, smp_mb() D pairs with smp_mb() C. If the above
> + * srcu_readers_active_idx() see a decrement of the active counter
> + * in srcu_read_unlock(), it should see one of these for corresponding
> + * srcu_read_lock():
> + * See the increment of the active counter,
> + * Failed to see the increment of the active counter.
> + * The second one can cause srcu_readers_active_idx() incorrectly
> + * return zero, but it means the above srcu_readers_seq_idx() does not
> + * see the increment of the seq(ref: comments of smp_mb() A),
> + * and the following srcu_readers_seq_idx() sees the increment of
> + * the seq. The seq is changed.
> *
> - * Therefore, if srcu_readers_active_idx() returned zero, and
> - * none of the counters changed, we know that the zero was the
> - * correct sum.
> - *
> - * Of course, it is possible that a task might be delayed
> - * for a very long time in __srcu_read_lock() after fetching
> - * the index but before incrementing its counter. This
> - * possibility will be dealt with in __synchronize_srcu().
> + * This smp_mb() D pairs with smp_mb() C for critical section.
> + * then any of the current task's subsequent code will happen after
> + * that SRCU read-side critical section whose read-unlock is seen in
> + * srcu_readers_active_idx().
> */
> - for_each_possible_cpu(cpu)
> - if (sp->snap[cpu] !=
> - ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> - return false; /* False zero reading! */
> - return true;
> + smp_mb(); /* D */
> +
> + return srcu_readers_seq_idx(sp, idx) == seq;
> }
>
> /**
> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
> preempt_disable();
> idx = rcu_dereference_index_check(sp->completed,
> rcu_read_lock_sched_held()) & 0x1;
> - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> - SRCU_USAGE_COUNT + 1;
> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
> smp_mb(); /* B */ /* Avoid leaking the critical section. */
> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
> preempt_enable();
> return idx;
> }
> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> int trycount = 0;
>
> /*
> - * If a reader fetches the index before the ->completed increment,
> - * but increments its counter after srcu_readers_active_idx_check()
> - * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
> - * smp_mb() B to ensure that the SRCU read-side critical section
> - * will see any updates that the current task performed before its
> - * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
> - * as the case may be.
> - */
> - smp_mb(); /* D */
> -
> - /*
> * SRCU read-side critical sections are normally short, so wait
> * a small amount of time before possibly blocking.
> */
> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> schedule_timeout_interruptible(1);
> }
> }
> -
> - /*
> - * The following smp_mb() E pairs with srcu_read_unlock()'s
> - * smp_mb C to ensure that if srcu_readers_active_idx_check()
> - * sees srcu_read_unlock()'s counter decrement, then any
> - * of the current task's subsequent code will happen after
> - * that SRCU read-side critical section.
> - *
> - * It also ensures the order between the above waiting and
> - * the next flipping.
> - */
> - smp_mb(); /* E */
> }
>
> static void srcu_flip(struct srcu_struct *sp)
> --
> 1.7.4.4
>

2012-02-28 01:46:57

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
>> From: Lai Jiangshan <[email protected]>
>> Date: Mon, 27 Feb 2012 14:22:47 +0800
>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>>
>> This patch implement the algorithm as Peter's:
>> https://lkml.org/lkml/2012/2/1/119
>>
>> o Make the checking lock-free and we can perform parallel checking,
>> Although almost parallel checking makes no sense, but we need it
>> when 1) the original checking task is preempted for long, 2)
>> sychronize_srcu_expedited(), 3) avoid lock(see next)
>>
>> o Since it is lock-free, we save a mutex in state machine for
>> call_srcu().
>>
>> o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
>> (so we can remove the preempt_disable() in future, but use
>> __this_cpu_inc() instead.)
>>
>> o reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
>> more intuitive.
>
> Hello, Lai,
>
> Interesting approach!
>
> What happens given the following sequence of events?
>
> o CPU 0 in srcu_readers_active_idx_check() invokes
> srcu_readers_seq_idx(), getting some number back.
>
> o CPU 0 invokes srcu_readers_active_idx(), summing the
> ->c[] array up through CPU 3.
>
> o CPU 1 invokes __srcu_read_lock(), and increments its counter
> but not yet its ->seq[] element.


Any __srcu_read_lock() whose increment of active counter is not seen
by srcu_readers_active_idx() is considerred as
"reader-started-after-this-srcu_readers_active_idx_check()",
We don't need to wait.

As you said, this srcu C.S 's increment seq is not seen by above
srcu_readers_seq_idx().

>
> o CPU 0 completes its summing of the ->c[] array, incorrectly
> obtaining zero.
>
> o CPU 0 invokes srcu_readers_seq_idx(), getting the same
> number back that it got last time.

If it incorrectly get zero, it means __srcu_read_unlock() is seen
in srcu_readers_active_idx(), and it means the increment of
seq is seen in this srcu_readers_seq_idx(), it is different
from the above seq that it got last time.

increment of seq is not seen by above srcu_readers_seq_idx(),
but is seen by later one, so the two returned seq is different,
this is the core of Peter's algorithm, and this was written
in the comments(Sorry for my bad English). Or maybe I miss
your means in this mail.

Thanks,
Lai

>
> o In parallel with the previous step, CPU 1 executes out of order
> (as permitted by the lack of a second memory barrier in
> __srcu_read_lock()), starting up the critical section before
> incrementing its ->seq[] element.
>
> o Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
> completes the SRCU grace period before CPU 1 completes its
> SRCU read-side critical section.
>
> This actually might be safe, but I need to think more about it. In the
> meantime, I figured I should ask your thoughts.
>
> Thanx, Paul
>
>> Inspired-by: Peter Zijlstra <[email protected]>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>> ---
>> include/linux/srcu.h | 7 +--
>> kernel/srcu.c | 137 ++++++++++++++++++++-----------------------------
>> 2 files changed, 57 insertions(+), 87 deletions(-)
>>
>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>> index 5b49d41..15354db 100644
>> --- a/include/linux/srcu.h
>> +++ b/include/linux/srcu.h
>> @@ -32,18 +32,13 @@
>>
>> struct srcu_struct_array {
>> unsigned long c[2];
>> + unsigned long seq[2];
>> };
>>
>> -/* Bit definitions for field ->c above and ->snap below. */
>> -#define SRCU_USAGE_BITS 1
>> -#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
>> -#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
>> -
>> struct srcu_struct {
>> unsigned completed;
>> struct srcu_struct_array __percpu *per_cpu_ref;
>> struct mutex mutex;
>> - unsigned long snap[NR_CPUS];
>> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>> struct lockdep_map dep_map;
>> #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>> diff --git a/kernel/srcu.c b/kernel/srcu.c
>> index 47ee35d..376b583 100644
>> --- a/kernel/srcu.c
>> +++ b/kernel/srcu.c
>> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>
>> /*
>> + * Returns approximate total sequence of readers on the specified rank
>> + * of per-CPU counters.
>> + */
>> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
>> +{
>> + int cpu;
>> + unsigned long sum = 0;
>> + unsigned long t;
>> +
>> + for_each_possible_cpu(cpu) {
>> + t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
>> + sum += t;
>> + }
>> + return sum;
>> +}
>> +
>> +/*
>> * Returns approximate number of readers active on the specified rank
>> - * of per-CPU counters. Also snapshots each counter's value in the
>> - * corresponding element of sp->snap[] for later use validating
>> - * the sum.
>> + * of per-CPU counters.
>> */
>> static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>> {
>> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>> for_each_possible_cpu(cpu) {
>> t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
>> sum += t;
>> - sp->snap[cpu] = t;
>> }
>> - return sum & SRCU_REF_MASK;
>> + return sum;
>> }
>>
>> -/*
>> - * To be called from the update side after an index flip. Returns true
>> - * if the modulo sum of the counters is stably zero, false if there is
>> - * some possibility of non-zero.
>> - */
>> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>> {
>> int cpu;
>> + unsigned long seq;
>> +
>> + seq = srcu_readers_seq_idx(sp, idx);
>> +
>> + /*
>> + * smp_mb() A pairs with smp_mb() B for critical section.
>> + * It ensures that the SRCU read-side critical section whose
>> + * read-lock is not seen by the following srcu_readers_active_idx()
>> + * will see any updates that before the current task performed before.
>> + * (So we don't need to care these readers this time)
>> + *
>> + * Also, if we see the increment of the seq, we must see the
>> + * increment of the active counter in the following
>> + * srcu_readers_active_idx().
>> + */
>> + smp_mb(); /* A */
>>
>> /*
>> * Note that srcu_readers_active_idx() can incorrectly return
>> * zero even though there is a pre-existing reader throughout.
>> * To see this, suppose that task A is in a very long SRCU
>> * read-side critical section that started on CPU 0, and that
>> - * no other reader exists, so that the modulo sum of the counters
>> + * no other reader exists, so that the sum of the counters
>> * is equal to one. Then suppose that task B starts executing
>> * srcu_readers_active_idx(), summing up to CPU 1, and then that
>> * task C starts reading on CPU 0, so that its increment is not
>> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>> return false;
>>
>> /*
>> - * Since the caller recently flipped ->completed, we can see at
>> - * most one increment of each CPU's counter from this point
>> - * forward. The reason for this is that the reader CPU must have
>> - * fetched the index before srcu_readers_active_idx checked
>> - * that CPU's counter, but not yet incremented its counter.
>> - * Its eventual counter increment will follow the read in
>> - * srcu_readers_active_idx(), and that increment is immediately
>> - * followed by smp_mb() B. Because smp_mb() D is between
>> - * the ->completed flip and srcu_readers_active_idx()'s read,
>> - * that CPU's subsequent load of ->completed must see the new
>> - * value, and therefore increment the counter in the other rank.
>> - */
>> - smp_mb(); /* A */
>> -
>> - /*
>> - * Now, we check the ->snap array that srcu_readers_active_idx()
>> - * filled in from the per-CPU counter values. Since
>> - * __srcu_read_lock() increments the upper bits of the per-CPU
>> - * counter, an increment/decrement pair will change the value
>> - * of the counter. Since there is only one possible increment,
>> - * the only way to wrap the counter is to have a huge number of
>> - * counter decrements, which requires a huge number of tasks and
>> - * huge SRCU read-side critical-section nesting levels, even on
>> - * 32-bit systems.
>> - *
>> - * All of the ways of confusing the readings require that the scan
>> - * in srcu_readers_active_idx() see the read-side task's decrement,
>> - * but not its increment. However, between that decrement and
>> - * increment are smb_mb() B and C. Either or both of these pair
>> - * with smp_mb() A above to ensure that the scan below will see
>> - * the read-side tasks's increment, thus noting a difference in
>> - * the counter values between the two passes.
>> + * Validation step, smp_mb() D pairs with smp_mb() C. If the above
>> + * srcu_readers_active_idx() see a decrement of the active counter
>> + * in srcu_read_unlock(), it should see one of these for corresponding
>> + * srcu_read_lock():
>> + * See the increment of the active counter,
>> + * Failed to see the increment of the active counter.
>> + * The second one can cause srcu_readers_active_idx() incorrectly
>> + * return zero, but it means the above srcu_readers_seq_idx() does not
>> + * see the increment of the seq(ref: comments of smp_mb() A),
>> + * and the following srcu_readers_seq_idx() sees the increment of
>> + * the seq. The seq is changed.
>> *
>> - * Therefore, if srcu_readers_active_idx() returned zero, and
>> - * none of the counters changed, we know that the zero was the
>> - * correct sum.
>> - *
>> - * Of course, it is possible that a task might be delayed
>> - * for a very long time in __srcu_read_lock() after fetching
>> - * the index but before incrementing its counter. This
>> - * possibility will be dealt with in __synchronize_srcu().
>> + * This smp_mb() D pairs with smp_mb() C for critical section.
>> + * then any of the current task's subsequent code will happen after
>> + * that SRCU read-side critical section whose read-unlock is seen in
>> + * srcu_readers_active_idx().
>> */
>> - for_each_possible_cpu(cpu)
>> - if (sp->snap[cpu] !=
>> - ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
>> - return false; /* False zero reading! */
>> - return true;
>> + smp_mb(); /* D */
>> +
>> + return srcu_readers_seq_idx(sp, idx) == seq;
>> }
>>
>> /**
>> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
>> preempt_disable();
>> idx = rcu_dereference_index_check(sp->completed,
>> rcu_read_lock_sched_held()) & 0x1;
>> - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
>> - SRCU_USAGE_COUNT + 1;
>> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
>> smp_mb(); /* B */ /* Avoid leaking the critical section. */
>> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
>> preempt_enable();
>> return idx;
>> }
>> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>> int trycount = 0;
>>
>> /*
>> - * If a reader fetches the index before the ->completed increment,
>> - * but increments its counter after srcu_readers_active_idx_check()
>> - * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
>> - * smp_mb() B to ensure that the SRCU read-side critical section
>> - * will see any updates that the current task performed before its
>> - * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
>> - * as the case may be.
>> - */
>> - smp_mb(); /* D */
>> -
>> - /*
>> * SRCU read-side critical sections are normally short, so wait
>> * a small amount of time before possibly blocking.
>> */
>> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>> schedule_timeout_interruptible(1);
>> }
>> }
>> -
>> - /*
>> - * The following smp_mb() E pairs with srcu_read_unlock()'s
>> - * smp_mb C to ensure that if srcu_readers_active_idx_check()
>> - * sees srcu_read_unlock()'s counter decrement, then any
>> - * of the current task's subsequent code will happen after
>> - * that SRCU read-side critical section.
>> - *
>> - * It also ensures the order between the above waiting and
>> - * the next flipping.
>> - */
>> - smp_mb(); /* E */
>> }
>>
>> static void srcu_flip(struct srcu_struct *sp)
>> --
>> 1.7.4.4
>>
>
>

2012-02-28 13:47:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
> > On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
> >> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
> >> From: Lai Jiangshan <[email protected]>
> >> Date: Mon, 27 Feb 2012 14:22:47 +0800
> >> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
> >>
> >> This patch implement the algorithm as Peter's:
> >> https://lkml.org/lkml/2012/2/1/119
> >>
> >> o Make the checking lock-free and we can perform parallel checking,
> >> Although almost parallel checking makes no sense, but we need it
> >> when 1) the original checking task is preempted for long, 2)
> >> sychronize_srcu_expedited(), 3) avoid lock(see next)
> >>
> >> o Since it is lock-free, we save a mutex in state machine for
> >> call_srcu().
> >>
> >> o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
> >> (so we can remove the preempt_disable() in future, but use
> >> __this_cpu_inc() instead.)
> >>
> >> o reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
> >> more intuitive.
> >
> > Hello, Lai,
> >
> > Interesting approach!
> >
> > What happens given the following sequence of events?
> >
> > o CPU 0 in srcu_readers_active_idx_check() invokes
> > srcu_readers_seq_idx(), getting some number back.
> >
> > o CPU 0 invokes srcu_readers_active_idx(), summing the
> > ->c[] array up through CPU 3.
> >
> > o CPU 1 invokes __srcu_read_lock(), and increments its counter
> > but not yet its ->seq[] element.
>
>
> Any __srcu_read_lock() whose increment of active counter is not seen
> by srcu_readers_active_idx() is considerred as
> "reader-started-after-this-srcu_readers_active_idx_check()",
> We don't need to wait.
>
> As you said, this srcu C.S 's increment seq is not seen by above
> srcu_readers_seq_idx().
>
> >
> > o CPU 0 completes its summing of the ->c[] array, incorrectly
> > obtaining zero.
> >
> > o CPU 0 invokes srcu_readers_seq_idx(), getting the same
> > number back that it got last time.
>
> If it incorrectly get zero, it means __srcu_read_unlock() is seen
> in srcu_readers_active_idx(), and it means the increment of
> seq is seen in this srcu_readers_seq_idx(), it is different
> from the above seq that it got last time.
>
> increment of seq is not seen by above srcu_readers_seq_idx(),
> but is seen by later one, so the two returned seq is different,
> this is the core of Peter's algorithm, and this was written
> in the comments(Sorry for my bad English). Or maybe I miss
> your means in this mail.

OK, good, this analysis agrees with what I was thinking.

So my next question is about the lock freedom. This lock freedom has to
be limited in nature and carefully implemented. The reasons for this are:

1. Readers can block in any case, which can of course block both
synchronize_srcu_expedited() and synchronize_srcu().

2. Because only one CPU at a time can be incrementing ->completed,
some sort of lock with preemption disabling will of course be
needed. Alternatively, an rt_mutex could be used for its
priority-inheritance properties.

3. Once some CPU has incremented ->completed, all CPUs that might
still be summing up the old indexes must stop. If they don't,
they might incorrectly call a too-short grace period in case of
->seq[]-sum overflow on 32-bit systems.

Or did you have something else in mind?

Thanx, Paul

> Thanks,
> Lai
>
> >
> > o In parallel with the previous step, CPU 1 executes out of order
> > (as permitted by the lack of a second memory barrier in
> > __srcu_read_lock()), starting up the critical section before
> > incrementing its ->seq[] element.
> >
> > o Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
> > completes the SRCU grace period before CPU 1 completes its
> > SRCU read-side critical section.
> >
> > This actually might be safe, but I need to think more about it. In the
> > meantime, I figured I should ask your thoughts.
> >
> > Thanx, Paul
> >
> >> Inspired-by: Peter Zijlstra <[email protected]>
> >> Signed-off-by: Lai Jiangshan <[email protected]>
> >> ---
> >> include/linux/srcu.h | 7 +--
> >> kernel/srcu.c | 137 ++++++++++++++++++++-----------------------------
> >> 2 files changed, 57 insertions(+), 87 deletions(-)
> >>
> >> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> >> index 5b49d41..15354db 100644
> >> --- a/include/linux/srcu.h
> >> +++ b/include/linux/srcu.h
> >> @@ -32,18 +32,13 @@
> >>
> >> struct srcu_struct_array {
> >> unsigned long c[2];
> >> + unsigned long seq[2];
> >> };
> >>
> >> -/* Bit definitions for field ->c above and ->snap below. */
> >> -#define SRCU_USAGE_BITS 1
> >> -#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
> >> -#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
> >> -
> >> struct srcu_struct {
> >> unsigned completed;
> >> struct srcu_struct_array __percpu *per_cpu_ref;
> >> struct mutex mutex;
> >> - unsigned long snap[NR_CPUS];
> >> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >> struct lockdep_map dep_map;
> >> #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >> diff --git a/kernel/srcu.c b/kernel/srcu.c
> >> index 47ee35d..376b583 100644
> >> --- a/kernel/srcu.c
> >> +++ b/kernel/srcu.c
> >> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> >> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>
> >> /*
> >> + * Returns approximate total sequence of readers on the specified rank
> >> + * of per-CPU counters.
> >> + */
> >> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> >> +{
> >> + int cpu;
> >> + unsigned long sum = 0;
> >> + unsigned long t;
> >> +
> >> + for_each_possible_cpu(cpu) {
> >> + t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> >> + sum += t;
> >> + }
> >> + return sum;
> >> +}
> >> +
> >> +/*
> >> * Returns approximate number of readers active on the specified rank
> >> - * of per-CPU counters. Also snapshots each counter's value in the
> >> - * corresponding element of sp->snap[] for later use validating
> >> - * the sum.
> >> + * of per-CPU counters.
> >> */
> >> static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >> {
> >> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >> for_each_possible_cpu(cpu) {
> >> t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> >> sum += t;
> >> - sp->snap[cpu] = t;
> >> }
> >> - return sum & SRCU_REF_MASK;
> >> + return sum;
> >> }
> >>
> >> -/*
> >> - * To be called from the update side after an index flip. Returns true
> >> - * if the modulo sum of the counters is stably zero, false if there is
> >> - * some possibility of non-zero.
> >> - */
> >> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >> {
> >> int cpu;
> >> + unsigned long seq;
> >> +
> >> + seq = srcu_readers_seq_idx(sp, idx);
> >> +
> >> + /*
> >> + * smp_mb() A pairs with smp_mb() B for critical section.
> >> + * It ensures that the SRCU read-side critical section whose
> >> + * read-lock is not seen by the following srcu_readers_active_idx()
> >> + * will see any updates that before the current task performed before.
> >> + * (So we don't need to care these readers this time)
> >> + *
> >> + * Also, if we see the increment of the seq, we must see the
> >> + * increment of the active counter in the following
> >> + * srcu_readers_active_idx().
> >> + */
> >> + smp_mb(); /* A */
> >>
> >> /*
> >> * Note that srcu_readers_active_idx() can incorrectly return
> >> * zero even though there is a pre-existing reader throughout.
> >> * To see this, suppose that task A is in a very long SRCU
> >> * read-side critical section that started on CPU 0, and that
> >> - * no other reader exists, so that the modulo sum of the counters
> >> + * no other reader exists, so that the sum of the counters
> >> * is equal to one. Then suppose that task B starts executing
> >> * srcu_readers_active_idx(), summing up to CPU 1, and then that
> >> * task C starts reading on CPU 0, so that its increment is not
> >> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >> return false;
> >>
> >> /*
> >> - * Since the caller recently flipped ->completed, we can see at
> >> - * most one increment of each CPU's counter from this point
> >> - * forward. The reason for this is that the reader CPU must have
> >> - * fetched the index before srcu_readers_active_idx checked
> >> - * that CPU's counter, but not yet incremented its counter.
> >> - * Its eventual counter increment will follow the read in
> >> - * srcu_readers_active_idx(), and that increment is immediately
> >> - * followed by smp_mb() B. Because smp_mb() D is between
> >> - * the ->completed flip and srcu_readers_active_idx()'s read,
> >> - * that CPU's subsequent load of ->completed must see the new
> >> - * value, and therefore increment the counter in the other rank.
> >> - */
> >> - smp_mb(); /* A */
> >> -
> >> - /*
> >> - * Now, we check the ->snap array that srcu_readers_active_idx()
> >> - * filled in from the per-CPU counter values. Since
> >> - * __srcu_read_lock() increments the upper bits of the per-CPU
> >> - * counter, an increment/decrement pair will change the value
> >> - * of the counter. Since there is only one possible increment,
> >> - * the only way to wrap the counter is to have a huge number of
> >> - * counter decrements, which requires a huge number of tasks and
> >> - * huge SRCU read-side critical-section nesting levels, even on
> >> - * 32-bit systems.
> >> - *
> >> - * All of the ways of confusing the readings require that the scan
> >> - * in srcu_readers_active_idx() see the read-side task's decrement,
> >> - * but not its increment. However, between that decrement and
> >> - * increment are smb_mb() B and C. Either or both of these pair
> >> - * with smp_mb() A above to ensure that the scan below will see
> >> - * the read-side tasks's increment, thus noting a difference in
> >> - * the counter values between the two passes.
> >> + * Validation step, smp_mb() D pairs with smp_mb() C. If the above
> >> + * srcu_readers_active_idx() see a decrement of the active counter
> >> + * in srcu_read_unlock(), it should see one of these for corresponding
> >> + * srcu_read_lock():
> >> + * See the increment of the active counter,
> >> + * Failed to see the increment of the active counter.
> >> + * The second one can cause srcu_readers_active_idx() incorrectly
> >> + * return zero, but it means the above srcu_readers_seq_idx() does not
> >> + * see the increment of the seq(ref: comments of smp_mb() A),
> >> + * and the following srcu_readers_seq_idx() sees the increment of
> >> + * the seq. The seq is changed.
> >> *
> >> - * Therefore, if srcu_readers_active_idx() returned zero, and
> >> - * none of the counters changed, we know that the zero was the
> >> - * correct sum.
> >> - *
> >> - * Of course, it is possible that a task might be delayed
> >> - * for a very long time in __srcu_read_lock() after fetching
> >> - * the index but before incrementing its counter. This
> >> - * possibility will be dealt with in __synchronize_srcu().
> >> + * This smp_mb() D pairs with smp_mb() C for critical section.
> >> + * then any of the current task's subsequent code will happen after
> >> + * that SRCU read-side critical section whose read-unlock is seen in
> >> + * srcu_readers_active_idx().
> >> */
> >> - for_each_possible_cpu(cpu)
> >> - if (sp->snap[cpu] !=
> >> - ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> >> - return false; /* False zero reading! */
> >> - return true;
> >> + smp_mb(); /* D */
> >> +
> >> + return srcu_readers_seq_idx(sp, idx) == seq;
> >> }
> >>
> >> /**
> >> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >> preempt_disable();
> >> idx = rcu_dereference_index_check(sp->completed,
> >> rcu_read_lock_sched_held()) & 0x1;
> >> - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> >> - SRCU_USAGE_COUNT + 1;
> >> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
> >> smp_mb(); /* B */ /* Avoid leaking the critical section. */
> >> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
> >> preempt_enable();
> >> return idx;
> >> }
> >> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >> int trycount = 0;
> >>
> >> /*
> >> - * If a reader fetches the index before the ->completed increment,
> >> - * but increments its counter after srcu_readers_active_idx_check()
> >> - * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
> >> - * smp_mb() B to ensure that the SRCU read-side critical section
> >> - * will see any updates that the current task performed before its
> >> - * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
> >> - * as the case may be.
> >> - */
> >> - smp_mb(); /* D */
> >> -
> >> - /*
> >> * SRCU read-side critical sections are normally short, so wait
> >> * a small amount of time before possibly blocking.
> >> */
> >> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >> schedule_timeout_interruptible(1);
> >> }
> >> }
> >> -
> >> - /*
> >> - * The following smp_mb() E pairs with srcu_read_unlock()'s
> >> - * smp_mb C to ensure that if srcu_readers_active_idx_check()
> >> - * sees srcu_read_unlock()'s counter decrement, then any
> >> - * of the current task's subsequent code will happen after
> >> - * that SRCU read-side critical section.
> >> - *
> >> - * It also ensures the order between the above waiting and
> >> - * the next flipping.
> >> - */
> >> - smp_mb(); /* E */
> >> }
> >>
> >> static void srcu_flip(struct srcu_struct *sp)
> >> --
> >> 1.7.4.4
> >>
> >
> >
>

2012-02-29 10:11:12

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
> On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
>> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
>>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
>>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
>>>> From: Lai Jiangshan <[email protected]>
>>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
>>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>>>>
>>>> This patch implement the algorithm as Peter's:
>>>> https://lkml.org/lkml/2012/2/1/119
>>>>
>>>> o Make the checking lock-free and we can perform parallel checking,
>>>> Although almost parallel checking makes no sense, but we need it
>>>> when 1) the original checking task is preempted for long, 2)
>>>> sychronize_srcu_expedited(), 3) avoid lock(see next)
>>>>
>>>> o Since it is lock-free, we save a mutex in state machine for
>>>> call_srcu().
>>>>
>>>> o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
>>>> (so we can remove the preempt_disable() in future, but use
>>>> __this_cpu_inc() instead.)
>>>>
>>>> o reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
>>>> more intuitive.
>>>
>>> Hello, Lai,
>>>
>>> Interesting approach!
>>>
>>> What happens given the following sequence of events?
>>>
>>> o CPU 0 in srcu_readers_active_idx_check() invokes
>>> srcu_readers_seq_idx(), getting some number back.
>>>
>>> o CPU 0 invokes srcu_readers_active_idx(), summing the
>>> ->c[] array up through CPU 3.
>>>
>>> o CPU 1 invokes __srcu_read_lock(), and increments its counter
>>> but not yet its ->seq[] element.
>>
>>
>> Any __srcu_read_lock() whose increment of active counter is not seen
>> by srcu_readers_active_idx() is considerred as
>> "reader-started-after-this-srcu_readers_active_idx_check()",
>> We don't need to wait.
>>
>> As you said, this srcu C.S 's increment seq is not seen by above
>> srcu_readers_seq_idx().
>>
>>>
>>> o CPU 0 completes its summing of the ->c[] array, incorrectly
>>> obtaining zero.
>>>
>>> o CPU 0 invokes srcu_readers_seq_idx(), getting the same
>>> number back that it got last time.
>>
>> If it incorrectly get zero, it means __srcu_read_unlock() is seen
>> in srcu_readers_active_idx(), and it means the increment of
>> seq is seen in this srcu_readers_seq_idx(), it is different
>> from the above seq that it got last time.
>>
>> increment of seq is not seen by above srcu_readers_seq_idx(),
>> but is seen by later one, so the two returned seq is different,
>> this is the core of Peter's algorithm, and this was written
>> in the comments(Sorry for my bad English). Or maybe I miss
>> your means in this mail.
>
> OK, good, this analysis agrees with what I was thinking.
>
> So my next question is about the lock freedom. This lock freedom has to
> be limited in nature and carefully implemented. The reasons for this are:
>
> 1. Readers can block in any case, which can of course block both
> synchronize_srcu_expedited() and synchronize_srcu().
>
> 2. Because only one CPU at a time can be incrementing ->completed,
> some sort of lock with preemption disabling will of course be
> needed. Alternatively, an rt_mutex could be used for its
> priority-inheritance properties.
>
> 3. Once some CPU has incremented ->completed, all CPUs that might
> still be summing up the old indexes must stop. If they don't,
> they might incorrectly call a too-short grace period in case of
> ->seq[]-sum overflow on 32-bit systems.
>
> Or did you have something else in mind?

When flip happens when check_zero, this check_zero will no be
committed even it is success.

I play too much with lock-free for call_srcu(), the code becomes complicated,
I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.

(But I still like Peter's approach, it has some other good thing
besides lock-free-checking, if you don't like it, I will send
another patch to fix srcu_readers_active())

Thanks,
Lai

>
> Thanx, Paul
>
>> Thanks,
>> Lai
>>
>>>
>>> o In parallel with the previous step, CPU 1 executes out of order
>>> (as permitted by the lack of a second memory barrier in
>>> __srcu_read_lock()), starting up the critical section before
>>> incrementing its ->seq[] element.
>>>
>>> o Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
>>> completes the SRCU grace period before CPU 1 completes its
>>> SRCU read-side critical section.
>>>
>>> This actually might be safe, but I need to think more about it. In the
>>> meantime, I figured I should ask your thoughts.
>>>
>>> Thanx, Paul
>>>
>>>> Inspired-by: Peter Zijlstra <[email protected]>
>>>> Signed-off-by: Lai Jiangshan <[email protected]>
>>>> ---
>>>> include/linux/srcu.h | 7 +--
>>>> kernel/srcu.c | 137 ++++++++++++++++++++-----------------------------
>>>> 2 files changed, 57 insertions(+), 87 deletions(-)
>>>>
>>>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>>>> index 5b49d41..15354db 100644
>>>> --- a/include/linux/srcu.h
>>>> +++ b/include/linux/srcu.h
>>>> @@ -32,18 +32,13 @@
>>>>
>>>> struct srcu_struct_array {
>>>> unsigned long c[2];
>>>> + unsigned long seq[2];
>>>> };
>>>>
>>>> -/* Bit definitions for field ->c above and ->snap below. */
>>>> -#define SRCU_USAGE_BITS 1
>>>> -#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
>>>> -#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
>>>> -
>>>> struct srcu_struct {
>>>> unsigned completed;
>>>> struct srcu_struct_array __percpu *per_cpu_ref;
>>>> struct mutex mutex;
>>>> - unsigned long snap[NR_CPUS];
>>>> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>>> struct lockdep_map dep_map;
>>>> #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>>> diff --git a/kernel/srcu.c b/kernel/srcu.c
>>>> index 47ee35d..376b583 100644
>>>> --- a/kernel/srcu.c
>>>> +++ b/kernel/srcu.c
>>>> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>>>> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>>>
>>>> /*
>>>> + * Returns approximate total sequence of readers on the specified rank
>>>> + * of per-CPU counters.
>>>> + */
>>>> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
>>>> +{
>>>> + int cpu;
>>>> + unsigned long sum = 0;
>>>> + unsigned long t;
>>>> +
>>>> + for_each_possible_cpu(cpu) {
>>>> + t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
>>>> + sum += t;
>>>> + }
>>>> + return sum;
>>>> +}
>>>> +
>>>> +/*
>>>> * Returns approximate number of readers active on the specified rank
>>>> - * of per-CPU counters. Also snapshots each counter's value in the
>>>> - * corresponding element of sp->snap[] for later use validating
>>>> - * the sum.
>>>> + * of per-CPU counters.
>>>> */
>>>> static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>>>> {
>>>> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>>>> for_each_possible_cpu(cpu) {
>>>> t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
>>>> sum += t;
>>>> - sp->snap[cpu] = t;
>>>> }
>>>> - return sum & SRCU_REF_MASK;
>>>> + return sum;
>>>> }
>>>>
>>>> -/*
>>>> - * To be called from the update side after an index flip. Returns true
>>>> - * if the modulo sum of the counters is stably zero, false if there is
>>>> - * some possibility of non-zero.
>>>> - */
>>>> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>>> {
>>>> int cpu;
>>>> + unsigned long seq;
>>>> +
>>>> + seq = srcu_readers_seq_idx(sp, idx);
>>>> +
>>>> + /*
>>>> + * smp_mb() A pairs with smp_mb() B for critical section.
>>>> + * It ensures that the SRCU read-side critical section whose
>>>> + * read-lock is not seen by the following srcu_readers_active_idx()
>>>> + * will see any updates that before the current task performed before.
>>>> + * (So we don't need to care these readers this time)
>>>> + *
>>>> + * Also, if we see the increment of the seq, we must see the
>>>> + * increment of the active counter in the following
>>>> + * srcu_readers_active_idx().
>>>> + */
>>>> + smp_mb(); /* A */
>>>>
>>>> /*
>>>> * Note that srcu_readers_active_idx() can incorrectly return
>>>> * zero even though there is a pre-existing reader throughout.
>>>> * To see this, suppose that task A is in a very long SRCU
>>>> * read-side critical section that started on CPU 0, and that
>>>> - * no other reader exists, so that the modulo sum of the counters
>>>> + * no other reader exists, so that the sum of the counters
>>>> * is equal to one. Then suppose that task B starts executing
>>>> * srcu_readers_active_idx(), summing up to CPU 1, and then that
>>>> * task C starts reading on CPU 0, so that its increment is not
>>>> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>>> return false;
>>>>
>>>> /*
>>>> - * Since the caller recently flipped ->completed, we can see at
>>>> - * most one increment of each CPU's counter from this point
>>>> - * forward. The reason for this is that the reader CPU must have
>>>> - * fetched the index before srcu_readers_active_idx checked
>>>> - * that CPU's counter, but not yet incremented its counter.
>>>> - * Its eventual counter increment will follow the read in
>>>> - * srcu_readers_active_idx(), and that increment is immediately
>>>> - * followed by smp_mb() B. Because smp_mb() D is between
>>>> - * the ->completed flip and srcu_readers_active_idx()'s read,
>>>> - * that CPU's subsequent load of ->completed must see the new
>>>> - * value, and therefore increment the counter in the other rank.
>>>> - */
>>>> - smp_mb(); /* A */
>>>> -
>>>> - /*
>>>> - * Now, we check the ->snap array that srcu_readers_active_idx()
>>>> - * filled in from the per-CPU counter values. Since
>>>> - * __srcu_read_lock() increments the upper bits of the per-CPU
>>>> - * counter, an increment/decrement pair will change the value
>>>> - * of the counter. Since there is only one possible increment,
>>>> - * the only way to wrap the counter is to have a huge number of
>>>> - * counter decrements, which requires a huge number of tasks and
>>>> - * huge SRCU read-side critical-section nesting levels, even on
>>>> - * 32-bit systems.
>>>> - *
>>>> - * All of the ways of confusing the readings require that the scan
>>>> - * in srcu_readers_active_idx() see the read-side task's decrement,
>>>> - * but not its increment. However, between that decrement and
>>>> - * increment are smb_mb() B and C. Either or both of these pair
>>>> - * with smp_mb() A above to ensure that the scan below will see
>>>> - * the read-side tasks's increment, thus noting a difference in
>>>> - * the counter values between the two passes.
>>>> + * Validation step, smp_mb() D pairs with smp_mb() C. If the above
>>>> + * srcu_readers_active_idx() see a decrement of the active counter
>>>> + * in srcu_read_unlock(), it should see one of these for corresponding
>>>> + * srcu_read_lock():
>>>> + * See the increment of the active counter,
>>>> + * Failed to see the increment of the active counter.
>>>> + * The second one can cause srcu_readers_active_idx() incorrectly
>>>> + * return zero, but it means the above srcu_readers_seq_idx() does not
>>>> + * see the increment of the seq(ref: comments of smp_mb() A),
>>>> + * and the following srcu_readers_seq_idx() sees the increment of
>>>> + * the seq. The seq is changed.
>>>> *
>>>> - * Therefore, if srcu_readers_active_idx() returned zero, and
>>>> - * none of the counters changed, we know that the zero was the
>>>> - * correct sum.
>>>> - *
>>>> - * Of course, it is possible that a task might be delayed
>>>> - * for a very long time in __srcu_read_lock() after fetching
>>>> - * the index but before incrementing its counter. This
>>>> - * possibility will be dealt with in __synchronize_srcu().
>>>> + * This smp_mb() D pairs with smp_mb() C for critical section.
>>>> + * then any of the current task's subsequent code will happen after
>>>> + * that SRCU read-side critical section whose read-unlock is seen in
>>>> + * srcu_readers_active_idx().
>>>> */
>>>> - for_each_possible_cpu(cpu)
>>>> - if (sp->snap[cpu] !=
>>>> - ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
>>>> - return false; /* False zero reading! */
>>>> - return true;
>>>> + smp_mb(); /* D */
>>>> +
>>>> + return srcu_readers_seq_idx(sp, idx) == seq;
>>>> }
>>>>
>>>> /**
>>>> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
>>>> preempt_disable();
>>>> idx = rcu_dereference_index_check(sp->completed,
>>>> rcu_read_lock_sched_held()) & 0x1;
>>>> - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
>>>> - SRCU_USAGE_COUNT + 1;
>>>> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
>>>> smp_mb(); /* B */ /* Avoid leaking the critical section. */
>>>> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
>>>> preempt_enable();
>>>> return idx;
>>>> }
>>>> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>>>> int trycount = 0;
>>>>
>>>> /*
>>>> - * If a reader fetches the index before the ->completed increment,
>>>> - * but increments its counter after srcu_readers_active_idx_check()
>>>> - * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
>>>> - * smp_mb() B to ensure that the SRCU read-side critical section
>>>> - * will see any updates that the current task performed before its
>>>> - * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
>>>> - * as the case may be.
>>>> - */
>>>> - smp_mb(); /* D */
>>>> -
>>>> - /*
>>>> * SRCU read-side critical sections are normally short, so wait
>>>> * a small amount of time before possibly blocking.
>>>> */
>>>> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>>>> schedule_timeout_interruptible(1);
>>>> }
>>>> }
>>>> -
>>>> - /*
>>>> - * The following smp_mb() E pairs with srcu_read_unlock()'s
>>>> - * smp_mb C to ensure that if srcu_readers_active_idx_check()
>>>> - * sees srcu_read_unlock()'s counter decrement, then any
>>>> - * of the current task's subsequent code will happen after
>>>> - * that SRCU read-side critical section.
>>>> - *
>>>> - * It also ensures the order between the above waiting and
>>>> - * the next flipping.
>>>> - */
>>>> - smp_mb(); /* E */
>>>> }
>>>>
>>>> static void srcu_flip(struct srcu_struct *sp)
>>>> --
>>>> 1.7.4.4
>>>>
>>>
>>>
>>
>
>

2012-02-29 13:56:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

On Wed, Feb 29, 2012 at 06:07:32PM +0800, Lai Jiangshan wrote:
> On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
> > On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
> >> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
> >>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
> >>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
> >>>> From: Lai Jiangshan <[email protected]>
> >>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
> >>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
> >>>>
> >>>> This patch implement the algorithm as Peter's:
> >>>> https://lkml.org/lkml/2012/2/1/119
> >>>>
> >>>> o Make the checking lock-free and we can perform parallel checking,
> >>>> Although almost parallel checking makes no sense, but we need it
> >>>> when 1) the original checking task is preempted for long, 2)
> >>>> sychronize_srcu_expedited(), 3) avoid lock(see next)
> >>>>
> >>>> o Since it is lock-free, we save a mutex in state machine for
> >>>> call_srcu().
> >>>>
> >>>> o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
> >>>> (so we can remove the preempt_disable() in future, but use
> >>>> __this_cpu_inc() instead.)
> >>>>
> >>>> o reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
> >>>> more intuitive.
> >>>
> >>> Hello, Lai,
> >>>
> >>> Interesting approach!
> >>>
> >>> What happens given the following sequence of events?
> >>>
> >>> o CPU 0 in srcu_readers_active_idx_check() invokes
> >>> srcu_readers_seq_idx(), getting some number back.
> >>>
> >>> o CPU 0 invokes srcu_readers_active_idx(), summing the
> >>> ->c[] array up through CPU 3.
> >>>
> >>> o CPU 1 invokes __srcu_read_lock(), and increments its counter
> >>> but not yet its ->seq[] element.
> >>
> >>
> >> Any __srcu_read_lock() whose increment of active counter is not seen
> >> by srcu_readers_active_idx() is considerred as
> >> "reader-started-after-this-srcu_readers_active_idx_check()",
> >> We don't need to wait.
> >>
> >> As you said, this srcu C.S 's increment seq is not seen by above
> >> srcu_readers_seq_idx().
> >>
> >>>
> >>> o CPU 0 completes its summing of the ->c[] array, incorrectly
> >>> obtaining zero.
> >>>
> >>> o CPU 0 invokes srcu_readers_seq_idx(), getting the same
> >>> number back that it got last time.
> >>
> >> If it incorrectly get zero, it means __srcu_read_unlock() is seen
> >> in srcu_readers_active_idx(), and it means the increment of
> >> seq is seen in this srcu_readers_seq_idx(), it is different
> >> from the above seq that it got last time.
> >>
> >> increment of seq is not seen by above srcu_readers_seq_idx(),
> >> but is seen by later one, so the two returned seq is different,
> >> this is the core of Peter's algorithm, and this was written
> >> in the comments(Sorry for my bad English). Or maybe I miss
> >> your means in this mail.
> >
> > OK, good, this analysis agrees with what I was thinking.
> >
> > So my next question is about the lock freedom. This lock freedom has to
> > be limited in nature and carefully implemented. The reasons for this are:
> >
> > 1. Readers can block in any case, which can of course block both
> > synchronize_srcu_expedited() and synchronize_srcu().
> >
> > 2. Because only one CPU at a time can be incrementing ->completed,
> > some sort of lock with preemption disabling will of course be
> > needed. Alternatively, an rt_mutex could be used for its
> > priority-inheritance properties.
> >
> > 3. Once some CPU has incremented ->completed, all CPUs that might
> > still be summing up the old indexes must stop. If they don't,
> > they might incorrectly call a too-short grace period in case of
> > ->seq[]-sum overflow on 32-bit systems.
> >
> > Or did you have something else in mind?
>
> When flip happens when check_zero, this check_zero will no be
> committed even it is success.

But if the CPU in check_zero isn't blocking the grace period, then
->completed could overflow while that CPU was preempted. Then how
would this CPU know that the flip had happened?

> I play too much with lock-free for call_srcu(), the code becomes complicated,
> I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.

Makes sense to me!

> (But I still like Peter's approach, it has some other good thing
> besides lock-free-checking, if you don't like it, I will send
> another patch to fix srcu_readers_active())

Try them both and check their performance &c. If within espilon of
each other, pick whichever one you prefer.

Thanx, Paul

> Thanks,
> Lai
>
> >
> > Thanx, Paul
> >
> >> Thanks,
> >> Lai
> >>
> >>>
> >>> o In parallel with the previous step, CPU 1 executes out of order
> >>> (as permitted by the lack of a second memory barrier in
> >>> __srcu_read_lock()), starting up the critical section before
> >>> incrementing its ->seq[] element.
> >>>
> >>> o Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
> >>> completes the SRCU grace period before CPU 1 completes its
> >>> SRCU read-side critical section.
> >>>
> >>> This actually might be safe, but I need to think more about it. In the
> >>> meantime, I figured I should ask your thoughts.
> >>>
> >>> Thanx, Paul
> >>>
> >>>> Inspired-by: Peter Zijlstra <[email protected]>
> >>>> Signed-off-by: Lai Jiangshan <[email protected]>
> >>>> ---
> >>>> include/linux/srcu.h | 7 +--
> >>>> kernel/srcu.c | 137 ++++++++++++++++++++-----------------------------
> >>>> 2 files changed, 57 insertions(+), 87 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> >>>> index 5b49d41..15354db 100644
> >>>> --- a/include/linux/srcu.h
> >>>> +++ b/include/linux/srcu.h
> >>>> @@ -32,18 +32,13 @@
> >>>>
> >>>> struct srcu_struct_array {
> >>>> unsigned long c[2];
> >>>> + unsigned long seq[2];
> >>>> };
> >>>>
> >>>> -/* Bit definitions for field ->c above and ->snap below. */
> >>>> -#define SRCU_USAGE_BITS 1
> >>>> -#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
> >>>> -#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
> >>>> -
> >>>> struct srcu_struct {
> >>>> unsigned completed;
> >>>> struct srcu_struct_array __percpu *per_cpu_ref;
> >>>> struct mutex mutex;
> >>>> - unsigned long snap[NR_CPUS];
> >>>> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >>>> struct lockdep_map dep_map;
> >>>> #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>>> diff --git a/kernel/srcu.c b/kernel/srcu.c
> >>>> index 47ee35d..376b583 100644
> >>>> --- a/kernel/srcu.c
> >>>> +++ b/kernel/srcu.c
> >>>> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> >>>> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>>>
> >>>> /*
> >>>> + * Returns approximate total sequence of readers on the specified rank
> >>>> + * of per-CPU counters.
> >>>> + */
> >>>> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> >>>> +{
> >>>> + int cpu;
> >>>> + unsigned long sum = 0;
> >>>> + unsigned long t;
> >>>> +
> >>>> + for_each_possible_cpu(cpu) {
> >>>> + t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> >>>> + sum += t;
> >>>> + }
> >>>> + return sum;
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> * Returns approximate number of readers active on the specified rank
> >>>> - * of per-CPU counters. Also snapshots each counter's value in the
> >>>> - * corresponding element of sp->snap[] for later use validating
> >>>> - * the sum.
> >>>> + * of per-CPU counters.
> >>>> */
> >>>> static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >>>> {
> >>>> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >>>> for_each_possible_cpu(cpu) {
> >>>> t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> >>>> sum += t;
> >>>> - sp->snap[cpu] = t;
> >>>> }
> >>>> - return sum & SRCU_REF_MASK;
> >>>> + return sum;
> >>>> }
> >>>>
> >>>> -/*
> >>>> - * To be called from the update side after an index flip. Returns true
> >>>> - * if the modulo sum of the counters is stably zero, false if there is
> >>>> - * some possibility of non-zero.
> >>>> - */
> >>>> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >>>> {
> >>>> int cpu;
> >>>> + unsigned long seq;
> >>>> +
> >>>> + seq = srcu_readers_seq_idx(sp, idx);
> >>>> +
> >>>> + /*
> >>>> + * smp_mb() A pairs with smp_mb() B for critical section.
> >>>> + * It ensures that the SRCU read-side critical section whose
> >>>> + * read-lock is not seen by the following srcu_readers_active_idx()
> >>>> + * will see any updates that before the current task performed before.
> >>>> + * (So we don't need to care these readers this time)
> >>>> + *
> >>>> + * Also, if we see the increment of the seq, we must see the
> >>>> + * increment of the active counter in the following
> >>>> + * srcu_readers_active_idx().
> >>>> + */
> >>>> + smp_mb(); /* A */
> >>>>
> >>>> /*
> >>>> * Note that srcu_readers_active_idx() can incorrectly return
> >>>> * zero even though there is a pre-existing reader throughout.
> >>>> * To see this, suppose that task A is in a very long SRCU
> >>>> * read-side critical section that started on CPU 0, and that
> >>>> - * no other reader exists, so that the modulo sum of the counters
> >>>> + * no other reader exists, so that the sum of the counters
> >>>> * is equal to one. Then suppose that task B starts executing
> >>>> * srcu_readers_active_idx(), summing up to CPU 1, and then that
> >>>> * task C starts reading on CPU 0, so that its increment is not
> >>>> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >>>> return false;
> >>>>
> >>>> /*
> >>>> - * Since the caller recently flipped ->completed, we can see at
> >>>> - * most one increment of each CPU's counter from this point
> >>>> - * forward. The reason for this is that the reader CPU must have
> >>>> - * fetched the index before srcu_readers_active_idx checked
> >>>> - * that CPU's counter, but not yet incremented its counter.
> >>>> - * Its eventual counter increment will follow the read in
> >>>> - * srcu_readers_active_idx(), and that increment is immediately
> >>>> - * followed by smp_mb() B. Because smp_mb() D is between
> >>>> - * the ->completed flip and srcu_readers_active_idx()'s read,
> >>>> - * that CPU's subsequent load of ->completed must see the new
> >>>> - * value, and therefore increment the counter in the other rank.
> >>>> - */
> >>>> - smp_mb(); /* A */
> >>>> -
> >>>> - /*
> >>>> - * Now, we check the ->snap array that srcu_readers_active_idx()
> >>>> - * filled in from the per-CPU counter values. Since
> >>>> - * __srcu_read_lock() increments the upper bits of the per-CPU
> >>>> - * counter, an increment/decrement pair will change the value
> >>>> - * of the counter. Since there is only one possible increment,
> >>>> - * the only way to wrap the counter is to have a huge number of
> >>>> - * counter decrements, which requires a huge number of tasks and
> >>>> - * huge SRCU read-side critical-section nesting levels, even on
> >>>> - * 32-bit systems.
> >>>> - *
> >>>> - * All of the ways of confusing the readings require that the scan
> >>>> - * in srcu_readers_active_idx() see the read-side task's decrement,
> >>>> - * but not its increment. However, between that decrement and
> >>>> - * increment are smb_mb() B and C. Either or both of these pair
> >>>> - * with smp_mb() A above to ensure that the scan below will see
> >>>> - * the read-side tasks's increment, thus noting a difference in
> >>>> - * the counter values between the two passes.
> >>>> + * Validation step, smp_mb() D pairs with smp_mb() C. If the above
> >>>> + * srcu_readers_active_idx() see a decrement of the active counter
> >>>> + * in srcu_read_unlock(), it should see one of these for corresponding
> >>>> + * srcu_read_lock():
> >>>> + * See the increment of the active counter,
> >>>> + * Failed to see the increment of the active counter.
> >>>> + * The second one can cause srcu_readers_active_idx() incorrectly
> >>>> + * return zero, but it means the above srcu_readers_seq_idx() does not
> >>>> + * see the increment of the seq(ref: comments of smp_mb() A),
> >>>> + * and the following srcu_readers_seq_idx() sees the increment of
> >>>> + * the seq. The seq is changed.
> >>>> *
> >>>> - * Therefore, if srcu_readers_active_idx() returned zero, and
> >>>> - * none of the counters changed, we know that the zero was the
> >>>> - * correct sum.
> >>>> - *
> >>>> - * Of course, it is possible that a task might be delayed
> >>>> - * for a very long time in __srcu_read_lock() after fetching
> >>>> - * the index but before incrementing its counter. This
> >>>> - * possibility will be dealt with in __synchronize_srcu().
> >>>> + * This smp_mb() D pairs with smp_mb() C for critical section.
> >>>> + * then any of the current task's subsequent code will happen after
> >>>> + * that SRCU read-side critical section whose read-unlock is seen in
> >>>> + * srcu_readers_active_idx().
> >>>> */
> >>>> - for_each_possible_cpu(cpu)
> >>>> - if (sp->snap[cpu] !=
> >>>> - ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> >>>> - return false; /* False zero reading! */
> >>>> - return true;
> >>>> + smp_mb(); /* D */
> >>>> +
> >>>> + return srcu_readers_seq_idx(sp, idx) == seq;
> >>>> }
> >>>>
> >>>> /**
> >>>> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >>>> preempt_disable();
> >>>> idx = rcu_dereference_index_check(sp->completed,
> >>>> rcu_read_lock_sched_held()) & 0x1;
> >>>> - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> >>>> - SRCU_USAGE_COUNT + 1;
> >>>> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
> >>>> smp_mb(); /* B */ /* Avoid leaking the critical section. */
> >>>> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
> >>>> preempt_enable();
> >>>> return idx;
> >>>> }
> >>>> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >>>> int trycount = 0;
> >>>>
> >>>> /*
> >>>> - * If a reader fetches the index before the ->completed increment,
> >>>> - * but increments its counter after srcu_readers_active_idx_check()
> >>>> - * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
> >>>> - * smp_mb() B to ensure that the SRCU read-side critical section
> >>>> - * will see any updates that the current task performed before its
> >>>> - * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
> >>>> - * as the case may be.
> >>>> - */
> >>>> - smp_mb(); /* D */
> >>>> -
> >>>> - /*
> >>>> * SRCU read-side critical sections are normally short, so wait
> >>>> * a small amount of time before possibly blocking.
> >>>> */
> >>>> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >>>> schedule_timeout_interruptible(1);
> >>>> }
> >>>> }
> >>>> -
> >>>> - /*
> >>>> - * The following smp_mb() E pairs with srcu_read_unlock()'s
> >>>> - * smp_mb C to ensure that if srcu_readers_active_idx_check()
> >>>> - * sees srcu_read_unlock()'s counter decrement, then any
> >>>> - * of the current task's subsequent code will happen after
> >>>> - * that SRCU read-side critical section.
> >>>> - *
> >>>> - * It also ensures the order between the above waiting and
> >>>> - * the next flipping.
> >>>> - */
> >>>> - smp_mb(); /* E */
> >>>> }
> >>>>
> >>>> static void srcu_flip(struct srcu_struct *sp)
> >>>> --
> >>>> 1.7.4.4
> >>>>
> >>>
> >>>
> >>
> >
> >
>

2012-03-01 02:26:46

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

On 02/29/2012 09:55 PM, Paul E. McKenney wrote:
> On Wed, Feb 29, 2012 at 06:07:32PM +0800, Lai Jiangshan wrote:
>> On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
>>> On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
>>>> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
>>>>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
>>>>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
>>>>>> From: Lai Jiangshan <[email protected]>
>>>>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
>>>>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>>>>>>
>>>>>> This patch implement the algorithm as Peter's:
>>>>>> https://lkml.org/lkml/2012/2/1/119
>>>>>>
>>>>>> o Make the checking lock-free and we can perform parallel checking,
>>>>>> Although almost parallel checking makes no sense, but we need it
>>>>>> when 1) the original checking task is preempted for long, 2)
>>>>>> sychronize_srcu_expedited(), 3) avoid lock(see next)
>>>>>>
>>>>>> o Since it is lock-free, we save a mutex in state machine for
>>>>>> call_srcu().
>>>>>>
>>>>>> o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
>>>>>> (so we can remove the preempt_disable() in future, but use
>>>>>> __this_cpu_inc() instead.)
>>>>>>
>>>>>> o reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
>>>>>> more intuitive.
>>>>>
>>>>> Hello, Lai,
>>>>>
>>>>> Interesting approach!
>>>>>
>>>>> What happens given the following sequence of events?
>>>>>
>>>>> o CPU 0 in srcu_readers_active_idx_check() invokes
>>>>> srcu_readers_seq_idx(), getting some number back.
>>>>>
>>>>> o CPU 0 invokes srcu_readers_active_idx(), summing the
>>>>> ->c[] array up through CPU 3.
>>>>>
>>>>> o CPU 1 invokes __srcu_read_lock(), and increments its counter
>>>>> but not yet its ->seq[] element.
>>>>
>>>>
>>>> Any __srcu_read_lock() whose increment of active counter is not seen
>>>> by srcu_readers_active_idx() is considerred as
>>>> "reader-started-after-this-srcu_readers_active_idx_check()",
>>>> We don't need to wait.
>>>>
>>>> As you said, this srcu C.S 's increment seq is not seen by above
>>>> srcu_readers_seq_idx().
>>>>
>>>>>
>>>>> o CPU 0 completes its summing of the ->c[] array, incorrectly
>>>>> obtaining zero.
>>>>>
>>>>> o CPU 0 invokes srcu_readers_seq_idx(), getting the same
>>>>> number back that it got last time.
>>>>
>>>> If it incorrectly get zero, it means __srcu_read_unlock() is seen
>>>> in srcu_readers_active_idx(), and it means the increment of
>>>> seq is seen in this srcu_readers_seq_idx(), it is different
>>>> from the above seq that it got last time.
>>>>
>>>> increment of seq is not seen by above srcu_readers_seq_idx(),
>>>> but is seen by later one, so the two returned seq is different,
>>>> this is the core of Peter's algorithm, and this was written
>>>> in the comments(Sorry for my bad English). Or maybe I miss
>>>> your means in this mail.
>>>
>>> OK, good, this analysis agrees with what I was thinking.
>>>
>>> So my next question is about the lock freedom. This lock freedom has to
>>> be limited in nature and carefully implemented. The reasons for this are:
>>>
>>> 1. Readers can block in any case, which can of course block both
>>> synchronize_srcu_expedited() and synchronize_srcu().
>>>
>>> 2. Because only one CPU at a time can be incrementing ->completed,
>>> some sort of lock with preemption disabling will of course be
>>> needed. Alternatively, an rt_mutex could be used for its
>>> priority-inheritance properties.
>>>
>>> 3. Once some CPU has incremented ->completed, all CPUs that might
>>> still be summing up the old indexes must stop. If they don't,
>>> they might incorrectly call a too-short grace period in case of
>>> ->seq[]-sum overflow on 32-bit systems.
>>>
>>> Or did you have something else in mind?
>>
>> When flip happens when check_zero, this check_zero will no be
>> committed even it is success.
>
> But if the CPU in check_zero isn't blocking the grace period, then
> ->completed could overflow while that CPU was preempted. Then how
> would this CPU know that the flip had happened?

as you said, check the ->completed.
but disable the overflow for ->completed.

there is a spinlock for srcu_struct(including locking for flipping)

1) assume we need to wait on widx
2) use srcu_read_lock() to hold a reference of the 1-widx active counter
3) release the spinlock
4) do_check_zero
5) gain the spinlock
6) srcu_read_unlock()
7) if ->completed is not changed, and there is no other later check_zero which
is committed earlier than us, we will commit our check_zero if we success.

too complicated.

Thanks,
Lai

>
>> I play too much with lock-free for call_srcu(), the code becomes complicated,
>> I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.
>
> Makes sense to me!
>
>> (But I still like Peter's approach, it has some other good thing
>> besides lock-free-checking, if you don't like it, I will send
>> another patch to fix srcu_readers_active())
>
> Try them both and check their performance &c. If within espilon of
> each other, pick whichever one you prefer.
>
> Thanx, Paul