2016-11-14 18:36:48

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu] SRCU rewrite

SRCU uses two per-cpu counters: a nesting counter to count the number of
active critical sections, and a sequence counter to ensure that the nesting
counters don't change while they are being added together in
srcu_readers_active_idx_check().

This patch instead uses per-cpu lock and unlock counters. Because the both
counters only increase and srcu_readers_active_idx_check() reads the unlock
counter before the lock counter, this achieves the same end without having
to increment two different counters in srcu_read_lock(). This also saves a
smp_mb() in srcu_readers_active_idx_check().

A possible problem with this patch is that it can only handle
ULONG_MAX - NR_CPUS simultaneous readers, whereas the old version could
handle up to ULONG_MAX.

Suggested-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Lance Roy <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
[ paulmck: Queued for 4.12, that is, merge window after this coming one. ]

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index dc8eb63c6568..0caea34d8c5f 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -34,8 +34,8 @@
#include <linux/workqueue.h>

struct srcu_struct_array {
- unsigned long c[2];
- unsigned long seq[2];
+ unsigned long lock_count[2];
+ unsigned long unlock_count[2];
};

struct rcu_batch {
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 87c51225ceec..6e4fd7680c70 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -564,10 +564,24 @@ static void srcu_torture_stats(void)
pr_alert("%s%s per-CPU(idx=%d):",
torture_type, TORTURE_FLAG, idx);
for_each_possible_cpu(cpu) {
+ unsigned long l0, l1;
+ unsigned long u0, u1;
long c0, c1;
+ struct srcu_struct_array* counts =
+ per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu);

- c0 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[!idx];
- c1 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[idx];
+ u0 = counts->unlock_count[!idx];
+ u1 = counts->unlock_count[idx];
+
+ /* Make sure that a lock is always counted if the corresponding
+ unlock is counted. */
+ smp_rmb();
+
+ l0 = counts->lock_count[!idx];
+ l1 = counts->lock_count[idx];
+
+ c0 = (long)(l0 - u0);
+ c1 = (long)(l1 - u1);
pr_cont(" %d(%ld,%ld)", cpu, c0, c1);
}
pr_cont("\n");
diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 9b9cdd549caa..edfdfadec821 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -141,34 +141,38 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */

/*
- * Returns approximate total of the readers' ->seq[] values for the
+ * Returns approximate total of the readers' ->lock_count[] values for the
* rank of per-CPU counters specified by idx.
*/
-static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
+static unsigned long srcu_readers_lock_idx(struct srcu_struct *sp, int idx)
{
int cpu;
unsigned long sum = 0;
unsigned long t;

for_each_possible_cpu(cpu) {
- t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
+ struct srcu_struct_array* cpu_counts =
+ per_cpu_ptr(sp->per_cpu_ref, cpu);
+ t = READ_ONCE(cpu_counts->lock_count[idx]);
sum += t;
}
return sum;
}

/*
- * Returns approximate number of readers active on the specified rank
- * of the per-CPU ->c[] counters.
+ * Returns approximate total of the readers' ->unlock_count[] values for the
+ * rank of per-CPU counters specified by idx.
*/
-static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
+static unsigned long srcu_readers_unlock_idx(struct srcu_struct *sp, int idx)
{
int cpu;
unsigned long sum = 0;
unsigned long t;

for_each_possible_cpu(cpu) {
- t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
+ struct srcu_struct_array* cpu_counts =
+ per_cpu_ptr(sp->per_cpu_ref, cpu);
+ t = READ_ONCE(cpu_counts->unlock_count[idx]);
sum += t;
}
return sum;
@@ -176,79 +180,43 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)

/*
* Return true if the number of pre-existing readers is determined to
- * be stably zero. An example unstable zero can occur if the call
- * to srcu_readers_active_idx() misses an __srcu_read_lock() increment,
- * but due to task migration, sees the corresponding __srcu_read_unlock()
- * decrement. This can happen because srcu_readers_active_idx() takes
- * time to sum the array, and might in fact be interrupted or preempted
- * partway through the summation.
+ * be zero.
*/
static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
{
- unsigned long seq;
+ unsigned long unlocks;

- seq = srcu_readers_seq_idx(sp, idx);
+ unlocks = srcu_readers_unlock_idx(sp, idx);

/*
- * The following smp_mb() A pairs with the smp_mb() B located in
- * __srcu_read_lock(). This pairing ensures that if an
- * __srcu_read_lock() increments its counter after the summation
- * in srcu_readers_active_idx(), then the corresponding SRCU read-side
- * critical section will see any changes made prior to the start
- * of the current SRCU grace period.
+ * Make sure that a lock is always counted if the corresponding unlock
+ * is counted. Needs to be a smp_mb() as the read side may contain a
+ * read from a variable that is written to before the synchronize_srcu()
+ * in the write side. In this case smp_mb()s A and B act like the store
+ * buffering pattern.
*
- * Also, if the above call to srcu_readers_seq_idx() saw the
- * increment of ->seq[], then the call to srcu_readers_active_idx()
- * must see the increment of ->c[].
+ * This smp_mb() also pairs with smp_mb() C to prevent writes after the
+ * synchronize_srcu() from being executed before the grace period ends.
*/
smp_mb(); /* A */

/*
- * Note that srcu_readers_active_idx() can incorrectly return
- * zero even though there is a pre-existing reader throughout.
- * To see this, suppose that task A is in a very long SRCU
- * read-side critical section that started on CPU 0, and that
- * no other reader exists, so that the sum of the counters
- * is equal to one. Then suppose that task B starts executing
- * srcu_readers_active_idx(), summing up to CPU 1, and then that
- * task C starts reading on CPU 0, so that its increment is not
- * summed, but finishes reading on CPU 2, so that its decrement
- * -is- summed. Then when task B completes its sum, it will
- * incorrectly get zero, despite the fact that task A has been
- * in its SRCU read-side critical section the whole time.
- *
- * We therefore do a validation step should srcu_readers_active_idx()
- * return zero.
- */
- if (srcu_readers_active_idx(sp, idx) != 0)
- return false;
-
- /*
- * The remainder of this function is the validation step.
- * The following smp_mb() D pairs with the smp_mb() C in
- * __srcu_read_unlock(). If the __srcu_read_unlock() was seen
- * by srcu_readers_active_idx() above, then any destructive
- * operation performed after the grace period will happen after
- * the corresponding SRCU read-side critical section.
+ * If the locks are the same as the unlocks, then there must of have
+ * been no readers on this index at some time in between. This does not
+ * mean that there are no more readers, as one could have read the
+ * current index but have incremented the lock counter yet.
*
- * Note that there can be at most NR_CPUS worth of readers using
- * the old index, which is not enough to overflow even a 32-bit
- * integer. (Yes, this does mean that systems having more than
- * a billion or so CPUs need to be 64-bit systems.) Therefore,
- * the sum of the ->seq[] counters cannot possibly overflow.
- * Therefore, the only way that the return values of the two
- * calls to srcu_readers_seq_idx() can be equal is if there were
- * no increments of the corresponding rank of ->seq[] counts
- * in the interim. But the missed-increment scenario laid out
- * above includes an increment of the ->seq[] counter by
- * the corresponding __srcu_read_lock(). Therefore, if this
- * scenario occurs, the return values from the two calls to
- * srcu_readers_seq_idx() will differ, and thus the validation
- * step below suffices.
+ * Note that there can be at most NR_CPUS worth of readers using the old
+ * index that haven't incremented ->lock_count[] yet. Therefore, the
+ * sum of the ->lock_count[]s cannot increment enough times to overflow
+ * and end up equal the sum of the ->unlock_count[]s, as long as there
+ * are at most ULONG_MAX - NR_CPUS readers at a time. (Yes, this does
+ * mean that systems having more than a billion or so CPUs need to be
+ * 64-bit systems.) Therefore, the only way that the return values of
+ * the two calls to srcu_readers_(un)lock_idx() can be equal is if there
+ * are no active readers using this index.
*/
- smp_mb(); /* D */
-
- return srcu_readers_seq_idx(sp, idx) == seq;
+ return srcu_readers_lock_idx(sp, idx) == unlocks;
}

/**
@@ -266,8 +234,12 @@ static bool srcu_readers_active(struct srcu_struct *sp)
unsigned long sum = 0;

for_each_possible_cpu(cpu) {
- sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]);
- sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]);
+ struct srcu_struct_array* cpu_counts =
+ per_cpu_ptr(sp->per_cpu_ref, cpu);
+ sum += READ_ONCE(cpu_counts->lock_count[0]);
+ sum += READ_ONCE(cpu_counts->lock_count[1]);
+ sum -= READ_ONCE(cpu_counts->unlock_count[0]);
+ sum -= READ_ONCE(cpu_counts->unlock_count[1]);
}
return sum;
}
@@ -298,9 +270,8 @@ int __srcu_read_lock(struct srcu_struct *sp)
int idx;

idx = READ_ONCE(sp->completed) & 0x1;
- __this_cpu_inc(sp->per_cpu_ref->c[idx]);
+ __this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
smp_mb(); /* B */ /* Avoid leaking the critical section. */
- __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
return idx;
}
EXPORT_SYMBOL_GPL(__srcu_read_lock);
@@ -314,7 +285,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
void __srcu_read_unlock(struct srcu_struct *sp, int idx)
{
smp_mb(); /* C */ /* Avoid leaking the critical section. */
- this_cpu_dec(sp->per_cpu_ref->c[idx]);
+ this_cpu_inc(sp->per_cpu_ref->unlock_count[idx]);
}
EXPORT_SYMBOL_GPL(__srcu_read_unlock);

@@ -349,7 +320,7 @@ static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount)

/*
* Increment the ->completed counter so that future SRCU readers will
- * use the other rank of the ->c[] and ->seq[] arrays. This allows
+ * use the other rank of the ->(un)lock_count[] arrays. This allows
* us to wait for pre-existing readers in a starvation-free manner.
*/
static void srcu_flip(struct srcu_struct *sp)


2016-11-14 18:59:56

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

----- On Nov 14, 2016, at 1:36 PM, Paul E. McKenney [email protected] wrote:

> SRCU uses two per-cpu counters: a nesting counter to count the number of
> active critical sections, and a sequence counter to ensure that the nesting
> counters don't change while they are being added together in
> srcu_readers_active_idx_check().
>
> This patch instead uses per-cpu lock and unlock counters. Because the both
> counters only increase and srcu_readers_active_idx_check() reads the unlock
> counter before the lock counter, this achieves the same end without having
> to increment two different counters in srcu_read_lock(). This also saves a
> smp_mb() in srcu_readers_active_idx_check().
>
> A possible problem with this patch is that it can only handle
> ULONG_MAX - NR_CPUS simultaneous readers, whereas the old version could
> handle up to ULONG_MAX.

I think for the above we ended up agreeing that the old version did have
similar limitations as the new one ? I would have expected the sentence
above to be removed from the changelog.

Thanks,

Mathieu

>
> Suggested-by: Mathieu Desnoyers <[email protected]>
> Signed-off-by: Lance Roy <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> [ paulmck: Queued for 4.12, that is, merge window after this coming one. ]
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index dc8eb63c6568..0caea34d8c5f 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -34,8 +34,8 @@
> #include <linux/workqueue.h>
>
> struct srcu_struct_array {
> - unsigned long c[2];
> - unsigned long seq[2];
> + unsigned long lock_count[2];
> + unsigned long unlock_count[2];
> };
>
> struct rcu_batch {
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index 87c51225ceec..6e4fd7680c70 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -564,10 +564,24 @@ static void srcu_torture_stats(void)
> pr_alert("%s%s per-CPU(idx=%d):",
> torture_type, TORTURE_FLAG, idx);
> for_each_possible_cpu(cpu) {
> + unsigned long l0, l1;
> + unsigned long u0, u1;
> long c0, c1;
> + struct srcu_struct_array* counts =
> + per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu);
>
> - c0 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[!idx];
> - c1 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[idx];
> + u0 = counts->unlock_count[!idx];
> + u1 = counts->unlock_count[idx];
> +
> + /* Make sure that a lock is always counted if the corresponding
> + unlock is counted. */
> + smp_rmb();
> +
> + l0 = counts->lock_count[!idx];
> + l1 = counts->lock_count[idx];
> +
> + c0 = (long)(l0 - u0);
> + c1 = (long)(l1 - u1);
> pr_cont(" %d(%ld,%ld)", cpu, c0, c1);
> }
> pr_cont("\n");
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> index 9b9cdd549caa..edfdfadec821 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -141,34 +141,38 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>
> /*
> - * Returns approximate total of the readers' ->seq[] values for the
> + * Returns approximate total of the readers' ->lock_count[] values for the
> * rank of per-CPU counters specified by idx.
> */
> -static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> +static unsigned long srcu_readers_lock_idx(struct srcu_struct *sp, int idx)
> {
> int cpu;
> unsigned long sum = 0;
> unsigned long t;
>
> for_each_possible_cpu(cpu) {
> - t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + t = READ_ONCE(cpu_counts->lock_count[idx]);
> sum += t;
> }
> return sum;
> }
>
> /*
> - * Returns approximate number of readers active on the specified rank
> - * of the per-CPU ->c[] counters.
> + * Returns approximate total of the readers' ->unlock_count[] values for the
> + * rank of per-CPU counters specified by idx.
> */
> -static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> +static unsigned long srcu_readers_unlock_idx(struct srcu_struct *sp, int idx)
> {
> int cpu;
> unsigned long sum = 0;
> unsigned long t;
>
> for_each_possible_cpu(cpu) {
> - t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + t = READ_ONCE(cpu_counts->unlock_count[idx]);
> sum += t;
> }
> return sum;
> @@ -176,79 +180,43 @@ static unsigned long srcu_readers_active_idx(struct
> srcu_struct *sp, int idx)
>
> /*
> * Return true if the number of pre-existing readers is determined to
> - * be stably zero. An example unstable zero can occur if the call
> - * to srcu_readers_active_idx() misses an __srcu_read_lock() increment,
> - * but due to task migration, sees the corresponding __srcu_read_unlock()
> - * decrement. This can happen because srcu_readers_active_idx() takes
> - * time to sum the array, and might in fact be interrupted or preempted
> - * partway through the summation.
> + * be zero.
> */
> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> {
> - unsigned long seq;
> + unsigned long unlocks;
>
> - seq = srcu_readers_seq_idx(sp, idx);
> + unlocks = srcu_readers_unlock_idx(sp, idx);
>
> /*
> - * The following smp_mb() A pairs with the smp_mb() B located in
> - * __srcu_read_lock(). This pairing ensures that if an
> - * __srcu_read_lock() increments its counter after the summation
> - * in srcu_readers_active_idx(), then the corresponding SRCU read-side
> - * critical section will see any changes made prior to the start
> - * of the current SRCU grace period.
> + * Make sure that a lock is always counted if the corresponding unlock
> + * is counted. Needs to be a smp_mb() as the read side may contain a
> + * read from a variable that is written to before the synchronize_srcu()
> + * in the write side. In this case smp_mb()s A and B act like the store
> + * buffering pattern.
> *
> - * Also, if the above call to srcu_readers_seq_idx() saw the
> - * increment of ->seq[], then the call to srcu_readers_active_idx()
> - * must see the increment of ->c[].
> + * This smp_mb() also pairs with smp_mb() C to prevent writes after the
> + * synchronize_srcu() from being executed before the grace period ends.
> */
> smp_mb(); /* A */
>
> /*
> - * Note that srcu_readers_active_idx() can incorrectly return
> - * zero even though there is a pre-existing reader throughout.
> - * To see this, suppose that task A is in a very long SRCU
> - * read-side critical section that started on CPU 0, and that
> - * no other reader exists, so that the sum of the counters
> - * is equal to one. Then suppose that task B starts executing
> - * srcu_readers_active_idx(), summing up to CPU 1, and then that
> - * task C starts reading on CPU 0, so that its increment is not
> - * summed, but finishes reading on CPU 2, so that its decrement
> - * -is- summed. Then when task B completes its sum, it will
> - * incorrectly get zero, despite the fact that task A has been
> - * in its SRCU read-side critical section the whole time.
> - *
> - * We therefore do a validation step should srcu_readers_active_idx()
> - * return zero.
> - */
> - if (srcu_readers_active_idx(sp, idx) != 0)
> - return false;
> -
> - /*
> - * The remainder of this function is the validation step.
> - * The following smp_mb() D pairs with the smp_mb() C in
> - * __srcu_read_unlock(). If the __srcu_read_unlock() was seen
> - * by srcu_readers_active_idx() above, then any destructive
> - * operation performed after the grace period will happen after
> - * the corresponding SRCU read-side critical section.
> + * If the locks are the same as the unlocks, then there must of have
> + * been no readers on this index at some time in between. This does not
> + * mean that there are no more readers, as one could have read the
> + * current index but have incremented the lock counter yet.
> *
> - * Note that there can be at most NR_CPUS worth of readers using
> - * the old index, which is not enough to overflow even a 32-bit
> - * integer. (Yes, this does mean that systems having more than
> - * a billion or so CPUs need to be 64-bit systems.) Therefore,
> - * the sum of the ->seq[] counters cannot possibly overflow.
> - * Therefore, the only way that the return values of the two
> - * calls to srcu_readers_seq_idx() can be equal is if there were
> - * no increments of the corresponding rank of ->seq[] counts
> - * in the interim. But the missed-increment scenario laid out
> - * above includes an increment of the ->seq[] counter by
> - * the corresponding __srcu_read_lock(). Therefore, if this
> - * scenario occurs, the return values from the two calls to
> - * srcu_readers_seq_idx() will differ, and thus the validation
> - * step below suffices.
> + * Note that there can be at most NR_CPUS worth of readers using the old
> + * index that haven't incremented ->lock_count[] yet. Therefore, the
> + * sum of the ->lock_count[]s cannot increment enough times to overflow
> + * and end up equal the sum of the ->unlock_count[]s, as long as there
> + * are at most ULONG_MAX - NR_CPUS readers at a time. (Yes, this does
> + * mean that systems having more than a billion or so CPUs need to be
> + * 64-bit systems.) Therefore, the only way that the return values of
> + * the two calls to srcu_readers_(un)lock_idx() can be equal is if there
> + * are no active readers using this index.
> */
> - smp_mb(); /* D */
> -
> - return srcu_readers_seq_idx(sp, idx) == seq;
> + return srcu_readers_lock_idx(sp, idx) == unlocks;
> }
>
> /**
> @@ -266,8 +234,12 @@ static bool srcu_readers_active(struct srcu_struct *sp)
> unsigned long sum = 0;
>
> for_each_possible_cpu(cpu) {
> - sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]);
> - sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + sum += READ_ONCE(cpu_counts->lock_count[0]);
> + sum += READ_ONCE(cpu_counts->lock_count[1]);
> + sum -= READ_ONCE(cpu_counts->unlock_count[0]);
> + sum -= READ_ONCE(cpu_counts->unlock_count[1]);
> }
> return sum;
> }
> @@ -298,9 +270,8 @@ int __srcu_read_lock(struct srcu_struct *sp)
> int idx;
>
> idx = READ_ONCE(sp->completed) & 0x1;
> - __this_cpu_inc(sp->per_cpu_ref->c[idx]);
> + __this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
> smp_mb(); /* B */ /* Avoid leaking the critical section. */
> - __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
> return idx;
> }
> EXPORT_SYMBOL_GPL(__srcu_read_lock);
> @@ -314,7 +285,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
> void __srcu_read_unlock(struct srcu_struct *sp, int idx)
> {
> smp_mb(); /* C */ /* Avoid leaking the critical section. */
> - this_cpu_dec(sp->per_cpu_ref->c[idx]);
> + this_cpu_inc(sp->per_cpu_ref->unlock_count[idx]);
> }
> EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>
> @@ -349,7 +320,7 @@ static bool try_check_zero(struct srcu_struct *sp, int idx,
> int trycount)
>
> /*
> * Increment the ->completed counter so that future SRCU readers will
> - * use the other rank of the ->c[] and ->seq[] arrays. This allows
> + * use the other rank of the ->(un)lock_count[] arrays. This allows
> * us to wait for pre-existing readers in a starvation-free manner.
> */
> static void srcu_flip(struct srcu_struct *sp)

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2016-11-15 01:44:54

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Mon, Nov 14, 2016 at 10:36:36AM -0800, Paul E. McKenney wrote:
> SRCU uses two per-cpu counters: a nesting counter to count the number of
> active critical sections, and a sequence counter to ensure that the nesting
> counters don't change while they are being added together in
> srcu_readers_active_idx_check().
>
> This patch instead uses per-cpu lock and unlock counters. Because the both
> counters only increase and srcu_readers_active_idx_check() reads the unlock
> counter before the lock counter, this achieves the same end without having
> to increment two different counters in srcu_read_lock(). This also saves a
> smp_mb() in srcu_readers_active_idx_check().
>
> A possible problem with this patch is that it can only handle
> ULONG_MAX - NR_CPUS simultaneous readers, whereas the old version could
> handle up to ULONG_MAX.
>
> Suggested-by: Mathieu Desnoyers <[email protected]>
> Signed-off-by: Lance Roy <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> [ paulmck: Queued for 4.12, that is, merge window after this coming one. ]
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index dc8eb63c6568..0caea34d8c5f 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -34,8 +34,8 @@
> #include <linux/workqueue.h>
>
> struct srcu_struct_array {
> - unsigned long c[2];
> - unsigned long seq[2];
> + unsigned long lock_count[2];
> + unsigned long unlock_count[2];
> };
>
> struct rcu_batch {
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index 87c51225ceec..6e4fd7680c70 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -564,10 +564,24 @@ static void srcu_torture_stats(void)
> pr_alert("%s%s per-CPU(idx=%d):",
> torture_type, TORTURE_FLAG, idx);
> for_each_possible_cpu(cpu) {
> + unsigned long l0, l1;
> + unsigned long u0, u1;
> long c0, c1;
> + struct srcu_struct_array* counts =
> + per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu);
>
> - c0 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[!idx];
> - c1 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[idx];
> + u0 = counts->unlock_count[!idx];
> + u1 = counts->unlock_count[idx];
> +
> + /* Make sure that a lock is always counted if the corresponding
> + unlock is counted. */
> + smp_rmb();
> +
> + l0 = counts->lock_count[!idx];
> + l1 = counts->lock_count[idx];
> +
> + c0 = (long)(l0 - u0);
> + c1 = (long)(l1 - u1);
> pr_cont(" %d(%ld,%ld)", cpu, c0, c1);
> }
> pr_cont("\n");
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> index 9b9cdd549caa..edfdfadec821 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -141,34 +141,38 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>
> /*
> - * Returns approximate total of the readers' ->seq[] values for the
> + * Returns approximate total of the readers' ->lock_count[] values for the
> * rank of per-CPU counters specified by idx.
> */
> -static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> +static unsigned long srcu_readers_lock_idx(struct srcu_struct *sp, int idx)
> {
> int cpu;
> unsigned long sum = 0;
> unsigned long t;
>
> for_each_possible_cpu(cpu) {
> - t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + t = READ_ONCE(cpu_counts->lock_count[idx]);
> sum += t;
> }
> return sum;
> }
>
> /*
> - * Returns approximate number of readers active on the specified rank
> - * of the per-CPU ->c[] counters.
> + * Returns approximate total of the readers' ->unlock_count[] values for the
> + * rank of per-CPU counters specified by idx.
> */
> -static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> +static unsigned long srcu_readers_unlock_idx(struct srcu_struct *sp, int idx)
> {
> int cpu;
> unsigned long sum = 0;
> unsigned long t;
>
> for_each_possible_cpu(cpu) {
> - t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + t = READ_ONCE(cpu_counts->unlock_count[idx]);
> sum += t;
> }
> return sum;
> @@ -176,79 +180,43 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>
> /*
> * Return true if the number of pre-existing readers is determined to
> - * be stably zero. An example unstable zero can occur if the call
> - * to srcu_readers_active_idx() misses an __srcu_read_lock() increment,
> - * but due to task migration, sees the corresponding __srcu_read_unlock()
> - * decrement. This can happen because srcu_readers_active_idx() takes
> - * time to sum the array, and might in fact be interrupted or preempted
> - * partway through the summation.
> + * be zero.
> */
> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> {
> - unsigned long seq;
> + unsigned long unlocks;
>
> - seq = srcu_readers_seq_idx(sp, idx);
> + unlocks = srcu_readers_unlock_idx(sp, idx);
>
> /*
> - * The following smp_mb() A pairs with the smp_mb() B located in
> - * __srcu_read_lock(). This pairing ensures that if an
> - * __srcu_read_lock() increments its counter after the summation
> - * in srcu_readers_active_idx(), then the corresponding SRCU read-side
> - * critical section will see any changes made prior to the start
> - * of the current SRCU grace period.
> + * Make sure that a lock is always counted if the corresponding unlock
> + * is counted. Needs to be a smp_mb() as the read side may contain a
> + * read from a variable that is written to before the synchronize_srcu()
> + * in the write side. In this case smp_mb()s A and B act like the store
> + * buffering pattern.
> *
> - * Also, if the above call to srcu_readers_seq_idx() saw the
> - * increment of ->seq[], then the call to srcu_readers_active_idx()
> - * must see the increment of ->c[].
> + * This smp_mb() also pairs with smp_mb() C to prevent writes after the
> + * synchronize_srcu() from being executed before the grace period ends.
> */
> smp_mb(); /* A */
>
> /*
> - * Note that srcu_readers_active_idx() can incorrectly return
> - * zero even though there is a pre-existing reader throughout.
> - * To see this, suppose that task A is in a very long SRCU
> - * read-side critical section that started on CPU 0, and that
> - * no other reader exists, so that the sum of the counters
> - * is equal to one. Then suppose that task B starts executing
> - * srcu_readers_active_idx(), summing up to CPU 1, and then that
> - * task C starts reading on CPU 0, so that its increment is not
> - * summed, but finishes reading on CPU 2, so that its decrement
> - * -is- summed. Then when task B completes its sum, it will
> - * incorrectly get zero, despite the fact that task A has been
> - * in its SRCU read-side critical section the whole time.
> - *
> - * We therefore do a validation step should srcu_readers_active_idx()
> - * return zero.
> - */
> - if (srcu_readers_active_idx(sp, idx) != 0)
> - return false;
> -
> - /*
> - * The remainder of this function is the validation step.
> - * The following smp_mb() D pairs with the smp_mb() C in
> - * __srcu_read_unlock(). If the __srcu_read_unlock() was seen
> - * by srcu_readers_active_idx() above, then any destructive
> - * operation performed after the grace period will happen after
> - * the corresponding SRCU read-side critical section.
> + * If the locks are the same as the unlocks, then there must of have
> + * been no readers on this index at some time in between. This does not
> + * mean that there are no more readers, as one could have read the
> + * current index but have incremented the lock counter yet.
> *
> - * Note that there can be at most NR_CPUS worth of readers using
> - * the old index, which is not enough to overflow even a 32-bit
> - * integer. (Yes, this does mean that systems having more than
> - * a billion or so CPUs need to be 64-bit systems.) Therefore,
> - * the sum of the ->seq[] counters cannot possibly overflow.
> - * Therefore, the only way that the return values of the two
> - * calls to srcu_readers_seq_idx() can be equal is if there were
> - * no increments of the corresponding rank of ->seq[] counts
> - * in the interim. But the missed-increment scenario laid out
> - * above includes an increment of the ->seq[] counter by
> - * the corresponding __srcu_read_lock(). Therefore, if this
> - * scenario occurs, the return values from the two calls to
> - * srcu_readers_seq_idx() will differ, and thus the validation
> - * step below suffices.
> + * Note that there can be at most NR_CPUS worth of readers using the old
> + * index that haven't incremented ->lock_count[] yet. Therefore, the
> + * sum of the ->lock_count[]s cannot increment enough times to overflow
> + * and end up equal the sum of the ->unlock_count[]s, as long as there
> + * are at most ULONG_MAX - NR_CPUS readers at a time. (Yes, this does
> + * mean that systems having more than a billion or so CPUs need to be
> + * 64-bit systems.) Therefore, the only way that the return values of
> + * the two calls to srcu_readers_(un)lock_idx() can be equal is if there
> + * are no active readers using this index.
> */
> - smp_mb(); /* D */
> -
> - return srcu_readers_seq_idx(sp, idx) == seq;
> + return srcu_readers_lock_idx(sp, idx) == unlocks;
> }
>
> /**
> @@ -266,8 +234,12 @@ static bool srcu_readers_active(struct srcu_struct *sp)
> unsigned long sum = 0;
>
> for_each_possible_cpu(cpu) {
> - sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]);
> - sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + sum += READ_ONCE(cpu_counts->lock_count[0]);
> + sum += READ_ONCE(cpu_counts->lock_count[1]);
> + sum -= READ_ONCE(cpu_counts->unlock_count[0]);
> + sum -= READ_ONCE(cpu_counts->unlock_count[1]);
> }
> return sum;
> }
> @@ -298,9 +270,8 @@ int __srcu_read_lock(struct srcu_struct *sp)
> int idx;
>
> idx = READ_ONCE(sp->completed) & 0x1;
> - __this_cpu_inc(sp->per_cpu_ref->c[idx]);
> + __this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
> smp_mb(); /* B */ /* Avoid leaking the critical section. */
> - __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
> return idx;

__srcu_read_lock() used to be called with preemption disabled. I guess
the reason was because we have two percpu variables to increase. So with
only one percpu right, could we remove the preempt_{dis,en}able() in
srcu_read_lock() and use this_cpu_inc() here?

Regards,
Boqun

> }
> EXPORT_SYMBOL_GPL(__srcu_read_lock);
> @@ -314,7 +285,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
> void __srcu_read_unlock(struct srcu_struct *sp, int idx)
> {
> smp_mb(); /* C */ /* Avoid leaking the critical section. */
> - this_cpu_dec(sp->per_cpu_ref->c[idx]);
> + this_cpu_inc(sp->per_cpu_ref->unlock_count[idx]);
> }
> EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>
> @@ -349,7 +320,7 @@ static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount)
>
> /*
> * Increment the ->completed counter so that future SRCU readers will
> - * use the other rank of the ->c[] and ->seq[] arrays. This allows
> + * use the other rank of the ->(un)lock_count[] arrays. This allows
> * us to wait for pre-existing readers in a starvation-free manner.
> */
> static void srcu_flip(struct srcu_struct *sp)
>


Attachments:
(No filename) (11.32 kB)
signature.asc (455.00 B)
Download all attachments

2016-11-15 07:51:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Mon, Nov 14, 2016 at 10:36:36AM -0800, Paul E. McKenney wrote:
> SRCU uses two per-cpu counters: a nesting counter to count the number of
> active critical sections, and a sequence counter to ensure that the nesting
> counters don't change while they are being added together in
> srcu_readers_active_idx_check().
>
> This patch instead uses per-cpu lock and unlock counters. Because the both
> counters only increase and srcu_readers_active_idx_check() reads the unlock
> counter before the lock counter, this achieves the same end without having
> to increment two different counters in srcu_read_lock(). This also saves a
> smp_mb() in srcu_readers_active_idx_check().

A very small improvement... I feel SRCU has much bigger issues :/

2016-11-15 13:54:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

----- On Nov 15, 2016, at 2:51 AM, Peter Zijlstra [email protected] wrote:

> On Mon, Nov 14, 2016 at 10:36:36AM -0800, Paul E. McKenney wrote:
>> SRCU uses two per-cpu counters: a nesting counter to count the number of
>> active critical sections, and a sequence counter to ensure that the nesting
>> counters don't change while they are being added together in
>> srcu_readers_active_idx_check().
>>
>> This patch instead uses per-cpu lock and unlock counters. Because the both
>> counters only increase and srcu_readers_active_idx_check() reads the unlock
>> counter before the lock counter, this achieves the same end without having
>> to increment two different counters in srcu_read_lock(). This also saves a
>> smp_mb() in srcu_readers_active_idx_check().
>
> A very small improvement... I feel SRCU has much bigger issues :/

Do you have specific issues in mind ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2016-11-15 13:59:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Tue, Nov 15, 2016 at 01:54:50PM +0000, Mathieu Desnoyers wrote:
> ----- On Nov 15, 2016, at 2:51 AM, Peter Zijlstra [email protected] wrote:
>
> > On Mon, Nov 14, 2016 at 10:36:36AM -0800, Paul E. McKenney wrote:
> >> SRCU uses two per-cpu counters: a nesting counter to count the number of
> >> active critical sections, and a sequence counter to ensure that the nesting
> >> counters don't change while they are being added together in
> >> srcu_readers_active_idx_check().
> >>
> >> This patch instead uses per-cpu lock and unlock counters. Because the both
> >> counters only increase and srcu_readers_active_idx_check() reads the unlock
> >> counter before the lock counter, this achieves the same end without having
> >> to increment two different counters in srcu_read_lock(). This also saves a
> >> smp_mb() in srcu_readers_active_idx_check().
> >
> > A very small improvement... I feel SRCU has much bigger issues :/
>
> Do you have specific issues in mind ?

The smp_mb()s in read_{un,}lock() and the lock in call_srcu() come to
mind.

2016-11-15 14:26:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Tue, Nov 15, 2016 at 02:59:12PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 15, 2016 at 01:54:50PM +0000, Mathieu Desnoyers wrote:
> > ----- On Nov 15, 2016, at 2:51 AM, Peter Zijlstra [email protected] wrote:
> >
> > > On Mon, Nov 14, 2016 at 10:36:36AM -0800, Paul E. McKenney wrote:
> > >> SRCU uses two per-cpu counters: a nesting counter to count the number of
> > >> active critical sections, and a sequence counter to ensure that the nesting
> > >> counters don't change while they are being added together in
> > >> srcu_readers_active_idx_check().
> > >>
> > >> This patch instead uses per-cpu lock and unlock counters. Because the both
> > >> counters only increase and srcu_readers_active_idx_check() reads the unlock
> > >> counter before the lock counter, this achieves the same end without having
> > >> to increment two different counters in srcu_read_lock(). This also saves a
> > >> smp_mb() in srcu_readers_active_idx_check().
> > >
> > > A very small improvement... I feel SRCU has much bigger issues :/
> >
> > Do you have specific issues in mind ?
>
> The smp_mb()s in read_{un,}lock() and the lock in call_srcu() come to
> mind.

There is some possibility of weakening the srcu_read_unlock() ordering,
but one step at a time.

Has the lock in call_srcu() been causing trouble? Easy to fix if so,
but as you noted in another email today, we don't need complexity for
complexity's sake. And no reports of problems with this have reached
me thus far.

Thanx, Paul

2016-11-15 14:37:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
> On Mon, Nov 14, 2016 at 10:36:36AM -0800, Paul E. McKenney wrote:
> > SRCU uses two per-cpu counters: a nesting counter to count the number of
> > active critical sections, and a sequence counter to ensure that the nesting
> > counters don't change while they are being added together in
> > srcu_readers_active_idx_check().
> >
> > This patch instead uses per-cpu lock and unlock counters. Because the both
> > counters only increase and srcu_readers_active_idx_check() reads the unlock
> > counter before the lock counter, this achieves the same end without having
> > to increment two different counters in srcu_read_lock(). This also saves a
> > smp_mb() in srcu_readers_active_idx_check().
> >
> > A possible problem with this patch is that it can only handle
> > ULONG_MAX - NR_CPUS simultaneous readers, whereas the old version could
> > handle up to ULONG_MAX.
> >
> > Suggested-by: Mathieu Desnoyers <[email protected]>
> > Signed-off-by: Lance Roy <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > [ paulmck: Queued for 4.12, that is, merge window after this coming one. ]
> >
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index dc8eb63c6568..0caea34d8c5f 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -34,8 +34,8 @@
> > #include <linux/workqueue.h>
> >
> > struct srcu_struct_array {
> > - unsigned long c[2];
> > - unsigned long seq[2];
> > + unsigned long lock_count[2];
> > + unsigned long unlock_count[2];
> > };
> >
> > struct rcu_batch {
> > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> > index 87c51225ceec..6e4fd7680c70 100644
> > --- a/kernel/rcu/rcutorture.c
> > +++ b/kernel/rcu/rcutorture.c
> > @@ -564,10 +564,24 @@ static void srcu_torture_stats(void)
> > pr_alert("%s%s per-CPU(idx=%d):",
> > torture_type, TORTURE_FLAG, idx);
> > for_each_possible_cpu(cpu) {
> > + unsigned long l0, l1;
> > + unsigned long u0, u1;
> > long c0, c1;
> > + struct srcu_struct_array* counts =
> > + per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu);
> >
> > - c0 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[!idx];
> > - c1 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[idx];
> > + u0 = counts->unlock_count[!idx];
> > + u1 = counts->unlock_count[idx];
> > +
> > + /* Make sure that a lock is always counted if the corresponding
> > + unlock is counted. */
> > + smp_rmb();
> > +
> > + l0 = counts->lock_count[!idx];
> > + l1 = counts->lock_count[idx];
> > +
> > + c0 = (long)(l0 - u0);
> > + c1 = (long)(l1 - u1);
> > pr_cont(" %d(%ld,%ld)", cpu, c0, c1);
> > }
> > pr_cont("\n");
> > diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> > index 9b9cdd549caa..edfdfadec821 100644
> > --- a/kernel/rcu/srcu.c
> > +++ b/kernel/rcu/srcu.c
> > @@ -141,34 +141,38 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> > #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >
> > /*
> > - * Returns approximate total of the readers' ->seq[] values for the
> > + * Returns approximate total of the readers' ->lock_count[] values for the
> > * rank of per-CPU counters specified by idx.
> > */
> > -static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> > +static unsigned long srcu_readers_lock_idx(struct srcu_struct *sp, int idx)
> > {
> > int cpu;
> > unsigned long sum = 0;
> > unsigned long t;
> >
> > for_each_possible_cpu(cpu) {
> > - t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> > + struct srcu_struct_array* cpu_counts =
> > + per_cpu_ptr(sp->per_cpu_ref, cpu);
> > + t = READ_ONCE(cpu_counts->lock_count[idx]);
> > sum += t;
> > }
> > return sum;
> > }
> >
> > /*
> > - * Returns approximate number of readers active on the specified rank
> > - * of the per-CPU ->c[] counters.
> > + * Returns approximate total of the readers' ->unlock_count[] values for the
> > + * rank of per-CPU counters specified by idx.
> > */
> > -static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> > +static unsigned long srcu_readers_unlock_idx(struct srcu_struct *sp, int idx)
> > {
> > int cpu;
> > unsigned long sum = 0;
> > unsigned long t;
> >
> > for_each_possible_cpu(cpu) {
> > - t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> > + struct srcu_struct_array* cpu_counts =
> > + per_cpu_ptr(sp->per_cpu_ref, cpu);
> > + t = READ_ONCE(cpu_counts->unlock_count[idx]);
> > sum += t;
> > }
> > return sum;
> > @@ -176,79 +180,43 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >
> > /*
> > * Return true if the number of pre-existing readers is determined to
> > - * be stably zero. An example unstable zero can occur if the call
> > - * to srcu_readers_active_idx() misses an __srcu_read_lock() increment,
> > - * but due to task migration, sees the corresponding __srcu_read_unlock()
> > - * decrement. This can happen because srcu_readers_active_idx() takes
> > - * time to sum the array, and might in fact be interrupted or preempted
> > - * partway through the summation.
> > + * be zero.
> > */
> > static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > {
> > - unsigned long seq;
> > + unsigned long unlocks;
> >
> > - seq = srcu_readers_seq_idx(sp, idx);
> > + unlocks = srcu_readers_unlock_idx(sp, idx);
> >
> > /*
> > - * The following smp_mb() A pairs with the smp_mb() B located in
> > - * __srcu_read_lock(). This pairing ensures that if an
> > - * __srcu_read_lock() increments its counter after the summation
> > - * in srcu_readers_active_idx(), then the corresponding SRCU read-side
> > - * critical section will see any changes made prior to the start
> > - * of the current SRCU grace period.
> > + * Make sure that a lock is always counted if the corresponding unlock
> > + * is counted. Needs to be a smp_mb() as the read side may contain a
> > + * read from a variable that is written to before the synchronize_srcu()
> > + * in the write side. In this case smp_mb()s A and B act like the store
> > + * buffering pattern.
> > *
> > - * Also, if the above call to srcu_readers_seq_idx() saw the
> > - * increment of ->seq[], then the call to srcu_readers_active_idx()
> > - * must see the increment of ->c[].
> > + * This smp_mb() also pairs with smp_mb() C to prevent writes after the
> > + * synchronize_srcu() from being executed before the grace period ends.
> > */
> > smp_mb(); /* A */
> >
> > /*
> > - * Note that srcu_readers_active_idx() can incorrectly return
> > - * zero even though there is a pre-existing reader throughout.
> > - * To see this, suppose that task A is in a very long SRCU
> > - * read-side critical section that started on CPU 0, and that
> > - * no other reader exists, so that the sum of the counters
> > - * is equal to one. Then suppose that task B starts executing
> > - * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > - * task C starts reading on CPU 0, so that its increment is not
> > - * summed, but finishes reading on CPU 2, so that its decrement
> > - * -is- summed. Then when task B completes its sum, it will
> > - * incorrectly get zero, despite the fact that task A has been
> > - * in its SRCU read-side critical section the whole time.
> > - *
> > - * We therefore do a validation step should srcu_readers_active_idx()
> > - * return zero.
> > - */
> > - if (srcu_readers_active_idx(sp, idx) != 0)
> > - return false;
> > -
> > - /*
> > - * The remainder of this function is the validation step.
> > - * The following smp_mb() D pairs with the smp_mb() C in
> > - * __srcu_read_unlock(). If the __srcu_read_unlock() was seen
> > - * by srcu_readers_active_idx() above, then any destructive
> > - * operation performed after the grace period will happen after
> > - * the corresponding SRCU read-side critical section.
> > + * If the locks are the same as the unlocks, then there must of have
> > + * been no readers on this index at some time in between. This does not
> > + * mean that there are no more readers, as one could have read the
> > + * current index but have incremented the lock counter yet.
> > *
> > - * Note that there can be at most NR_CPUS worth of readers using
> > - * the old index, which is not enough to overflow even a 32-bit
> > - * integer. (Yes, this does mean that systems having more than
> > - * a billion or so CPUs need to be 64-bit systems.) Therefore,
> > - * the sum of the ->seq[] counters cannot possibly overflow.
> > - * Therefore, the only way that the return values of the two
> > - * calls to srcu_readers_seq_idx() can be equal is if there were
> > - * no increments of the corresponding rank of ->seq[] counts
> > - * in the interim. But the missed-increment scenario laid out
> > - * above includes an increment of the ->seq[] counter by
> > - * the corresponding __srcu_read_lock(). Therefore, if this
> > - * scenario occurs, the return values from the two calls to
> > - * srcu_readers_seq_idx() will differ, and thus the validation
> > - * step below suffices.
> > + * Note that there can be at most NR_CPUS worth of readers using the old
> > + * index that haven't incremented ->lock_count[] yet. Therefore, the
> > + * sum of the ->lock_count[]s cannot increment enough times to overflow
> > + * and end up equal the sum of the ->unlock_count[]s, as long as there
> > + * are at most ULONG_MAX - NR_CPUS readers at a time. (Yes, this does
> > + * mean that systems having more than a billion or so CPUs need to be
> > + * 64-bit systems.) Therefore, the only way that the return values of
> > + * the two calls to srcu_readers_(un)lock_idx() can be equal is if there
> > + * are no active readers using this index.
> > */
> > - smp_mb(); /* D */
> > -
> > - return srcu_readers_seq_idx(sp, idx) == seq;
> > + return srcu_readers_lock_idx(sp, idx) == unlocks;
> > }
> >
> > /**
> > @@ -266,8 +234,12 @@ static bool srcu_readers_active(struct srcu_struct *sp)
> > unsigned long sum = 0;
> >
> > for_each_possible_cpu(cpu) {
> > - sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]);
> > - sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]);
> > + struct srcu_struct_array* cpu_counts =
> > + per_cpu_ptr(sp->per_cpu_ref, cpu);
> > + sum += READ_ONCE(cpu_counts->lock_count[0]);
> > + sum += READ_ONCE(cpu_counts->lock_count[1]);
> > + sum -= READ_ONCE(cpu_counts->unlock_count[0]);
> > + sum -= READ_ONCE(cpu_counts->unlock_count[1]);
> > }
> > return sum;
> > }
> > @@ -298,9 +270,8 @@ int __srcu_read_lock(struct srcu_struct *sp)
> > int idx;
> >
> > idx = READ_ONCE(sp->completed) & 0x1;
> > - __this_cpu_inc(sp->per_cpu_ref->c[idx]);
> > + __this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
> > smp_mb(); /* B */ /* Avoid leaking the critical section. */
> > - __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
> > return idx;
>
> __srcu_read_lock() used to be called with preemption disabled. I guess
> the reason was because we have two percpu variables to increase. So with
> only one percpu right, could we remove the preempt_{dis,en}able() in
> srcu_read_lock() and use this_cpu_inc() here?

Quite possibly...

Thanx, Paul

> Regards,
> Boqun
>
> > }
> > EXPORT_SYMBOL_GPL(__srcu_read_lock);
> > @@ -314,7 +285,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
> > void __srcu_read_unlock(struct srcu_struct *sp, int idx)
> > {
> > smp_mb(); /* C */ /* Avoid leaking the critical section. */
> > - this_cpu_dec(sp->per_cpu_ref->c[idx]);
> > + this_cpu_inc(sp->per_cpu_ref->unlock_count[idx]);
> > }
> > EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> >
> > @@ -349,7 +320,7 @@ static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount)
> >
> > /*
> > * Increment the ->completed counter so that future SRCU readers will
> > - * use the other rank of the ->c[] and ->seq[] arrays. This allows
> > + * use the other rank of the ->(un)lock_count[] arrays. This allows
> > * us to wait for pre-existing readers in a starvation-free manner.
> > */
> > static void srcu_flip(struct srcu_struct *sp)
> >


2016-11-15 14:55:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Tue, Nov 15, 2016 at 06:26:27AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 15, 2016 at 02:59:12PM +0100, Peter Zijlstra wrote:

> > The smp_mb()s in read_{un,}lock() and the lock in call_srcu() come to
> > mind.
>
> There is some possibility of weakening the srcu_read_unlock() ordering,
> but one step at a time.
>
> Has the lock in call_srcu() been causing trouble? Easy to fix if so,
> but as you noted in another email today, we don't need complexity for
> complexity's sake. And no reports of problems with this have reached
> me thus far.

It was a cause for concern in the optimistic fault series, but since
that never got sorted, it hasn't shown up in practise afaik.

2016-11-15 15:43:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Tue, Nov 15, 2016 at 03:55:36PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 15, 2016 at 06:26:27AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 15, 2016 at 02:59:12PM +0100, Peter Zijlstra wrote:
>
> > > The smp_mb()s in read_{un,}lock() and the lock in call_srcu() come to
> > > mind.
> >
> > There is some possibility of weakening the srcu_read_unlock() ordering,
> > but one step at a time.
> >
> > Has the lock in call_srcu() been causing trouble? Easy to fix if so,
> > but as you noted in another email today, we don't need complexity for
> > complexity's sake. And no reports of problems with this have reached
> > me thus far.
>
> It was a cause for concern in the optimistic fault series, but since
> that never got sorted, it hasn't shown up in practise afaik.

Sounds like it is not too early for me to be thinking about fixes,
probably similar to the approach used for expedited grace periods.

And thank you for letting me know!

Thanx, Paul

2016-11-17 14:31:17

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
> On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
>
> >>
> >> __srcu_read_lock() used to be called with preemption disabled. I guess
> >> the reason was because we have two percpu variables to increase. So with
> >> only one percpu right, could we remove the preempt_{dis,en}able() in
> >> srcu_read_lock() and use this_cpu_inc() here?
> >
> > Quite possibly...
> >
>

Hello, Lai ;-)

> it will be nicer if it is removed.
>
> The reason for the preemption-disabled was also because we
> have to disallow any preemption between the fetching of the idx
> and the increasement. so that we have at most NR_CPUS worth
> of readers using the old index that haven't incremented the counters.
>

After reading the comment for a while, I actually got a question, maybe
I miss something ;-)

Why "at most NR_CPUS worth of readers using the old index haven't
incremented the counters" could save us from overflow the counter?

Please consider the following case in current implementation:


{sp->completed = 0} so idx = 1 in srcu_advance_batches(...)

one thread A is currently in __srcu_read_lock() and using idx = 1 and
about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
have been called and returned with idx = 1, please note I think this is
possible because I assume we may have some code like this:

unsigned long i = 0;
for (; i < ULONG_MAX; i++)
srcu_read_lock(); // return the same idx 1;

And none of the corresponding srcu_read_unlock() has been called;

In this case, at the time thread A increases the percpu c[idx], that
will result in an overflow, right? So even one reader using old idx will
result in overflow.


I think we won't be hit by overflow is not because we have few readers
using old idx, it's because there are unlikely ULONG_MAX + 1
__srcu_read_lock() called for the same idx, right? And the reason of
this is much complex: because we won't have a fair mount of threads in
the system, because no thread will nest srcu many levels, because there
won't be a lot readers using old idx.

And this will still be true if we use new mechanism and shrink the
preemption disabled section, right?

Regards,
Boqun

> if we remove the preempt_{dis,en}able(). we must change the
> "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
> one on-going reader needs at least need 4bytes at the stack). it is still safe.
>
> but we still need to think more if we want to remove the preempt_{dis,en}able().
>
> Thanks
> Lai


Attachments:
(No filename) (2.56 kB)
signature.asc (455.00 B)
Download all attachments

2016-11-17 17:08:40

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

----- On Nov 17, 2016, at 10:53 AM, Paul E. McKenney [email protected] wrote:

> On Thu, Nov 17, 2016 at 03:38:08PM +0000, Mathieu Desnoyers wrote:
>> ----- On Nov 17, 2016, at 10:31 AM, Mathieu Desnoyers
>> [email protected] wrote:
>>
>> > ----- On Nov 17, 2016, at 10:07 AM, Lai Jiangshan [email protected] wrote:
>> >
>> >> On Thu, Nov 17, 2016 at 10:31 PM, Boqun Feng <[email protected]> wrote:
>> >>> On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
>> >>>> On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
>> >>>> <[email protected]> wrote:
>> >>>> > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
>> >>>>
>> >>>> >>
>> >>>> >> __srcu_read_lock() used to be called with preemption disabled. I guess
>> >>>> >> the reason was because we have two percpu variables to increase. So with
>> >>>> >> only one percpu right, could we remove the preempt_{dis,en}able() in
>> >>>> >> srcu_read_lock() and use this_cpu_inc() here?
>> >>>> >
>> >>>> > Quite possibly...
>> >>>> >
>> >>>>
>> >>>
>> >>> Hello, Lai ;-)
>> >>>
>> >>>> it will be nicer if it is removed.
>> >>>>
>> >>>> The reason for the preemption-disabled was also because we
>> >>>> have to disallow any preemption between the fetching of the idx
>> >>>> and the increasement. so that we have at most NR_CPUS worth
>> >>>> of readers using the old index that haven't incremented the counters.
>> >>>>
>> >>>
>> >>> After reading the comment for a while, I actually got a question, maybe
>> >>> I miss something ;-)
>> >>>
>> >>> Why "at most NR_CPUS worth of readers using the old index haven't
>> >>> incremented the counters" could save us from overflow the counter?
>> >>>
>> >>> Please consider the following case in current implementation:
>> >>>
>> >>>
>> >>> {sp->completed = 0} so idx = 1 in srcu_advance_batches(...)
>> >>>
>> >>> one thread A is currently in __srcu_read_lock() and using idx = 1 and
>> >>> about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
>> >>> have been called and returned with idx = 1, please note I think this is
>> >>> possible because I assume we may have some code like this:
>> >>>
>> >>> unsigned long i = 0;
>> >>> for (; i < ULONG_MAX; i++)
>> >>> srcu_read_lock(); // return the same idx 1;
>> >>
>> >> this is the wrong usage of the api.
>> >>
>> >>
>> >> you might rewrite it as:
>> >>
>> >> unsigned long index[2] = {0, 0};
>> >> unsigned long i = 0;
>> >> for (; index[1] < ULONG_MAX; i++)
>> >> index[srcu_read_lock()]++;
>> >>
>> >>
>> >> I think we should add document to disallow this kind of usage.
>> >> a reader should eat 4bytes on the memory at least.
>> >>
>> >
>> > (the analysis below refers to the rewritten SRCU scheme)
>> >
>> > Let's presume we use the API correctly, as you describe (saving
>> > the returned index of srcu_read_lock() somewhere).
>> >
>> > So for the sake of argument, we can either call srcu_read_lock
>> > in a loop (during which we can be migrated), or call it
>> > concurrently from various threads. The effect in terms of
>> > overflow is pretty much the same.
>> >
>> > What is done here is incrementing per-cpu split-counters. In
>> > the worse-case scenario, let's assume we're incrementing those
>> > counters for a single index (0 or 1).
>> >
>> > If we think about this system-wide, we don't really care about
>> > the overflow of a single CPU counter, because what matters is
>> > the difference between the overall nr_lock - nr_unlock counts
>> > for a given index, once summed up by synchronize_srcu().
>> >
>> > So the only situation that could lead to an overflow that matters
>> > is if synchronize_srcu() see ULONG_MAX more increments of nr_lock
>> > than the observed number of nr_unlock increments.
>> >
>> > So the bound is not only about the number of concurrent SRCU
>> > readers, but also about the number of SRCU readers that may
>> > appear between the moment synchronize_srcu() reads the nr_unlock
>> > per-cpu counters and the moment it reads the nr_lock counters.
>> >
>> > This maximum bound of ULONG_MAX - 1 therefore applies to the
>> > sum of:
>> > - numner of concurrent SRCU read-side critical sections active
>> > at the same time,
>> > - number of SRCU read-side critical sections beginning after
>> > synchronize_srcu() has read the nr_unlock counters, before
>> > it reads the nr_lock counters.
>>
>> Now that I think of it, since we flip the period before summing
>> the nr_unlock counter, we cannot have any newcoming readers appearing
>> within the target period while we execute synchronize_srcu().
>> So it ends up being a limit on the number of concurrent SRCU
>> read-side c.s. active at the same time. (you can scratch the
>> second bullet above).
>
> We can have NR_CPUS worth of them -- those that have fetched the
> index, but not yet incremented their counter.

Ah, yes, due to preemption being disabled across those operations.
More precisely, it would be NR_CPUS - 1, because synchronize_srcu()
needs to run somewhere ;-)

>
> But if the updater fails to see their counter increment, then
> their next srcu_read_lock() is guaranteed to see the new index.

Got it, thanks!

Mathieu

>
> Thanx, Paul
>
>> Thanks,
>>
>> Mathieu
>>
>>
>>
>> > You guys seem to see cases that would require a lower max nr
>> > reader bound, but I'm afraid I don't quite understand them.
>> >
>> > Thanks,
>> >
>> > Mathieu
>> >
>> >
>> >>>
>> >>> And none of the corresponding srcu_read_unlock() has been called;
>> >>>
>> >>> In this case, at the time thread A increases the percpu c[idx], that
>> >>> will result in an overflow, right? So even one reader using old idx will
>> >>> result in overflow.
>> >>>
>> >>>
>> >>> I think we won't be hit by overflow is not because we have few readers
>> >>> using old idx, it's because there are unlikely ULONG_MAX + 1
>> >>> __srcu_read_lock() called for the same idx, right? And the reason of
>> >>> this is much complex: because we won't have a fair mount of threads in
>> >>> the system, because no thread will nest srcu many levels, because there
>> >>> won't be a lot readers using old idx.
>> >>>
>> >>> And this will still be true if we use new mechanism and shrink the
>> >>> preemption disabled section, right?
>> >>>
>> >>> Regards,
>> >>> Boqun
>> >>>
>> >>>> if we remove the preempt_{dis,en}able(). we must change the
>> >>>> "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
>> >>>> one on-going reader needs at least need 4bytes at the stack). it is still safe.
>> >>>>
>> >>>> but we still need to think more if we want to remove the preempt_{dis,en}able().
>> >>>>
>> >>>> Thanks
>> >> >> Lai
>> >
>> > --
>> > Mathieu Desnoyers
>> > EfficiOS Inc.
>> > http://www.efficios.com
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2016-11-17 17:15:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 10:31:00PM +0800, Boqun Feng wrote:
> On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
> > On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
> > <[email protected]> wrote:
> > > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
> >
> > >>
> > >> __srcu_read_lock() used to be called with preemption disabled. I guess
> > >> the reason was because we have two percpu variables to increase. So with
> > >> only one percpu right, could we remove the preempt_{dis,en}able() in
> > >> srcu_read_lock() and use this_cpu_inc() here?
> > >
> > > Quite possibly...
> > >
> >
>
> Hello, Lai ;-)
>
> > it will be nicer if it is removed.
> >
> > The reason for the preemption-disabled was also because we
> > have to disallow any preemption between the fetching of the idx
> > and the increasement. so that we have at most NR_CPUS worth
> > of readers using the old index that haven't incremented the counters.
> >
>
> After reading the comment for a while, I actually got a question, maybe
> I miss something ;-)
>
> Why "at most NR_CPUS worth of readers using the old index haven't
> incremented the counters" could save us from overflow the counter?
>
> Please consider the following case in current implementation:
>
>
> {sp->completed = 0} so idx = 1 in srcu_advance_batches(...)
>
> one thread A is currently in __srcu_read_lock() and using idx = 1 and
> about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
> have been called and returned with idx = 1, please note I think this is
> possible because I assume we may have some code like this:
>
> unsigned long i = 0;
> for (; i < ULONG_MAX; i++)
> srcu_read_lock(); // return the same idx 1;

First, please don't do this. For any number of reasons! ;-)

Second, the theory is that if the updater fails to see the update from
one of the srcu_read_lock() calls in the loop, then the reader must see
the new index on the next pass through the loop. Which would be one of
the problems with the above loop -- it cannot be guaranteed that they
all will return the same index.

> And none of the corresponding srcu_read_unlock() has been called;
>
> In this case, at the time thread A increases the percpu c[idx], that
> will result in an overflow, right? So even one reader using old idx will
> result in overflow.

It is quite possible that the NR_CPUS bound is too tight, but the memory
barriers do prevent readers from seeing the old index beyond a certain
point.

> I think we won't be hit by overflow is not because we have few readers
> using old idx, it's because there are unlikely ULONG_MAX + 1
> __srcu_read_lock() called for the same idx, right? And the reason of
> this is much complex: because we won't have a fair mount of threads in
> the system, because no thread will nest srcu many levels, because there
> won't be a lot readers using old idx.
>
> And this will still be true if we use new mechanism and shrink the
> preemption disabled section, right?

Well, the analysis needs to be revisited, for sure. ;-)

Thanx, Paul

> Regards,
> Boqun
>
> > if we remove the preempt_{dis,en}able(). we must change the
> > "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
> > one on-going reader needs at least need 4bytes at the stack). it is still safe.
> >
> > but we still need to think more if we want to remove the preempt_{dis,en}able().
> >
> > Thanks
> > Lai


2016-11-17 17:14:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 05:49:57AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
> > On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
> > <[email protected]> wrote:
> > > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
> >
> > >>
> > >> __srcu_read_lock() used to be called with preemption disabled. I guess
> > >> the reason was because we have two percpu variables to increase. So with
> > >> only one percpu right, could we remove the preempt_{dis,en}able() in
> > >> srcu_read_lock() and use this_cpu_inc() here?
> > >
> > > Quite possibly...
> > >
> >
> > it will be nicer if it is removed.
> >
> > The reason for the preemption-disabled was also because we
> > have to disallow any preemption between the fetching of the idx
> > and the increasement. so that we have at most NR_CPUS worth
> > of readers using the old index that haven't incremented the counters.
> >
> > if we remove the preempt_{dis,en}able(). we must change the
> > "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
> > one on-going reader needs at least need 4bytes at the stack). it is still safe.
> >
> > but we still need to think more if we want to remove the preempt_{dis,en}able().
>
> Good points! Agreed, any change in the preemption needs careful thought
> and needs to be a separate patch.

And one area needing special thought is the call to __srcu_read_lock()
and __srcu_read_unlock() in do_exit().

Thanx, Paul

2016-11-17 17:25:02

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 10:31 PM, Boqun Feng <[email protected]> wrote:
> On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
>> On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
>>
>> >>
>> >> __srcu_read_lock() used to be called with preemption disabled. I guess
>> >> the reason was because we have two percpu variables to increase. So with
>> >> only one percpu right, could we remove the preempt_{dis,en}able() in
>> >> srcu_read_lock() and use this_cpu_inc() here?
>> >
>> > Quite possibly...
>> >
>>
>
> Hello, Lai ;-)
>
>> it will be nicer if it is removed.
>>
>> The reason for the preemption-disabled was also because we
>> have to disallow any preemption between the fetching of the idx
>> and the increasement. so that we have at most NR_CPUS worth
>> of readers using the old index that haven't incremented the counters.
>>
>
> After reading the comment for a while, I actually got a question, maybe
> I miss something ;-)
>
> Why "at most NR_CPUS worth of readers using the old index haven't
> incremented the counters" could save us from overflow the counter?
>
> Please consider the following case in current implementation:
>
>
> {sp->completed = 0} so idx = 1 in srcu_advance_batches(...)
>
> one thread A is currently in __srcu_read_lock() and using idx = 1 and
> about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
> have been called and returned with idx = 1, please note I think this is
> possible because I assume we may have some code like this:
>
> unsigned long i = 0;
> for (; i < ULONG_MAX; i++)
> srcu_read_lock(); // return the same idx 1;

this is the wrong usage of the api.


you might rewrite it as:

unsigned long index[2] = {0, 0};
unsigned long i = 0;
for (; index[1] < ULONG_MAX; i++)
index[srcu_read_lock()]++;


I think we should add document to disallow this kind of usage.
a reader should eat 4bytes on the memory at least.

>
> And none of the corresponding srcu_read_unlock() has been called;
>
> In this case, at the time thread A increases the percpu c[idx], that
> will result in an overflow, right? So even one reader using old idx will
> result in overflow.
>
>
> I think we won't be hit by overflow is not because we have few readers
> using old idx, it's because there are unlikely ULONG_MAX + 1
> __srcu_read_lock() called for the same idx, right? And the reason of
> this is much complex: because we won't have a fair mount of threads in
> the system, because no thread will nest srcu many levels, because there
> won't be a lot readers using old idx.
>
> And this will still be true if we use new mechanism and shrink the
> preemption disabled section, right?
>
> Regards,
> Boqun
>
>> if we remove the preempt_{dis,en}able(). we must change the
>> "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
>> one on-going reader needs at least need 4bytes at the stack). it is still safe.
>>
>> but we still need to think more if we want to remove the preempt_{dis,en}able().
>>
>> Thanks
>> Lai

2016-11-17 17:42:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 11:55:07PM +0800, Lai Jiangshan wrote:
> On Thu, Nov 17, 2016 at 10:45 PM, Boqun Feng <[email protected]> wrote:
> > On Thu, Nov 17, 2016 at 06:38:29AM -0800, Paul E. McKenney wrote:
> >> On Thu, Nov 17, 2016 at 05:49:57AM -0800, Paul E. McKenney wrote:
> >> > On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
> >> > > On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
> >> > > <[email protected]> wrote:
> >> > > > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
> >> > >
> >> > > >>
> >> > > >> __srcu_read_lock() used to be called with preemption disabled. I guess
> >> > > >> the reason was because we have two percpu variables to increase. So with
> >> > > >> only one percpu right, could we remove the preempt_{dis,en}able() in
> >> > > >> srcu_read_lock() and use this_cpu_inc() here?
> >> > > >
> >> > > > Quite possibly...
> >> > > >
> >> > >
> >> > > it will be nicer if it is removed.
> >> > >
> >> > > The reason for the preemption-disabled was also because we
> >> > > have to disallow any preemption between the fetching of the idx
> >> > > and the increasement. so that we have at most NR_CPUS worth
> >> > > of readers using the old index that haven't incremented the counters.
> >> > >
> >> > > if we remove the preempt_{dis,en}able(). we must change the
> >> > > "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
> >> > > one on-going reader needs at least need 4bytes at the stack). it is still safe.
> >> > >
> >> > > but we still need to think more if we want to remove the preempt_{dis,en}able().
> >> >
> >> > Good points! Agreed, any change in the preemption needs careful thought
> >> > and needs to be a separate patch.
> >>
> >> And one area needing special thought is the call to __srcu_read_lock()
> >> and __srcu_read_unlock() in do_exit().
> >>
> >
> > So before commit 49f5903b473c5, we don't have the read of ->completed in
> > preemption disable section?
> >
> > And following "git blame", I found commit 7a6b55e7108b3 ;-)
>
> Ouch, it shows 7a6b55e7108b3 at least has a bug in the comments about NR_CPUS.
>
> we should focus on the total number of all active readers instead the number
> of the readers using the old index that haven't incremented the counters.
> the later is smaller than the prior one which is smaller than the ULONG_MAX/4
> or even smaller. so that we can simplify the comments.
>
> + * Note that the sum of the ->lock_count[]s cannot increment enough
> + * times to overflow and end up equal the sum of the ->unlock_count[]s,
> + * even too much readers using the old index that haven't incremented
> + * ->lock_count[] yet, as long as there are at most ULONG_MAX/4
> + * readers at a time. Therefore, the only way that the return values of
> + * the two calls to srcu_readers_(un)lock_idx() can be equal is if there
> + * are no active readers using this index.

I would welcome a patch making the limitations simpler and more accurate.
That is, once we work out exactly what those limitations are. ;-)

Thanx, Paul

2016-11-17 17:52:58

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Tue, Nov 15, 2016 at 2:36 AM, Paul E. McKenney
<[email protected]> wrote:
> SRCU uses two per-cpu counters: a nesting counter to count the number of
> active critical sections, and a sequence counter to ensure that the nesting
> counters don't change while they are being added together in
> srcu_readers_active_idx_check().
>
> This patch instead uses per-cpu lock and unlock counters. Because the both
> counters only increase and srcu_readers_active_idx_check() reads the unlock
> counter before the lock counter, this achieves the same end without having
> to increment two different counters in srcu_read_lock(). This also saves a
> smp_mb() in srcu_readers_active_idx_check().
>
> A possible problem with this patch is that it can only handle
> ULONG_MAX - NR_CPUS simultaneous readers, whereas the old version could
> handle up to ULONG_MAX.
>
> Suggested-by: Mathieu Desnoyers <[email protected]>
> Signed-off-by: Lance Roy <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> [ paulmck: Queued for 4.12, that is, merge window after this coming one. ]
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index dc8eb63c6568..0caea34d8c5f 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -34,8 +34,8 @@
> #include <linux/workqueue.h>
>
> struct srcu_struct_array {
> - unsigned long c[2];
> - unsigned long seq[2];
> + unsigned long lock_count[2];
> + unsigned long unlock_count[2];
> };
>
> struct rcu_batch {
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index 87c51225ceec..6e4fd7680c70 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -564,10 +564,24 @@ static void srcu_torture_stats(void)
> pr_alert("%s%s per-CPU(idx=%d):",
> torture_type, TORTURE_FLAG, idx);
> for_each_possible_cpu(cpu) {
> + unsigned long l0, l1;
> + unsigned long u0, u1;
> long c0, c1;
> + struct srcu_struct_array* counts =
> + per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu);
>
> - c0 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[!idx];
> - c1 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[idx];
> + u0 = counts->unlock_count[!idx];
> + u1 = counts->unlock_count[idx];
> +
> + /* Make sure that a lock is always counted if the corresponding
> + unlock is counted. */
> + smp_rmb();
> +
> + l0 = counts->lock_count[!idx];
> + l1 = counts->lock_count[idx];
> +
> + c0 = (long)(l0 - u0);
> + c1 = (long)(l1 - u1);
> pr_cont(" %d(%ld,%ld)", cpu, c0, c1);
> }
> pr_cont("\n");
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> index 9b9cdd549caa..edfdfadec821 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -141,34 +141,38 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>
> /*
> - * Returns approximate total of the readers' ->seq[] values for the
> + * Returns approximate total of the readers' ->lock_count[] values for the
> * rank of per-CPU counters specified by idx.
> */
> -static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> +static unsigned long srcu_readers_lock_idx(struct srcu_struct *sp, int idx)
> {
> int cpu;
> unsigned long sum = 0;
> unsigned long t;
>
> for_each_possible_cpu(cpu) {
> - t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + t = READ_ONCE(cpu_counts->lock_count[idx]);
> sum += t;
> }
> return sum;
> }
>
> /*
> - * Returns approximate number of readers active on the specified rank
> - * of the per-CPU ->c[] counters.
> + * Returns approximate total of the readers' ->unlock_count[] values for the
> + * rank of per-CPU counters specified by idx.
> */
> -static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> +static unsigned long srcu_readers_unlock_idx(struct srcu_struct *sp, int idx)
> {
> int cpu;
> unsigned long sum = 0;
> unsigned long t;
>
> for_each_possible_cpu(cpu) {
> - t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + t = READ_ONCE(cpu_counts->unlock_count[idx]);
> sum += t;
> }
> return sum;
> @@ -176,79 +180,43 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>
> /*
> * Return true if the number of pre-existing readers is determined to
> - * be stably zero. An example unstable zero can occur if the call
> - * to srcu_readers_active_idx() misses an __srcu_read_lock() increment,
> - * but due to task migration, sees the corresponding __srcu_read_unlock()
> - * decrement. This can happen because srcu_readers_active_idx() takes
> - * time to sum the array, and might in fact be interrupted or preempted
> - * partway through the summation.
> + * be zero.
> */
> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> {
> - unsigned long seq;
> + unsigned long unlocks;
>
> - seq = srcu_readers_seq_idx(sp, idx);
> + unlocks = srcu_readers_unlock_idx(sp, idx);
>
> /*
> - * The following smp_mb() A pairs with the smp_mb() B located in
> - * __srcu_read_lock(). This pairing ensures that if an
> - * __srcu_read_lock() increments its counter after the summation
> - * in srcu_readers_active_idx(), then the corresponding SRCU read-side
> - * critical section will see any changes made prior to the start
> - * of the current SRCU grace period.
> + * Make sure that a lock is always counted if the corresponding unlock
> + * is counted. Needs to be a smp_mb() as the read side may contain a
> + * read from a variable that is written to before the synchronize_srcu()
> + * in the write side. In this case smp_mb()s A and B act like the store
> + * buffering pattern.
> *
> - * Also, if the above call to srcu_readers_seq_idx() saw the
> - * increment of ->seq[], then the call to srcu_readers_active_idx()
> - * must see the increment of ->c[].
> + * This smp_mb() also pairs with smp_mb() C to prevent writes after the
> + * synchronize_srcu() from being executed before the grace period ends.
> */
> smp_mb(); /* A */
>
> /*
> - * Note that srcu_readers_active_idx() can incorrectly return
> - * zero even though there is a pre-existing reader throughout.
> - * To see this, suppose that task A is in a very long SRCU
> - * read-side critical section that started on CPU 0, and that
> - * no other reader exists, so that the sum of the counters
> - * is equal to one. Then suppose that task B starts executing
> - * srcu_readers_active_idx(), summing up to CPU 1, and then that
> - * task C starts reading on CPU 0, so that its increment is not
> - * summed, but finishes reading on CPU 2, so that its decrement
> - * -is- summed. Then when task B completes its sum, it will
> - * incorrectly get zero, despite the fact that task A has been
> - * in its SRCU read-side critical section the whole time.
> - *
> - * We therefore do a validation step should srcu_readers_active_idx()
> - * return zero.
> - */
> - if (srcu_readers_active_idx(sp, idx) != 0)
> - return false;
> -
> - /*
> - * The remainder of this function is the validation step.
> - * The following smp_mb() D pairs with the smp_mb() C in
> - * __srcu_read_unlock(). If the __srcu_read_unlock() was seen
> - * by srcu_readers_active_idx() above, then any destructive
> - * operation performed after the grace period will happen after
> - * the corresponding SRCU read-side critical section.
> + * If the locks are the same as the unlocks, then there must of have
> + * been no readers on this index at some time in between. This does not
> + * mean that there are no more readers, as one could have read the
> + * current index but have incremented the lock counter yet.
> *
> - * Note that there can be at most NR_CPUS worth of readers using
> - * the old index, which is not enough to overflow even a 32-bit
> - * integer. (Yes, this does mean that systems having more than
> - * a billion or so CPUs need to be 64-bit systems.) Therefore,
> - * the sum of the ->seq[] counters cannot possibly overflow.
> - * Therefore, the only way that the return values of the two
> - * calls to srcu_readers_seq_idx() can be equal is if there were
> - * no increments of the corresponding rank of ->seq[] counts
> - * in the interim. But the missed-increment scenario laid out
> - * above includes an increment of the ->seq[] counter by
> - * the corresponding __srcu_read_lock(). Therefore, if this
> - * scenario occurs, the return values from the two calls to
> - * srcu_readers_seq_idx() will differ, and thus the validation
> - * step below suffices.
> + * Note that there can be at most NR_CPUS worth of readers using the old
> + * index that haven't incremented ->lock_count[] yet. Therefore, the
> + * sum of the ->lock_count[]s cannot increment enough times to overflow
> + * and end up equal the sum of the ->unlock_count[]s, as long as there
> + * are at most ULONG_MAX - NR_CPUS readers at a time. (Yes, this does

from the changelog, it sounds like that "ULONG_MAX - NR_CPUS" is the limit
of the implements(old or this one). but actually the real max number of
active readers is much smaller, I think ULONG_MAX/4 can be used here instead
and that part of the changelog can be removed.


> + * mean that systems having more than a billion or so CPUs need to be
> + * 64-bit systems.) Therefore, the only way that the return values of
> + * the two calls to srcu_readers_(un)lock_idx() can be equal is if there
> + * are no active readers using this index.
> */
> - smp_mb(); /* D */
> -
> - return srcu_readers_seq_idx(sp, idx) == seq;
> + return srcu_readers_lock_idx(sp, idx) == unlocks;
> }
>
> /**
> @@ -266,8 +234,12 @@ static bool srcu_readers_active(struct srcu_struct *sp)
> unsigned long sum = 0;
>
> for_each_possible_cpu(cpu) {
> - sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]);
> - sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]);
> + struct srcu_struct_array* cpu_counts =
> + per_cpu_ptr(sp->per_cpu_ref, cpu);
> + sum += READ_ONCE(cpu_counts->lock_count[0]);
> + sum += READ_ONCE(cpu_counts->lock_count[1]);
> + sum -= READ_ONCE(cpu_counts->unlock_count[0]);
> + sum -= READ_ONCE(cpu_counts->unlock_count[1]);
> }
> return sum;
> }
> @@ -298,9 +270,8 @@ int __srcu_read_lock(struct srcu_struct *sp)
> int idx;
>
> idx = READ_ONCE(sp->completed) & 0x1;
> - __this_cpu_inc(sp->per_cpu_ref->c[idx]);
> + __this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
> smp_mb(); /* B */ /* Avoid leaking the critical section. */
> - __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
> return idx;
> }
> EXPORT_SYMBOL_GPL(__srcu_read_lock);
> @@ -314,7 +285,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
> void __srcu_read_unlock(struct srcu_struct *sp, int idx)
> {
> smp_mb(); /* C */ /* Avoid leaking the critical section. */
> - this_cpu_dec(sp->per_cpu_ref->c[idx]);
> + this_cpu_inc(sp->per_cpu_ref->unlock_count[idx]);
> }
> EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>
> @@ -349,7 +320,7 @@ static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount)
>
> /*
> * Increment the ->completed counter so that future SRCU readers will
> - * use the other rank of the ->c[] and ->seq[] arrays. This allows
> + * use the other rank of the ->(un)lock_count[] arrays. This allows
> * us to wait for pre-existing readers in a starvation-free manner.
> */
> static void srcu_flip(struct srcu_struct *sp)
>

Acked-by: Lai Jiangshan <[email protected]>

2016-11-17 17:58:15

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

----- On Nov 17, 2016, at 10:07 AM, Lai Jiangshan [email protected] wrote:

> On Thu, Nov 17, 2016 at 10:31 PM, Boqun Feng <[email protected]> wrote:
>> On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
>>> On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
>>> <[email protected]> wrote:
>>> > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
>>>
>>> >>
>>> >> __srcu_read_lock() used to be called with preemption disabled. I guess
>>> >> the reason was because we have two percpu variables to increase. So with
>>> >> only one percpu right, could we remove the preempt_{dis,en}able() in
>>> >> srcu_read_lock() and use this_cpu_inc() here?
>>> >
>>> > Quite possibly...
>>> >
>>>
>>
>> Hello, Lai ;-)
>>
>>> it will be nicer if it is removed.
>>>
>>> The reason for the preemption-disabled was also because we
>>> have to disallow any preemption between the fetching of the idx
>>> and the increasement. so that we have at most NR_CPUS worth
>>> of readers using the old index that haven't incremented the counters.
>>>
>>
>> After reading the comment for a while, I actually got a question, maybe
>> I miss something ;-)
>>
>> Why "at most NR_CPUS worth of readers using the old index haven't
>> incremented the counters" could save us from overflow the counter?
>>
>> Please consider the following case in current implementation:
>>
>>
>> {sp->completed = 0} so idx = 1 in srcu_advance_batches(...)
>>
>> one thread A is currently in __srcu_read_lock() and using idx = 1 and
>> about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
>> have been called and returned with idx = 1, please note I think this is
>> possible because I assume we may have some code like this:
>>
>> unsigned long i = 0;
>> for (; i < ULONG_MAX; i++)
>> srcu_read_lock(); // return the same idx 1;
>
> this is the wrong usage of the api.
>
>
> you might rewrite it as:
>
> unsigned long index[2] = {0, 0};
> unsigned long i = 0;
> for (; index[1] < ULONG_MAX; i++)
> index[srcu_read_lock()]++;
>
>
> I think we should add document to disallow this kind of usage.
> a reader should eat 4bytes on the memory at least.
>

(the analysis below refers to the rewritten SRCU scheme)

Let's presume we use the API correctly, as you describe (saving
the returned index of srcu_read_lock() somewhere).

So for the sake of argument, we can either call srcu_read_lock
in a loop (during which we can be migrated), or call it
concurrently from various threads. The effect in terms of
overflow is pretty much the same.

What is done here is incrementing per-cpu split-counters. In
the worse-case scenario, let's assume we're incrementing those
counters for a single index (0 or 1).

If we think about this system-wide, we don't really care about
the overflow of a single CPU counter, because what matters is
the difference between the overall nr_lock - nr_unlock counts
for a given index, once summed up by synchronize_srcu().

So the only situation that could lead to an overflow that matters
is if synchronize_srcu() see ULONG_MAX more increments of nr_lock
than the observed number of nr_unlock increments.

So the bound is not only about the number of concurrent SRCU
readers, but also about the number of SRCU readers that may
appear between the moment synchronize_srcu() reads the nr_unlock
per-cpu counters and the moment it reads the nr_lock counters.

This maximum bound of ULONG_MAX - 1 therefore applies to the
sum of:
- numner of concurrent SRCU read-side critical sections active
at the same time,
- number of SRCU read-side critical sections beginning after
synchronize_srcu() has read the nr_unlock counters, before
it reads the nr_lock counters.

You guys seem to see cases that would require a lower max nr
reader bound, but I'm afraid I don't quite understand them.

Thanks,

Mathieu


>>
>> And none of the corresponding srcu_read_unlock() has been called;
>>
>> In this case, at the time thread A increases the percpu c[idx], that
>> will result in an overflow, right? So even one reader using old idx will
>> result in overflow.
>>
>>
>> I think we won't be hit by overflow is not because we have few readers
>> using old idx, it's because there are unlikely ULONG_MAX + 1
>> __srcu_read_lock() called for the same idx, right? And the reason of
>> this is much complex: because we won't have a fair mount of threads in
>> the system, because no thread will nest srcu many levels, because there
>> won't be a lot readers using old idx.
>>
>> And this will still be true if we use new mechanism and shrink the
>> preemption disabled section, right?
>>
>> Regards,
>> Boqun
>>
>>> if we remove the preempt_{dis,en}able(). we must change the
>>> "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
>>> one on-going reader needs at least need 4bytes at the stack). it is still safe.
>>>
>>> but we still need to think more if we want to remove the preempt_{dis,en}able().
>>>
>>> Thanks
> >> Lai

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2016-11-17 18:03:16

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

----- On Nov 17, 2016, at 10:31 AM, Mathieu Desnoyers [email protected] wrote:

> ----- On Nov 17, 2016, at 10:07 AM, Lai Jiangshan [email protected] wrote:
>
>> On Thu, Nov 17, 2016 at 10:31 PM, Boqun Feng <[email protected]> wrote:
>>> On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
>>>> On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
>>>> <[email protected]> wrote:
>>>> > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
>>>>
>>>> >>
>>>> >> __srcu_read_lock() used to be called with preemption disabled. I guess
>>>> >> the reason was because we have two percpu variables to increase. So with
>>>> >> only one percpu right, could we remove the preempt_{dis,en}able() in
>>>> >> srcu_read_lock() and use this_cpu_inc() here?
>>>> >
>>>> > Quite possibly...
>>>> >
>>>>
>>>
>>> Hello, Lai ;-)
>>>
>>>> it will be nicer if it is removed.
>>>>
>>>> The reason for the preemption-disabled was also because we
>>>> have to disallow any preemption between the fetching of the idx
>>>> and the increasement. so that we have at most NR_CPUS worth
>>>> of readers using the old index that haven't incremented the counters.
>>>>
>>>
>>> After reading the comment for a while, I actually got a question, maybe
>>> I miss something ;-)
>>>
>>> Why "at most NR_CPUS worth of readers using the old index haven't
>>> incremented the counters" could save us from overflow the counter?
>>>
>>> Please consider the following case in current implementation:
>>>
>>>
>>> {sp->completed = 0} so idx = 1 in srcu_advance_batches(...)
>>>
>>> one thread A is currently in __srcu_read_lock() and using idx = 1 and
>>> about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
>>> have been called and returned with idx = 1, please note I think this is
>>> possible because I assume we may have some code like this:
>>>
>>> unsigned long i = 0;
>>> for (; i < ULONG_MAX; i++)
>>> srcu_read_lock(); // return the same idx 1;
>>
>> this is the wrong usage of the api.
>>
>>
>> you might rewrite it as:
>>
>> unsigned long index[2] = {0, 0};
>> unsigned long i = 0;
>> for (; index[1] < ULONG_MAX; i++)
>> index[srcu_read_lock()]++;
>>
>>
>> I think we should add document to disallow this kind of usage.
>> a reader should eat 4bytes on the memory at least.
>>
>
> (the analysis below refers to the rewritten SRCU scheme)
>
> Let's presume we use the API correctly, as you describe (saving
> the returned index of srcu_read_lock() somewhere).
>
> So for the sake of argument, we can either call srcu_read_lock
> in a loop (during which we can be migrated), or call it
> concurrently from various threads. The effect in terms of
> overflow is pretty much the same.
>
> What is done here is incrementing per-cpu split-counters. In
> the worse-case scenario, let's assume we're incrementing those
> counters for a single index (0 or 1).
>
> If we think about this system-wide, we don't really care about
> the overflow of a single CPU counter, because what matters is
> the difference between the overall nr_lock - nr_unlock counts
> for a given index, once summed up by synchronize_srcu().
>
> So the only situation that could lead to an overflow that matters
> is if synchronize_srcu() see ULONG_MAX more increments of nr_lock
> than the observed number of nr_unlock increments.
>
> So the bound is not only about the number of concurrent SRCU
> readers, but also about the number of SRCU readers that may
> appear between the moment synchronize_srcu() reads the nr_unlock
> per-cpu counters and the moment it reads the nr_lock counters.
>
> This maximum bound of ULONG_MAX - 1 therefore applies to the
> sum of:
> - numner of concurrent SRCU read-side critical sections active
> at the same time,
> - number of SRCU read-side critical sections beginning after
> synchronize_srcu() has read the nr_unlock counters, before
> it reads the nr_lock counters.

Now that I think of it, since we flip the period before summing
the nr_unlock counter, we cannot have any newcoming readers appearing
within the target period while we execute synchronize_srcu().
So it ends up being a limit on the number of concurrent SRCU
read-side c.s. active at the same time. (you can scratch the
second bullet above).

Thanks,

Mathieu



> You guys seem to see cases that would require a lower max nr
> reader bound, but I'm afraid I don't quite understand them.
>
> Thanks,
>
> Mathieu
>
>
>>>
>>> And none of the corresponding srcu_read_unlock() has been called;
>>>
>>> In this case, at the time thread A increases the percpu c[idx], that
>>> will result in an overflow, right? So even one reader using old idx will
>>> result in overflow.
>>>
>>>
>>> I think we won't be hit by overflow is not because we have few readers
>>> using old idx, it's because there are unlikely ULONG_MAX + 1
>>> __srcu_read_lock() called for the same idx, right? And the reason of
>>> this is much complex: because we won't have a fair mount of threads in
>>> the system, because no thread will nest srcu many levels, because there
>>> won't be a lot readers using old idx.
>>>
>>> And this will still be true if we use new mechanism and shrink the
>>> preemption disabled section, right?
>>>
>>> Regards,
>>> Boqun
>>>
>>>> if we remove the preempt_{dis,en}able(). we must change the
>>>> "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
>>>> one on-going reader needs at least need 4bytes at the stack). it is still safe.
>>>>
>>>> but we still need to think more if we want to remove the preempt_{dis,en}able().
>>>>
>>>> Thanks
>> >> Lai
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2016-11-17 18:19:52

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 10:45 PM, Boqun Feng <[email protected]> wrote:
> On Thu, Nov 17, 2016 at 06:38:29AM -0800, Paul E. McKenney wrote:
>> On Thu, Nov 17, 2016 at 05:49:57AM -0800, Paul E. McKenney wrote:
>> > On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
>> > > On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
>> > > <[email protected]> wrote:
>> > > > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
>> > >
>> > > >>
>> > > >> __srcu_read_lock() used to be called with preemption disabled. I guess
>> > > >> the reason was because we have two percpu variables to increase. So with
>> > > >> only one percpu right, could we remove the preempt_{dis,en}able() in
>> > > >> srcu_read_lock() and use this_cpu_inc() here?
>> > > >
>> > > > Quite possibly...
>> > > >
>> > >
>> > > it will be nicer if it is removed.
>> > >
>> > > The reason for the preemption-disabled was also because we
>> > > have to disallow any preemption between the fetching of the idx
>> > > and the increasement. so that we have at most NR_CPUS worth
>> > > of readers using the old index that haven't incremented the counters.
>> > >
>> > > if we remove the preempt_{dis,en}able(). we must change the
>> > > "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
>> > > one on-going reader needs at least need 4bytes at the stack). it is still safe.
>> > >
>> > > but we still need to think more if we want to remove the preempt_{dis,en}able().
>> >
>> > Good points! Agreed, any change in the preemption needs careful thought
>> > and needs to be a separate patch.
>>
>> And one area needing special thought is the call to __srcu_read_lock()
>> and __srcu_read_unlock() in do_exit().
>>
>
> So before commit 49f5903b473c5, we don't have the read of ->completed in
> preemption disable section?
>
> And following "git blame", I found commit 7a6b55e7108b3 ;-)

Ouch, it shows 7a6b55e7108b3 at least has a bug in the comments about NR_CPUS.

we should focus on the total number of all active readers instead the number
of the readers using the old index that haven't incremented the counters.
the later is smaller than the prior one which is smaller than the ULONG_MAX/4
or even smaller. so that we can simplify the comments.

+ * Note that the sum of the ->lock_count[]s cannot increment enough
+ * times to overflow and end up equal the sum of the ->unlock_count[]s,
+ * even too much readers using the old index that haven't incremented
+ * ->lock_count[] yet, as long as there are at most ULONG_MAX/4
+ * readers at a time. Therefore, the only way that the return values of
+ * the two calls to srcu_readers_(un)lock_idx() can be equal is if there
+ * are no active readers using this index.

>
> Regards,
> Boqun
>
>> Thanx, Paul
>>

2016-11-17 18:37:22

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 06:38:29AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 17, 2016 at 05:49:57AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
> > > On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
> > > <[email protected]> wrote:
> > > > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
> > >
> > > >>
> > > >> __srcu_read_lock() used to be called with preemption disabled. I guess
> > > >> the reason was because we have two percpu variables to increase. So with
> > > >> only one percpu right, could we remove the preempt_{dis,en}able() in
> > > >> srcu_read_lock() and use this_cpu_inc() here?
> > > >
> > > > Quite possibly...
> > > >
> > >
> > > it will be nicer if it is removed.
> > >
> > > The reason for the preemption-disabled was also because we
> > > have to disallow any preemption between the fetching of the idx
> > > and the increasement. so that we have at most NR_CPUS worth
> > > of readers using the old index that haven't incremented the counters.
> > >
> > > if we remove the preempt_{dis,en}able(). we must change the
> > > "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
> > > one on-going reader needs at least need 4bytes at the stack). it is still safe.
> > >
> > > but we still need to think more if we want to remove the preempt_{dis,en}able().
> >
> > Good points! Agreed, any change in the preemption needs careful thought
> > and needs to be a separate patch.
>
> And one area needing special thought is the call to __srcu_read_lock()
> and __srcu_read_unlock() in do_exit().
>

So before commit 49f5903b473c5, we don't have the read of ->completed in
preemption disable section?

And following "git blame", I found commit 7a6b55e7108b3 ;-)

Regards,
Boqun

> Thanx, Paul
>


Attachments:
(No filename) (1.77 kB)
signature.asc (455.00 B)
Download all attachments

2016-11-17 18:41:59

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
<[email protected]> wrote:
> On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:

>>
>> __srcu_read_lock() used to be called with preemption disabled. I guess
>> the reason was because we have two percpu variables to increase. So with
>> only one percpu right, could we remove the preempt_{dis,en}able() in
>> srcu_read_lock() and use this_cpu_inc() here?
>
> Quite possibly...
>

it will be nicer if it is removed.

The reason for the preemption-disabled was also because we
have to disallow any preemption between the fetching of the idx
and the increasement. so that we have at most NR_CPUS worth
of readers using the old index that haven't incremented the counters.

if we remove the preempt_{dis,en}able(). we must change the
"NR_CPUS" in the comment into ULONG_MAX/4. (I assume
one on-going reader needs at least need 4bytes at the stack). it is still safe.

but we still need to think more if we want to remove the preempt_{dis,en}able().

Thanks
Lai

2016-11-17 18:45:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
> On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
>
> >>
> >> __srcu_read_lock() used to be called with preemption disabled. I guess
> >> the reason was because we have two percpu variables to increase. So with
> >> only one percpu right, could we remove the preempt_{dis,en}able() in
> >> srcu_read_lock() and use this_cpu_inc() here?
> >
> > Quite possibly...
> >
>
> it will be nicer if it is removed.
>
> The reason for the preemption-disabled was also because we
> have to disallow any preemption between the fetching of the idx
> and the increasement. so that we have at most NR_CPUS worth
> of readers using the old index that haven't incremented the counters.
>
> if we remove the preempt_{dis,en}able(). we must change the
> "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
> one on-going reader needs at least need 4bytes at the stack). it is still safe.
>
> but we still need to think more if we want to remove the preempt_{dis,en}able().

Good points! Agreed, any change in the preemption needs careful thought
and needs to be a separate patch.

Thanx, Paul

2016-11-17 19:16:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 10:45:44PM +0800, Boqun Feng wrote:
> On Thu, Nov 17, 2016 at 06:38:29AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 17, 2016 at 05:49:57AM -0800, Paul E. McKenney wrote:
> > > On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
> > > > On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
> > > > <[email protected]> wrote:
> > > > > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
> > > >
> > > > >>
> > > > >> __srcu_read_lock() used to be called with preemption disabled. I guess
> > > > >> the reason was because we have two percpu variables to increase. So with
> > > > >> only one percpu right, could we remove the preempt_{dis,en}able() in
> > > > >> srcu_read_lock() and use this_cpu_inc() here?
> > > > >
> > > > > Quite possibly...
> > > > >
> > > >
> > > > it will be nicer if it is removed.
> > > >
> > > > The reason for the preemption-disabled was also because we
> > > > have to disallow any preemption between the fetching of the idx
> > > > and the increasement. so that we have at most NR_CPUS worth
> > > > of readers using the old index that haven't incremented the counters.
> > > >
> > > > if we remove the preempt_{dis,en}able(). we must change the
> > > > "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
> > > > one on-going reader needs at least need 4bytes at the stack). it is still safe.
> > > >
> > > > but we still need to think more if we want to remove the preempt_{dis,en}able().
> > >
> > > Good points! Agreed, any change in the preemption needs careful thought
> > > and needs to be a separate patch.
> >
> > And one area needing special thought is the call to __srcu_read_lock()
> > and __srcu_read_unlock() in do_exit().
> >
>
> So before commit 49f5903b473c5, we don't have the read of ->completed in
> preemption disable section?
>
> And following "git blame", I found commit 7a6b55e7108b3 ;-)

Thankfully the first (49f5903b473c5) came after the second. ;-)

Thanx, Paul

2016-11-17 19:52:52

by Lance Roy

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, 17 Nov 2016 21:58:34 +0800
Lai Jiangshan <[email protected]> wrote:
> from the changelog, it sounds like that "ULONG_MAX - NR_CPUS" is the limit
> of the implements(old or this one). but actually the real max number of
> active readers is much smaller, I think ULONG_MAX/4 can be used here instead
> and that part of the changelog can be removed.
In the old version, there are two separate limits. There first is that there
are no more than ULONG_MAX nested or parallel readers, as otherwise ->c[] would
overflow.

The other limit is to prevent ->seq[] from overflowing during
srcu_readers_active_idx_check(). For this to happen, there must be ULONG_MAX+1
readers that loaded ->completed before srcu_flip() was run which then increment
->seq[]. The ->seq[] array is supposed to prevent
srcu_readers_active_idx_check() from completing successfully if any such
readers increment ->seq[], because otherwise they could decrement ->c[] while
it is being read, which could cause it to incorrectly report that there are no
active readers. If ->seq[] overflows then there is nothing (except how
improbable it is) to prevent this from happening.

I used to think (because of the previous comment) that there could be at most
one such increment of ->seq[] per CPU, as they would have to be using to old
value of ->completed and preemption would be disabled. This is not the case
because there are no barriers around srcu_flip(), so the processor is not
required to increment ->completed before reading ->seq[] the first time, nor is
it required to wait until it is done reading ->seq[] the second time before
incrementing. This means that the following code could cause ->seq[] to
increment an arbitrarily large number of times between the two ->seq[] loads in
srcu_readers_active_idx_check().
while (true) {
int idx = srcu_read_lock(sp);
srcu_read_unlock(sp, idx);
}

Thanks,
Lance

2016-11-17 20:31:21

by Lance Roy

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, 17 Nov 2016 23:07:02 +0800
Lai Jiangshan <[email protected]> wrote:

> On Thu, Nov 17, 2016 at 10:31 PM, Boqun Feng <[email protected]> wrote:
> > After reading the comment for a while, I actually got a question, maybe
> > I miss something ;-)
> >
> > Why "at most NR_CPUS worth of readers using the old index haven't
> > incremented the counters" could save us from overflow the counter?
> >
> > Please consider the following case in current implementation:
> >
> >
> > {sp->completed = 0} so idx = 1 in srcu_advance_batches(...)
> >
> > one thread A is currently in __srcu_read_lock() and using idx = 1 and
> > about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
> > have been called and returned with idx = 1, please note I think this is
> > possible because I assume we may have some code like this:
> >
> > unsigned long i = 0;
> > for (; i < ULONG_MAX; i++)
> > srcu_read_lock(); // return the same idx 1;
>
> this is the wrong usage of the api.
>
>
> you might rewrite it as:
>
> unsigned long index[2] = {0, 0};
> unsigned long i = 0;
> for (; index[1] < ULONG_MAX; i++)
> index[srcu_read_lock()]++;
Doesn't this still fail the API by not nesting correctly? If you don't keep
track of the order the indices were returned in then it seems like you would
exit the critical sections in the wrong order.

Thanks,
Lance

2016-11-17 21:22:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 03:38:08PM +0000, Mathieu Desnoyers wrote:
> ----- On Nov 17, 2016, at 10:31 AM, Mathieu Desnoyers [email protected] wrote:
>
> > ----- On Nov 17, 2016, at 10:07 AM, Lai Jiangshan [email protected] wrote:
> >
> >> On Thu, Nov 17, 2016 at 10:31 PM, Boqun Feng <[email protected]> wrote:
> >>> On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
> >>>> On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
> >>>> <[email protected]> wrote:
> >>>> > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
> >>>>
> >>>> >>
> >>>> >> __srcu_read_lock() used to be called with preemption disabled. I guess
> >>>> >> the reason was because we have two percpu variables to increase. So with
> >>>> >> only one percpu right, could we remove the preempt_{dis,en}able() in
> >>>> >> srcu_read_lock() and use this_cpu_inc() here?
> >>>> >
> >>>> > Quite possibly...
> >>>> >
> >>>>
> >>>
> >>> Hello, Lai ;-)
> >>>
> >>>> it will be nicer if it is removed.
> >>>>
> >>>> The reason for the preemption-disabled was also because we
> >>>> have to disallow any preemption between the fetching of the idx
> >>>> and the increasement. so that we have at most NR_CPUS worth
> >>>> of readers using the old index that haven't incremented the counters.
> >>>>
> >>>
> >>> After reading the comment for a while, I actually got a question, maybe
> >>> I miss something ;-)
> >>>
> >>> Why "at most NR_CPUS worth of readers using the old index haven't
> >>> incremented the counters" could save us from overflow the counter?
> >>>
> >>> Please consider the following case in current implementation:
> >>>
> >>>
> >>> {sp->completed = 0} so idx = 1 in srcu_advance_batches(...)
> >>>
> >>> one thread A is currently in __srcu_read_lock() and using idx = 1 and
> >>> about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
> >>> have been called and returned with idx = 1, please note I think this is
> >>> possible because I assume we may have some code like this:
> >>>
> >>> unsigned long i = 0;
> >>> for (; i < ULONG_MAX; i++)
> >>> srcu_read_lock(); // return the same idx 1;
> >>
> >> this is the wrong usage of the api.
> >>
> >>
> >> you might rewrite it as:
> >>
> >> unsigned long index[2] = {0, 0};
> >> unsigned long i = 0;
> >> for (; index[1] < ULONG_MAX; i++)
> >> index[srcu_read_lock()]++;
> >>
> >>
> >> I think we should add document to disallow this kind of usage.
> >> a reader should eat 4bytes on the memory at least.
> >>
> >
> > (the analysis below refers to the rewritten SRCU scheme)
> >
> > Let's presume we use the API correctly, as you describe (saving
> > the returned index of srcu_read_lock() somewhere).
> >
> > So for the sake of argument, we can either call srcu_read_lock
> > in a loop (during which we can be migrated), or call it
> > concurrently from various threads. The effect in terms of
> > overflow is pretty much the same.
> >
> > What is done here is incrementing per-cpu split-counters. In
> > the worse-case scenario, let's assume we're incrementing those
> > counters for a single index (0 or 1).
> >
> > If we think about this system-wide, we don't really care about
> > the overflow of a single CPU counter, because what matters is
> > the difference between the overall nr_lock - nr_unlock counts
> > for a given index, once summed up by synchronize_srcu().
> >
> > So the only situation that could lead to an overflow that matters
> > is if synchronize_srcu() see ULONG_MAX more increments of nr_lock
> > than the observed number of nr_unlock increments.
> >
> > So the bound is not only about the number of concurrent SRCU
> > readers, but also about the number of SRCU readers that may
> > appear between the moment synchronize_srcu() reads the nr_unlock
> > per-cpu counters and the moment it reads the nr_lock counters.
> >
> > This maximum bound of ULONG_MAX - 1 therefore applies to the
> > sum of:
> > - numner of concurrent SRCU read-side critical sections active
> > at the same time,
> > - number of SRCU read-side critical sections beginning after
> > synchronize_srcu() has read the nr_unlock counters, before
> > it reads the nr_lock counters.
>
> Now that I think of it, since we flip the period before summing
> the nr_unlock counter, we cannot have any newcoming readers appearing
> within the target period while we execute synchronize_srcu().
> So it ends up being a limit on the number of concurrent SRCU
> read-side c.s. active at the same time. (you can scratch the
> second bullet above).

We can have NR_CPUS worth of them -- those that have fetched the
index, but not yet incremented their counter.

But if the updater fails to see their counter increment, then
their next srcu_read_lock() is guaranteed to see the new index.

Thanx, Paul

> Thanks,
>
> Mathieu
>
>
>
> > You guys seem to see cases that would require a lower max nr
> > reader bound, but I'm afraid I don't quite understand them.
> >
> > Thanks,
> >
> > Mathieu
> >
> >
> >>>
> >>> And none of the corresponding srcu_read_unlock() has been called;
> >>>
> >>> In this case, at the time thread A increases the percpu c[idx], that
> >>> will result in an overflow, right? So even one reader using old idx will
> >>> result in overflow.
> >>>
> >>>
> >>> I think we won't be hit by overflow is not because we have few readers
> >>> using old idx, it's because there are unlikely ULONG_MAX + 1
> >>> __srcu_read_lock() called for the same idx, right? And the reason of
> >>> this is much complex: because we won't have a fair mount of threads in
> >>> the system, because no thread will nest srcu many levels, because there
> >>> won't be a lot readers using old idx.
> >>>
> >>> And this will still be true if we use new mechanism and shrink the
> >>> preemption disabled section, right?
> >>>
> >>> Regards,
> >>> Boqun
> >>>
> >>>> if we remove the preempt_{dis,en}able(). we must change the
> >>>> "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
> >>>> one on-going reader needs at least need 4bytes at the stack). it is still safe.
> >>>>
> >>>> but we still need to think more if we want to remove the preempt_{dis,en}able().
> >>>>
> >>>> Thanks
> >> >> Lai
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>

2016-11-18 13:28:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite

On Thu, Nov 17, 2016 at 11:53:04AM -0800, Lance Roy wrote:
> On Thu, 17 Nov 2016 21:58:34 +0800
> Lai Jiangshan <[email protected]> wrote:
> > from the changelog, it sounds like that "ULONG_MAX - NR_CPUS" is the limit
> > of the implements(old or this one). but actually the real max number of
> > active readers is much smaller, I think ULONG_MAX/4 can be used here instead
> > and that part of the changelog can be removed.
> In the old version, there are two separate limits. There first is that there
> are no more than ULONG_MAX nested or parallel readers, as otherwise ->c[] would
> overflow.
>
> The other limit is to prevent ->seq[] from overflowing during
> srcu_readers_active_idx_check(). For this to happen, there must be ULONG_MAX+1
> readers that loaded ->completed before srcu_flip() was run which then increment
> ->seq[]. The ->seq[] array is supposed to prevent
> srcu_readers_active_idx_check() from completing successfully if any such
> readers increment ->seq[], because otherwise they could decrement ->c[] while
> it is being read, which could cause it to incorrectly report that there are no
> active readers. If ->seq[] overflows then there is nothing (except how
> improbable it is) to prevent this from happening.
>
> I used to think (because of the previous comment) that there could be at most
> one such increment of ->seq[] per CPU, as they would have to be using to old
> value of ->completed and preemption would be disabled. This is not the case
> because there are no barriers around srcu_flip(), so the processor is not
> required to increment ->completed before reading ->seq[] the first time, nor is
> it required to wait until it is done reading ->seq[] the second time before
> incrementing. This means that the following code could cause ->seq[] to
> increment an arbitrarily large number of times between the two ->seq[] loads in
> srcu_readers_active_idx_check().
> while (true) {
> int idx = srcu_read_lock(sp);
> srcu_read_unlock(sp, idx);
> }

I also initially thought that there would need to be a memory barrier
immediately after srcu_flip(). But after further thought, I don't
believe that this is the case.

The key point is that updaters do the flip, sum the unlock counters,
do a full memory barrier, then sum the lock counters.

We therefore know that if an updater sees an unlock, it is guaranteed
to see the corresponding lock. Which prevents negative sums. However,
it is true that the flip and the unlock reads can be interchanged.
This can result in failing to see a count of zero, but it cannot result
in spuriously seeing a count of zero.

More to this point, if an updater fails to see a lock, the next time
that CPU/task does an srcu_read_lock(), that CPU/task is guaranteed
to see the new value of the index. This limits the number of CPUs/tasks
that can be using the old value of the index. Given that preemption
is disabled across the fetch of the index and the increment of the lock
count, that number is NR_CPUS-1, given that the updater has to be running
on one of the CPUs (as Mathieu pointed out earlier in this thread).

Or am I missing something?

Thanx, Paul