2022-06-20 22:53:08

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 0/12] Polled grace-period updates for v5.20

Hello!

This series adds support for polled expedited RCU grace periods:

1. Make normal polling GP be more precise about sequence numbers.

2. Provide a get_completed_synchronize_rcu() function.

3. Validate get_completed_synchronize_rcu().

4. Switch polled grace-period APIs to ->gp_seq_polled.

5. Make polled grace-period API account for expedited grace periods.

6. Make Tiny RCU grace periods visible to polled APIs.

7. Verify that polled GP API sees synchronous grace periods.

8. Add polled expedited grace-period primitives.

9. Test polled expedited grace-period primitives.

10. Put panic_on_rcu_stall() after expedited RCU CPU stall warnings,
courtesy of Zqiang.

11. Diagnose extended sync_rcu_do_polled_gp() loops.

12. Add irqs-disabled indicator to expedited RCU CPU stall warnings,
courtesy of Zqiang.

Thanx, Paul

------------------------------------------------------------------------

b/include/linux/rcupdate.h | 1
b/include/linux/rcutiny.h | 10 +++
b/include/linux/rcutree.h | 2
b/kernel/rcu/rcu.h | 12 ++++
b/kernel/rcu/rcutorture.c | 6 ++
b/kernel/rcu/tiny.c | 4 -
b/kernel/rcu/tree.c | 4 -
b/kernel/rcu/tree.h | 2
b/kernel/rcu/tree_exp.h | 16 +++++-
b/kernel/rcu/update.c | 13 ++++
kernel/rcu/rcu.h | 3 +
kernel/rcu/rcutorture.c | 97 +++++++++++++++++++++++++++++++-----
kernel/rcu/tiny.c | 7 +-
kernel/rcu/tree.c | 119 ++++++++++++++++++++++++++++++++++++++++-----
kernel/rcu/tree.h | 8 +++
kernel/rcu/tree_exp.h | 99 +++++++++++++++++++++++++++++++++++--
16 files changed, 365 insertions(+), 38 deletions(-)


2022-06-20 22:53:35

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 01/12] rcu: Make normal polling GP be more precise about sequence numbers

Currently, poll_state_synchronize_rcu() uses rcu_seq_done() to check
whether the specified grace period has completed. However, rcu_seq_done()
does a simple comparison that reserves have of the sequence-number space
for uncompleted grace periods. This has the unfortunate side-effect
of not handling sequence-number wrap gracefully. Of course, one can
argue that if someone has already waited for half of the full range of
grace periods, they can wait for the other half, but why wait at all in
this case?

This commit therefore creates a rcu_seq_done_exact() that counts as
uncompleted only the two grace periods during which the sequence number
might have been handed out, while still being uncompleted. This way,
if sequence-number wrap happens to hit that range, at most two additional
grace periods need be waited for.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcu.h | 12 ++++++++++++
kernel/rcu/tree.c | 4 ++--
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 4916077119f3f..0adb55941aeb3 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -119,6 +119,18 @@ static inline bool rcu_seq_done(unsigned long *sp, unsigned long s)
return ULONG_CMP_GE(READ_ONCE(*sp), s);
}

+/*
+ * Given a snapshot from rcu_seq_snap(), determine whether or not a
+ * full update-side operation has occurred, but do not allow the
+ * (ULONG_MAX / 2) safety-factor/guard-band.
+ */
+static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
+{
+ unsigned long cur_s = READ_ONCE(*sp);
+
+ return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
+}
+
/*
* Has a grace period completed since the time the old gp_seq was collected?
*/
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c25ba442044a6..ec28e259774e7 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3911,7 +3911,7 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
*
* Yes, this function does not take counter wrap into account.
* But counter wrap is harmless. If the counter wraps, we have waited for
- * more than 2 billion grace periods (and way more on a 64-bit system!).
+ * more than a billion grace periods (and way more on a 64-bit system!).
* Those needing to keep oldstate values for very long time periods
* (many hours even on 32-bit systems) should check them occasionally
* and either refresh them or set a flag indicating that the grace period
@@ -3924,7 +3924,7 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
*/
bool poll_state_synchronize_rcu(unsigned long oldstate)
{
- if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) {
+ if (rcu_seq_done_exact(&rcu_state.gp_seq, oldstate)) {
smp_mb(); /* Ensure GP ends before subsequent accesses. */
return true;
}
--
2.31.1.189.g2e36527f23

2022-06-20 22:53:41

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 06/12] rcu: Make Tiny RCU grace periods visible to polled APIs

This commit makes the Tiny RCU implementation of synchronize_rcu()
increment the rcu_ctrlblk.gp_seq counter, thus making both
synchronize_rcu() and synchronize_rcu_expedited() visible to
get_state_synchronize_rcu() and friends.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tiny.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index dbee6bea67269..60071817d9399 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -139,8 +139,10 @@ static __latent_entropy void rcu_process_callbacks(struct softirq_action *unused
/*
* Wait for a grace period to elapse. But it is illegal to invoke
* synchronize_rcu() from within an RCU read-side critical section.
- * Therefore, any legal call to synchronize_rcu() is a quiescent
- * state, and so on a UP system, synchronize_rcu() need do nothing.
+ * Therefore, any legal call to synchronize_rcu() is a quiescent state,
+ * and so on a UP system, synchronize_rcu() need do nothing, other than
+ * let the polled APIs know that another grace period elapsed.
+ *
* (But Lai Jiangshan points out the benefits of doing might_sleep()
* to reduce latency.)
*
@@ -152,6 +154,7 @@ void synchronize_rcu(void)
lock_is_held(&rcu_lock_map) ||
lock_is_held(&rcu_sched_lock_map),
"Illegal synchronize_rcu() in RCU read-side critical section");
+ WRITE_ONCE(rcu_ctrlblk.gp_seq, rcu_ctrlblk.gp_seq + 2);
}
EXPORT_SYMBOL_GPL(synchronize_rcu);

--
2.31.1.189.g2e36527f23

2022-06-20 22:53:48

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 12/12] rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings

From: Zqiang <[email protected]>

If a CPU has interrupts disabled continuously starting before the
beginning of a given expedited RCU grace period, that CPU will not
execute that grace period's IPI handler. This will in turn mean
that the ->cpu_no_qs.b.exp field in that CPU's rcu_data structure
will continue to contain the boolean value false.

Knowing whether or not a CPU has had interrupts disabled can be helpful
when debugging an expedited RCU CPU stall warning, so this commit
adds a "D" indicator expedited RCU CPU stall warnings that signifies
that the corresponding CPU has had interrupts disabled throughout.

This capability was tested as follows:

runqemu kvm slirp nographic qemuparams="-m 4096 -smp 4" bootparams=
"isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 rcutree.dump_tree=1
rcutorture.stall_cpu_holdoff=30 rcutorture.stall_cpu=40
rcutorture.stall_cpu_irqsoff=1 rcutorture.stall_cpu_block=0
rcutorture.stall_no_softlockup=1" -d

The rcu_torture_stall() function ran on CPU 1, which displays the "D"
as expected given the rcutorture.stall_cpu_irqsoff=1 module parameter:

............
rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks:
{ 1-...D } 26467 jiffies s: 13317 root: 0x1/.
rcu: blocking rcu_node structures (internal RCU debug): l=1:0-1:0x2/.
Task dump for CPU 1:
task:rcu_torture_sta state:R running task stack: 0 pid: 76 ppid: 2 flags:0x00004008

Signed-off-by: Zqiang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_exp.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 4c7037b507032..f092c7f18a5f3 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -637,10 +637,11 @@ static void synchronize_rcu_expedited_wait(void)
continue;
ndetected++;
rdp = per_cpu_ptr(&rcu_data, cpu);
- pr_cont(" %d-%c%c%c", cpu,
+ pr_cont(" %d-%c%c%c%c", cpu,
"O."[!!cpu_online(cpu)],
"o."[!!(rdp->grpmask & rnp->expmaskinit)],
- "N."[!!(rdp->grpmask & rnp->expmaskinitnext)]);
+ "N."[!!(rdp->grpmask & rnp->expmaskinitnext)],
+ "D."[!!(rdp->cpu_no_qs.b.exp)]);
}
}
pr_cont(" } %lu jiffies s: %lu root: %#lx/%c\n",
--
2.31.1.189.g2e36527f23

2022-06-20 22:54:04

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 09/12] rcutorture: Test polled expedited grace-period primitives

This commit adds tests of start_poll_synchronize_rcu_expedited() and
poll_state_synchronize_rcu_expedited().

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcutorture.c | 87 +++++++++++++++++++++++++++++++++++------
1 file changed, 74 insertions(+), 13 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index d2edc763bb92a..0788ef2a44911 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -86,10 +86,12 @@ torture_param(int, fwd_progress_holdoff, 60,
torture_param(bool, fwd_progress_need_resched, 1,
"Hide cond_resched() behind need_resched()");
torture_param(bool, gp_cond, false, "Use conditional/async GP wait primitives");
+torture_param(bool, gp_cond_exp, false, "Use conditional/async expedited GP wait primitives");
torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
torture_param(bool, gp_normal, false,
"Use normal (non-expedited) GP wait primitives");
torture_param(bool, gp_poll, false, "Use polling GP wait primitives");
+torture_param(bool, gp_poll_exp, false, "Use polling expedited GP wait primitives");
torture_param(bool, gp_sync, false, "Use synchronous GP wait primitives");
torture_param(int, irqreader, 1, "Allow RCU readers from irq handlers");
torture_param(int, leakpointer, 0, "Leak pointer dereferences from readers");
@@ -209,12 +211,16 @@ static int rcu_torture_writer_state;
#define RTWS_DEF_FREE 3
#define RTWS_EXP_SYNC 4
#define RTWS_COND_GET 5
-#define RTWS_COND_SYNC 6
-#define RTWS_POLL_GET 7
-#define RTWS_POLL_WAIT 8
-#define RTWS_SYNC 9
-#define RTWS_STUTTER 10
-#define RTWS_STOPPING 11
+#define RTWS_COND_GET_EXP 6
+#define RTWS_COND_SYNC 7
+#define RTWS_COND_SYNC_EXP 8
+#define RTWS_POLL_GET 9
+#define RTWS_POLL_GET_EXP 10
+#define RTWS_POLL_WAIT 11
+#define RTWS_POLL_WAIT_EXP 12
+#define RTWS_SYNC 13
+#define RTWS_STUTTER 14
+#define RTWS_STOPPING 15
static const char * const rcu_torture_writer_state_names[] = {
"RTWS_FIXED_DELAY",
"RTWS_DELAY",
@@ -222,9 +228,13 @@ static const char * const rcu_torture_writer_state_names[] = {
"RTWS_DEF_FREE",
"RTWS_EXP_SYNC",
"RTWS_COND_GET",
+ "RTWS_COND_GET_EXP",
"RTWS_COND_SYNC",
+ "RTWS_COND_SYNC_EXP",
"RTWS_POLL_GET",
+ "RTWS_POLL_GET_EXP",
"RTWS_POLL_WAIT",
+ "RTWS_POLL_WAIT_EXP",
"RTWS_SYNC",
"RTWS_STUTTER",
"RTWS_STOPPING",
@@ -337,6 +347,10 @@ struct rcu_torture_ops {
void (*deferred_free)(struct rcu_torture *p);
void (*sync)(void);
void (*exp_sync)(void);
+ unsigned long (*get_gp_state_exp)(void);
+ unsigned long (*start_gp_poll_exp)(void);
+ bool (*poll_gp_state_exp)(unsigned long oldstate);
+ void (*cond_sync_exp)(unsigned long oldstate);
unsigned long (*get_gp_state)(void);
unsigned long (*get_gp_completed)(void);
unsigned long (*start_gp_poll)(void);
@@ -509,6 +523,10 @@ static struct rcu_torture_ops rcu_ops = {
.start_gp_poll = start_poll_synchronize_rcu,
.poll_gp_state = poll_state_synchronize_rcu,
.cond_sync = cond_synchronize_rcu,
+ .get_gp_state_exp = get_state_synchronize_rcu,
+ .start_gp_poll_exp = start_poll_synchronize_rcu_expedited,
+ .poll_gp_state_exp = poll_state_synchronize_rcu,
+ .cond_sync_exp = cond_synchronize_rcu_expedited,
.call = call_rcu,
.cb_barrier = rcu_barrier,
.fqs = rcu_force_quiescent_state,
@@ -1138,9 +1156,8 @@ rcu_torture_fqs(void *arg)
return 0;
}

-// Used by writers to randomly choose from the available grace-period
-// primitives. The only purpose of the initialization is to size the array.
-static int synctype[] = { RTWS_DEF_FREE, RTWS_EXP_SYNC, RTWS_COND_GET, RTWS_POLL_GET, RTWS_SYNC };
+// Used by writers to randomly choose from the available grace-period primitives.
+static int synctype[ARRAY_SIZE(rcu_torture_writer_state_names)] = { };
static int nsynctypes;

/*
@@ -1148,18 +1165,27 @@ static int nsynctypes;
*/
static void rcu_torture_write_types(void)
{
- bool gp_cond1 = gp_cond, gp_exp1 = gp_exp, gp_normal1 = gp_normal;
- bool gp_poll1 = gp_poll, gp_sync1 = gp_sync;
+ bool gp_cond1 = gp_cond, gp_cond_exp1 = gp_cond_exp, gp_exp1 = gp_exp;
+ bool gp_poll_exp1 = gp_poll_exp, gp_normal1 = gp_normal, gp_poll1 = gp_poll;
+ bool gp_sync1 = gp_sync;

/* Initialize synctype[] array. If none set, take default. */
- if (!gp_cond1 && !gp_exp1 && !gp_normal1 && !gp_poll1 && !gp_sync1)
- gp_cond1 = gp_exp1 = gp_normal1 = gp_poll1 = gp_sync1 = true;
+ if (!gp_cond1 && !gp_cond_exp1 && !gp_exp1 && !gp_poll_exp &&
+ !gp_normal1 && !gp_poll1 && !gp_sync1)
+ gp_cond1 = gp_cond_exp1 = gp_exp1 = gp_poll_exp1 =
+ gp_normal1 = gp_poll1 = gp_sync1 = true;
if (gp_cond1 && cur_ops->get_gp_state && cur_ops->cond_sync) {
synctype[nsynctypes++] = RTWS_COND_GET;
pr_info("%s: Testing conditional GPs.\n", __func__);
} else if (gp_cond && (!cur_ops->get_gp_state || !cur_ops->cond_sync)) {
pr_alert("%s: gp_cond without primitives.\n", __func__);
}
+ if (gp_cond_exp1 && cur_ops->get_gp_state_exp && cur_ops->cond_sync_exp) {
+ synctype[nsynctypes++] = RTWS_COND_GET_EXP;
+ pr_info("%s: Testing conditional expedited GPs.\n", __func__);
+ } else if (gp_cond_exp && (!cur_ops->get_gp_state_exp || !cur_ops->cond_sync_exp)) {
+ pr_alert("%s: gp_cond_exp without primitives.\n", __func__);
+ }
if (gp_exp1 && cur_ops->exp_sync) {
synctype[nsynctypes++] = RTWS_EXP_SYNC;
pr_info("%s: Testing expedited GPs.\n", __func__);
@@ -1178,6 +1204,12 @@ static void rcu_torture_write_types(void)
} else if (gp_poll && (!cur_ops->start_gp_poll || !cur_ops->poll_gp_state)) {
pr_alert("%s: gp_poll without primitives.\n", __func__);
}
+ if (gp_poll_exp1 && cur_ops->start_gp_poll_exp && cur_ops->poll_gp_state_exp) {
+ synctype[nsynctypes++] = RTWS_POLL_GET_EXP;
+ pr_info("%s: Testing polling expedited GPs.\n", __func__);
+ } else if (gp_poll_exp && (!cur_ops->start_gp_poll_exp || !cur_ops->poll_gp_state_exp)) {
+ pr_alert("%s: gp_poll_exp without primitives.\n", __func__);
+ }
if (gp_sync1 && cur_ops->sync) {
synctype[nsynctypes++] = RTWS_SYNC;
pr_info("%s: Testing normal GPs.\n", __func__);
@@ -1285,6 +1317,14 @@ rcu_torture_writer(void *arg)
cur_ops->cond_sync(gp_snap);
rcu_torture_pipe_update(old_rp);
break;
+ case RTWS_COND_GET_EXP:
+ rcu_torture_writer_state = RTWS_COND_GET_EXP;
+ gp_snap = cur_ops->get_gp_state_exp();
+ torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand);
+ rcu_torture_writer_state = RTWS_COND_SYNC_EXP;
+ cur_ops->cond_sync_exp(gp_snap);
+ rcu_torture_pipe_update(old_rp);
+ break;
case RTWS_POLL_GET:
rcu_torture_writer_state = RTWS_POLL_GET;
gp_snap = cur_ops->start_gp_poll();
@@ -1294,6 +1334,15 @@ rcu_torture_writer(void *arg)
&rand);
rcu_torture_pipe_update(old_rp);
break;
+ case RTWS_POLL_GET_EXP:
+ rcu_torture_writer_state = RTWS_POLL_GET_EXP;
+ gp_snap = cur_ops->start_gp_poll_exp();
+ rcu_torture_writer_state = RTWS_POLL_WAIT_EXP;
+ while (!cur_ops->poll_gp_state_exp(gp_snap))
+ torture_hrtimeout_jiffies(torture_random(&rand) % 16,
+ &rand);
+ rcu_torture_pipe_update(old_rp);
+ break;
case RTWS_SYNC:
rcu_torture_writer_state = RTWS_SYNC;
if (cur_ops->get_gp_state && cur_ops->poll_gp_state)
@@ -1400,6 +1449,11 @@ rcu_torture_fakewriter(void *arg)
torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand);
cur_ops->cond_sync(gp_snap);
break;
+ case RTWS_COND_GET_EXP:
+ gp_snap = cur_ops->get_gp_state_exp();
+ torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand);
+ cur_ops->cond_sync_exp(gp_snap);
+ break;
case RTWS_POLL_GET:
gp_snap = cur_ops->start_gp_poll();
while (!cur_ops->poll_gp_state(gp_snap)) {
@@ -1407,6 +1461,13 @@ rcu_torture_fakewriter(void *arg)
&rand);
}
break;
+ case RTWS_POLL_GET_EXP:
+ gp_snap = cur_ops->start_gp_poll_exp();
+ while (!cur_ops->poll_gp_state_exp(gp_snap)) {
+ torture_hrtimeout_jiffies(torture_random(&rand) % 16,
+ &rand);
+ }
+ break;
case RTWS_SYNC:
cur_ops->sync();
break;
--
2.31.1.189.g2e36527f23

2022-06-20 22:54:13

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 11/12] rcu: Diagnose extended sync_rcu_do_polled_gp() loops

This commit dumps out state when the sync_rcu_do_polled_gp() function
loops more than expected. This is a debugging aid.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_exp.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index f05a15b11fa0c..4c7037b507032 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -970,6 +970,7 @@ EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
static void sync_rcu_do_polled_gp(struct work_struct *wp)
{
unsigned long flags;
+ int i = 0;
struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq);
unsigned long s;

@@ -979,8 +980,12 @@ static void sync_rcu_do_polled_gp(struct work_struct *wp)
raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
if (s == RCU_GET_STATE_COMPLETED)
return;
- while (!poll_state_synchronize_rcu(s))
+ while (!poll_state_synchronize_rcu(s)) {
synchronize_rcu_expedited();
+ if (i == 10 || i == 20)
+ pr_info("%s: i = %d s = %lx gp_seq_polled = %lx\n", __func__, i, s, READ_ONCE(rcu_state.gp_seq_polled));
+ i++;
+ }
raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
s = rnp->exp_seq_poll_rq;
if (poll_state_synchronize_rcu(s))
--
2.31.1.189.g2e36527f23

2022-06-20 22:54:30

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 07/12] rcutorture: Verify that polled GP API sees synchronous grace periods

This commit causes rcu_torture_writer() to use WARN_ON_ONCE() to check
that the cookie returned by the current RCU flavor's ->get_gp_state()
function (get_state_synchronize_rcu() for vanilla RCU) causes that
flavor's ->poll_gp_state function (poll_state_synchronize_rcu() for
vanilla RCU) to unconditionally return true.

Note that a pair calls to synchronous grace-period-wait functions are
used. This is necessary to account for partially overlapping normal and
expedited grace periods aligning in just the wrong way with polled API
invocations, which can cause those polled API invocations to ignore one or
the other of those partially overlapping grace periods. It is unlikely
that this sort of ignored grace period will be a problem in production,
but rcutorture can make it happen quite within a few tens of seconds.

This commit is in preparation for polled expedited grace periods.

[ paulmck: Apply feedback from Frederic Weisbecker. ]

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

---
kernel/rcu/rcutorture.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 4ceec9f4169c7..d2edc763bb92a 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1269,7 +1269,12 @@ rcu_torture_writer(void *arg)
break;
case RTWS_EXP_SYNC:
rcu_torture_writer_state = RTWS_EXP_SYNC;
+ if (cur_ops->get_gp_state && cur_ops->poll_gp_state)
+ cookie = cur_ops->get_gp_state();
cur_ops->exp_sync();
+ cur_ops->exp_sync();
+ if (cur_ops->get_gp_state && cur_ops->poll_gp_state)
+ WARN_ON_ONCE(!cur_ops->poll_gp_state(cookie));
rcu_torture_pipe_update(old_rp);
break;
case RTWS_COND_GET:
@@ -1291,7 +1296,12 @@ rcu_torture_writer(void *arg)
break;
case RTWS_SYNC:
rcu_torture_writer_state = RTWS_SYNC;
+ if (cur_ops->get_gp_state && cur_ops->poll_gp_state)
+ cookie = cur_ops->get_gp_state();
cur_ops->sync();
+ cur_ops->sync();
+ if (cur_ops->get_gp_state && cur_ops->poll_gp_state)
+ WARN_ON_ONCE(!cur_ops->poll_gp_state(cookie));
rcu_torture_pipe_update(old_rp);
break;
default:
--
2.31.1.189.g2e36527f23

2022-06-20 22:54:31

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 05/12] rcu: Make polled grace-period API account for expedited grace periods

Currently, this code could splat:

oldstate = get_state_synchronize_rcu();
synchronize_rcu_expedited();
WARN_ON_ONCE(!poll_state_synchronize_rcu(oldstate));

This situation is counter-intuitive and user-unfriendly. After all, there
really was a perfectly valid full grace period right after the call to
get_state_synchronize_rcu(), so why shouldn't poll_state_synchronize_rcu()
know about it?

This commit therefore makes the polled grace-period API aware of expedited
grace periods in addition to the normal grace periods that it is already
aware of. With this change, the above code is guaranteed not to splat.

Please note that the above code can still splat due to counter wrap on the
one hand and situations involving partially overlapping normal/expedited
grace periods on the other. On 64-bit systems, the second is of course
much more likely than the first. It is possible to modify this approach
to prevent overlapping grace periods from causing splats, but only at
the expense of greatly increasing the probability of counter wrap, as
in within milliseconds on 32-bit systems and within minutes on 64-bit
systems.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 9 +++++----
kernel/rcu/tree.h | 1 +
kernel/rcu/tree_exp.h | 16 ++++++++++++++--
3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 637e8f9454573..251eb9a8cd925 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1812,6 +1812,7 @@ static void rcu_poll_gp_seq_end(unsigned long *snap)
if (*snap && *snap == rcu_state.gp_seq_polled) {
rcu_seq_end(&rcu_state.gp_seq_polled);
rcu_state.gp_seq_polled_snap = 0;
+ rcu_state.gp_seq_polled_exp_snap = 0;
} else {
*snap = 0;
}
@@ -3913,10 +3914,10 @@ void synchronize_rcu(void)
"Illegal synchronize_rcu() in RCU read-side critical section");
if (rcu_blocking_is_gp()) {
// Note well that this code runs with !PREEMPT && !SMP.
- // In addition, all code that advances grace periods runs
- // at process level. Therefore, this GP overlaps with other
- // GPs only by being fully nested within them, which allows
- // reuse of ->gp_seq_polled_snap.
+ // In addition, all code that advances grace periods runs at
+ // process level. Therefore, this normal GP overlaps with
+ // other normal GPs only by being fully nested within them,
+ // which allows reuse of ->gp_seq_polled_snap.
rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap);
rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap);
if (rcu_init_invoked())
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 9c853033f159d..5634e76106c48 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -325,6 +325,7 @@ struct rcu_state {
unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */
unsigned long gp_seq_polled; /* GP seq for polled API. */
unsigned long gp_seq_polled_snap; /* ->gp_seq_polled at normal GP start. */
+ unsigned long gp_seq_polled_exp_snap; /* ->gp_seq_polled at expedited GP start. */

/* End of fields guarded by root rcu_node's lock. */

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 0f70f62039a90..e0258066b881e 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -18,6 +18,7 @@ static int rcu_print_task_exp_stall(struct rcu_node *rnp);
static void rcu_exp_gp_seq_start(void)
{
rcu_seq_start(&rcu_state.expedited_sequence);
+ rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_exp_snap);
}

/*
@@ -34,6 +35,7 @@ static __maybe_unused unsigned long rcu_exp_gp_seq_endval(void)
*/
static void rcu_exp_gp_seq_end(void)
{
+ rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_exp_snap);
rcu_seq_end(&rcu_state.expedited_sequence);
smp_mb(); /* Ensure that consecutive grace periods serialize. */
}
@@ -913,8 +915,18 @@ void synchronize_rcu_expedited(void)
"Illegal synchronize_rcu_expedited() in RCU read-side critical section");

/* Is the state is such that the call is a grace period? */
- if (rcu_blocking_is_gp())
- return;
+ if (rcu_blocking_is_gp()) {
+ // Note well that this code runs with !PREEMPT && !SMP.
+ // In addition, all code that advances grace periods runs
+ // at process level. Therefore, this expedited GP overlaps
+ // with other expedited GPs only by being fully nested within
+ // them, which allows reuse of ->gp_seq_polled_exp_snap.
+ rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_exp_snap);
+ rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_exp_snap);
+ if (rcu_init_invoked())
+ cond_resched();
+ return; // Context allows vacuous grace periods.
+ }

/* If expedited grace periods are prohibited, fall back to normal. */
if (rcu_gp_is_normal()) {
--
2.31.1.189.g2e36527f23

2022-06-20 22:56:08

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 04/12] rcu: Switch polled grace-period APIs to ->gp_seq_polled

This commit switches the existing polled grace-period APIs to use a
new ->gp_seq_polled counter in the rcu_state structure. An additional
->gp_seq_polled_snap counter in that same structure allows the normal
grace period kthread to interact properly with the !SMP !PREEMPT fastpath
through synchronize_rcu(). The first of the two to note the end of a
given grace period will make knowledge of this transition available to
the polled API.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 90 +++++++++++++++++++++++++++++++++++++++++++++--
kernel/rcu/tree.h | 2 ++
2 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 46cfceea87847..637e8f9454573 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1775,6 +1775,78 @@ static void rcu_strict_gp_boundary(void *unused)
invoke_rcu_core();
}

+// Has rcu_init() been invoked? This is used (for example) to determine
+// whether spinlocks may be acquired safely.
+static bool rcu_init_invoked(void)
+{
+ return !!rcu_state.n_online_cpus;
+}
+
+// Make the polled API aware of the beginning of a grace period.
+static void rcu_poll_gp_seq_start(unsigned long *snap)
+{
+ struct rcu_node *rnp = rcu_get_root();
+
+ if (rcu_init_invoked())
+ raw_lockdep_assert_held_rcu_node(rnp);
+
+ // If RCU was idle, note beginning of GP.
+ if (!rcu_seq_state(rcu_state.gp_seq_polled))
+ rcu_seq_start(&rcu_state.gp_seq_polled);
+
+ // Either way, record current state.
+ *snap = rcu_state.gp_seq_polled;
+}
+
+// Make the polled API aware of the end of a grace period.
+static void rcu_poll_gp_seq_end(unsigned long *snap)
+{
+ struct rcu_node *rnp = rcu_get_root();
+
+ if (rcu_init_invoked())
+ raw_lockdep_assert_held_rcu_node(rnp);
+
+ // If the the previously noted GP is still in effect, record the
+ // end of that GP. Either way, zero counter to avoid counter-wrap
+ // problems.
+ if (*snap && *snap == rcu_state.gp_seq_polled) {
+ rcu_seq_end(&rcu_state.gp_seq_polled);
+ rcu_state.gp_seq_polled_snap = 0;
+ } else {
+ *snap = 0;
+ }
+}
+
+// Make the polled API aware of the beginning of a grace period, but
+// where caller does not hold the root rcu_node structure's lock.
+static void rcu_poll_gp_seq_start_unlocked(unsigned long *snap)
+{
+ struct rcu_node *rnp = rcu_get_root();
+
+ if (rcu_init_invoked()) {
+ lockdep_assert_irqs_enabled();
+ raw_spin_lock_irq_rcu_node(rnp);
+ }
+ rcu_poll_gp_seq_start(snap);
+ if (rcu_init_invoked())
+ raw_spin_unlock_irq_rcu_node(rnp);
+}
+
+// Make the polled API aware of the end of a grace period, but where
+// caller does not hold the root rcu_node structure's lock.
+static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
+{
+ struct rcu_node *rnp = rcu_get_root();
+
+ if (rcu_init_invoked()) {
+ lockdep_assert_irqs_enabled();
+ raw_spin_lock_irq_rcu_node(rnp);
+ }
+ rcu_poll_gp_seq_end(snap);
+ if (rcu_init_invoked())
+ raw_spin_unlock_irq_rcu_node(rnp);
+}
+
/*
* Initialize a new grace period. Return false if no grace period required.
*/
@@ -1810,6 +1882,7 @@ static noinline_for_stack bool rcu_gp_init(void)
rcu_seq_start(&rcu_state.gp_seq);
ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
+ rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
raw_spin_unlock_irq_rcu_node(rnp);

/*
@@ -2069,6 +2142,7 @@ static noinline void rcu_gp_cleanup(void)
* safe for us to drop the lock in order to mark the grace
* period as completed in all of the rcu_node structures.
*/
+ rcu_poll_gp_seq_end(&rcu_state.gp_seq_polled_snap);
raw_spin_unlock_irq_rcu_node(rnp);

/*
@@ -3837,8 +3911,18 @@ void synchronize_rcu(void)
lock_is_held(&rcu_lock_map) ||
lock_is_held(&rcu_sched_lock_map),
"Illegal synchronize_rcu() in RCU read-side critical section");
- if (rcu_blocking_is_gp())
+ if (rcu_blocking_is_gp()) {
+ // Note well that this code runs with !PREEMPT && !SMP.
+ // In addition, all code that advances grace periods runs
+ // at process level. Therefore, this GP overlaps with other
+ // GPs only by being fully nested within them, which allows
+ // reuse of ->gp_seq_polled_snap.
+ rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap);
+ rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap);
+ if (rcu_init_invoked())
+ cond_resched_tasks_rcu_qs();
return; // Context allows vacuous grace periods.
+ }
if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
@@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
* before the load from ->gp_seq.
*/
smp_mb(); /* ^^^ */
- return rcu_seq_snap(&rcu_state.gp_seq);
+ return rcu_seq_snap(&rcu_state.gp_seq_polled);
}
EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);

@@ -3925,7 +4009,7 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
bool poll_state_synchronize_rcu(unsigned long oldstate)
{
if (oldstate == RCU_GET_STATE_COMPLETED ||
- rcu_seq_done_exact(&rcu_state.gp_seq, oldstate)) {
+ rcu_seq_done_exact(&rcu_state.gp_seq_polled, oldstate)) {
smp_mb(); /* Ensure GP ends before subsequent accesses. */
return true;
}
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2ccf5845957df..9c853033f159d 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -323,6 +323,8 @@ struct rcu_state {
short gp_state; /* GP kthread sleep state. */
unsigned long gp_wake_time; /* Last GP kthread wake. */
unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */
+ unsigned long gp_seq_polled; /* GP seq for polled API. */
+ unsigned long gp_seq_polled_snap; /* ->gp_seq_polled at normal GP start. */

/* End of fields guarded by root rcu_node's lock. */

--
2.31.1.189.g2e36527f23

2022-06-20 23:11:29

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 08/12] rcu: Add polled expedited grace-period primitives

This commit adds expedited grace-period functionality to RCU's polled
grace-period API, adding start_poll_synchronize_rcu_expedited() and
cond_synchronize_rcu_expedited(), which are similar to the existing
start_poll_synchronize_rcu() and cond_synchronize_rcu() functions,
respectively.

Note that although start_poll_synchronize_rcu_expedited() can be invoked
very early, the resulting expedited grace periods are not guaranteed
to start until after workqueues are fully initialized. On the other
hand, both synchronize_rcu() and synchronize_rcu_expedited() can also
be invoked very early, and the resulting grace periods will be taken
into account as they occur.

[ paulmck: Apply feedback from Neeraj Upadhyay. ]

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcutiny.h | 10 +++++
include/linux/rcutree.h | 2 +
kernel/rcu/tree.c | 17 ++++++---
kernel/rcu/tree.h | 7 ++++
kernel/rcu/tree_exp.h | 85 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 116 insertions(+), 5 deletions(-)

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 5fed476f977f6..ab7e20dfb07b0 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -23,6 +23,16 @@ static inline void cond_synchronize_rcu(unsigned long oldstate)
might_sleep();
}

+static inline unsigned long start_poll_synchronize_rcu_expedited(void)
+{
+ return start_poll_synchronize_rcu();
+}
+
+static inline void cond_synchronize_rcu_expedited(unsigned long oldstate)
+{
+ cond_synchronize_rcu(oldstate);
+}
+
extern void rcu_barrier(void);

static inline void synchronize_rcu_expedited(void)
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 9c6cfb742504f..20dbaa9a38820 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -40,6 +40,8 @@ bool rcu_eqs_special_set(int cpu);
void rcu_momentary_dyntick_idle(void);
void kfree_rcu_scheduler_running(void);
bool rcu_gp_might_be_stalled(void);
+unsigned long start_poll_synchronize_rcu_expedited(void);
+void cond_synchronize_rcu_expedited(unsigned long oldstate);
unsigned long get_state_synchronize_rcu(void);
unsigned long start_poll_synchronize_rcu(void);
bool poll_state_synchronize_rcu(unsigned long oldstate);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 251eb9a8cd925..8b6e5414fbaa7 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4021,20 +4021,20 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
/**
* cond_synchronize_rcu - Conditionally wait for an RCU grace period
*
- * @oldstate: value from get_state_synchronize_rcu() or start_poll_synchronize_rcu()
+ * @oldstate: value from get_state_synchronize_rcu(), start_poll_synchronize_rcu(), or start_poll_synchronize_rcu_expedited()
*
* If a full RCU grace period has elapsed since the earlier call to
* get_state_synchronize_rcu() or start_poll_synchronize_rcu(), just return.
* Otherwise, invoke synchronize_rcu() to wait for a full grace period.
*
- * Yes, this function does not take counter wrap into account. But
- * counter wrap is harmless. If the counter wraps, we have waited for
+ * Yes, this function does not take counter wrap into account.
+ * But counter wrap is harmless. If the counter wraps, we have waited for
* more than 2 billion grace periods (and way more on a 64-bit system!),
- * so waiting for one additional grace period should be just fine.
+ * so waiting for a couple of additional grace periods should be just fine.
*
* This function provides the same memory-ordering guarantees that
* would be provided by a synchronize_rcu() that was invoked at the call
- * to the function that provided @oldstate, and that returned at the end
+ * to the function that provided @oldstate and that returned at the end
* of this function.
*/
void cond_synchronize_rcu(unsigned long oldstate)
@@ -4787,6 +4787,9 @@ static void __init rcu_init_one(void)
init_waitqueue_head(&rnp->exp_wq[3]);
spin_lock_init(&rnp->exp_lock);
mutex_init(&rnp->boost_kthread_mutex);
+ raw_spin_lock_init(&rnp->exp_poll_lock);
+ rnp->exp_seq_poll_rq = RCU_GET_STATE_COMPLETED;
+ INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp);
}
}

@@ -5012,6 +5015,10 @@ void __init rcu_init(void)
qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
else
qovld_calc = qovld;
+
+ // Kick-start any polled grace periods that started early.
+ if (!(per_cpu_ptr(&rcu_data, cpu)->mynode->exp_seq_poll_rq & 0x1))
+ (void)start_poll_synchronize_rcu_expedited();
}

#include "tree_stall.h"
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 5634e76106c48..fb77deca5f5c6 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -133,6 +133,10 @@ struct rcu_node {
wait_queue_head_t exp_wq[4];
struct rcu_exp_work rew;
bool exp_need_flush; /* Need to flush workitem? */
+ raw_spinlock_t exp_poll_lock;
+ /* Lock and data for polled expedited grace periods. */
+ unsigned long exp_seq_poll_rq;
+ struct work_struct exp_poll_wq;
} ____cacheline_internodealigned_in_smp;

/*
@@ -484,3 +488,6 @@ static void rcu_iw_handler(struct irq_work *iwp);
static void check_cpu_stall(struct rcu_data *rdp);
static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
const unsigned long gpssdelay);
+
+/* Forward declarations for tree_exp.h. */
+static void sync_rcu_do_polled_gp(struct work_struct *wp);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index e0258066b881e..571b0a700cced 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -962,3 +962,88 @@ void synchronize_rcu_expedited(void)
synchronize_rcu_expedited_destroy_work(&rew);
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
+
+/*
+ * Ensure that start_poll_synchronize_rcu_expedited() has the expedited
+ * RCU grace periods that it needs.
+ */
+static void sync_rcu_do_polled_gp(struct work_struct *wp)
+{
+ unsigned long flags;
+ struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq);
+ unsigned long s;
+
+ raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+ s = rnp->exp_seq_poll_rq;
+ rnp->exp_seq_poll_rq = RCU_GET_STATE_COMPLETED;
+ raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+ if (s == RCU_GET_STATE_COMPLETED)
+ return;
+ while (!poll_state_synchronize_rcu(s))
+ synchronize_rcu_expedited();
+ raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+ s = rnp->exp_seq_poll_rq;
+ if (poll_state_synchronize_rcu(s))
+ rnp->exp_seq_poll_rq = RCU_GET_STATE_COMPLETED;
+ raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+}
+
+/**
+ * start_poll_synchronize_rcu_expedited - Snapshot current RCU state and start expedited grace period
+ *
+ * Returns a cookie to pass to a call to cond_synchronize_rcu(),
+ * cond_synchronize_rcu_expedited(), or poll_state_synchronize_rcu(),
+ * allowing them to determine whether or not any sort of grace period has
+ * elapsed in the meantime. If the needed expedited grace period is not
+ * already slated to start, initiates that grace period.
+ */
+unsigned long start_poll_synchronize_rcu_expedited(void)
+{
+ unsigned long flags;
+ struct rcu_data *rdp;
+ struct rcu_node *rnp;
+ unsigned long s;
+
+ s = get_state_synchronize_rcu();
+ rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id());
+ rnp = rdp->mynode;
+ if (rcu_init_invoked())
+ raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+ if (!poll_state_synchronize_rcu(s)) {
+ rnp->exp_seq_poll_rq = s;
+ if (rcu_init_invoked())
+ queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
+ }
+ if (rcu_init_invoked())
+ raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+
+ return s;
+}
+EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited);
+
+/**
+ * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
+ *
+ * @oldstate: value from get_state_synchronize_rcu(), start_poll_synchronize_rcu(), or start_poll_synchronize_rcu_expedited()
+ *
+ * If any type of full RCU grace period has elapsed since the earlier
+ * call to get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
+ * or start_poll_synchronize_rcu_expedited(), just return. Otherwise,
+ * invoke synchronize_rcu_expedited() to wait for a full grace period.
+ *
+ * Yes, this function does not take counter wrap into account.
+ * But counter wrap is harmless. If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!),
+ * so waiting for a couple of additional grace periods should be just fine.
+ *
+ * This function provides the same memory-ordering guarantees that
+ * would be provided by a synchronize_rcu() that was invoked at the call
+ * to the function that provided @oldstate and that returned at the end
+ * of this function.
+ */
+void cond_synchronize_rcu_expedited(unsigned long oldstate)
+{
+ if (!poll_state_synchronize_rcu(oldstate))
+ synchronize_rcu_expedited();
+}
+EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited);
--
2.31.1.189.g2e36527f23

2022-06-20 23:14:47

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 03/12] rcutorture: Validate get_completed_synchronize_rcu()

This commit verifies that the RCU grace-period state cookie returned
from get_completed_synchronize_rcu() causes poll_state_synchronize_rcu()
to return true, as required.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcutorture.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7120165a93426..4ceec9f4169c7 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -338,6 +338,7 @@ struct rcu_torture_ops {
void (*sync)(void);
void (*exp_sync)(void);
unsigned long (*get_gp_state)(void);
+ unsigned long (*get_gp_completed)(void);
unsigned long (*start_gp_poll)(void);
bool (*poll_gp_state)(unsigned long oldstate);
void (*cond_sync)(unsigned long oldstate);
@@ -504,6 +505,7 @@ static struct rcu_torture_ops rcu_ops = {
.sync = synchronize_rcu,
.exp_sync = synchronize_rcu_expedited,
.get_gp_state = get_state_synchronize_rcu,
+ .get_gp_completed = get_completed_synchronize_rcu,
.start_gp_poll = start_poll_synchronize_rcu,
.poll_gp_state = poll_state_synchronize_rcu,
.cond_sync = cond_synchronize_rcu,
@@ -1254,6 +1256,10 @@ rcu_torture_writer(void *arg)
rcu_torture_writer_state_getname(),
rcu_torture_writer_state,
cookie, cur_ops->get_gp_state());
+ if (cur_ops->get_gp_completed) {
+ cookie = cur_ops->get_gp_completed();
+ WARN_ON_ONCE(!cur_ops->poll_gp_state(cookie));
+ }
cur_ops->readunlock(idx);
}
switch (synctype[torture_random(&rand) % nsynctypes]) {
--
2.31.1.189.g2e36527f23

2022-06-21 00:02:22

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 02/12] rcu: Provide a get_completed_synchronize_rcu() function

It is currently up to the caller to handle stale return values from
get_state_synchronize_rcu(). If poll_state_synchronize_rcu() returned
true once, a grace period has elapsed, regardless of the fact that counter
wrap might cause some future poll_state_synchronize_rcu() invocation to
return false. For example, the caller might store a separate flag that
indicates whether some previous call to poll_state_synchronize_rcu()
determined that the relevant grace period had already ended.

This approach works, but it requires extra storage and is easy to get
wrong. This commit therefore introduces a get_completed_synchronize_rcu()
that returns a cookie that causes poll_state_synchronize_rcu() to always
return true. This already-completed cookie can be stored in place of the
cookie that previously caused poll_state_synchronize_rcu() to return true.
It can also be used to flag a given structure as not having been exposed
to readers, and thus not requiring a grace period to elapse.

This commit is in preparation for polled expedited grace periods.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 1 +
kernel/rcu/rcu.h | 3 +++
kernel/rcu/tiny.c | 4 ++--
kernel/rcu/tree.c | 3 ++-
kernel/rcu/update.c | 13 +++++++++++++
5 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 1a32036c918cd..7f12daa4708b1 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -41,6 +41,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func);
void rcu_barrier_tasks(void);
void rcu_barrier_tasks_rude(void);
void synchronize_rcu(void);
+unsigned long get_completed_synchronize_rcu(void);

#ifdef CONFIG_PREEMPT_RCU

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 0adb55941aeb3..32291f4eefde9 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -23,6 +23,9 @@
#define RCU_SEQ_CTR_SHIFT 2
#define RCU_SEQ_STATE_MASK ((1 << RCU_SEQ_CTR_SHIFT) - 1)

+/* Low-order bit definition for polled grace-period APIs. */
+#define RCU_GET_STATE_COMPLETED 0x1
+
extern int sysctl_sched_rt_runtime;

/*
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 340b3f8b090d4..dbee6bea67269 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -58,7 +58,7 @@ void rcu_qs(void)
rcu_ctrlblk.donetail = rcu_ctrlblk.curtail;
raise_softirq_irqoff(RCU_SOFTIRQ);
}
- WRITE_ONCE(rcu_ctrlblk.gp_seq, rcu_ctrlblk.gp_seq + 1);
+ WRITE_ONCE(rcu_ctrlblk.gp_seq, rcu_ctrlblk.gp_seq + 2);
local_irq_restore(flags);
}

@@ -213,7 +213,7 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
*/
bool poll_state_synchronize_rcu(unsigned long oldstate)
{
- return READ_ONCE(rcu_ctrlblk.gp_seq) != oldstate;
+ return oldstate == RCU_GET_STATE_COMPLETED || READ_ONCE(rcu_ctrlblk.gp_seq) != oldstate;
}
EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ec28e259774e7..46cfceea87847 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3924,7 +3924,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
*/
bool poll_state_synchronize_rcu(unsigned long oldstate)
{
- if (rcu_seq_done_exact(&rcu_state.gp_seq, oldstate)) {
+ if (oldstate == RCU_GET_STATE_COMPLETED ||
+ rcu_seq_done_exact(&rcu_state.gp_seq, oldstate)) {
smp_mb(); /* Ensure GP ends before subsequent accesses. */
return true;
}
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index fc7fef5756064..2e93acad1e31f 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -516,6 +516,19 @@ int rcu_cpu_stall_suppress_at_boot __read_mostly; // !0 = suppress boot stalls.
EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress_at_boot);
module_param(rcu_cpu_stall_suppress_at_boot, int, 0444);

+/**
+ * get_completed_synchronize_rcu - Return a pre-completed polled state cookie
+ *
+ * Returns a value that will always be treated by functions like
+ * poll_state_synchronize_rcu() as a cookie whose grace period has already
+ * completed.
+ */
+unsigned long get_completed_synchronize_rcu(void)
+{
+ return RCU_GET_STATE_COMPLETED;
+}
+EXPORT_SYMBOL_GPL(get_completed_synchronize_rcu);
+
#ifdef CONFIG_PROVE_RCU

/*
--
2.31.1.189.g2e36527f23

2022-06-21 00:11:30

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH rcu 10/12] rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings

From: Zqiang <[email protected]>

When a normal RCU CPU stall warning is encountered with the
panic_on_rcu_stall sysfs variable is set, the system panics only after
the stall warning is printed. But when an expedited RCU CPU stall
warning is encountered with the panic_on_rcu_stall sysfs variable is
set, the system panics first, thus never printing the stall warning.
This commit therefore brings the expedited stall warning into line with
the normal stall warning by printing first and panicking afterwards.

Signed-off-by: Zqiang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_exp.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 571b0a700cced..f05a15b11fa0c 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -623,7 +623,6 @@ static void synchronize_rcu_expedited_wait(void)
return;
if (rcu_stall_is_suppressed())
continue;
- panic_on_rcu_stall();
trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall"));
pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {",
rcu_state.name);
@@ -671,6 +670,7 @@ static void synchronize_rcu_expedited_wait(void)
}
}
jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3;
+ panic_on_rcu_stall();
}
}

--
2.31.1.189.g2e36527f23

2022-07-21 01:37:35

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH rcu 04/12] rcu: Switch polled grace-period APIs to ->gp_seq_polled

Hi Paul,

On Mon, Jun 20, 2022 at 03:51:20PM -0700, Paul E. McKenney wrote:
> This commit switches the existing polled grace-period APIs to use a
> new ->gp_seq_polled counter in the rcu_state structure. An additional
> ->gp_seq_polled_snap counter in that same structure allows the normal
> grace period kthread to interact properly with the !SMP !PREEMPT fastpath
> through synchronize_rcu(). The first of the two to note the end of a
> given grace period will make knowledge of this transition available to
> the polled API.
>
> This commit is in preparation for polled expedited grace periods.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
> Cc: Brian Foster <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Ian Kent <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree.c | 90 +++++++++++++++++++++++++++++++++++++++++++++--
> kernel/rcu/tree.h | 2 ++
> 2 files changed, 89 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 46cfceea87847..637e8f9454573 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1775,6 +1775,78 @@ static void rcu_strict_gp_boundary(void *unused)
> invoke_rcu_core();
> }
>
> +// Has rcu_init() been invoked? This is used (for example) to determine
> +// whether spinlocks may be acquired safely.
> +static bool rcu_init_invoked(void)
> +{
> + return !!rcu_state.n_online_cpus;
> +}
> +
> +// Make the polled API aware of the beginning of a grace period.
> +static void rcu_poll_gp_seq_start(unsigned long *snap)
> +{
> + struct rcu_node *rnp = rcu_get_root();
> +
> + if (rcu_init_invoked())
> + raw_lockdep_assert_held_rcu_node(rnp);
> +
> + // If RCU was idle, note beginning of GP.
> + if (!rcu_seq_state(rcu_state.gp_seq_polled))
> + rcu_seq_start(&rcu_state.gp_seq_polled);
> +
> + // Either way, record current state.
> + *snap = rcu_state.gp_seq_polled;
> +}
> +
> +// Make the polled API aware of the end of a grace period.
> +static void rcu_poll_gp_seq_end(unsigned long *snap)
> +{
> + struct rcu_node *rnp = rcu_get_root();
> +
> + if (rcu_init_invoked())
> + raw_lockdep_assert_held_rcu_node(rnp);
> +
> + // If the the previously noted GP is still in effect, record the
> + // end of that GP. Either way, zero counter to avoid counter-wrap
> + // problems.
> + if (*snap && *snap == rcu_state.gp_seq_polled) {
> + rcu_seq_end(&rcu_state.gp_seq_polled);
> + rcu_state.gp_seq_polled_snap = 0;
> + } else {
> + *snap = 0;
> + }
> +}
> +
> +// Make the polled API aware of the beginning of a grace period, but
> +// where caller does not hold the root rcu_node structure's lock.
> +static void rcu_poll_gp_seq_start_unlocked(unsigned long *snap)
> +{
> + struct rcu_node *rnp = rcu_get_root();
> +
> + if (rcu_init_invoked()) {
> + lockdep_assert_irqs_enabled();
> + raw_spin_lock_irq_rcu_node(rnp);
> + }
> + rcu_poll_gp_seq_start(snap);
> + if (rcu_init_invoked())
> + raw_spin_unlock_irq_rcu_node(rnp);
> +}
> +
> +// Make the polled API aware of the end of a grace period, but where
> +// caller does not hold the root rcu_node structure's lock.
> +static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> +{
> + struct rcu_node *rnp = rcu_get_root();
> +
> + if (rcu_init_invoked()) {
> + lockdep_assert_irqs_enabled();
> + raw_spin_lock_irq_rcu_node(rnp);
> + }
> + rcu_poll_gp_seq_end(snap);
> + if (rcu_init_invoked())
> + raw_spin_unlock_irq_rcu_node(rnp);
> +}
> +
> /*
> * Initialize a new grace period. Return false if no grace period required.
> */
> @@ -1810,6 +1882,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> rcu_seq_start(&rcu_state.gp_seq);
> ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> + rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> raw_spin_unlock_irq_rcu_node(rnp);
>
> /*
> @@ -2069,6 +2142,7 @@ static noinline void rcu_gp_cleanup(void)
> * safe for us to drop the lock in order to mark the grace
> * period as completed in all of the rcu_node structures.
> */
> + rcu_poll_gp_seq_end(&rcu_state.gp_seq_polled_snap);
> raw_spin_unlock_irq_rcu_node(rnp);
>
> /*
> @@ -3837,8 +3911,18 @@ void synchronize_rcu(void)
> lock_is_held(&rcu_lock_map) ||
> lock_is_held(&rcu_sched_lock_map),
> "Illegal synchronize_rcu() in RCU read-side critical section");
> - if (rcu_blocking_is_gp())
> + if (rcu_blocking_is_gp()) {
> + // Note well that this code runs with !PREEMPT && !SMP.
> + // In addition, all code that advances grace periods runs
> + // at process level. Therefore, this GP overlaps with other
> + // GPs only by being fully nested within them, which allows
> + // reuse of ->gp_seq_polled_snap.
> + rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap);
> + rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap);
> + if (rcu_init_invoked())
> + cond_resched_tasks_rcu_qs();
> return; // Context allows vacuous grace periods.
> + }
> if (rcu_gp_is_expedited())
> synchronize_rcu_expedited();
> else
> @@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
> * before the load from ->gp_seq.
> */
> smp_mb(); /* ^^^ */
> - return rcu_seq_snap(&rcu_state.gp_seq);
> + return rcu_seq_snap(&rcu_state.gp_seq_polled);

I happened to run into this. There is one usage of
get_state_synchronize_rcu() in start_poll_synchronize_rcu(), in which
the return value of get_state_synchronize_rcu() ("gp_seq") will be used
for rcu_start_this_gp(). I don't think this is quite right, because
after this change, rcu_state.gp_seq and rcu_state.gp_seq_polled are
different values, in fact ->gp_seq_polled is greater than ->gp_seq
by how many synchronize_rcu() is called in early boot.

Am I missing something here?

Regards,
Boqun

> }
> EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
>
> @@ -3925,7 +4009,7 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
> bool poll_state_synchronize_rcu(unsigned long oldstate)
> {
> if (oldstate == RCU_GET_STATE_COMPLETED ||
> - rcu_seq_done_exact(&rcu_state.gp_seq, oldstate)) {
> + rcu_seq_done_exact(&rcu_state.gp_seq_polled, oldstate)) {
> smp_mb(); /* Ensure GP ends before subsequent accesses. */
> return true;
> }
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 2ccf5845957df..9c853033f159d 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -323,6 +323,8 @@ struct rcu_state {
> short gp_state; /* GP kthread sleep state. */
> unsigned long gp_wake_time; /* Last GP kthread wake. */
> unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */
> + unsigned long gp_seq_polled; /* GP seq for polled API. */
> + unsigned long gp_seq_polled_snap; /* ->gp_seq_polled at normal GP start. */
>
> /* End of fields guarded by root rcu_node's lock. */
>
> --
> 2.31.1.189.g2e36527f23
>

2022-07-21 01:39:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 04/12] rcu: Switch polled grace-period APIs to ->gp_seq_polled

On Wed, Jul 20, 2022 at 05:53:38PM -0700, Boqun Feng wrote:
> Hi Paul,
>
> On Mon, Jun 20, 2022 at 03:51:20PM -0700, Paul E. McKenney wrote:
> > This commit switches the existing polled grace-period APIs to use a
> > new ->gp_seq_polled counter in the rcu_state structure. An additional
> > ->gp_seq_polled_snap counter in that same structure allows the normal
> > grace period kthread to interact properly with the !SMP !PREEMPT fastpath
> > through synchronize_rcu(). The first of the two to note the end of a
> > given grace period will make knowledge of this transition available to
> > the polled API.
> >
> > This commit is in preparation for polled expedited grace periods.
> >
> > Link: https://lore.kernel.org/all/[email protected]/
> > Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
> > Cc: Brian Foster <[email protected]>
> > Cc: Dave Chinner <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: Ian Kent <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 90 +++++++++++++++++++++++++++++++++++++++++++++--
> > kernel/rcu/tree.h | 2 ++
> > 2 files changed, 89 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 46cfceea87847..637e8f9454573 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1775,6 +1775,78 @@ static void rcu_strict_gp_boundary(void *unused)
> > invoke_rcu_core();
> > }
> >
> > +// Has rcu_init() been invoked? This is used (for example) to determine
> > +// whether spinlocks may be acquired safely.
> > +static bool rcu_init_invoked(void)
> > +{
> > + return !!rcu_state.n_online_cpus;
> > +}
> > +
> > +// Make the polled API aware of the beginning of a grace period.
> > +static void rcu_poll_gp_seq_start(unsigned long *snap)
> > +{
> > + struct rcu_node *rnp = rcu_get_root();
> > +
> > + if (rcu_init_invoked())
> > + raw_lockdep_assert_held_rcu_node(rnp);
> > +
> > + // If RCU was idle, note beginning of GP.
> > + if (!rcu_seq_state(rcu_state.gp_seq_polled))
> > + rcu_seq_start(&rcu_state.gp_seq_polled);
> > +
> > + // Either way, record current state.
> > + *snap = rcu_state.gp_seq_polled;
> > +}
> > +
> > +// Make the polled API aware of the end of a grace period.
> > +static void rcu_poll_gp_seq_end(unsigned long *snap)
> > +{
> > + struct rcu_node *rnp = rcu_get_root();
> > +
> > + if (rcu_init_invoked())
> > + raw_lockdep_assert_held_rcu_node(rnp);
> > +
> > + // If the the previously noted GP is still in effect, record the
> > + // end of that GP. Either way, zero counter to avoid counter-wrap
> > + // problems.
> > + if (*snap && *snap == rcu_state.gp_seq_polled) {
> > + rcu_seq_end(&rcu_state.gp_seq_polled);
> > + rcu_state.gp_seq_polled_snap = 0;
> > + } else {
> > + *snap = 0;
> > + }
> > +}
> > +
> > +// Make the polled API aware of the beginning of a grace period, but
> > +// where caller does not hold the root rcu_node structure's lock.
> > +static void rcu_poll_gp_seq_start_unlocked(unsigned long *snap)
> > +{
> > + struct rcu_node *rnp = rcu_get_root();
> > +
> > + if (rcu_init_invoked()) {
> > + lockdep_assert_irqs_enabled();
> > + raw_spin_lock_irq_rcu_node(rnp);
> > + }
> > + rcu_poll_gp_seq_start(snap);
> > + if (rcu_init_invoked())
> > + raw_spin_unlock_irq_rcu_node(rnp);
> > +}
> > +
> > +// Make the polled API aware of the end of a grace period, but where
> > +// caller does not hold the root rcu_node structure's lock.
> > +static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> > +{
> > + struct rcu_node *rnp = rcu_get_root();
> > +
> > + if (rcu_init_invoked()) {
> > + lockdep_assert_irqs_enabled();
> > + raw_spin_lock_irq_rcu_node(rnp);
> > + }
> > + rcu_poll_gp_seq_end(snap);
> > + if (rcu_init_invoked())
> > + raw_spin_unlock_irq_rcu_node(rnp);
> > +}
> > +
> > /*
> > * Initialize a new grace period. Return false if no grace period required.
> > */
> > @@ -1810,6 +1882,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > rcu_seq_start(&rcu_state.gp_seq);
> > ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > + rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > raw_spin_unlock_irq_rcu_node(rnp);
> >
> > /*
> > @@ -2069,6 +2142,7 @@ static noinline void rcu_gp_cleanup(void)
> > * safe for us to drop the lock in order to mark the grace
> > * period as completed in all of the rcu_node structures.
> > */
> > + rcu_poll_gp_seq_end(&rcu_state.gp_seq_polled_snap);
> > raw_spin_unlock_irq_rcu_node(rnp);
> >
> > /*
> > @@ -3837,8 +3911,18 @@ void synchronize_rcu(void)
> > lock_is_held(&rcu_lock_map) ||
> > lock_is_held(&rcu_sched_lock_map),
> > "Illegal synchronize_rcu() in RCU read-side critical section");
> > - if (rcu_blocking_is_gp())
> > + if (rcu_blocking_is_gp()) {
> > + // Note well that this code runs with !PREEMPT && !SMP.
> > + // In addition, all code that advances grace periods runs
> > + // at process level. Therefore, this GP overlaps with other
> > + // GPs only by being fully nested within them, which allows
> > + // reuse of ->gp_seq_polled_snap.
> > + rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap);
> > + rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap);
> > + if (rcu_init_invoked())
> > + cond_resched_tasks_rcu_qs();
> > return; // Context allows vacuous grace periods.
> > + }
> > if (rcu_gp_is_expedited())
> > synchronize_rcu_expedited();
> > else
> > @@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
> > * before the load from ->gp_seq.
> > */
> > smp_mb(); /* ^^^ */
> > - return rcu_seq_snap(&rcu_state.gp_seq);
> > + return rcu_seq_snap(&rcu_state.gp_seq_polled);
>
> I happened to run into this. There is one usage of
> get_state_synchronize_rcu() in start_poll_synchronize_rcu(), in which
> the return value of get_state_synchronize_rcu() ("gp_seq") will be used
> for rcu_start_this_gp(). I don't think this is quite right, because
> after this change, rcu_state.gp_seq and rcu_state.gp_seq_polled are
> different values, in fact ->gp_seq_polled is greater than ->gp_seq
> by how many synchronize_rcu() is called in early boot.
>
> Am I missing something here?

It does not appear that your are missing anything, sad to say!

Does the following make it work better?

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2122359f0c862..cf2fd58a93a41 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3571,7 +3571,7 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
unsigned long start_poll_synchronize_rcu(void)
{
unsigned long flags;
- unsigned long gp_seq = get_state_synchronize_rcu();
+ unsigned long gp_seq = rcu_seq_snap(&rcu_state.gp_seq);
bool needwake;
struct rcu_data *rdp;
struct rcu_node *rnp;

2022-07-21 02:03:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 04/12] rcu: Switch polled grace-period APIs to ->gp_seq_polled

On Wed, Jul 20, 2022 at 06:04:55PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 20, 2022 at 05:53:38PM -0700, Boqun Feng wrote:
> > Hi Paul,
> >
> > On Mon, Jun 20, 2022 at 03:51:20PM -0700, Paul E. McKenney wrote:
> > > This commit switches the existing polled grace-period APIs to use a
> > > new ->gp_seq_polled counter in the rcu_state structure. An additional
> > > ->gp_seq_polled_snap counter in that same structure allows the normal
> > > grace period kthread to interact properly with the !SMP !PREEMPT fastpath
> > > through synchronize_rcu(). The first of the two to note the end of a
> > > given grace period will make knowledge of this transition available to
> > > the polled API.
> > >
> > > This commit is in preparation for polled expedited grace periods.
> > >
> > > Link: https://lore.kernel.org/all/[email protected]/
> > > Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
> > > Cc: Brian Foster <[email protected]>
> > > Cc: Dave Chinner <[email protected]>
> > > Cc: Al Viro <[email protected]>
> > > Cc: Ian Kent <[email protected]>
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> > > kernel/rcu/tree.c | 90 +++++++++++++++++++++++++++++++++++++++++++++--
> > > kernel/rcu/tree.h | 2 ++
> > > 2 files changed, 89 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 46cfceea87847..637e8f9454573 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1775,6 +1775,78 @@ static void rcu_strict_gp_boundary(void *unused)
> > > invoke_rcu_core();
> > > }
> > >
> > > +// Has rcu_init() been invoked? This is used (for example) to determine
> > > +// whether spinlocks may be acquired safely.
> > > +static bool rcu_init_invoked(void)
> > > +{
> > > + return !!rcu_state.n_online_cpus;
> > > +}
> > > +
> > > +// Make the polled API aware of the beginning of a grace period.
> > > +static void rcu_poll_gp_seq_start(unsigned long *snap)
> > > +{
> > > + struct rcu_node *rnp = rcu_get_root();
> > > +
> > > + if (rcu_init_invoked())
> > > + raw_lockdep_assert_held_rcu_node(rnp);
> > > +
> > > + // If RCU was idle, note beginning of GP.
> > > + if (!rcu_seq_state(rcu_state.gp_seq_polled))
> > > + rcu_seq_start(&rcu_state.gp_seq_polled);
> > > +
> > > + // Either way, record current state.
> > > + *snap = rcu_state.gp_seq_polled;
> > > +}
> > > +
> > > +// Make the polled API aware of the end of a grace period.
> > > +static void rcu_poll_gp_seq_end(unsigned long *snap)
> > > +{
> > > + struct rcu_node *rnp = rcu_get_root();
> > > +
> > > + if (rcu_init_invoked())
> > > + raw_lockdep_assert_held_rcu_node(rnp);
> > > +
> > > + // If the the previously noted GP is still in effect, record the
> > > + // end of that GP. Either way, zero counter to avoid counter-wrap
> > > + // problems.
> > > + if (*snap && *snap == rcu_state.gp_seq_polled) {
> > > + rcu_seq_end(&rcu_state.gp_seq_polled);
> > > + rcu_state.gp_seq_polled_snap = 0;
> > > + } else {
> > > + *snap = 0;
> > > + }
> > > +}
> > > +
> > > +// Make the polled API aware of the beginning of a grace period, but
> > > +// where caller does not hold the root rcu_node structure's lock.
> > > +static void rcu_poll_gp_seq_start_unlocked(unsigned long *snap)
> > > +{
> > > + struct rcu_node *rnp = rcu_get_root();
> > > +
> > > + if (rcu_init_invoked()) {
> > > + lockdep_assert_irqs_enabled();
> > > + raw_spin_lock_irq_rcu_node(rnp);
> > > + }
> > > + rcu_poll_gp_seq_start(snap);
> > > + if (rcu_init_invoked())
> > > + raw_spin_unlock_irq_rcu_node(rnp);
> > > +}
> > > +
> > > +// Make the polled API aware of the end of a grace period, but where
> > > +// caller does not hold the root rcu_node structure's lock.
> > > +static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> > > +{
> > > + struct rcu_node *rnp = rcu_get_root();
> > > +
> > > + if (rcu_init_invoked()) {
> > > + lockdep_assert_irqs_enabled();
> > > + raw_spin_lock_irq_rcu_node(rnp);
> > > + }
> > > + rcu_poll_gp_seq_end(snap);
> > > + if (rcu_init_invoked())
> > > + raw_spin_unlock_irq_rcu_node(rnp);
> > > +}
> > > +
> > > /*
> > > * Initialize a new grace period. Return false if no grace period required.
> > > */
> > > @@ -1810,6 +1882,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > rcu_seq_start(&rcu_state.gp_seq);
> > > ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > + rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > raw_spin_unlock_irq_rcu_node(rnp);
> > >
> > > /*
> > > @@ -2069,6 +2142,7 @@ static noinline void rcu_gp_cleanup(void)
> > > * safe for us to drop the lock in order to mark the grace
> > > * period as completed in all of the rcu_node structures.
> > > */
> > > + rcu_poll_gp_seq_end(&rcu_state.gp_seq_polled_snap);
> > > raw_spin_unlock_irq_rcu_node(rnp);
> > >
> > > /*
> > > @@ -3837,8 +3911,18 @@ void synchronize_rcu(void)
> > > lock_is_held(&rcu_lock_map) ||
> > > lock_is_held(&rcu_sched_lock_map),
> > > "Illegal synchronize_rcu() in RCU read-side critical section");
> > > - if (rcu_blocking_is_gp())
> > > + if (rcu_blocking_is_gp()) {
> > > + // Note well that this code runs with !PREEMPT && !SMP.
> > > + // In addition, all code that advances grace periods runs
> > > + // at process level. Therefore, this GP overlaps with other
> > > + // GPs only by being fully nested within them, which allows
> > > + // reuse of ->gp_seq_polled_snap.
> > > + rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap);
> > > + rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap);
> > > + if (rcu_init_invoked())
> > > + cond_resched_tasks_rcu_qs();
> > > return; // Context allows vacuous grace periods.
> > > + }
> > > if (rcu_gp_is_expedited())
> > > synchronize_rcu_expedited();
> > > else
> > > @@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
> > > * before the load from ->gp_seq.
> > > */
> > > smp_mb(); /* ^^^ */
> > > - return rcu_seq_snap(&rcu_state.gp_seq);
> > > + return rcu_seq_snap(&rcu_state.gp_seq_polled);
> >
> > I happened to run into this. There is one usage of
> > get_state_synchronize_rcu() in start_poll_synchronize_rcu(), in which
> > the return value of get_state_synchronize_rcu() ("gp_seq") will be used
> > for rcu_start_this_gp(). I don't think this is quite right, because
> > after this change, rcu_state.gp_seq and rcu_state.gp_seq_polled are
> > different values, in fact ->gp_seq_polled is greater than ->gp_seq
> > by how many synchronize_rcu() is called in early boot.
> >
> > Am I missing something here?
>
> It does not appear that your are missing anything, sad to say!
>
> Does the following make it work better?

Well, rcutorture doesn't like this change much. ;-)

No surprise, given that it is only the value feeding into
rcu_start_this_gp() that needs to change, not the value returned from
start_poll_synchronize_rcu().

Take 2, still untested.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2122359f0c862..061c1f6737ddc 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3581,7 +3581,7 @@ unsigned long start_poll_synchronize_rcu(void)
rdp = this_cpu_ptr(&rcu_data);
rnp = rdp->mynode;
raw_spin_lock_rcu_node(rnp); // irqs already disabled.
- needwake = rcu_start_this_gp(rnp, rdp, gp_seq);
+ needwake = rcu_start_this_gp(rnp, rdp, rcu_seq_snap(&rcu_state.gp_seq));
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
if (needwake)
rcu_gp_kthread_wake();

2022-07-21 02:03:48

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH rcu 04/12] rcu: Switch polled grace-period APIs to ->gp_seq_polled

On Wed, Jul 20, 2022 at 06:04:55PM -0700, Paul E. McKenney wrote:
[...]
> > > @@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
> > > * before the load from ->gp_seq.
> > > */
> > > smp_mb(); /* ^^^ */
> > > - return rcu_seq_snap(&rcu_state.gp_seq);
> > > + return rcu_seq_snap(&rcu_state.gp_seq_polled);
> >
> > I happened to run into this. There is one usage of
> > get_state_synchronize_rcu() in start_poll_synchronize_rcu(), in which
> > the return value of get_state_synchronize_rcu() ("gp_seq") will be used
> > for rcu_start_this_gp(). I don't think this is quite right, because
> > after this change, rcu_state.gp_seq and rcu_state.gp_seq_polled are
> > different values, in fact ->gp_seq_polled is greater than ->gp_seq
> > by how many synchronize_rcu() is called in early boot.
> >
> > Am I missing something here?
>
> It does not appear that your are missing anything, sad to say!
>
> Does the following make it work better?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2122359f0c862..cf2fd58a93a41 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3571,7 +3571,7 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
> unsigned long start_poll_synchronize_rcu(void)
> {
> unsigned long flags;
> - unsigned long gp_seq = get_state_synchronize_rcu();
> + unsigned long gp_seq = rcu_seq_snap(&rcu_state.gp_seq);

get_state_synchronize_rcu() is still needed, because we need to return
a cookie for polling for this function. Something like below maybe? Hope
I didn't mess up the ordering ;-)

Regards,
Boqun

---------------
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 84d281776688..0f9134871289 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3571,11 +3583,39 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
unsigned long start_poll_synchronize_rcu(void)
{
unsigned long flags;
- unsigned long gp_seq = get_state_synchronize_rcu();
+ unsigned long gp_seq_poll = get_state_synchronize_rcu();
+ unsigned long gp_seq;
bool needwake;
struct rcu_data *rdp;
struct rcu_node *rnp;

+ /*
+ * Need to start a gp if no gp has been started yet.
+ *
+ * Note that we need to snapshot gp_seq after gp_seq_poll, otherwise
+ * consider the follow case:
+ *
+ * <no gp in progress> // gp# is 0
+ * snapshot gp_seq // gp #2 will be set as needed
+ * <a gp passed>
+ * // gp# is 1
+ * snapshot gp_seq_poll // polling gets ready until gp #3
+ *
+ * then the following rcu_start_this_gp() won't mark gp #3 as needed,
+ * and polling won't become ready if others don't start a gp.
+ *
+ * And the following case is fine:
+ *
+ * <no gp in progress> // gp# is 0
+ * snapshot gp_seq_poll // polling gets ready until gp #2
+ * <a gp passed>
+ * // gp# is 1
+ * snapshot gp_seq // gp #3 will be set as needed
+ *
+ * Also note, we rely on the smp_mb() in get_state_synchronize_rcu()
+ * to order the two snapshots.
+ */
+ gp_seq = rcu_seq_snap(&rcu_state.gp_seq);
lockdep_assert_irqs_enabled();
local_irq_save(flags);
rdp = this_cpu_ptr(&rcu_data);
@@ -3585,7 +3625,7 @@ unsigned long start_poll_synchronize_rcu(void)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
if (needwake)
rcu_gp_kthread_wake();
- return gp_seq;
+ return gp_seq_poll;
}
EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);

2022-07-21 05:17:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 04/12] rcu: Switch polled grace-period APIs to ->gp_seq_polled

On Wed, Jul 20, 2022 at 06:51:45PM -0700, Boqun Feng wrote:
> On Wed, Jul 20, 2022 at 06:04:55PM -0700, Paul E. McKenney wrote:
> [...]
> > > > @@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
> > > > * before the load from ->gp_seq.
> > > > */
> > > > smp_mb(); /* ^^^ */
> > > > - return rcu_seq_snap(&rcu_state.gp_seq);
> > > > + return rcu_seq_snap(&rcu_state.gp_seq_polled);
> > >
> > > I happened to run into this. There is one usage of
> > > get_state_synchronize_rcu() in start_poll_synchronize_rcu(), in which
> > > the return value of get_state_synchronize_rcu() ("gp_seq") will be used
> > > for rcu_start_this_gp(). I don't think this is quite right, because
> > > after this change, rcu_state.gp_seq and rcu_state.gp_seq_polled are
> > > different values, in fact ->gp_seq_polled is greater than ->gp_seq
> > > by how many synchronize_rcu() is called in early boot.
> > >
> > > Am I missing something here?
> >
> > It does not appear that your are missing anything, sad to say!
> >
> > Does the following make it work better?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 2122359f0c862..cf2fd58a93a41 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3571,7 +3571,7 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
> > unsigned long start_poll_synchronize_rcu(void)
> > {
> > unsigned long flags;
> > - unsigned long gp_seq = get_state_synchronize_rcu();
> > + unsigned long gp_seq = rcu_seq_snap(&rcu_state.gp_seq);
>
> get_state_synchronize_rcu() is still needed, because we need to return
> a cookie for polling for this function. Something like below maybe? Hope
> I didn't mess up the ordering ;-)

My thought is to combine your comment with my functionally equivalent
code that avoids the extra variable. If that works for you (and if it
works, for that matter), does Co-developed-by work for you?

Thanx, Paul

> Regards,
> Boqun
>
> ---------------
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 84d281776688..0f9134871289 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3571,11 +3583,39 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
> unsigned long start_poll_synchronize_rcu(void)
> {
> unsigned long flags;
> - unsigned long gp_seq = get_state_synchronize_rcu();
> + unsigned long gp_seq_poll = get_state_synchronize_rcu();
> + unsigned long gp_seq;
> bool needwake;
> struct rcu_data *rdp;
> struct rcu_node *rnp;
>
> + /*
> + * Need to start a gp if no gp has been started yet.
> + *
> + * Note that we need to snapshot gp_seq after gp_seq_poll, otherwise
> + * consider the follow case:
> + *
> + * <no gp in progress> // gp# is 0
> + * snapshot gp_seq // gp #2 will be set as needed
> + * <a gp passed>
> + * // gp# is 1
> + * snapshot gp_seq_poll // polling gets ready until gp #3
> + *
> + * then the following rcu_start_this_gp() won't mark gp #3 as needed,
> + * and polling won't become ready if others don't start a gp.
> + *
> + * And the following case is fine:
> + *
> + * <no gp in progress> // gp# is 0
> + * snapshot gp_seq_poll // polling gets ready until gp #2
> + * <a gp passed>
> + * // gp# is 1
> + * snapshot gp_seq // gp #3 will be set as needed
> + *
> + * Also note, we rely on the smp_mb() in get_state_synchronize_rcu()
> + * to order the two snapshots.
> + */
> + gp_seq = rcu_seq_snap(&rcu_state.gp_seq);
> lockdep_assert_irqs_enabled();
> local_irq_save(flags);
> rdp = this_cpu_ptr(&rcu_data);
> @@ -3585,7 +3625,7 @@ unsigned long start_poll_synchronize_rcu(void)
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (needwake)
> rcu_gp_kthread_wake();
> - return gp_seq;
> + return gp_seq_poll;
> }
> EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);

2022-07-21 06:31:48

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH rcu 04/12] rcu: Switch polled grace-period APIs to ->gp_seq_polled

On Wed, Jul 20, 2022 at 09:47:08PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 20, 2022 at 06:51:45PM -0700, Boqun Feng wrote:
> > On Wed, Jul 20, 2022 at 06:04:55PM -0700, Paul E. McKenney wrote:
> > [...]
> > > > > @@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
> > > > > * before the load from ->gp_seq.
> > > > > */
> > > > > smp_mb(); /* ^^^ */
> > > > > - return rcu_seq_snap(&rcu_state.gp_seq);
> > > > > + return rcu_seq_snap(&rcu_state.gp_seq_polled);
> > > >
> > > > I happened to run into this. There is one usage of
> > > > get_state_synchronize_rcu() in start_poll_synchronize_rcu(), in which
> > > > the return value of get_state_synchronize_rcu() ("gp_seq") will be used
> > > > for rcu_start_this_gp(). I don't think this is quite right, because
> > > > after this change, rcu_state.gp_seq and rcu_state.gp_seq_polled are
> > > > different values, in fact ->gp_seq_polled is greater than ->gp_seq
> > > > by how many synchronize_rcu() is called in early boot.
> > > >
> > > > Am I missing something here?
> > >
> > > It does not appear that your are missing anything, sad to say!
> > >
> > > Does the following make it work better?
> > >
> > > Thanx, Paul
> > >
> > > ------------------------------------------------------------------------
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 2122359f0c862..cf2fd58a93a41 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3571,7 +3571,7 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
> > > unsigned long start_poll_synchronize_rcu(void)
> > > {
> > > unsigned long flags;
> > > - unsigned long gp_seq = get_state_synchronize_rcu();
> > > + unsigned long gp_seq = rcu_seq_snap(&rcu_state.gp_seq);
> >
> > get_state_synchronize_rcu() is still needed, because we need to return
> > a cookie for polling for this function. Something like below maybe? Hope
> > I didn't mess up the ordering ;-)
>
> My thought is to combine your comment with my functionally equivalent
> code that avoids the extra variable. If that works for you (and if it
> works, for that matter), does Co-developed-by work for you?
>

Sure! Thanks ;-)

Regards,
Boqun

> Thanx, Paul
>
> > Regards,
> > Boqun
> >
> > ---------------
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 84d281776688..0f9134871289 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3571,11 +3583,39 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
> > unsigned long start_poll_synchronize_rcu(void)
> > {
> > unsigned long flags;
> > - unsigned long gp_seq = get_state_synchronize_rcu();
> > + unsigned long gp_seq_poll = get_state_synchronize_rcu();
> > + unsigned long gp_seq;
> > bool needwake;
> > struct rcu_data *rdp;
> > struct rcu_node *rnp;
> >
> > + /*
> > + * Need to start a gp if no gp has been started yet.
> > + *
> > + * Note that we need to snapshot gp_seq after gp_seq_poll, otherwise
> > + * consider the follow case:
> > + *
> > + * <no gp in progress> // gp# is 0
> > + * snapshot gp_seq // gp #2 will be set as needed
> > + * <a gp passed>
> > + * // gp# is 1
> > + * snapshot gp_seq_poll // polling gets ready until gp #3
> > + *
> > + * then the following rcu_start_this_gp() won't mark gp #3 as needed,
> > + * and polling won't become ready if others don't start a gp.
> > + *
> > + * And the following case is fine:
> > + *
> > + * <no gp in progress> // gp# is 0
> > + * snapshot gp_seq_poll // polling gets ready until gp #2
> > + * <a gp passed>
> > + * // gp# is 1
> > + * snapshot gp_seq // gp #3 will be set as needed
> > + *
> > + * Also note, we rely on the smp_mb() in get_state_synchronize_rcu()
> > + * to order the two snapshots.
> > + */
> > + gp_seq = rcu_seq_snap(&rcu_state.gp_seq);
> > lockdep_assert_irqs_enabled();
> > local_irq_save(flags);
> > rdp = this_cpu_ptr(&rcu_data);
> > @@ -3585,7 +3625,7 @@ unsigned long start_poll_synchronize_rcu(void)
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > if (needwake)
> > rcu_gp_kthread_wake();
> > - return gp_seq;
> > + return gp_seq_poll;
> > }
> > EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);

2022-07-22 01:23:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH rcu 04/12] rcu: Switch polled grace-period APIs to ->gp_seq_polled

On Wed, Jul 20, 2022 at 10:49:26PM -0700, Boqun Feng wrote:
> On Wed, Jul 20, 2022 at 09:47:08PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 20, 2022 at 06:51:45PM -0700, Boqun Feng wrote:
> > > On Wed, Jul 20, 2022 at 06:04:55PM -0700, Paul E. McKenney wrote:
> > > [...]
> > > > > > @@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
> > > > > > * before the load from ->gp_seq.
> > > > > > */
> > > > > > smp_mb(); /* ^^^ */
> > > > > > - return rcu_seq_snap(&rcu_state.gp_seq);
> > > > > > + return rcu_seq_snap(&rcu_state.gp_seq_polled);
> > > > >
> > > > > I happened to run into this. There is one usage of
> > > > > get_state_synchronize_rcu() in start_poll_synchronize_rcu(), in which
> > > > > the return value of get_state_synchronize_rcu() ("gp_seq") will be used
> > > > > for rcu_start_this_gp(). I don't think this is quite right, because
> > > > > after this change, rcu_state.gp_seq and rcu_state.gp_seq_polled are
> > > > > different values, in fact ->gp_seq_polled is greater than ->gp_seq
> > > > > by how many synchronize_rcu() is called in early boot.
> > > > >
> > > > > Am I missing something here?
> > > >
> > > > It does not appear that your are missing anything, sad to say!
> > > >
> > > > Does the following make it work better?
> > > >
> > > > Thanx, Paul
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 2122359f0c862..cf2fd58a93a41 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -3571,7 +3571,7 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
> > > > unsigned long start_poll_synchronize_rcu(void)
> > > > {
> > > > unsigned long flags;
> > > > - unsigned long gp_seq = get_state_synchronize_rcu();
> > > > + unsigned long gp_seq = rcu_seq_snap(&rcu_state.gp_seq);
> > >
> > > get_state_synchronize_rcu() is still needed, because we need to return
> > > a cookie for polling for this function. Something like below maybe? Hope
> > > I didn't mess up the ordering ;-)
> >
> > My thought is to combine your comment with my functionally equivalent
> > code that avoids the extra variable. If that works for you (and if it
> > works, for that matter), does Co-developed-by work for you?
>
> Sure! Thanks ;-)

I did some summarization on the comment, folded it into the original,
and ended up with the patch shown below. Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit bf95b2bc3e42f11f4d7a5e8a98376c2b4a2aa82f
Author: Paul E. McKenney <[email protected]>
Date: Wed Apr 13 17:46:15 2022 -0700

rcu: Switch polled grace-period APIs to ->gp_seq_polled

This commit switches the existing polled grace-period APIs to use a
new ->gp_seq_polled counter in the rcu_state structure. An additional
->gp_seq_polled_snap counter in that same structure allows the normal
grace period kthread to interact properly with the !SMP !PREEMPT fastpath
through synchronize_rcu(). The first of the two to note the end of a
given grace period will make knowledge of this transition available to
the polled API.

This commit is in preparation for polled expedited grace periods.

[ paulmck: Fix use of rcu_state.gp_seq_polled to start normal grace period. ]

Link: https://lore.kernel.org/all/[email protected]/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ian Kent <[email protected]>
Co-developed-by: Boqun Feng <[email protected]>
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 46cfceea87847..b40a5a19ddd2a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1775,6 +1775,78 @@ static void rcu_strict_gp_boundary(void *unused)
invoke_rcu_core();
}

+// Has rcu_init() been invoked? This is used (for example) to determine
+// whether spinlocks may be acquired safely.
+static bool rcu_init_invoked(void)
+{
+ return !!rcu_state.n_online_cpus;
+}
+
+// Make the polled API aware of the beginning of a grace period.
+static void rcu_poll_gp_seq_start(unsigned long *snap)
+{
+ struct rcu_node *rnp = rcu_get_root();
+
+ if (rcu_init_invoked())
+ raw_lockdep_assert_held_rcu_node(rnp);
+
+ // If RCU was idle, note beginning of GP.
+ if (!rcu_seq_state(rcu_state.gp_seq_polled))
+ rcu_seq_start(&rcu_state.gp_seq_polled);
+
+ // Either way, record current state.
+ *snap = rcu_state.gp_seq_polled;
+}
+
+// Make the polled API aware of the end of a grace period.
+static void rcu_poll_gp_seq_end(unsigned long *snap)
+{
+ struct rcu_node *rnp = rcu_get_root();
+
+ if (rcu_init_invoked())
+ raw_lockdep_assert_held_rcu_node(rnp);
+
+ // If the previously noted GP is still in effect, record the
+ // end of that GP. Either way, zero counter to avoid counter-wrap
+ // problems.
+ if (*snap && *snap == rcu_state.gp_seq_polled) {
+ rcu_seq_end(&rcu_state.gp_seq_polled);
+ rcu_state.gp_seq_polled_snap = 0;
+ } else {
+ *snap = 0;
+ }
+}
+
+// Make the polled API aware of the beginning of a grace period, but
+// where caller does not hold the root rcu_node structure's lock.
+static void rcu_poll_gp_seq_start_unlocked(unsigned long *snap)
+{
+ struct rcu_node *rnp = rcu_get_root();
+
+ if (rcu_init_invoked()) {
+ lockdep_assert_irqs_enabled();
+ raw_spin_lock_irq_rcu_node(rnp);
+ }
+ rcu_poll_gp_seq_start(snap);
+ if (rcu_init_invoked())
+ raw_spin_unlock_irq_rcu_node(rnp);
+}
+
+// Make the polled API aware of the end of a grace period, but where
+// caller does not hold the root rcu_node structure's lock.
+static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
+{
+ struct rcu_node *rnp = rcu_get_root();
+
+ if (rcu_init_invoked()) {
+ lockdep_assert_irqs_enabled();
+ raw_spin_lock_irq_rcu_node(rnp);
+ }
+ rcu_poll_gp_seq_end(snap);
+ if (rcu_init_invoked())
+ raw_spin_unlock_irq_rcu_node(rnp);
+}
+
/*
* Initialize a new grace period. Return false if no grace period required.
*/
@@ -1810,6 +1882,7 @@ static noinline_for_stack bool rcu_gp_init(void)
rcu_seq_start(&rcu_state.gp_seq);
ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
+ rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
raw_spin_unlock_irq_rcu_node(rnp);

/*
@@ -2069,6 +2142,7 @@ static noinline void rcu_gp_cleanup(void)
* safe for us to drop the lock in order to mark the grace
* period as completed in all of the rcu_node structures.
*/
+ rcu_poll_gp_seq_end(&rcu_state.gp_seq_polled_snap);
raw_spin_unlock_irq_rcu_node(rnp);

/*
@@ -3837,8 +3911,18 @@ void synchronize_rcu(void)
lock_is_held(&rcu_lock_map) ||
lock_is_held(&rcu_sched_lock_map),
"Illegal synchronize_rcu() in RCU read-side critical section");
- if (rcu_blocking_is_gp())
+ if (rcu_blocking_is_gp()) {
+ // Note well that this code runs with !PREEMPT && !SMP.
+ // In addition, all code that advances grace periods runs
+ // at process level. Therefore, this GP overlaps with other
+ // GPs only by being fully nested within them, which allows
+ // reuse of ->gp_seq_polled_snap.
+ rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap);
+ rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap);
+ if (rcu_init_invoked())
+ cond_resched_tasks_rcu_qs();
return; // Context allows vacuous grace periods.
+ }
if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
@@ -3860,7 +3944,7 @@ unsigned long get_state_synchronize_rcu(void)
* before the load from ->gp_seq.
*/
smp_mb(); /* ^^^ */
- return rcu_seq_snap(&rcu_state.gp_seq);
+ return rcu_seq_snap(&rcu_state.gp_seq_polled);
}
EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);

@@ -3889,7 +3973,13 @@ unsigned long start_poll_synchronize_rcu(void)
rdp = this_cpu_ptr(&rcu_data);
rnp = rdp->mynode;
raw_spin_lock_rcu_node(rnp); // irqs already disabled.
- needwake = rcu_start_this_gp(rnp, rdp, gp_seq);
+ // Note it is possible for a grace period to have elapsed between
+ // the above call to get_state_synchronize_rcu() and the below call
+ // to rcu_seq_snap. This is OK, the worst that happens is that we
+ // get a grace period that no one needed. These accesses are ordered
+ // by smp_mb(), and we are accessing them in the opposite order
+ // from which they are updated at grace-period start, as required.
+ needwake = rcu_start_this_gp(rnp, rdp, rcu_seq_snap(&rcu_state.gp_seq));
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
if (needwake)
rcu_gp_kthread_wake();
@@ -3925,7 +4015,7 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
bool poll_state_synchronize_rcu(unsigned long oldstate)
{
if (oldstate == RCU_GET_STATE_COMPLETED ||
- rcu_seq_done_exact(&rcu_state.gp_seq, oldstate)) {
+ rcu_seq_done_exact(&rcu_state.gp_seq_polled, oldstate)) {
smp_mb(); /* Ensure GP ends before subsequent accesses. */
return true;
}
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2ccf5845957df..9c853033f159d 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -323,6 +323,8 @@ struct rcu_state {
short gp_state; /* GP kthread sleep state. */
unsigned long gp_wake_time; /* Last GP kthread wake. */
unsigned long gp_wake_seq; /* ->gp_seq at ^^^. */
+ unsigned long gp_seq_polled; /* GP seq for polled API. */
+ unsigned long gp_seq_polled_snap; /* ->gp_seq_polled at normal GP start. */

/* End of fields guarded by root rcu_node's lock. */