2018-06-26 00:33:45

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 0/27] Post-gp_seq miscellaneous fixes for v4.19

Hello!

This series contains miscellaneous fixes that are sequenced after the
infamous ->gp_seq conversion:

1. Make rcu_read_unlock_special() static.

2. Improve rcu_note_voluntary_context_switch() reporting, courtesy of
Byungchul Park.

3. Get rid of Sphinx warnings at rcu_pointer_handoff(), courtesy of
Mauro Carvalho Chehab.

4. Use pr_fmt to prefix "rcu: " to logging output, courtesy of
Joe Perches.

5. Improve RCU-tasks naming and comments.

6. Mark task as .need_qs less aggressively.

7-14. Inline, remove "inline", remove __maybe_unused, remove unused
variables, remove unused functions, and fix comments.

15. Use RCU CPU stall timeout for rcu_check_gp_start_stall().

16. Add comment documenting how rcu_seq_snap() works, courtesy of
Joel Fernandes.

17. Speed up calling of RCU tasks callbacks, courtesy of Steven Rostedt.

18. Add comment to the last sleep in the rcu tasks loop, also courtesy
of Steven Rostedt.

19. Add diagnostics for rcutorture writer stall warning.

20. Check the range of jiffies_till_{first,next}_fqs when setting them,
courtesy of Byungchul Park.

21. Update synchronize_rcu() definition in whatisRCU.txt, courtesy
of Andrea Parri.

22. Make rcu_seq_diff() more exact, which in turn allows more exact
rcutorture diagnostics of busted RCU implementations.

23. Update RCU, SRCU, and TORTURE-TEST entries in MAINTAINERS file.

24. Print stall-warning NMI dyntick state in hexadecimal.

25. Add grace-period number to rcutorture statistics printout.

26. Improve documentation for list_for_each_entry_from_rcu(), courtesy
of NeilBrown.

27. Assign higher prio to RCU threads if rcutorture is built-in,
courtesy of Joel Fernandes.

Thanx, Paul

------------------------------------------------------------------------

Documentation/RCU/whatisRCU.txt | 18 +++--
MAINTAINERS | 9 +-
include/linux/rculist.h | 19 +++++
include/linux/rcupdate.h | 22 +++---
include/linux/rcutiny.h | 2
kernel/rcu/rcu.h | 49 ++++++++++++++
kernel/rcu/rcuperf.c | 9 +-
kernel/rcu/rcutorture.c | 6 -
kernel/rcu/srcutree.c | 8 +-
kernel/rcu/tiny.c | 4 -
kernel/rcu/tree.c | 132 +++++++++++++++++++++++++---------------
kernel/rcu/tree.h | 1
kernel/rcu/tree_plugin.h | 55 ++++++----------
kernel/rcu/update.c | 45 +++++++++----
14 files changed, 245 insertions(+), 134 deletions(-)



2018-06-26 00:34:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 10/27] rcu: Remove unused local variable "cpu"

One danger of using __maybe_unused is that the compiler doesn't yell
at you when you remove the last reference, witness rcu_bind_gp_kthread()
and its local variable "cpu". This commit removes this local variable.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 75a91d58b8f7..2cc9bf0d363a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2670,8 +2670,6 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp)
*/
static void rcu_bind_gp_kthread(void)
{
- int __maybe_unused cpu;
-
if (!tick_nohz_full_enabled())
return;
housekeeping_affine(current, HK_FLAG_RCU);
--
2.17.1


2018-06-26 00:34:44

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 27/27] rcu: Assign higher prio to RCU threads if rcutorture is built-in

From: "Joel Fernandes (Google)" <[email protected]>

The rcutorture RCU priority boosting tests fail even with CONFIG_RCU_BOOST
set because rcutorture's threads run at the same priority as the default
RCU kthreads (RT class with priority of 1).

This patch checks if RCU torture is built into the kernel and if so,
assigns RT priority 1 to the RCU threads, allowing the rcutorture boost
tests to pass.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c59cca40ea92..988683089a4c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3921,12 +3921,16 @@ static int __init rcu_spawn_gp_kthread(void)
struct task_struct *t;

/* Force priority into range. */
- if (IS_ENABLED(CONFIG_RCU_BOOST) && kthread_prio < 1)
+ if (IS_ENABLED(CONFIG_RCU_BOOST) && kthread_prio < 2
+ && IS_BUILTIN(CONFIG_RCU_TORTURE_TEST))
+ kthread_prio = 2;
+ else if (IS_ENABLED(CONFIG_RCU_BOOST) && kthread_prio < 1)
kthread_prio = 1;
else if (kthread_prio < 0)
kthread_prio = 0;
else if (kthread_prio > 99)
kthread_prio = 99;
+
if (kthread_prio != kthread_prio_in)
pr_alert("rcu_spawn_gp_kthread(): Limited prio to %d from %d\n",
kthread_prio, kthread_prio_in);
--
2.17.1


2018-06-26 00:35:46

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

From: "Joel Fernandes (Google)" <[email protected]>

rcu_seq_snap may be tricky to decipher. Lets document how it works with
an example to make it easier.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcu.h | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index aa215d6355f8..60f089d08c47 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -91,7 +91,39 @@ static inline void rcu_seq_end(unsigned long *sp)
WRITE_ONCE(*sp, rcu_seq_endval(sp));
}

-/* Take a snapshot of the update side's sequence number. */
+/*
+ * rcu_seq_snap - Take a snapshot of the update side's sequence number.
+ *
+ * This function returns the earliest value of the grace-period sequence number
+ * that will indicate that a full grace period has elapsed since the current
+ * time. Once the grace-period sequence number has reached this value, it will
+ * be safe to invoke all callbacks that have been registered prior to the
+ * current time. This value is the current grace-period number plus two to the
+ * power of the number of low-order bits reserved for state, then rounded up to
+ * the next value in which the state bits are all zero.
+ *
+ * In the current design, RCU_SEQ_STATE_MASK=3 and the least significant bit of
+ * the seq is used to track if a GP is in progress or not. Given this, it is
+ * sufficient if we add (6+1) and mask with ~3 to get the next GP. Let's see
+ * why with an example:
+ *
+ * Say the current seq is 12 which is 0b1100 (GP is 3 and state bits are 0b00).
+ * To get to the next GP number of 4, we have to add 0b100 to this (0x1 << 2)
+ * to account for the shift due to 2 state bits. Now, if the current seq is
+ * 13 (GP is 3 and state bits are 0b01), then it means the current grace period
+ * is already in progress so the next GP that a future call back will be queued
+ * to run at is GP+2 = 5, not 4. To account for the extra +1, we just overflow
+ * the 2 lower bits by adding 0b11. In case the lower bit was set, the overflow
+ * will cause the extra +1 to the GP, along with the usual +1 explained before.
+ * This gives us GP+2. Finally we mask the lower to bits by ~0x3 in case the
+ * overflow didn't occur. This masking is needed because in case RCU was idle
+ * (no GP in progress so lower 2 bits are 0b00), then the overflow of the lower
+ * 2 state bits wouldn't occur, so we mask to zero out those lower 2 bits.
+ *
+ * In other words, the next seq can be obtained by (0b11 + 0b100) & (~0b11)
+ * which can be generalized to:
+ * seq + (RCU_SEQ_STATE_MASK + (RCU_SEQ_STATE_MASK + 1)) & (~RCU_SEQ_STATE_MASK)
+ */
static inline unsigned long rcu_seq_snap(unsigned long *sp)
{
unsigned long s;
--
2.17.1


2018-06-26 00:35:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 09/27] rcu: Remove unused rcu_kick_nohz_cpu() function

The rcu_kick_nohz_cpu() function is no longer used, and the functionality
it used to provide is now provided by a call to resched_cpu() in the
force-quiescent-state function rcu_implicit_dynticks_qs(). This commit
therefore removes rcu_kick_nohz_cpu().

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.h | 1 -
kernel/rcu/tree_plugin.h | 17 -----------------
2 files changed, 18 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index d51e6edc8e83..4e74df768c57 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -483,7 +483,6 @@ static void __init rcu_spawn_nocb_kthreads(void);
#ifdef CONFIG_RCU_NOCB_CPU
static void __init rcu_organize_nocb_kthreads(struct rcu_state *rsp);
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
-static void __maybe_unused rcu_kick_nohz_cpu(int cpu);
static bool init_nocb_callback_list(struct rcu_data *rdp);
static void rcu_bind_gp_kthread(void);
static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 07d1ad175994..75a91d58b8f7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2645,23 +2645,6 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)

#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */

-/*
- * An adaptive-ticks CPU can potentially execute in kernel mode for an
- * arbitrarily long period of time with the scheduling-clock tick turned
- * off. RCU will be paying attention to this CPU because it is in the
- * kernel, but the CPU cannot be guaranteed to be executing the RCU state
- * machine because the scheduling-clock tick has been disabled. Therefore,
- * if an adaptive-ticks CPU is failing to respond to the current grace
- * period and has not be idle from an RCU perspective, kick it.
- */
-static void __maybe_unused rcu_kick_nohz_cpu(int cpu)
-{
-#ifdef CONFIG_NO_HZ_FULL
- if (tick_nohz_full_cpu(cpu))
- smp_send_reschedule(cpu);
-#endif /* #ifdef CONFIG_NO_HZ_FULL */
-}
-
/*
* Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
* grace-period kthread will do force_quiescent_state() processing?
--
2.17.1


2018-06-26 00:35:59

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 06/27] rcu: Mark task as .need_qs less aggressively

If any scheduling-clock interrupt interrupts an RCU-preempt read-side
critical section, the interrupted task's ->rcu_read_unlock_special.b.need_qs
field is set. This causes the outermost rcu_read_unlock() to incur the
extra overhead of calling into rcu_read_unlock_special(). This commit
reduces that overhead by setting ->rcu_read_unlock_special.b.need_qs only
if the grace period has been in effect for more than one second.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index dbfe90191e19..0239cf8a4be6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -730,6 +730,7 @@ rcu_preempt_check_blocked_tasks(struct rcu_state *rsp, struct rcu_node *rnp)
*/
static void rcu_preempt_check_callbacks(void)
{
+ struct rcu_state *rsp = &rcu_preempt_state;
struct task_struct *t = current;

if (t->rcu_read_lock_nesting == 0) {
@@ -738,7 +739,9 @@ static void rcu_preempt_check_callbacks(void)
}
if (t->rcu_read_lock_nesting > 0 &&
__this_cpu_read(rcu_data_p->core_needs_qs) &&
- __this_cpu_read(rcu_data_p->cpu_no_qs.b.norm))
+ __this_cpu_read(rcu_data_p->cpu_no_qs.b.norm) &&
+ !t->rcu_read_unlock_special.b.need_qs &&
+ time_after(jiffies, rsp->gp_start + HZ))
t->rcu_read_unlock_special.b.need_qs = true;
}

--
2.17.1


2018-06-26 00:36:21

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 14/27] rcu: Remove __maybe_unused from rcu_cpu_has_callbacks()

The rcu_cpu_has_callbacks() function is now used in all configurations,
so this commit removes the __maybe_unused.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 11fb6542b9d4..a95d49730cfc 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3313,7 +3313,7 @@ static int rcu_pending(void)
* non-NULL, store an indication of whether all callbacks are lazy.
* (If there are no callbacks, all of them are deemed to be lazy.)
*/
-static bool __maybe_unused rcu_cpu_has_callbacks(bool *all_lazy)
+static bool rcu_cpu_has_callbacks(bool *all_lazy)
{
bool al = true;
bool hc = false;
--
2.17.1


2018-06-26 00:36:33

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 18/27] rcu: Add comment to the last sleep in the rcu tasks loop

From: "Steven Rostedt (VMware)" <[email protected]>

At the end of rcu_tasks_kthread() there's a lonely
schedule_timeout_uninterruptible() call with no apparent rationale for
its existence. But there is. It is to keep the thread from going into
a tight loop if there's some anomaly. That really needs a comment.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/update.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 4c7c49c106ee..39cb23d22109 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -814,6 +814,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
list = next;
cond_resched();
}
+ /* Paranoid sleep to keep this from entering a tight loop */
schedule_timeout_uninterruptible(HZ/10);
}
}
--
2.17.1


2018-06-26 00:36:42

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 22/27] rcu: Make rcu_seq_diff() more exact

The current implementatation of rcu_seq_diff() follows tradition in
providing a rough-and-ready approximation of the number of elapsed grace
periods between the two rcu_seq values. However, this difference is
used to flag RCU-failure "near misses", which can be a valuable debugging
aid, so more exactitude would be an improvement. This commit therefore
improves the accuracy of rcu_seq_diff().

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcu.h | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 60f089d08c47..46f12d0fe319 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -180,7 +180,20 @@ static inline bool rcu_seq_new_gp(unsigned long old, unsigned long new)
*/
static inline unsigned long rcu_seq_diff(unsigned long new, unsigned long old)
{
- return (new - old) >> RCU_SEQ_CTR_SHIFT;
+ unsigned long rnd_diff;
+
+ if (old == new)
+ return 0;
+ /*
+ * Compute the number of grace periods (still shifted up), plus
+ * one if either of new and old is not an exact grace period.
+ */
+ rnd_diff = (new & ~RCU_SEQ_STATE_MASK) -
+ ((old + RCU_SEQ_STATE_MASK) & ~RCU_SEQ_STATE_MASK) +
+ ((new & RCU_SEQ_STATE_MASK) || (old & RCU_SEQ_STATE_MASK));
+ if (ULONG_CMP_GE(RCU_SEQ_STATE_MASK, rnd_diff))
+ return 1; /* Definitely no grace period has elapsed. */
+ return ((rnd_diff - RCU_SEQ_STATE_MASK - 1) >> RCU_SEQ_CTR_SHIFT) + 2;
}

/*
--
2.17.1


2018-06-26 00:36:55

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 20/27] rcu: Check the range of jiffies_till_{first,next}_fqs when setting them

From: Byungchul Park <[email protected]>

Currently, the range of jiffies_till_{first,next}_fqs are checked and
adjusted on and on in the loop of rcu_gp_kthread on runtime.

However, it's enough to check them only when setting them, not every
time in the loop. So make them handled on a setting time via sysfs.

Signed-off-by: Byungchul Park <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 45 ++++++++++++++++++++++++++++++++-------------
1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bc1ed77a3df9..c59cca40ea92 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -510,8 +510,38 @@ static ulong jiffies_till_first_fqs = ULONG_MAX;
static ulong jiffies_till_next_fqs = ULONG_MAX;
static bool rcu_kick_kthreads;

-module_param(jiffies_till_first_fqs, ulong, 0644);
-module_param(jiffies_till_next_fqs, ulong, 0644);
+static int param_set_first_fqs_jiffies(const char *val, const struct kernel_param *kp)
+{
+ ulong j;
+ int ret = kstrtoul(val, 0, &j);
+
+ if (!ret)
+ WRITE_ONCE(*(ulong *)kp->arg, (j > HZ) ? HZ : j);
+ return ret;
+}
+
+static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param *kp)
+{
+ ulong j;
+ int ret = kstrtoul(val, 0, &j);
+
+ if (!ret)
+ WRITE_ONCE(*(ulong *)kp->arg, (j > HZ) ? HZ : (j ?: 1));
+ return ret;
+}
+
+static struct kernel_param_ops first_fqs_jiffies_ops = {
+ .set = param_set_first_fqs_jiffies,
+ .get = param_get_ulong,
+};
+
+static struct kernel_param_ops next_fqs_jiffies_ops = {
+ .set = param_set_next_fqs_jiffies,
+ .get = param_get_ulong,
+};
+
+module_param_cb(jiffies_till_first_fqs, &first_fqs_jiffies_ops, &jiffies_till_first_fqs, 0644);
+module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next_fqs, 0644);
module_param(rcu_kick_kthreads, bool, 0644);

/*
@@ -2180,10 +2210,6 @@ static int __noreturn rcu_gp_kthread(void *arg)
/* Handle quiescent-state forcing. */
first_gp_fqs = true;
j = jiffies_till_first_fqs;
- if (j > HZ) {
- j = HZ;
- jiffies_till_first_fqs = HZ;
- }
ret = 0;
for (;;) {
if (!ret) {
@@ -2218,13 +2244,6 @@ static int __noreturn rcu_gp_kthread(void *arg)
WRITE_ONCE(rsp->gp_activity, jiffies);
ret = 0; /* Force full wait till next FQS. */
j = jiffies_till_next_fqs;
- if (j > HZ) {
- j = HZ;
- jiffies_till_next_fqs = HZ;
- } else if (j < 1) {
- j = 1;
- jiffies_till_next_fqs = 1;
- }
} else {
/* Deal with stray signal. */
cond_resched_tasks_rcu_qs();
--
2.17.1


2018-06-26 00:36:56

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 26/27] rculist: Improve documentation for list_for_each_entry_from_rcu()

From: NeilBrown <[email protected]>

Unfortunately the patch for adding list_for_each_entry_from_rcu()
wasn't the final patch after all review. It is functionally
correct but the documentation was incomplete.

This patch adds this missing documentation which includes an update to
the documentation for list_for_each_entry_continue_rcu() to match the
documentation for the new list_for_each_entry_from_rcu(), and adds
list_for_each_entry_from_rcu() and the already existing
hlist_for_each_entry_from_rcu() to section 7 of whatisRCU.txt.

Reviewed-by: Paul E. McKenney <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/whatisRCU.txt | 2 ++
include/linux/rculist.h | 19 ++++++++++++++++++-
2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 94288f1b8759..c2a7facf7ff9 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -820,11 +820,13 @@ RCU list traversal:
list_next_rcu
list_for_each_entry_rcu
list_for_each_entry_continue_rcu
+ list_for_each_entry_from_rcu
hlist_first_rcu
hlist_next_rcu
hlist_pprev_rcu
hlist_for_each_entry_rcu
hlist_for_each_entry_rcu_bh
+ hlist_for_each_entry_from_rcu
hlist_for_each_entry_continue_rcu
hlist_for_each_entry_continue_rcu_bh
hlist_nulls_first_rcu
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 36df6ccbc874..4786c2235b98 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -396,7 +396,16 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
* @member: the name of the list_head within the struct.
*
* Continue to iterate over list of given type, continuing after
- * the current position.
+ * the current position which must have been in the list when the RCU read
+ * lock was taken.
+ * This would typically require either that you obtained the node from a
+ * previous walk of the list in the same RCU read-side critical section, or
+ * that you held some sort of non-RCU reference (such as a reference count)
+ * to keep the node alive *and* in the list.
+ *
+ * This iterator is similar to list_for_each_entry_from_rcu() except
+ * this starts after the given position and that one starts at the given
+ * position.
*/
#define list_for_each_entry_continue_rcu(pos, head, member) \
for (pos = list_entry_rcu(pos->member.next, typeof(*pos), member); \
@@ -411,6 +420,14 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
*
* Iterate over the tail of a list starting from a given position,
* which must have been in the list when the RCU read lock was taken.
+ * This would typically require either that you obtained the node from a
+ * previous walk of the list in the same RCU read-side critical section, or
+ * that you held some sort of non-RCU reference (such as a reference count)
+ * to keep the node alive *and* in the list.
+ *
+ * This iterator is similar to list_for_each_entry_continue_rcu() except
+ * this starts from the given position and that one starts from the position
+ * after the given position.
*/
#define list_for_each_entry_from_rcu(pos, head, member) \
for (; &(pos)->member != (head); \
--
2.17.1


2018-06-26 00:37:59

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 12/27] rcu: Remove "inline" from rcu_torture_print_module_parms()

This function is in rcutorture.c, which is not an include file, so there
is no problem dropping the "inline", especially given that this function
is invoked only twice per rcutorture run. This commit therefore delegates
the inlining decision to the compiler by dropping the "inline".

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcutorture.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 773ce12f9582..2cce2ca2bc2b 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1359,7 +1359,7 @@ rcu_torture_stats(void *arg)
return 0;
}

-static inline void
+static void
rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, const char *tag)
{
pr_alert("%s" TORTURE_FLAG
--
2.17.1


2018-06-26 00:38:18

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 02/27] rcu: Improve rcu_note_voluntary_context_switch() reporting

From: Byungchul Park <[email protected]>

We expect a quiescent state of TASKS_RCU when cond_resched_tasks_rcu_qs()
is called, no matter whether it actually be scheduled or not. However,
it currently doesn't report the quiescent state when the task enters
into __schedule() as it's called with preempt = true. So make it report
the quiescent state unconditionally when cond_resched_tasks_rcu_qs() is
called.

And in TINY_RCU, even though the quiescent state of rcu_bh also should
be reported when the tick interrupt comes from user, it doesn't. So make
it reported.

Lastly in TREE_RCU, rcu_note_voluntary_context_switch() should be
reported when the tick interrupt comes from not only user but also idle,
as an extended quiescent state.

Signed-off-by: Byungchul Park <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
[ paulmck: Simplify rcutiny portion given no RCU-tasks for !PREEMPT. ]
---
include/linux/rcupdate.h | 4 ++--
kernel/rcu/tiny.c | 4 +---
kernel/rcu/tree.c | 4 ++--
3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 67ec077c7ee5..5cab15e7ec83 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -194,8 +194,8 @@ static inline void exit_tasks_rcu_finish(void) { }
*/
#define cond_resched_tasks_rcu_qs() \
do { \
- if (!cond_resched()) \
- rcu_note_voluntary_context_switch_lite(current); \
+ rcu_note_voluntary_context_switch_lite(current); \
+ cond_resched(); \
} while (0)

/*
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index a64eee0db39e..befc9321a89c 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -122,10 +122,8 @@ void rcu_check_callbacks(int user)
{
if (user)
rcu_sched_qs();
- else if (!in_softirq())
+ if (user || !in_softirq())
rcu_bh_qs();
- if (user)
- rcu_note_voluntary_context_switch(current);
}

/*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4b0aee4f9318..50a8f49e1763 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2645,6 +2645,7 @@ void rcu_check_callbacks(int user)

rcu_sched_qs();
rcu_bh_qs();
+ rcu_note_voluntary_context_switch(current);

} else if (!in_softirq()) {

@@ -2660,8 +2661,7 @@ void rcu_check_callbacks(int user)
rcu_preempt_check_callbacks();
if (rcu_pending())
invoke_rcu_core();
- if (user)
- rcu_note_voluntary_context_switch(current);
+
trace_rcu_utilization(TPS("End scheduler-tick"));
}

--
2.17.1


2018-06-26 00:38:20

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 03/27] rcu: rcupdate.h: Get rid of Sphinx warnings at rcu_pointer_handoff()

From: Mauro Carvalho Chehab <[email protected]>

The code example at rcupdate.h currently produce lots of warnings:

./include/linux/rcupdate.h:572: WARNING: Unexpected indentation.
./include/linux/rcupdate.h:576: WARNING: Unexpected indentation.
./include/linux/rcupdate.h:580: WARNING: Block quote ends without a blank line; unexpected unindent.
./include/linux/rcupdate.h:582: WARNING: Block quote ends without a blank line; unexpected unindent.
./include/linux/rcupdate.h:582: WARNING: Inline literal start-string without end-string.

This commit therefore changes it to a code-block.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5cab15e7ec83..dacc90358b33 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -566,8 +566,8 @@ static inline void rcu_preempt_sleep_check(void) { }
* This is simply an identity function, but it documents where a pointer
* is handed off from RCU to some other synchronization mechanism, for
* example, reference counting or locking. In C11, it would map to
- * kill_dependency(). It could be used as follows:
- * ``
+ * kill_dependency(). It could be used as follows::
+ *
* rcu_read_lock();
* p = rcu_dereference(gp);
* long_lived = is_long_lived(p);
@@ -578,7 +578,6 @@ static inline void rcu_preempt_sleep_check(void) { }
* p = rcu_pointer_handoff(p);
* }
* rcu_read_unlock();
- *``
*/
#define rcu_pointer_handoff(p) (p)

--
2.17.1


2018-06-26 00:38:22

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 15/27] rcu: Use RCU CPU stall timeout for rcu_check_gp_start_stall()

Currently, rcu_check_gp_start_stall() waits for one second after the first
request before complaining that a grace period has not yet started. This
was desirable while testing the conversion from ->future_gp_needed[] to
->gp_seq_needed, but it is a bit on the hair-trigger side for production
use under heavy load. This commit therefore makes this wait time be
exactly that of the RCU CPU stall warning, allowing easy adjustment of
both timeouts to suit the distribution or installation at hand.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a95d49730cfc..c5fec335bc31 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2753,6 +2753,7 @@ static void
rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
struct rcu_data *rdp)
{
+ const unsigned long gpssdelay = rcu_jiffies_till_stall_check() * HZ;
unsigned long flags;
unsigned long j;
struct rcu_node *rnp_root = rcu_get_root(rsp);
@@ -2762,8 +2763,8 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
ULONG_CMP_GE(rnp_root->gp_seq, rnp_root->gp_seq_needed))
return;
j = jiffies; /* Expensive access, and in common case don't get here. */
- if (time_before(j, READ_ONCE(rsp->gp_req_activity) + HZ) ||
- time_before(j, READ_ONCE(rsp->gp_activity) + HZ) ||
+ if (time_before(j, READ_ONCE(rsp->gp_req_activity) + gpssdelay) ||
+ time_before(j, READ_ONCE(rsp->gp_activity) + gpssdelay) ||
atomic_read(&warned))
return;

@@ -2771,8 +2772,8 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
j = jiffies;
if (rcu_gp_in_progress(rsp) ||
ULONG_CMP_GE(rnp_root->gp_seq, rnp_root->gp_seq_needed) ||
- time_before(j, READ_ONCE(rsp->gp_req_activity) + HZ) ||
- time_before(j, READ_ONCE(rsp->gp_activity) + HZ) ||
+ time_before(j, READ_ONCE(rsp->gp_req_activity) + gpssdelay) ||
+ time_before(j, READ_ONCE(rsp->gp_activity) + gpssdelay) ||
atomic_read(&warned)) {
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
@@ -2783,18 +2784,18 @@ rcu_check_gp_start_stall(struct rcu_state *rsp, struct rcu_node *rnp,
j = jiffies;
if (rcu_gp_in_progress(rsp) ||
ULONG_CMP_GE(rnp_root->gp_seq, rnp_root->gp_seq_needed) ||
- time_before(j, rsp->gp_req_activity + HZ) ||
- time_before(j, rsp->gp_activity + HZ) ||
+ time_before(j, rsp->gp_req_activity + gpssdelay) ||
+ time_before(j, rsp->gp_activity + gpssdelay) ||
atomic_xchg(&warned, 1)) {
raw_spin_unlock_rcu_node(rnp_root); /* irqs remain disabled. */
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
- pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x %s->state:%#lx\n",
+ pr_alert("%s: g%ld->%ld gar:%lu ga:%lu f%#x gs:%d %s->state:%#lx\n",
__func__, (long)READ_ONCE(rsp->gp_seq),
(long)READ_ONCE(rnp_root->gp_seq_needed),
j - rsp->gp_req_activity, j - rsp->gp_activity,
- rsp->gp_flags, rsp->name,
+ rsp->gp_flags, rsp->gp_state, rsp->name,
rsp->gp_kthread ? rsp->gp_kthread->state : 0x1ffffL);
WARN_ON(1);
raw_spin_unlock_rcu_node(rnp_root);
--
2.17.1


2018-06-26 00:38:29

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 17/27] rcu: Speed up calling of RCU tasks callbacks

From: "Steven Rostedt (VMware)" <[email protected]>

Joel Fernandes found that the synchronize_rcu_tasks() was taking a
significant amount of time. He demonstrated it with the following test:

# cd /sys/kernel/tracing
# while [ 1 ]; do x=1; done &
# echo '__schedule_bug:traceon' > set_ftrace_filter
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;

real 0m1.064s
user 0m0.000s
sys 0m0.004s

Where it takes a little over a second to perform the synchronize,
because there's a loop that waits 1 second at a time for tasks to get
through their quiescent points when there's a task that must be waited
for.

After discussion we came up with a simple way to wait for holdouts but
increase the time for each iteration of the loop but no more than a
full second.

With the new patch we have:

# time echo '!__schedule_bug:traceon' > set_ftrace_filter;

real 0m0.131s
user 0m0.000s
sys 0m0.004s

Which drops it down to 13% of what the original wait time was.

Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Joel Fernandes (Google) <[email protected]>
Suggested-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/update.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 5783bdf86e5a..4c7c49c106ee 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -668,6 +668,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
struct rcu_head *list;
struct rcu_head *next;
LIST_HEAD(rcu_tasks_holdouts);
+ int fract;

/* Run on housekeeping CPUs by default. Sysadm can move if desired. */
housekeeping_affine(current, HK_FLAG_RCU);
@@ -749,13 +750,25 @@ static int __noreturn rcu_tasks_kthread(void *arg)
* holdouts. When the list is empty, we are done.
*/
lastreport = jiffies;
- while (!list_empty(&rcu_tasks_holdouts)) {
+
+ /* Start off with HZ/10 wait and slowly back off to 1 HZ wait*/
+ fract = 10;
+
+ for (;;) {
bool firstreport;
bool needreport;
int rtst;
struct task_struct *t1;

- schedule_timeout_interruptible(HZ);
+ if (list_empty(&rcu_tasks_holdouts))
+ break;
+
+ /* Slowly back off waiting for holdouts */
+ schedule_timeout_interruptible(HZ/fract);
+
+ if (fract > 1)
+ fract--;
+
rtst = READ_ONCE(rcu_task_stall_timeout);
needreport = rtst > 0 &&
time_after(jiffies, lastreport + rtst);
--
2.17.1


2018-06-26 00:38:30

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 13/27] rcu: Remove "inline" from rcu_perf_print_module_parms()

This function is in rcuperf.c, which is not an include file, so there
is no problem dropping the "inline", especially given that this function
is invoked only twice per rcuperf run. This commit therefore delegates
the inlining decision to the compiler by dropping the "inline".

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/rcuperf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 00e395c0d7d0..3e86940245d9 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -536,7 +536,7 @@ rcu_perf_writer(void *arg)
return 0;
}

-static inline void
+static void
rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag)
{
pr_alert("%s" PERF_FLAG
--
2.17.1


2018-06-26 00:38:48

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 11/27] rcu: Remove "inline" from panic_on_rcu_stall() and rcu_blocking_is_gp()

These functions are in kernel/rcu/tree.c, which is not an include file,
so there is no problem dropping the "inline", especially given that these
functions are nowhere near a fastpath. This commit therefore delegates
the inlining decision to the compiler by dropping the "inline".

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 98f51157f735..11fb6542b9d4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1342,7 +1342,7 @@ static void rcu_stall_kick_kthreads(struct rcu_state *rsp)
}
}

-static inline void panic_on_rcu_stall(void)
+static void panic_on_rcu_stall(void)
{
if (sysctl_panic_on_rcu_stall)
panic("RCU Stall\n");
@@ -3078,7 +3078,7 @@ EXPORT_SYMBOL_GPL(kfree_call_rcu);
* when there was in fact only one the whole time, as this just adds
* some overhead: RCU still operates correctly.
*/
-static inline int rcu_blocking_is_gp(void)
+static int rcu_blocking_is_gp(void)
{
int ret;

--
2.17.1


2018-06-26 00:38:55

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 07/27] rcu: Inline rcu_dynticks_momentary_idle() into its sole caller

The rcu_dynticks_momentary_idle() function is invoked only from
rcu_momentary_dyntick_idle(), and neither function is particularly
large. This commit therefore saves a few lines by inlining
rcu_dynticks_momentary_idle() into rcu_momentary_dyntick_idle().

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 23 +++++++----------------
1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1ea971244512..98f51157f735 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -385,20 +385,6 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
return snap != rcu_dynticks_snap(rdtp);
}

-/*
- * Do a double-increment of the ->dynticks counter to emulate a
- * momentary idle-CPU quiescent state.
- */
-static void rcu_dynticks_momentary_idle(void)
-{
- struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
- int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
- &rdtp->dynticks);
-
- /* It is illegal to call this from idle state. */
- WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
-}
-
/*
* Set the special (bottom) bit of the specified CPU so that it
* will take special action (such as flushing its TLB) on the
@@ -430,12 +416,17 @@ bool rcu_eqs_special_set(int cpu)
*
* We inform the RCU core by emulating a zero-duration dyntick-idle period.
*
- * The caller must have disabled interrupts.
+ * The caller must have disabled interrupts and must not be idle.
*/
static void rcu_momentary_dyntick_idle(void)
{
+ struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int special;
+
raw_cpu_write(rcu_dynticks.rcu_need_heavy_qs, false);
- rcu_dynticks_momentary_idle();
+ special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR, &rdtp->dynticks);
+ /* It is illegal to call this from idle state. */
+ WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
}

/*
--
2.17.1


2018-06-26 00:39:08

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 25/27] srcu: Add grace-period number to rcutorture statistics printout

This commit adds the SRCU grace-period number to the rcutorture statistics
printout, which allows it to be compared to the rcutorture "Writer stall
state" message.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/srcutree.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index e526b56998af..6c9866a854b1 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1268,7 +1268,8 @@ void srcu_torture_stats_print(struct srcu_struct *sp, char *tt, char *tf)
unsigned long s0 = 0, s1 = 0;

idx = sp->srcu_idx & 0x1;
- pr_alert("%s%s Tree SRCU per-CPU(idx=%d):", tt, tf, idx);
+ pr_alert("%s%s Tree SRCU g%ld per-CPU(idx=%d):",
+ tt, tf, rcu_seq_current(&sp->srcu_gp_seq), idx);
for_each_possible_cpu(cpu) {
unsigned long l0, l1;
unsigned long u0, u1;
--
2.17.1


2018-06-26 00:39:17

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 05/27] rcu: Improve RCU-tasks naming and comments

The naming and comments associated with some RCU-tasks code make
the faulty assumption that context switches due to cond_resched()
are voluntary. As several people pointed out, this is not the case.
This commit therefore updates function names and comments to better
reflect current reality.

Reported-by: Byungchul Park <[email protected]>
Reported-by: Joel Fernandes <[email protected]>
Reported-by: Steven Rostedt <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 12 ++++++------
include/linux/rcutiny.h | 2 +-
kernel/rcu/tree.c | 2 +-
kernel/rcu/update.c | 27 ++++++++++++++-------------
4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index dacc90358b33..75e5b393cf44 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -158,11 +158,11 @@ static inline void rcu_init_nohz(void) { }
} while (0)

/*
- * Note a voluntary context switch for RCU-tasks benefit. This is a
- * macro rather than an inline function to avoid #include hell.
+ * Note a quasi-voluntary context switch for RCU-tasks's benefit.
+ * This is a macro rather than an inline function to avoid #include hell.
*/
#ifdef CONFIG_TASKS_RCU
-#define rcu_note_voluntary_context_switch_lite(t) \
+#define rcu_tasks_qs(t) \
do { \
if (READ_ONCE((t)->rcu_tasks_holdout)) \
WRITE_ONCE((t)->rcu_tasks_holdout, false); \
@@ -170,14 +170,14 @@ static inline void rcu_init_nohz(void) { }
#define rcu_note_voluntary_context_switch(t) \
do { \
rcu_all_qs(); \
- rcu_note_voluntary_context_switch_lite(t); \
+ rcu_tasks_qs(t); \
} while (0)
void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
void synchronize_rcu_tasks(void);
void exit_tasks_rcu_start(void);
void exit_tasks_rcu_finish(void);
#else /* #ifdef CONFIG_TASKS_RCU */
-#define rcu_note_voluntary_context_switch_lite(t) do { } while (0)
+#define rcu_tasks_qs(t) do { } while (0)
#define rcu_note_voluntary_context_switch(t) rcu_all_qs()
#define call_rcu_tasks call_rcu_sched
#define synchronize_rcu_tasks synchronize_sched
@@ -194,7 +194,7 @@ static inline void exit_tasks_rcu_finish(void) { }
*/
#define cond_resched_tasks_rcu_qs() \
do { \
- rcu_note_voluntary_context_switch_lite(current); \
+ rcu_tasks_qs(current); \
cond_resched(); \
} while (0)

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 7b3c82e8a625..8d9a0ea8f0b5 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -93,7 +93,7 @@ static inline void kfree_call_rcu(struct rcu_head *head,
#define rcu_note_context_switch(preempt) \
do { \
rcu_sched_qs(); \
- rcu_note_voluntary_context_switch_lite(current); \
+ rcu_tasks_qs(current); \
} while (0)

static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index fcb0355a8084..1ea971244512 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -457,7 +457,7 @@ void rcu_note_context_switch(bool preempt)
rcu_momentary_dyntick_idle();
this_cpu_inc(rcu_dynticks.rcu_qs_ctr);
if (!preempt)
- rcu_note_voluntary_context_switch_lite(current);
+ rcu_tasks_qs(current);
out:
trace_rcu_utilization(TPS("End context switch"));
barrier(); /* Avoid RCU read-side critical sections leaking up. */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 4c230a60ece4..5783bdf86e5a 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -507,14 +507,15 @@ early_initcall(check_cpu_stall_init);
#ifdef CONFIG_TASKS_RCU

/*
- * Simple variant of RCU whose quiescent states are voluntary context switch,
- * user-space execution, and idle. As such, grace periods can take one good
- * long time. There are no read-side primitives similar to rcu_read_lock()
- * and rcu_read_unlock() because this implementation is intended to get
- * the system into a safe state for some of the manipulations involved in
- * tracing and the like. Finally, this implementation does not support
- * high call_rcu_tasks() rates from multiple CPUs. If this is required,
- * per-CPU callback lists will be needed.
+ * Simple variant of RCU whose quiescent states are voluntary context
+ * switch, cond_resched_rcu_qs(), user-space execution, and idle.
+ * As such, grace periods can take one good long time. There are no
+ * read-side primitives similar to rcu_read_lock() and rcu_read_unlock()
+ * because this implementation is intended to get the system into a safe
+ * state for some of the manipulations involved in tracing and the like.
+ * Finally, this implementation does not support high call_rcu_tasks()
+ * rates from multiple CPUs. If this is required, per-CPU callback lists
+ * will be needed.
*/

/* Global list of callbacks and associated lock. */
@@ -542,11 +543,11 @@ static struct task_struct *rcu_tasks_kthread_ptr;
* period elapses, in other words after all currently executing RCU
* read-side critical sections have completed. call_rcu_tasks() assumes
* that the read-side critical sections end at a voluntary context
- * switch (not a preemption!), entry into idle, or transition to usermode
- * execution. As such, there are no read-side primitives analogous to
- * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
- * to determine that all tasks have passed through a safe state, not so
- * much for data-strcuture synchronization.
+ * switch (not a preemption!), cond_resched_rcu_qs(), entry into idle,
+ * or transition to usermode execution. As such, there are no read-side
+ * primitives analogous to rcu_read_lock() and rcu_read_unlock() because
+ * this primitive is intended to determine that all tasks have passed
+ * through a safe state, not so much for data-strcuture synchronization.
*
* See the description of call_rcu() for more detailed information on
* memory ordering guarantees.
--
2.17.1


2018-06-26 00:39:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 23/27] MAINTAINERS: Update RCU, SRCU, and TORTURE-TEST entries

The RCU, SRCU, and TORTURE-TEST entries are missing some recent
changes, so this commit brings them up to date.

Reported-by: Andrea Parri <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
MAINTAINERS | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9d5eeff51b5f..27bbc72c6191 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12021,9 +12021,9 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
F: Documentation/RCU/
X: Documentation/RCU/torture.txt
F: include/linux/rcu*
-X: include/linux/srcu.h
+X: include/linux/srcu*.h
F: kernel/rcu/
-X: kernel/torture.c
+X: kernel/rcu/srcu*.c

REAL TIME CLOCK (RTC) SUBSYSTEM
M: Alessandro Zummo <[email protected]>
@@ -13060,8 +13060,8 @@ L: [email protected]
W: http://www.rdrop.com/users/paulmck/RCU/
S: Supported
T: git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
-F: include/linux/srcu.h
-F: kernel/rcu/srcu.c
+F: include/linux/srcu*.h
+F: kernel/rcu/srcu*.c

SERIAL LOW-POWER INTER-CHIP MEDIA BUS (SLIMbus)
M: Srinivas Kandagatla <[email protected]>
@@ -14420,6 +14420,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
F: Documentation/RCU/torture.txt
F: kernel/torture.c
F: kernel/rcu/rcutorture.c
+F: kernel/rcu/rcuperf.c
F: kernel/locking/locktorture.c

TOSHIBA ACPI EXTRAS DRIVER
--
2.17.1


2018-06-26 00:39:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 04/27] rcu: Use pr_fmt to prefix "rcu: " to logging output

From: Joe Perches <[email protected]>

This commit also adjusts some whitespace while in the area.

Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
[ paulmck: Revert string-breaking %s as requested by Andy Shevchenko. ]
---
kernel/rcu/rcuperf.c | 7 +++----
kernel/rcu/rcutorture.c | 4 ++--
kernel/rcu/srcutree.c | 5 ++++-
kernel/rcu/tree.c | 8 +++++---
kernel/rcu/tree_plugin.h | 10 ++++++----
5 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index b080bc4a4f45..00e395c0d7d0 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -680,12 +680,11 @@ rcu_perf_init(void)
break;
}
if (i == ARRAY_SIZE(perf_ops)) {
- pr_alert("rcu-perf: invalid perf type: \"%s\"\n",
- perf_type);
+ pr_alert("rcu-perf: invalid perf type: \"%s\"\n", perf_type);
pr_alert("rcu-perf types:");
for (i = 0; i < ARRAY_SIZE(perf_ops); i++)
- pr_alert(" %s", perf_ops[i]->name);
- pr_alert("\n");
+ pr_cont(" %s", perf_ops[i]->name);
+ pr_cont("\n");
firsterr = -EINVAL;
goto unwind;
}
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index eb6d4915b4e6..773ce12f9582 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1755,8 +1755,8 @@ rcu_torture_init(void)
torture_type);
pr_alert("rcu-torture types:");
for (i = 0; i < ARRAY_SIZE(torture_ops); i++)
- pr_alert(" %s", torture_ops[i]->name);
- pr_alert("\n");
+ pr_cont(" %s", torture_ops[i]->name);
+ pr_cont("\n");
firsterr = -EINVAL;
goto unwind;
}
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index d6d6ea9738c0..e526b56998af 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -26,6 +26,8 @@
*
*/

+#define pr_fmt(fmt) "rcu: " fmt
+
#include <linux/export.h>
#include <linux/mutex.h>
#include <linux/percpu.h>
@@ -390,7 +392,8 @@ void _cleanup_srcu_struct(struct srcu_struct *sp, bool quiesced)
}
if (WARN_ON(rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)) != SRCU_STATE_IDLE) ||
WARN_ON(srcu_readers_active(sp))) {
- pr_info("%s: Active srcu_struct %p state: %d\n", __func__, sp, rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)));
+ pr_info("%s: Active srcu_struct %p state: %d\n",
+ __func__, sp, rcu_seq_state(READ_ONCE(sp->srcu_gp_seq)));
return; /* Caller forgot to stop doing call_srcu()? */
}
free_percpu(sp->sda);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 50a8f49e1763..fcb0355a8084 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -27,6 +27,9 @@
* For detailed explanation of Read-Copy Update mechanism see -
* Documentation/RCU
*/
+
+#define pr_fmt(fmt) "rcu: " fmt
+
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/init.h>
@@ -1374,8 +1377,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gp_seq)
* See Documentation/RCU/stallwarn.txt for info on how to debug
* RCU CPU stall warnings.
*/
- pr_err("INFO: %s detected stalls on CPUs/tasks:",
- rsp->name);
+ pr_err("INFO: %s detected stalls on CPUs/tasks:", rsp->name);
print_cpu_stall_info_begin();
rcu_for_each_leaf_node(rsp, rnp) {
raw_spin_lock_irqsave_rcu_node(rnp, flags);
@@ -4046,7 +4048,7 @@ static void __init rcu_init_geometry(void)
if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
nr_cpu_ids == NR_CPUS)
return;
- pr_info("RCU: Adjusting geometry for rcu_fanout_leaf=%d, nr_cpu_ids=%u\n",
+ pr_info("Adjusting geometry for rcu_fanout_leaf=%d, nr_cpu_ids=%u\n",
rcu_fanout_leaf, nr_cpu_ids);

/*
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 54a251640f53..dbfe90191e19 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -74,8 +74,8 @@ static void __init rcu_bootup_announce_oddness(void)
pr_info("\tRCU event tracing is enabled.\n");
if ((IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 64) ||
(!IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 32))
- pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d\n",
- RCU_FANOUT);
+ pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d.\n",
+ RCU_FANOUT);
if (rcu_fanout_exact)
pr_info("\tHierarchical RCU autobalancing is disabled.\n");
if (IS_ENABLED(CONFIG_RCU_FAST_NO_HZ))
@@ -88,11 +88,13 @@ static void __init rcu_bootup_announce_oddness(void)
pr_info("\tBuild-time adjustment of leaf fanout to %d.\n",
RCU_FANOUT_LEAF);
if (rcu_fanout_leaf != RCU_FANOUT_LEAF)
- pr_info("\tBoot-time adjustment of leaf fanout to %d.\n", rcu_fanout_leaf);
+ pr_info("\tBoot-time adjustment of leaf fanout to %d.\n",
+ rcu_fanout_leaf);
if (nr_cpu_ids != NR_CPUS)
pr_info("\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%u.\n", NR_CPUS, nr_cpu_ids);
#ifdef CONFIG_RCU_BOOST
- pr_info("\tRCU priority boosting: priority %d delay %d ms.\n", kthread_prio, CONFIG_RCU_BOOST_DELAY);
+ pr_info("\tRCU priority boosting: priority %d delay %d ms.\n",
+ kthread_prio, CONFIG_RCU_BOOST_DELAY);
#endif
if (blimit != DEFAULT_RCU_BLIMIT)
pr_info("\tBoot-time adjustment of callback invocation limit to %ld.\n", blimit);
--
2.17.1


2018-06-26 00:40:05

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 01/27] rcu: Make rcu_read_unlock_special() static

Because rcu_read_unlock_special() is no longer used outside of
kernel/rcu/tree_plugin.h, this commit makes it static.

Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 1 -
kernel/rcu/tree_plugin.h | 3 ++-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 65163aa0bb04..67ec077c7ee5 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -64,7 +64,6 @@ void rcu_barrier_tasks(void);

void __rcu_read_lock(void);
void __rcu_read_unlock(void);
-void rcu_read_unlock_special(struct task_struct *t);
void synchronize_rcu(void);

/*
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 613372246a07..54a251640f53 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -127,6 +127,7 @@ static struct rcu_data __percpu *const rcu_data_p = &rcu_preempt_data;

static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
bool wake);
+static void rcu_read_unlock_special(struct task_struct *t);

/*
* Tell them what RCU they are running.
@@ -461,7 +462,7 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
* notify RCU core processing or task having blocked during the RCU
* read-side critical section.
*/
-void rcu_read_unlock_special(struct task_struct *t)
+static void rcu_read_unlock_special(struct task_struct *t)
{
bool empty_exp;
bool empty_norm;
--
2.17.1


2018-06-26 00:40:10

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 08/27] rcu: Clarify and correct the rcu_preempt_qs() header comment

The rcu_preempt_qs() function only applies to the CPU, not the task.
A task really is allowed to invoke this function while in an RCU-preempt
read-side critical section, but only if it has first added itself to
some leaf rcu_node structure's ->blkd_tasks list.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0239cf8a4be6..07d1ad175994 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -294,13 +294,17 @@ static void rcu_preempt_ctxt_queue(struct rcu_node *rnp, struct rcu_data *rdp)
}

/*
- * Record a preemptible-RCU quiescent state for the specified CPU. Note
- * that this just means that the task currently running on the CPU is
- * not in a quiescent state. There might be any number of tasks blocked
- * while in an RCU read-side critical section.
+ * Record a preemptible-RCU quiescent state for the specified CPU.
+ * Note that this does not necessarily mean that the task currently running
+ * on the CPU is in a quiescent state: Instead, it means that the current
+ * grace period need not wait on any RCU read-side critical section that
+ * starts later on this CPU. It also means that if the current task is
+ * in an RCU read-side critical section, it has already added itself to
+ * some leaf rcu_node structure's ->blkd_tasks list. In addition to the
+ * current task, there might be any number of other tasks blocked while
+ * in an RCU read-side critical section.
*
- * As with the other rcu_*_qs() functions, callers to this function
- * must disable preemption.
+ * Callers to this function must disable preemption.
*/
static void rcu_preempt_qs(void)
{
--
2.17.1


2018-06-26 00:40:10

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 21/27] doc: Update synchronize_rcu() definition in whatisRCU.txt

From: Andrea Parri <[email protected]>

The synchronize_rcu() definition based on RW-locks in whatisRCU.txt
does not meet the "Memory-Barrier Guarantees" in Requirements.html;
for example, the following SB-like test:

P0: P1:

WRITE_ONCE(x, 1); WRITE_ONCE(y, 1);
synchronize_rcu(); smp_mb();
r0 = READ_ONCE(y); r1 = READ_ONCE(x);

should not be allowed to reach the state "r0 = 0 AND r1 = 0", but
the current write_lock()+write_unlock() definition can not ensure
this. This commit therefore inserts an smp_mb__after_spinlock()
in order to cause this synchronize_rcu() implementation to provide
this memory-barrier guarantee.

Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Andrea Parri <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/RCU/whatisRCU.txt | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 65eb856526b7..94288f1b8759 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -588,6 +588,7 @@ It is extremely simple:
void synchronize_rcu(void)
{
write_lock(&rcu_gp_mutex);
+ smp_mb__after_spinlock();
write_unlock(&rcu_gp_mutex);
}

@@ -609,12 +610,15 @@ don't forget about them when submitting patches making use of RCU!]

The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
and release a global reader-writer lock. The synchronize_rcu()
-primitive write-acquires this same lock, then immediately releases
-it. This means that once synchronize_rcu() exits, all RCU read-side
-critical sections that were in progress before synchronize_rcu() was
-called are guaranteed to have completed -- there is no way that
-synchronize_rcu() would have been able to write-acquire the lock
-otherwise.
+primitive write-acquires this same lock, then releases it. This means
+that once synchronize_rcu() exits, all RCU read-side critical sections
+that were in progress before synchronize_rcu() was called are guaranteed
+to have completed -- there is no way that synchronize_rcu() would have
+been able to write-acquire the lock otherwise. The smp_mb__after_spinlock()
+promotes synchronize_rcu() to a full memory barrier in compliance with
+the "Memory-Barrier Guarantees" listed in:
+
+ Documentation/RCU/Design/Requirements/Requirements.html.

It is possible to nest rcu_read_lock(), since reader-writer locks may
be recursively acquired. Note also that rcu_read_lock() is immune
--
2.17.1


2018-06-26 00:40:13

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 24/27] rcu: Print stall-warning NMI dyntick state in hexadecimal

The ->dynticks_nmi_nesting field records the nesting depth of both
interrupt and NMI handlers. Because the kernel can enter interrupts
and never leave them (and vice versa) and because NMIs can interrupt
manipulation of the ->dynticks_nmi_nesting field, the values in this
field must be both chosen and maniupated very carefully. As a result,
although the value is zero when the corresponding CPU is executing
neither an interrupt nor an NMI handler, it is 4,611,686,018,427,387,906
on 64-bit systems when there is a single level of interrupt/NMI handling
in progress.

This number is difficult to remember and interpret, so this commit
switches the output to hexadecimal, resulting in the much nicer
0x4000000000000002.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 2cc9bf0d363a..c1b17f5b9361 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1801,7 +1801,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
}
print_cpu_stall_fast_no_hz(fast_no_hz, cpu);
delta = rcu_seq_ctr(rdp->mynode->gp_seq - rdp->rcu_iw_gp_seq);
- pr_err("\t%d-%c%c%c%c: (%lu %s) idle=%03x/%ld/%ld softirq=%u/%u fqs=%ld %s\n",
+ pr_err("\t%d-%c%c%c%c: (%lu %s) idle=%03x/%ld/%#lx softirq=%u/%u fqs=%ld %s\n",
cpu,
"O."[!!cpu_online(cpu)],
"o."[!!(rdp->grpmask & rdp->mynode->qsmaskinit)],
--
2.17.1


2018-06-26 00:41:06

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 19/27] rcu: Add diagnostics for rcutorture writer stall warning

This commit adds any in-the-future ->gp_seq_needed fields to the
diagnostics for an rcutorture writer stall warning message.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c5fec335bc31..bc1ed77a3df9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -606,11 +606,32 @@ EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);
*/
void show_rcu_gp_kthreads(void)
{
+ int cpu;
+ struct rcu_data *rdp;
+ struct rcu_node *rnp;
struct rcu_state *rsp;

for_each_rcu_flavor(rsp) {
pr_info("%s: wait state: %d ->state: %#lx\n",
rsp->name, rsp->gp_state, rsp->gp_kthread->state);
+ rcu_for_each_node_breadth_first(rsp, rnp) {
+ if (ULONG_CMP_GE(rsp->gp_seq, rnp->gp_seq_needed))
+ continue;
+ pr_info("\trcu_node %d:%d ->gp_seq %lu ->gp_seq_needed %lu\n",
+ rnp->grplo, rnp->grphi, rnp->gp_seq,
+ rnp->gp_seq_needed);
+ if (!rcu_is_leaf_node(rnp))
+ continue;
+ for_each_leaf_node_possible_cpu(rnp, cpu) {
+ rdp = per_cpu_ptr(rsp->rda, cpu);
+ if (rdp->gpwrap ||
+ ULONG_CMP_GE(rsp->gp_seq,
+ rdp->gp_seq_needed))
+ continue;
+ pr_info("\tcpu %d ->gp_seq_needed %lu\n",
+ cpu, rdp->gp_seq_needed);
+ }
+ }
/* sched_show_task(rsp->gp_kthread); */
}
}
--
2.17.1


2018-06-26 17:09:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 06/27] rcu: Mark task as .need_qs less aggressively

On Mon, Jun 25, 2018 at 05:34:52PM -0700, Paul E. McKenney wrote:
> If any scheduling-clock interrupt interrupts an RCU-preempt read-side
> critical section, the interrupted task's ->rcu_read_unlock_special.b.need_qs
> field is set. This causes the outermost rcu_read_unlock() to incur the
> extra overhead of calling into rcu_read_unlock_special(). This commit
> reduces that overhead by setting ->rcu_read_unlock_special.b.need_qs only
> if the grace period has been in effect for more than one second.

Even less agressive is never setting it at all.

Changelog fails to explain why not setting it every tick is correct, nor
why 1s is a 'safe' value to use.


2018-06-26 17:18:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> From: "Joel Fernandes (Google)" <[email protected]>
>
> rcu_seq_snap may be tricky to decipher. Lets document how it works with
> an example to make it easier.

Since you had me looking at them functions; why isn't rcu_seq_snap()
using smp_load_acquire() and rcu_seq_end() using smp_store_release() ?
Their respective comments seem to suggest that would be sufficent.


2018-06-26 19:08:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 07:30:55PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> > +/*
> > + * rcu_seq_snap - Take a snapshot of the update side's sequence number.
> > + *
> > + * This function returns the earliest value of the grace-period sequence number
> > + * that will indicate that a full grace period has elapsed since the current
> > + * time. Once the grace-period sequence number has reached this value, it will
> > + * be safe to invoke all callbacks that have been registered prior to the
> > + * current time. This value is the current grace-period number plus two to the
> > + * power of the number of low-order bits reserved for state, then rounded up to
> > + * the next value in which the state bits are all zero.
>
> If you complete that by saying _why_ you need to round up there, then
> the below verbiage is completely redundant.

I will leave this between you, Joel, and whoever else is interested.
The initial state worked for me. ;-)

> > + * In the current design, RCU_SEQ_STATE_MASK=3 and the least significant bit of
> > + * the seq is used to track if a GP is in progress or not. Given this, it is
> > + * sufficient if we add (6+1) and mask with ~3 to get the next GP. Let's see
> > + * why with an example:
> > + *
> > + * Say the current seq is 12 which is 0b1100 (GP is 3 and state bits are 0b00).
> > + * To get to the next GP number of 4, we have to add 0b100 to this (0x1 << 2)
> > + * to account for the shift due to 2 state bits. Now, if the current seq is
> > + * 13 (GP is 3 and state bits are 0b01), then it means the current grace period
> > + * is already in progress so the next GP that a future call back will be queued
> > + * to run at is GP+2 = 5, not 4. To account for the extra +1, we just overflow
> > + * the 2 lower bits by adding 0b11. In case the lower bit was set, the overflow
> > + * will cause the extra +1 to the GP, along with the usual +1 explained before.
> > + * This gives us GP+2. Finally we mask the lower to bits by ~0x3 in case the
> > + * overflow didn't occur. This masking is needed because in case RCU was idle
> > + * (no GP in progress so lower 2 bits are 0b00), then the overflow of the lower
> > + * 2 state bits wouldn't occur, so we mask to zero out those lower 2 bits.
> > + *
> > + * In other words, the next seq can be obtained by (0b11 + 0b100) & (~0b11)
> > + * which can be generalized to:
> > + * seq + (RCU_SEQ_STATE_MASK + (RCU_SEQ_STATE_MASK + 1)) & (~RCU_SEQ_STATE_MASK)
> > + */
>
> Is the below not much simpler:
>
> > static inline unsigned long rcu_seq_snap(unsigned long *sp)
> > {
> > unsigned long s;
>
> s = smp_load_aquire(sp);
>
> /* Add one GP */
> s += 1 << RCU_SEQ_CTR_SHIFT;
>
> /* Complete any pending state by rounding up */
> s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);
>
> return s;
> }

Seems equivalent to me, but with more lines. To say nothing of more
levels of lookup of macro definitions. ;-)

Thanx, Paul


2018-06-26 19:31:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 11:08:55AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 26, 2018 at 07:14:54PM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> > > From: "Joel Fernandes (Google)" <[email protected]>
> > >
> > > rcu_seq_snap may be tricky to decipher. Lets document how it works with
> > > an example to make it easier.
> >
> > Since you had me looking at them functions; why isn't rcu_seq_snap()
> > using smp_load_acquire() and rcu_seq_end() using smp_store_release() ?
> > Their respective comments seem to suggest that would be sufficent.
>
> I do not believe that this would suffice. Would it make sense to refer
> to Documentation/RCU/Design/Memory-Ordering in the comment header?

No, because I can't read that thing in an editor.

> Except that this would invite sprinkling this pathname far and wide...
>
> The key point is that these functions are part of the any-to-any
> memory-ordering guarantee that RCU grace periods provide.

Then the existing comment is misleading and really needs change.

2018-06-26 19:34:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 09:21:13PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 11:08:55AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 26, 2018 at 07:14:54PM +0200, Peter Zijlstra wrote:
> > > On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> > > > From: "Joel Fernandes (Google)" <[email protected]>
> > > >
> > > > rcu_seq_snap may be tricky to decipher. Lets document how it works with
> > > > an example to make it easier.
> > >
> > > Since you had me looking at them functions; why isn't rcu_seq_snap()
> > > using smp_load_acquire() and rcu_seq_end() using smp_store_release() ?
> > > Their respective comments seem to suggest that would be sufficent.
> >
> > I do not believe that this would suffice. Would it make sense to refer
> > to Documentation/RCU/Design/Memory-Ordering in the comment header?
>
> No, because I can't read that thing in an editor.
>
> > Except that this would invite sprinkling this pathname far and wide...
> >
> > The key point is that these functions are part of the any-to-any
> > memory-ordering guarantee that RCU grace periods provide.
>
> Then the existing comment is misleading and really needs change.

Would it be sufficient to add something like "The memory barrier is
required to support the many-to-many ordering guaranteed by RCU grace
periods"?

Thanx, Paul


2018-06-26 20:17:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 12:31:46PM -0700, Paul E. McKenney wrote:
> > Then the existing comment is misleading and really needs change.
>
> Would it be sufficient to add something like "The memory barrier is
> required to support the many-to-many ordering guaranteed by RCU grace
> periods"?

What would lead me to wonder why after the load and not (also) before.

2018-06-26 21:38:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 06/27] rcu: Mark task as .need_qs less aggressively

On Tue, Jun 26, 2018 at 11:03:03AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 26, 2018 at 07:08:12PM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 25, 2018 at 05:34:52PM -0700, Paul E. McKenney wrote:
> > > If any scheduling-clock interrupt interrupts an RCU-preempt read-side
> > > critical section, the interrupted task's ->rcu_read_unlock_special.b.need_qs
> > > field is set. This causes the outermost rcu_read_unlock() to incur the
> > > extra overhead of calling into rcu_read_unlock_special(). This commit
> > > reduces that overhead by setting ->rcu_read_unlock_special.b.need_qs only
> > > if the grace period has been in effect for more than one second.
> >
> > Even less agressive is never setting it at all.
>
> True, but if the CPU has been in an RCU read-side critical section for
> a full second (which is the case with high probability when .b.need_qs
> is set after this change), we might want to respond to the end of that
> critical section sooner rather than later.
>
> > Changelog fails to explain why not setting it every tick is correct, nor
> > why 1s is a 'safe' value to use.
>
> The RCU CPU stall warning cannot be set to less than 3s, so 1s is
> reasonable. It is a tradeoff -- setting it lower causes a greater
> fraction of RCU read-side critical sections to incur extra overhead at
> rcu_read_unlock() time, while setting it higher keeps a lazy approach
> to reporting the quiescent state to core RCU for longer critical sections.
>
> The upcoming RCU-bh/RCU-preempt/RCU-sched consolidation will raise
> contention and overhead, so this is one of several things done to
> lower overhead and contention to compensate for that.

And does the following updated commit log help?

Thanx, Paul

------------------------------------------------------------------------

commit aaf8af680740afc363583a6ed9549b08b613dd3d
Author: Paul E. McKenney <[email protected]>
Date: Wed May 16 14:41:41 2018 -0700

rcu: Mark task as .need_qs less aggressively

If any scheduling-clock interrupt interrupts an RCU-preempt read-side
critical section, the interrupted task's ->rcu_read_unlock_special.b.need_qs
field is set. This causes the outermost rcu_read_unlock() to incur the
extra overhead of calling into rcu_read_unlock_special(). This commit
reduces that overhead by setting ->rcu_read_unlock_special.b.need_qs only
if the grace period has been in effect for more than one second.

Why one second? Because this is comfortably smaller than the minimum
RCU CPU stall-warning timeout of three seconds, but long enough that the
.need_qs marking should happen quite rarely. And if your RCU read-side
critical section has run on-CPU for a full second, it is not unreasonable
to invest some CPU time in ending the grace period quickly.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index dbfe90191e19..0239cf8a4be6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -730,6 +730,7 @@ rcu_preempt_check_blocked_tasks(struct rcu_state *rsp, struct rcu_node *rnp)
*/
static void rcu_preempt_check_callbacks(void)
{
+ struct rcu_state *rsp = &rcu_preempt_state;
struct task_struct *t = current;

if (t->rcu_read_lock_nesting == 0) {
@@ -738,7 +739,9 @@ static void rcu_preempt_check_callbacks(void)
}
if (t->rcu_read_lock_nesting > 0 &&
__this_cpu_read(rcu_data_p->core_needs_qs) &&
- __this_cpu_read(rcu_data_p->cpu_no_qs.b.norm))
+ __this_cpu_read(rcu_data_p->cpu_no_qs.b.norm) &&
+ !t->rcu_read_unlock_special.b.need_qs &&
+ time_after(jiffies, rsp->gp_start + HZ))
t->rcu_read_unlock_special.b.need_qs = true;
}



2018-06-26 22:32:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> +/*
> + * rcu_seq_snap - Take a snapshot of the update side's sequence number.
> + *
> + * This function returns the earliest value of the grace-period sequence number
> + * that will indicate that a full grace period has elapsed since the current
> + * time. Once the grace-period sequence number has reached this value, it will
> + * be safe to invoke all callbacks that have been registered prior to the
> + * current time. This value is the current grace-period number plus two to the
> + * power of the number of low-order bits reserved for state, then rounded up to
> + * the next value in which the state bits are all zero.

If you complete that by saying _why_ you need to round up there, then
the below verbiage is completely redundant.

> + * In the current design, RCU_SEQ_STATE_MASK=3 and the least significant bit of
> + * the seq is used to track if a GP is in progress or not. Given this, it is
> + * sufficient if we add (6+1) and mask with ~3 to get the next GP. Let's see
> + * why with an example:
> + *
> + * Say the current seq is 12 which is 0b1100 (GP is 3 and state bits are 0b00).
> + * To get to the next GP number of 4, we have to add 0b100 to this (0x1 << 2)
> + * to account for the shift due to 2 state bits. Now, if the current seq is
> + * 13 (GP is 3 and state bits are 0b01), then it means the current grace period
> + * is already in progress so the next GP that a future call back will be queued
> + * to run at is GP+2 = 5, not 4. To account for the extra +1, we just overflow
> + * the 2 lower bits by adding 0b11. In case the lower bit was set, the overflow
> + * will cause the extra +1 to the GP, along with the usual +1 explained before.
> + * This gives us GP+2. Finally we mask the lower to bits by ~0x3 in case the
> + * overflow didn't occur. This masking is needed because in case RCU was idle
> + * (no GP in progress so lower 2 bits are 0b00), then the overflow of the lower
> + * 2 state bits wouldn't occur, so we mask to zero out those lower 2 bits.
> + *
> + * In other words, the next seq can be obtained by (0b11 + 0b100) & (~0b11)
> + * which can be generalized to:
> + * seq + (RCU_SEQ_STATE_MASK + (RCU_SEQ_STATE_MASK + 1)) & (~RCU_SEQ_STATE_MASK)
> + */

Is the below not much simpler:

> static inline unsigned long rcu_seq_snap(unsigned long *sp)
> {
> unsigned long s;

s = smp_load_aquire(sp);

/* Add one GP */
s += 1 << RCU_SEQ_CTR_SHIFT;

/* Complete any pending state by rounding up */
s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);

return s;
}



2018-06-26 22:46:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 06/27] rcu: Mark task as .need_qs less aggressively

On Tue, Jun 26, 2018 at 07:08:12PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 25, 2018 at 05:34:52PM -0700, Paul E. McKenney wrote:
> > If any scheduling-clock interrupt interrupts an RCU-preempt read-side
> > critical section, the interrupted task's ->rcu_read_unlock_special.b.need_qs
> > field is set. This causes the outermost rcu_read_unlock() to incur the
> > extra overhead of calling into rcu_read_unlock_special(). This commit
> > reduces that overhead by setting ->rcu_read_unlock_special.b.need_qs only
> > if the grace period has been in effect for more than one second.
>
> Even less agressive is never setting it at all.

True, but if the CPU has been in an RCU read-side critical section for
a full second (which is the case with high probability when .b.need_qs
is set after this change), we might want to respond to the end of that
critical section sooner rather than later.

> Changelog fails to explain why not setting it every tick is correct, nor
> why 1s is a 'safe' value to use.

The RCU CPU stall warning cannot be set to less than 3s, so 1s is
reasonable. It is a tradeoff -- setting it lower causes a greater
fraction of RCU read-side critical sections to incur extra overhead at
rcu_read_unlock() time, while setting it higher keeps a lazy approach
to reporting the quiescent state to core RCU for longer critical sections.

The upcoming RCU-bh/RCU-preempt/RCU-sched consolidation will raise
contention and overhead, so this is one of several things done to
lower overhead and contention to compensate for that.

Thanx, Paul


2018-06-26 23:27:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 07:14:54PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> > From: "Joel Fernandes (Google)" <[email protected]>
> >
> > rcu_seq_snap may be tricky to decipher. Lets document how it works with
> > an example to make it easier.
>
> Since you had me looking at them functions; why isn't rcu_seq_snap()
> using smp_load_acquire() and rcu_seq_end() using smp_store_release() ?
> Their respective comments seem to suggest that would be sufficent.

I do not believe that this would suffice. Would it make sense to refer
to Documentation/RCU/Design/Memory-Ordering in the comment header?

Except that this would invite sprinkling this pathname far and wide...

The key point is that these functions are part of the any-to-any
memory-ordering guarantee that RCU grace periods provide.

Thanx, Paul


2018-06-27 00:43:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 11:10:25AM -0700, Paul E. McKenney wrote:
> > Is the below not much simpler:
> >
> > > static inline unsigned long rcu_seq_snap(unsigned long *sp)
> > > {
> > > unsigned long s;
> >
> > s = smp_load_aquire(sp);
> >
> > /* Add one GP */
> > s += 1 << RCU_SEQ_CTR_SHIFT;
> >
> > /* Complete any pending state by rounding up */
> > s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);
> >
> > return s;
> > }
>
> Seems equivalent to me, but with more lines. To say nothing of more
> levels of lookup of macro definitions. ;-)

But it does explain the various parts in the equation, or at least gives
a fair clue as to how the thing composes. And I find the alignment thing
far easier to read that an open coded variant, but whatever.

2018-06-27 01:43:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 10:15:03PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 12:31:46PM -0700, Paul E. McKenney wrote:
> > > Then the existing comment is misleading and really needs change.
> >
> > Would it be sufficient to add something like "The memory barrier is
> > required to support the many-to-many ordering guaranteed by RCU grace
> > periods"?
>
> What would lead me to wonder why after the load and not (also) before.

The memory barriers on the other sides of these primitives are supplied
by the caller. For example, in srcu_gp_end(), there is a set of
spin_unlock_irq_rcu_node()/spin_lock_irq_rcu_node() pairs between the
rcu_seq_end() of the prior SRCU grace period and the rcu_seq_start()
of the next one.

However, these things aren't anywhere near a fastpath, so I could add
the smp_mb() calls on the other sides for readability, if that would
be useful.

Thanx, Paul


2018-06-27 05:09:48

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 07:30:55PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> > +/*
> > + * rcu_seq_snap - Take a snapshot of the update side's sequence number.
> > + *
> > + * This function returns the earliest value of the grace-period sequence number
> > + * that will indicate that a full grace period has elapsed since the current
> > + * time. Once the grace-period sequence number has reached this value, it will
> > + * be safe to invoke all callbacks that have been registered prior to the
> > + * current time. This value is the current grace-period number plus two to the
> > + * power of the number of low-order bits reserved for state, then rounded up to
> > + * the next value in which the state bits are all zero.
>
> If you complete that by saying _why_ you need to round up there, then
> the below verbiage is completely redundant.
>
> > + * In the current design, RCU_SEQ_STATE_MASK=3 and the least significant bit of
> > + * the seq is used to track if a GP is in progress or not. Given this, it is
> > + * sufficient if we add (6+1) and mask with ~3 to get the next GP. Let's see
> > + * why with an example:
> > + *
> > + * Say the current seq is 12 which is 0b1100 (GP is 3 and state bits are 0b00).
> > + * To get to the next GP number of 4, we have to add 0b100 to this (0x1 << 2)
> > + * to account for the shift due to 2 state bits. Now, if the current seq is
> > + * 13 (GP is 3 and state bits are 0b01), then it means the current grace period
> > + * is already in progress so the next GP that a future call back will be queued
> > + * to run at is GP+2 = 5, not 4. To account for the extra +1, we just overflow
> > + * the 2 lower bits by adding 0b11. In case the lower bit was set, the overflow
> > + * will cause the extra +1 to the GP, along with the usual +1 explained before.
> > + * This gives us GP+2. Finally we mask the lower to bits by ~0x3 in case the
> > + * overflow didn't occur. This masking is needed because in case RCU was idle
> > + * (no GP in progress so lower 2 bits are 0b00), then the overflow of the lower
> > + * 2 state bits wouldn't occur, so we mask to zero out those lower 2 bits.
> > + *
> > + * In other words, the next seq can be obtained by (0b11 + 0b100) & (~0b11)
> > + * which can be generalized to:
> > + * seq + (RCU_SEQ_STATE_MASK + (RCU_SEQ_STATE_MASK + 1)) & (~RCU_SEQ_STATE_MASK)
> > + */
>
> Is the below not much simpler:
>
> > static inline unsigned long rcu_seq_snap(unsigned long *sp)
> > {
> > unsigned long s;
>
> s = smp_load_aquire(sp);
>
> /* Add one GP */
> s += 1 << RCU_SEQ_CTR_SHIFT;
>
> /* Complete any pending state by rounding up */

I would suggest this comment be changed to "Add another GP if there was a
pending state".

> s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);
>

I agree with Peter's suggestions for both the verbiage reduction in the
comments in the header, as the new code he is proposing is more
self-documenting. I believe I proposed a big comment just because the code
wasn't self-documenting or obvious previously so needed an explanation.

How would you like to proceed? Let me know what you guys decide, I am really
Ok with anything. If you guys agree, should I write a follow-up patch with
Peter's suggestion that applies on top of this one? Or do we want to drop
this one in favor of Peter's suggestion?

I guess we also have to conclude the other part about using memory barriers,
but I think that should be a separate patch.

thanks!

- Joel


2018-06-27 17:54:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Tue, Jun 26, 2018 at 09:39:13PM -0700, Joel Fernandes wrote:
> On Tue, Jun 26, 2018 at 07:30:55PM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> > > +/*
> > > + * rcu_seq_snap - Take a snapshot of the update side's sequence number.
> > > + *
> > > + * This function returns the earliest value of the grace-period sequence number
> > > + * that will indicate that a full grace period has elapsed since the current
> > > + * time. Once the grace-period sequence number has reached this value, it will
> > > + * be safe to invoke all callbacks that have been registered prior to the
> > > + * current time. This value is the current grace-period number plus two to the
> > > + * power of the number of low-order bits reserved for state, then rounded up to
> > > + * the next value in which the state bits are all zero.
> >
> > If you complete that by saying _why_ you need to round up there, then
> > the below verbiage is completely redundant.
> >
> > > + * In the current design, RCU_SEQ_STATE_MASK=3 and the least significant bit of
> > > + * the seq is used to track if a GP is in progress or not. Given this, it is
> > > + * sufficient if we add (6+1) and mask with ~3 to get the next GP. Let's see
> > > + * why with an example:
> > > + *
> > > + * Say the current seq is 12 which is 0b1100 (GP is 3 and state bits are 0b00).
> > > + * To get to the next GP number of 4, we have to add 0b100 to this (0x1 << 2)
> > > + * to account for the shift due to 2 state bits. Now, if the current seq is
> > > + * 13 (GP is 3 and state bits are 0b01), then it means the current grace period
> > > + * is already in progress so the next GP that a future call back will be queued
> > > + * to run at is GP+2 = 5, not 4. To account for the extra +1, we just overflow
> > > + * the 2 lower bits by adding 0b11. In case the lower bit was set, the overflow
> > > + * will cause the extra +1 to the GP, along with the usual +1 explained before.
> > > + * This gives us GP+2. Finally we mask the lower to bits by ~0x3 in case the
> > > + * overflow didn't occur. This masking is needed because in case RCU was idle
> > > + * (no GP in progress so lower 2 bits are 0b00), then the overflow of the lower
> > > + * 2 state bits wouldn't occur, so we mask to zero out those lower 2 bits.
> > > + *
> > > + * In other words, the next seq can be obtained by (0b11 + 0b100) & (~0b11)
> > > + * which can be generalized to:
> > > + * seq + (RCU_SEQ_STATE_MASK + (RCU_SEQ_STATE_MASK + 1)) & (~RCU_SEQ_STATE_MASK)
> > > + */
> >
> > Is the below not much simpler:
> >
> > > static inline unsigned long rcu_seq_snap(unsigned long *sp)
> > > {
> > > unsigned long s;
> >
> > s = smp_load_aquire(sp);
> >
> > /* Add one GP */
> > s += 1 << RCU_SEQ_CTR_SHIFT;
> >
> > /* Complete any pending state by rounding up */
>
> I would suggest this comment be changed to "Add another GP if there was a
> pending state".
>
> > s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);
> >
>
> I agree with Peter's suggestions for both the verbiage reduction in the
> comments in the header, as the new code he is proposing is more
> self-documenting. I believe I proposed a big comment just because the code
> wasn't self-documenting or obvious previously so needed an explanation.
>
> How would you like to proceed? Let me know what you guys decide, I am really
> Ok with anything. If you guys agree, should I write a follow-up patch with
> Peter's suggestion that applies on top of this one? Or do we want to drop
> this one in favor of Peter's suggestion?

Shortening the comment would be good, so please do that.

I cannot say that I am much of a fan of the suggested change to the
computation, but I don't feel all that strongly about it. If the two
of you agree on a formulation and get at least one other RCU maintainer
or reviewer to agree as well, I will take the change.

> I guess we also have to conclude the other part about using memory barriers,
> but I think that should be a separate patch.

It definitely should not be part of this patch. ;-)

Thanx, Paul


2018-06-27 18:28:23

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Wed, Jun 27, 2018 at 10:54:36AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 26, 2018 at 09:39:13PM -0700, Joel Fernandes wrote:
> > On Tue, Jun 26, 2018 at 07:30:55PM +0200, Peter Zijlstra wrote:
> > > On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> > > > +/*
> > > > + * rcu_seq_snap - Take a snapshot of the update side's sequence number.
> > > > + *
> > > > + * This function returns the earliest value of the grace-period sequence number
> > > > + * that will indicate that a full grace period has elapsed since the current
> > > > + * time. Once the grace-period sequence number has reached this value, it will
> > > > + * be safe to invoke all callbacks that have been registered prior to the
> > > > + * current time. This value is the current grace-period number plus two to the
> > > > + * power of the number of low-order bits reserved for state, then rounded up to
> > > > + * the next value in which the state bits are all zero.
> > >
> > > If you complete that by saying _why_ you need to round up there, then
> > > the below verbiage is completely redundant.
> > >
> > > > + * In the current design, RCU_SEQ_STATE_MASK=3 and the least significant bit of
> > > > + * the seq is used to track if a GP is in progress or not. Given this, it is
> > > > + * sufficient if we add (6+1) and mask with ~3 to get the next GP. Let's see
> > > > + * why with an example:
> > > > + *
> > > > + * Say the current seq is 12 which is 0b1100 (GP is 3 and state bits are 0b00).
> > > > + * To get to the next GP number of 4, we have to add 0b100 to this (0x1 << 2)
> > > > + * to account for the shift due to 2 state bits. Now, if the current seq is
> > > > + * 13 (GP is 3 and state bits are 0b01), then it means the current grace period
> > > > + * is already in progress so the next GP that a future call back will be queued
> > > > + * to run at is GP+2 = 5, not 4. To account for the extra +1, we just overflow
> > > > + * the 2 lower bits by adding 0b11. In case the lower bit was set, the overflow
> > > > + * will cause the extra +1 to the GP, along with the usual +1 explained before.
> > > > + * This gives us GP+2. Finally we mask the lower to bits by ~0x3 in case the
> > > > + * overflow didn't occur. This masking is needed because in case RCU was idle
> > > > + * (no GP in progress so lower 2 bits are 0b00), then the overflow of the lower
> > > > + * 2 state bits wouldn't occur, so we mask to zero out those lower 2 bits.
> > > > + *
> > > > + * In other words, the next seq can be obtained by (0b11 + 0b100) & (~0b11)
> > > > + * which can be generalized to:
> > > > + * seq + (RCU_SEQ_STATE_MASK + (RCU_SEQ_STATE_MASK + 1)) & (~RCU_SEQ_STATE_MASK)
> > > > + */
> > >
> > > Is the below not much simpler:
> > >
> > > > static inline unsigned long rcu_seq_snap(unsigned long *sp)
> > > > {
> > > > unsigned long s;
> > >
> > > s = smp_load_aquire(sp);
> > >
> > > /* Add one GP */
> > > s += 1 << RCU_SEQ_CTR_SHIFT;
> > >
> > > /* Complete any pending state by rounding up */
> >
> > I would suggest this comment be changed to "Add another GP if there was a
> > pending state".
> >
> > > s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);
> > >
> >
> > I agree with Peter's suggestions for both the verbiage reduction in the
> > comments in the header, as the new code he is proposing is more
> > self-documenting. I believe I proposed a big comment just because the code
> > wasn't self-documenting or obvious previously so needed an explanation.
> >
> > How would you like to proceed? Let me know what you guys decide, I am really
> > Ok with anything. If you guys agree, should I write a follow-up patch with
> > Peter's suggestion that applies on top of this one? Or do we want to drop
> > this one in favor of Peter's suggestion?
>
> Shortening the comment would be good, so please do that.
>
> I cannot say that I am much of a fan of the suggested change to the
> computation, but I don't feel all that strongly about it. If the two

Did you mean a code generation standpoint or from a higher level coding standpoint?

From a code generation perspective, the code is identical, I did a quick
test to confirm that:

0000000000000000 <rcu_seq_snap_old>:
0: e8 00 00 00 00 callq 5 <rcu_seq_snap_old+0x5>
5: 48 8b 07 mov (%rdi),%rax
8: f0 83 44 24 fc 00 lock addl $0x0,-0x4(%rsp)
e: 48 83 c0 07 add $0x7,%rax
12: 48 83 e0 fc and $0xfffffffffffffffc,%rax
16: c3 retq
17: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
1e: 00 00

0000000000000020 <rcu_seq_snap_new>:
20: e8 00 00 00 00 callq 25 <rcu_seq_snap_new+0x5>
25: 48 8b 07 mov (%rdi),%rax
28: f0 83 44 24 fc 00 lock addl $0x0,-0x4(%rsp)
2e: 48 83 c0 07 add $0x7,%rax
32: 48 83 e0 fc and $0xfffffffffffffffc,%rax
36: c3 retq
37: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
3e: 00 00

> of you agree on a formulation and get at least one other RCU maintainer
> or reviewer to agree as well, I will take the change.
>

Cool, sounds good.

thanks!

- Joel


2018-06-27 20:19:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Wed, Jun 27, 2018 at 11:27:26AM -0700, Joel Fernandes wrote:
> On Wed, Jun 27, 2018 at 10:54:36AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 26, 2018 at 09:39:13PM -0700, Joel Fernandes wrote:
> > > On Tue, Jun 26, 2018 at 07:30:55PM +0200, Peter Zijlstra wrote:
> > > > On Mon, Jun 25, 2018 at 05:35:02PM -0700, Paul E. McKenney wrote:
> > > > > +/*
> > > > > + * rcu_seq_snap - Take a snapshot of the update side's sequence number.
> > > > > + *
> > > > > + * This function returns the earliest value of the grace-period sequence number
> > > > > + * that will indicate that a full grace period has elapsed since the current
> > > > > + * time. Once the grace-period sequence number has reached this value, it will
> > > > > + * be safe to invoke all callbacks that have been registered prior to the
> > > > > + * current time. This value is the current grace-period number plus two to the
> > > > > + * power of the number of low-order bits reserved for state, then rounded up to
> > > > > + * the next value in which the state bits are all zero.
> > > >
> > > > If you complete that by saying _why_ you need to round up there, then
> > > > the below verbiage is completely redundant.
> > > >
> > > > > + * In the current design, RCU_SEQ_STATE_MASK=3 and the least significant bit of
> > > > > + * the seq is used to track if a GP is in progress or not. Given this, it is
> > > > > + * sufficient if we add (6+1) and mask with ~3 to get the next GP. Let's see
> > > > > + * why with an example:
> > > > > + *
> > > > > + * Say the current seq is 12 which is 0b1100 (GP is 3 and state bits are 0b00).
> > > > > + * To get to the next GP number of 4, we have to add 0b100 to this (0x1 << 2)
> > > > > + * to account for the shift due to 2 state bits. Now, if the current seq is
> > > > > + * 13 (GP is 3 and state bits are 0b01), then it means the current grace period
> > > > > + * is already in progress so the next GP that a future call back will be queued
> > > > > + * to run at is GP+2 = 5, not 4. To account for the extra +1, we just overflow
> > > > > + * the 2 lower bits by adding 0b11. In case the lower bit was set, the overflow
> > > > > + * will cause the extra +1 to the GP, along with the usual +1 explained before.
> > > > > + * This gives us GP+2. Finally we mask the lower to bits by ~0x3 in case the
> > > > > + * overflow didn't occur. This masking is needed because in case RCU was idle
> > > > > + * (no GP in progress so lower 2 bits are 0b00), then the overflow of the lower
> > > > > + * 2 state bits wouldn't occur, so we mask to zero out those lower 2 bits.
> > > > > + *
> > > > > + * In other words, the next seq can be obtained by (0b11 + 0b100) & (~0b11)
> > > > > + * which can be generalized to:
> > > > > + * seq + (RCU_SEQ_STATE_MASK + (RCU_SEQ_STATE_MASK + 1)) & (~RCU_SEQ_STATE_MASK)
> > > > > + */
> > > >
> > > > Is the below not much simpler:
> > > >
> > > > > static inline unsigned long rcu_seq_snap(unsigned long *sp)
> > > > > {
> > > > > unsigned long s;
> > > >
> > > > s = smp_load_aquire(sp);
> > > >
> > > > /* Add one GP */
> > > > s += 1 << RCU_SEQ_CTR_SHIFT;
> > > >
> > > > /* Complete any pending state by rounding up */
> > >
> > > I would suggest this comment be changed to "Add another GP if there was a
> > > pending state".
> > >
> > > > s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);
> > > >
> > >
> > > I agree with Peter's suggestions for both the verbiage reduction in the
> > > comments in the header, as the new code he is proposing is more
> > > self-documenting. I believe I proposed a big comment just because the code
> > > wasn't self-documenting or obvious previously so needed an explanation.
> > >
> > > How would you like to proceed? Let me know what you guys decide, I am really
> > > Ok with anything. If you guys agree, should I write a follow-up patch with
> > > Peter's suggestion that applies on top of this one? Or do we want to drop
> > > this one in favor of Peter's suggestion?
> >
> > Shortening the comment would be good, so please do that.
> >
> > I cannot say that I am much of a fan of the suggested change to the
> > computation, but I don't feel all that strongly about it. If the two
>
> Did you mean a code generation standpoint or from a higher level coding standpoint?

I mean from the viewpoint that this changes the source code from a
straightforward single-line computation that can be seen at a glance
into something using a macro with a "__" prefix that invokes yet another
macro that eventually does something.

And yes, once I looked up the macro definitions, I did see that it is
functionally equivalent to the original code. ;-)

> >From a code generation perspective, the code is identical, I did a quick
> test to confirm that:
>
> 0000000000000000 <rcu_seq_snap_old>:
> 0: e8 00 00 00 00 callq 5 <rcu_seq_snap_old+0x5>
> 5: 48 8b 07 mov (%rdi),%rax
> 8: f0 83 44 24 fc 00 lock addl $0x0,-0x4(%rsp)
> e: 48 83 c0 07 add $0x7,%rax
> 12: 48 83 e0 fc and $0xfffffffffffffffc,%rax
> 16: c3 retq
> 17: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
> 1e: 00 00
>
> 0000000000000020 <rcu_seq_snap_new>:
> 20: e8 00 00 00 00 callq 25 <rcu_seq_snap_new+0x5>
> 25: 48 8b 07 mov (%rdi),%rax
> 28: f0 83 44 24 fc 00 lock addl $0x0,-0x4(%rsp)
> 2e: 48 83 c0 07 add $0x7,%rax
> 32: 48 83 e0 fc and $0xfffffffffffffffc,%rax
> 36: c3 retq
> 37: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
> 3e: 00 00

That is reassuring.

Thanx, Paul

> > of you agree on a formulation and get at least one other RCU maintainer
> > or reviewer to agree as well, I will take the change.
> >
>
> Cool, sounds good.
>
> thanks!
>
> - Joel
>


2018-06-28 05:20:17

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Wed, Jun 27, 2018 at 11:27:26AM -0700, Joel Fernandes wrote:
[..]
> > > > s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);
> > > >
> > >
> > > I agree with Peter's suggestions for both the verbiage reduction in the
> > > comments in the header, as the new code he is proposing is more
> > > self-documenting. I believe I proposed a big comment just because the code
> > > wasn't self-documenting or obvious previously so needed an explanation.
> > >
> > > How would you like to proceed? Let me know what you guys decide, I am really
> > > Ok with anything. If you guys agree, should I write a follow-up patch with
> > > Peter's suggestion that applies on top of this one? Or do we want to drop
> > > this one in favor of Peter's suggestion?
> >
> > Shortening the comment would be good, so please do that.

Paul,

Do you want to fold the below patch into the original one? Or do you prefer I
resent the original patch fixed up?

Following is the patch ontop of current 'dev' branch in your tree, with the
excessive comments removed.

Thanks to Peter for suggesting!

---8<-----------------------

From: "Joel Fernandes (Google)" <[email protected]>
Subject: [PATCH] rcu: Remove excessive commentary on rcu_seq_snap

There isn't a strong need to explain in excessive detail about
rcu_seq_snap with an example. Remove unnecessary and redundant comments.

Suggested-by: Peter Zijlstra <[email protected]>
Fixes: 9701945dd79e ("rcu: Add comment documenting how rcu_seq_snap works")
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/rcu/rcu.h | 22 ----------------------
1 file changed, 22 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 0af6ce6d8b66..4d04683c31b2 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -101,28 +101,6 @@ static inline void rcu_seq_end(unsigned long *sp)
* current time. This value is the current grace-period number plus two to the
* power of the number of low-order bits reserved for state, then rounded up to
* the next value in which the state bits are all zero.
- *
- * In the current design, RCU_SEQ_STATE_MASK=3 and the least significant bit of
- * the seq is used to track if a GP is in progress or not. Given this, it is
- * sufficient if we add (6+1) and mask with ~3 to get the next GP. Let's see
- * why with an example:
- *
- * Say the current seq is 12 which is 0b1100 (GP is 3 and state bits are 0b00).
- * To get to the next GP number of 4, we have to add 0b100 to this (0x1 << 2)
- * to account for the shift due to 2 state bits. Now, if the current seq is
- * 13 (GP is 3 and state bits are 0b01), then it means the current grace period
- * is already in progress so the next GP that a future call back will be queued
- * to run at is GP+2 = 5, not 4. To account for the extra +1, we just overflow
- * the 2 lower bits by adding 0b11. In case the lower bit was set, the overflow
- * will cause the extra +1 to the GP, along with the usual +1 explained before.
- * This gives us GP+2. Finally we mask the lower to bits by ~0x3 in case the
- * overflow didn't occur. This masking is needed because in case RCU was idle
- * (no GP in progress so lower 2 bits are 0b00), then the overflow of the lower
- * 2 state bits wouldn't occur, so we mask to zero out those lower 2 bits.
- *
- * In other words, the next seq can be obtained by (0b11 + 0b100) & (~0b11)
- * which can be generalized to:
- * seq + (RCU_SEQ_STATE_MASK + (RCU_SEQ_STATE_MASK + 1)) & (~RCU_SEQ_STATE_MASK)
*/
static inline unsigned long rcu_seq_snap(unsigned long *sp)
{
--
2.18.0.rc2.346.g013aa6912e-goog


2018-06-28 19:36:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/27] rcu: Add comment documenting how rcu_seq_snap works

On Wed, Jun 27, 2018 at 10:10:28PM -0700, Joel Fernandes wrote:
> On Wed, Jun 27, 2018 at 11:27:26AM -0700, Joel Fernandes wrote:
> [..]
> > > > > s = __ALIGN_MASK(s, RCU_SEQ_STATE_MASK);
> > > > >
> > > >
> > > > I agree with Peter's suggestions for both the verbiage reduction in the
> > > > comments in the header, as the new code he is proposing is more
> > > > self-documenting. I believe I proposed a big comment just because the code
> > > > wasn't self-documenting or obvious previously so needed an explanation.
> > > >
> > > > How would you like to proceed? Let me know what you guys decide, I am really
> > > > Ok with anything. If you guys agree, should I write a follow-up patch with
> > > > Peter's suggestion that applies on top of this one? Or do we want to drop
> > > > this one in favor of Peter's suggestion?
> > >
> > > Shortening the comment would be good, so please do that.
>
> Paul,
>
> Do you want to fold the below patch into the original one? Or do you prefer I
> resent the original patch fixed up?
>
> Following is the patch ontop of current 'dev' branch in your tree, with the
> excessive comments removed.
>
> Thanks to Peter for suggesting!

I merged this into the current commit, with the result shown below.

Thank you both!

Thanx, Paul

-----------------------------------------------------------------------

commit c9a6ed70aad9f4571afba3e12e869f5fccc26a40
Author: Joel Fernandes (Google) <[email protected]>
Date: Tue May 22 23:38:13 2018 -0700

rcu: Add comment documenting how rcu_seq_snap works

rcu_seq_snap may be tricky to decipher. Lets document how it works with
an example to make it easier.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
[ paulmck: Shrink comment as suggested by Peter Zijlstra. ]

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index aa215d6355f8..89f13fffac73 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -91,7 +91,17 @@ static inline void rcu_seq_end(unsigned long *sp)
WRITE_ONCE(*sp, rcu_seq_endval(sp));
}

-/* Take a snapshot of the update side's sequence number. */
+/*
+ * rcu_seq_snap - Take a snapshot of the update side's sequence number.
+ *
+ * This function returns the earliest value of the grace-period sequence number
+ * that will indicate that a full grace period has elapsed since the current
+ * time. Once the grace-period sequence number has reached this value, it will
+ * be safe to invoke all callbacks that have been registered prior to the
+ * current time. This value is the current grace-period number plus two to the
+ * power of the number of low-order bits reserved for state, then rounded up to
+ * the next value in which the state bits are all zero.
+ */
static inline unsigned long rcu_seq_snap(unsigned long *sp)
{
unsigned long s;