2017-04-12 16:54:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 0/13] Miscellaneous fixes for 4.12

Hello!

This series contains the following fixes:

1. Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU.

2. Use "WARNING" tag on RCU's lockdep splats.

3. Update obsolete callback_head comment.

4. Make RCU_FANOUT_LEAF help text more explicit about skew_tick.

5. Remove obsolete comment from rcu_future_gp_cleanup() header.

6. Disable sparse warning emitted by hlist_add_tail_rcu(), courtesy
of Michael S. Tsirkin.

7. Add smp_mb__after_atomic() to sync_exp_work_done().

8. Improve comments for hotplug/suspend/hibernate functions.

9. Use static initialization for "srcu" in mm/mmu_notifier.c.

10. Use correct path for Kconfig fragment for duplicate rcutorture
test scenarios.

11. Use bool value directly for ->beenonline comparison, courtesy
of Nicholas Mc Guire.

12. Use true/false in assignment to bool variable rcu_nocb_poll,
courtesy of Nicholas Mc Guire.

13. Fix typo in PER_RCU_NODE_PERIOD header comment.

Thanx, Paul

------------------------------------------------------------------------

Documentation/RCU/00-INDEX | 2
Documentation/RCU/rculist_nulls.txt | 6 -
Documentation/RCU/whatisRCU.txt | 3
drivers/gpu/drm/i915/i915_gem.c | 2
drivers/gpu/drm/i915/i915_gem_request.h | 2
drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c | 2
fs/jbd2/journal.c | 2
fs/signalfd.c | 2
include/linux/dma-fence.h | 4
include/linux/rculist.h | 3
include/linux/slab.h | 6 -
include/linux/types.h | 2
include/net/sock.h | 2
init/Kconfig | 10 +
kernel/fork.c | 4
kernel/locking/lockdep.c | 86 +++++++--------
kernel/locking/rtmutex-debug.c | 9 -
kernel/rcu/tree.c | 49 ++++++--
kernel/rcu/tree_exp.h | 1
kernel/rcu/tree_plugin.h | 2
kernel/signal.c | 2
mm/kasan/kasan.c | 6 -
mm/kmemcheck.c | 2
mm/mmu_notifier.c | 14 --
mm/rmap.c | 4
mm/slab.c | 6 -
mm/slab.h | 4
mm/slab_common.c | 6 -
mm/slob.c | 6 -
mm/slub.c | 12 +-
net/dccp/ipv4.c | 2
net/dccp/ipv6.c | 2
net/ipv4/tcp_ipv4.c | 2
net/ipv6/tcp_ipv6.c | 2
net/llc/af_llc.c | 2
net/llc/llc_conn.c | 4
net/llc/llc_sap.c | 2
net/netfilter/nf_conntrack_core.c | 8 -
net/smc/af_smc.c | 2
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2
40 files changed, 160 insertions(+), 129 deletions(-)


2017-04-12 16:56:03

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 02/13] lockdep: Use "WARNING" tag on lockdep splats

This commit changes lockdep splats to begin lines with "WARNING" and
to use pr_warn() instead of printk(). This change eases scripted
analysis of kernel console output.

Reported-by: Dmitry Vyukov <[email protected]>
Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
---
kernel/locking/lockdep.c | 86 +++++++++++++++++++++---------------------
kernel/locking/rtmutex-debug.c | 9 +++--
2 files changed, 48 insertions(+), 47 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index a95e5d1f4a9c..e9d4f85b290c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1144,10 +1144,10 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
return 0;

printk("\n");
- printk("======================================================\n");
- printk("[ INFO: possible circular locking dependency detected ]\n");
+ pr_warn("======================================================\n");
+ pr_warn("WARNING: possible circular locking dependency detected\n");
print_kernel_ident();
- printk("-------------------------------------------------------\n");
+ pr_warn("------------------------------------------------------\n");
printk("%s/%d is trying to acquire lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(check_src);
@@ -1482,11 +1482,11 @@ print_bad_irq_dependency(struct task_struct *curr,
return 0;

printk("\n");
- printk("======================================================\n");
- printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
+ pr_warn("=====================================================\n");
+ pr_warn("WARNING: %s-safe -> %s-unsafe lock order detected\n",
irqclass, irqclass);
print_kernel_ident();
- printk("------------------------------------------------------\n");
+ pr_warn("-----------------------------------------------------\n");
printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
curr->comm, task_pid_nr(curr),
curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
@@ -1711,10 +1711,10 @@ print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
return 0;

printk("\n");
- printk("=============================================\n");
- printk("[ INFO: possible recursive locking detected ]\n");
+ pr_warn("============================================\n");
+ pr_warn("WARNING: possible recursive locking detected\n");
print_kernel_ident();
- printk("---------------------------------------------\n");
+ pr_warn("--------------------------------------------\n");
printk("%s/%d is trying to acquire lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(next);
@@ -2061,10 +2061,10 @@ static void print_collision(struct task_struct *curr,
struct lock_chain *chain)
{
printk("\n");
- printk("======================\n");
- printk("[chain_key collision ]\n");
+ pr_warn("============================\n");
+ pr_warn("WARNING: chain_key collision\n");
print_kernel_ident();
- printk("----------------------\n");
+ pr_warn("----------------------------\n");
printk("%s/%d: ", current->comm, task_pid_nr(current));
printk("Hash chain already cached but the contents don't match!\n");

@@ -2360,10 +2360,10 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this,
return 0;

printk("\n");
- printk("=================================\n");
- printk("[ INFO: inconsistent lock state ]\n");
+ pr_warn("================================\n");
+ pr_warn("WARNING: inconsistent lock state\n");
print_kernel_ident();
- printk("---------------------------------\n");
+ pr_warn("--------------------------------\n");

printk("inconsistent {%s} -> {%s} usage.\n",
usage_str[prev_bit], usage_str[new_bit]);
@@ -2425,10 +2425,10 @@ print_irq_inversion_bug(struct task_struct *curr,
return 0;

printk("\n");
- printk("=========================================================\n");
- printk("[ INFO: possible irq lock inversion dependency detected ]\n");
+ pr_warn("========================================================\n");
+ pr_warn("WARNING: possible irq lock inversion dependency detected\n");
print_kernel_ident();
- printk("---------------------------------------------------------\n");
+ pr_warn("--------------------------------------------------------\n");
printk("%s/%d just changed the state of lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(this);
@@ -3170,10 +3170,10 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
return 0;

printk("\n");
- printk("==================================\n");
- printk("[ BUG: Nested lock was not taken ]\n");
+ pr_warn("==================================\n");
+ pr_warn("WARNING: Nested lock was not taken\n");
print_kernel_ident();
- printk("----------------------------------\n");
+ pr_warn("----------------------------------\n");

printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
print_lock(hlock);
@@ -3383,10 +3383,10 @@ print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
return 0;

printk("\n");
- printk("=====================================\n");
- printk("[ BUG: bad unlock balance detected! ]\n");
+ pr_warn("=====================================\n");
+ pr_warn("WARNING: bad unlock balance detected!\n");
print_kernel_ident();
- printk("-------------------------------------\n");
+ pr_warn("-------------------------------------\n");
printk("%s/%d is trying to release lock (",
curr->comm, task_pid_nr(curr));
print_lockdep_cache(lock);
@@ -3880,10 +3880,10 @@ print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
return 0;

printk("\n");
- printk("=================================\n");
- printk("[ BUG: bad contention detected! ]\n");
+ pr_warn("=================================\n");
+ pr_warn("WARNING: bad contention detected!\n");
print_kernel_ident();
- printk("---------------------------------\n");
+ pr_warn("---------------------------------\n");
printk("%s/%d is trying to contend lock (",
curr->comm, task_pid_nr(curr));
print_lockdep_cache(lock);
@@ -4244,10 +4244,10 @@ print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
return;

printk("\n");
- printk("=========================\n");
- printk("[ BUG: held lock freed! ]\n");
+ pr_warn("=========================\n");
+ pr_warn("WARNING: held lock freed!\n");
print_kernel_ident();
- printk("-------------------------\n");
+ pr_warn("-------------------------\n");
printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
print_lock(hlock);
@@ -4302,11 +4302,11 @@ static void print_held_locks_bug(void)
return;

printk("\n");
- printk("=====================================\n");
- printk("[ BUG: %s/%d still has locks held! ]\n",
+ pr_warn("====================================\n");
+ pr_warn("WARNING: %s/%d still has locks held!\n",
current->comm, task_pid_nr(current));
print_kernel_ident();
- printk("-------------------------------------\n");
+ pr_warn("------------------------------------\n");
lockdep_print_held_locks(current);
printk("\nstack backtrace:\n");
dump_stack();
@@ -4371,7 +4371,7 @@ void debug_show_all_locks(void)
} while_each_thread(g, p);

printk("\n");
- printk("=============================================\n\n");
+ pr_warn("=============================================\n\n");

if (unlock)
read_unlock(&tasklist_lock);
@@ -4401,10 +4401,10 @@ asmlinkage __visible void lockdep_sys_exit(void)
if (!debug_locks_off())
return;
printk("\n");
- printk("================================================\n");
- printk("[ BUG: lock held when returning to user space! ]\n");
+ pr_warn("================================================\n");
+ pr_warn("WARNING: lock held when returning to user space!\n");
print_kernel_ident();
- printk("------------------------------------------------\n");
+ pr_warn("------------------------------------------------\n");
printk("%s/%d is leaving the kernel with locks still held!\n",
curr->comm, curr->pid);
lockdep_print_held_locks(curr);
@@ -4421,13 +4421,13 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
#endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
/* Note: the following can be executed concurrently, so be careful. */
printk("\n");
- pr_err("===============================\n");
- pr_err("[ ERR: suspicious RCU usage. ]\n");
+ pr_warn("=============================\n");
+ pr_warn("WARNING: suspicious RCU usage\n");
print_kernel_ident();
- pr_err("-------------------------------\n");
- pr_err("%s:%d %s!\n", file, line, s);
- pr_err("\nother info that might help us debug this:\n\n");
- pr_err("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
+ pr_warn("-----------------------------\n");
+ printk("%s:%d %s!\n", file, line, s);
+ printk("\nother info that might help us debug this:\n\n");
+ printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
!rcu_lockdep_current_cpu_online()
? "RCU used illegally from offline CPU!\n"
: !rcu_is_watching()
diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
index 97ee9df32e0f..db4f55211b04 100644
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -102,10 +102,11 @@ void debug_rt_mutex_print_deadlock(struct rt_mutex_waiter *waiter)
return;
}

- printk("\n============================================\n");
- printk( "[ BUG: circular locking deadlock detected! ]\n");
- printk("%s\n", print_tainted());
- printk( "--------------------------------------------\n");
+ pr_warn("\n");
+ pr_warn("============================================\n");
+ pr_warn("WARNING: circular locking deadlock detected!\n");
+ pr_warn("%s\n", print_tainted());
+ pr_warn("--------------------------------------------\n");
printk("%s/%d is deadlocking current task %s/%d\n\n",
task->comm, task_pid_nr(task),
current->comm, task_pid_nr(current));
--
2.5.2

2017-04-12 16:56:05

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

The sync_exp_work_done() function needs to fully order the
counter-check operation against anything happening after the
corresponding grace period. This is a theoretical bug, as all current
architectures either provide full ordering for atomic operations
on the one hand or implement smp_mb__before_atomic() as smp_mb()
on the other. However, a little future-proofing is a good thing,
especially given that smp_mb__before_atomic() is only required to
provide acquire semantics rather than full ordering. This commit
therefore adds smp_mb__after_atomic() after the atomic_long_inc()
in sync_exp_work_done().

Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_exp.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index a7b639ccd46e..e0cafa5f3269 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
/* Ensure test happens before caller kfree(). */
smp_mb__before_atomic(); /* ^^^ */
atomic_long_inc(stat);
+ smp_mb__after_atomic(); /* ^^^ */
return true;
}
return false;
--
2.5.2

2017-04-12 16:56:12

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 03/13] types: Update obsolete callback_head comment

The comment header for callback_head (and thus for rcu_head) states that
the bottom two bits of a pointer to these structures must be zero. This
is obsolete: The new requirement is that only the bottom bit need be
zero. This commit therefore updates this comment.

Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/types.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/types.h b/include/linux/types.h
index 1e7bd24848fc..258099a4ed82 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -209,7 +209,7 @@ struct ustat {
* naturally due ABI requirements, but some architectures (like CRIS) have
* weird ABI and we need to ask it explicitly.
*
- * The alignment is required to guarantee that bits 0 and 1 of @next will be
+ * The alignment is required to guarantee that bit 0 of @next will be
* clear under normal conditions -- as long as we use call_rcu(),
* call_rcu_bh(), call_rcu_sched(), or call_srcu() to queue callback.
*
--
2.5.2

2017-04-12 16:56:15

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 08/13] rcu: Improve comments for hotplug/suspend/hibernate functions

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 41 +++++++++++++++++++++++++++++++++++++----
1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bdaa69d23a8a..c4f195dd7c94 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3894,6 +3894,10 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}

+/*
+ * Invoked early in the CPU-online process, when pretty much all
+ * services are available. The incoming CPU is not present.
+ */
int rcutree_prepare_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3907,6 +3911,9 @@ int rcutree_prepare_cpu(unsigned int cpu)
return 0;
}

+/*
+ * Update RCU priority boot kthread affinity for CPU-hotplug changes.
+ */
static void rcutree_affinity_setting(unsigned int cpu, int outgoing)
{
struct rcu_data *rdp = per_cpu_ptr(rcu_state_p->rda, cpu);
@@ -3914,6 +3921,10 @@ static void rcutree_affinity_setting(unsigned int cpu, int outgoing)
rcu_boost_kthread_setaffinity(rdp->mynode, outgoing);
}

+/*
+ * Near the end of the CPU-online process. Pretty much all services
+ * enabled, and the CPU is now very much alive.
+ */
int rcutree_online_cpu(unsigned int cpu)
{
sync_sched_exp_online_cleanup(cpu);
@@ -3921,13 +3932,19 @@ int rcutree_online_cpu(unsigned int cpu)
return 0;
}

+/*
+ * Near the beginning of the process. The CPU is still very much alive
+ * with pretty much all services enabled.
+ */
int rcutree_offline_cpu(unsigned int cpu)
{
rcutree_affinity_setting(cpu, cpu);
return 0;
}

-
+/*
+ * Near the end of the offline process. We do only tracing here.
+ */
int rcutree_dying_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3937,6 +3954,9 @@ int rcutree_dying_cpu(unsigned int cpu)
return 0;
}

+/*
+ * The outgoing CPU is gone and we are running elsewhere.
+ */
int rcutree_dead_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3954,6 +3974,10 @@ int rcutree_dead_cpu(unsigned int cpu)
* incoming CPUs are not allowed to use RCU read-side critical sections
* until this function is called. Failing to observe this restriction
* will result in lockdep splats.
+ *
+ * Note that this function is special in that it is invoked directly
+ * from the incoming CPU rather than from the cpuhp_step mechanism.
+ * This is because this function must be invoked at a precise location.
*/
void rcu_cpu_starting(unsigned int cpu)
{
@@ -3979,9 +4003,6 @@ void rcu_cpu_starting(unsigned int cpu)
* The CPU is exiting the idle loop into the arch_cpu_idle_dead()
* function. We now remove it from the rcu_node tree's ->qsmaskinit
* bit masks.
- * The CPU is exiting the idle loop into the arch_cpu_idle_dead()
- * function. We now remove it from the rcu_node tree's ->qsmaskinit
- * bit masks.
*/
static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
{
@@ -3997,6 +4018,14 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}

+/*
+ * The outgoing function has no further need of RCU, so remove it from
+ * the list of CPUs that RCU must track.
+ *
+ * Note that this function is special in that it is invoked directly
+ * from the outgoing CPU rather than from the cpuhp_step mechanism.
+ * This is because this function must be invoked at a precise location.
+ */
void rcu_report_dead(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -4011,6 +4040,10 @@ void rcu_report_dead(unsigned int cpu)
}
#endif

+/*
+ * On non-huge systems, use expedited RCU grace periods to make suspend
+ * and hibernation run faster.
+ */
static int rcu_pm_notify(struct notifier_block *self,
unsigned long action, void *hcpu)
{
--
2.5.2

2017-04-12 16:56:27

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 05/13] rcu: Remove obsolete comment from rcu_future_gp_cleanup() header

The rcu_nocb_gp_cleanup() function is now invoked elsewhere, so this
commit drags this comment into the year 2017.

Reported-by: Michalis Kokologiannakis <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 50fee7689e71..bdaa69d23a8a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1793,9 +1793,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,

/*
* Clean up any old requests for the just-ended grace period. Also return
- * whether any additional grace periods have been requested. Also invoke
- * rcu_nocb_gp_cleanup() in order to wake up any no-callbacks kthreads
- * waiting for this grace period to complete.
+ * whether any additional grace periods have been requested.
*/
static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
{
--
2.5.2

2017-04-12 16:56:18

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 06/13] hlist_add_tail_rcu disable sparse warning

From: "Michael S. Tsirkin" <[email protected]>

sparse is unhappy about this code in hlist_add_tail_rcu:

struct hlist_node *i, *last = NULL;

for (i = hlist_first_rcu(h); i; i = hlist_next_rcu(i))
last = i;

This is because hlist_next_rcu and hlist_next_rcu return
__rcu pointers.

It's a false positive - it's a write side primitive and so
does not need to be called in a read side critical section.

The following trivial patch disables the warning
without changing the behaviour in any way.

Note: __hlist_for_each_rcu would also remove the warning but it would be
confusing since it calls rcu_derefence and is designed to run in the rcu
read side critical section.

Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rculist.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 4f7a9561b8c4..b1fd8bf85fdc 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -509,7 +509,8 @@ static inline void hlist_add_tail_rcu(struct hlist_node *n,
{
struct hlist_node *i, *last = NULL;

- for (i = hlist_first_rcu(h); i; i = hlist_next_rcu(i))
+ /* Note: write side code, so rcu accessors are not needed. */
+ for (i = h->first; i; i = i->next)
last = i;

if (last) {
--
2.5.2

2017-04-12 16:56:31

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 12/13] rcu: Use true/false in assignment to bool

From: Nicholas Mc Guire <[email protected]>

This commit makes the parse_rcu_nocb_poll() function assign true
(rather than the constant 1) to the bool variable rcu_nocb_poll.

Signed-off-by: Nicholas Mc Guire <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0a62a8f1caac..f4b7a9be1a44 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1709,7 +1709,7 @@ __setup("rcu_nocbs=", rcu_nocb_setup);

static int __init parse_rcu_nocb_poll(char *arg)
{
- rcu_nocb_poll = 1;
+ rcu_nocb_poll = true;
return 0;
}
early_param("rcu_nocb_poll", parse_rcu_nocb_poll);
--
2.5.2

2017-04-12 16:56:34

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 09/13] mm: Use static initialization for "srcu"

The MM-notifier code currently dynamically initializes the srcu_struct
named "srcu" at subsys_initcall() time, and includes a BUG_ON() to check
this initialization in do_mmu_notifier_register(). Unfortunately, there
is no foolproof way to verify that an srcu_struct has been initialized,
given the possibility of an srcu_struct being allocated on the stack or
on the heap. This means that creating an srcu_struct_is_initialized()
function is not a reasonable course of action. Nor is peppering
do_mmu_notifier_register() with SRCU-specific #ifdefs an attractive
alternative.

This commit therefore uses DEFINE_STATIC_SRCU() to initialize
this srcu_struct at compile time, thus eliminating both the
subsys_initcall()-time initialization and the runtime BUG_ON().

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: <[email protected]>
Cc: Andrew Morton <[email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Vegard Nossum <[email protected]>
---
mm/mmu_notifier.c | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a7652acd2ab9..54ca54562928 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -21,7 +21,7 @@
#include <linux/slab.h>

/* global SRCU for all MMs */
-static struct srcu_struct srcu;
+DEFINE_STATIC_SRCU(srcu);

/*
* This function allows mmu_notifier::release callback to delay a call to
@@ -252,12 +252,6 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,

BUG_ON(atomic_read(&mm->mm_users) <= 0);

- /*
- * Verify that mmu_notifier_init() already run and the global srcu is
- * initialized.
- */
- BUG_ON(!srcu.per_cpu_ref);
-
ret = -ENOMEM;
mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
if (unlikely(!mmu_notifier_mm))
@@ -406,9 +400,3 @@ void mmu_notifier_unregister_no_release(struct mmu_notifier *mn,
mmdrop(mm);
}
EXPORT_SYMBOL_GPL(mmu_notifier_unregister_no_release);
-
-static int __init mmu_notifier_init(void)
-{
- return init_srcu_struct(&srcu);
-}
-subsys_initcall(mmu_notifier_init);
--
2.5.2

2017-04-12 16:56:22

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

If you set RCU_FANOUT_LEAF too high, you can get lock contention
on the leaf rcu_node, and you should boot with the skew_tick kernel
parameter set in order to avoid this lock contention. This commit
therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
this.

Signed-off-by: Paul E. McKenney <[email protected]>
---
init/Kconfig | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a92f27da4a27..946e561e67b7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
initialization. These systems tend to run CPU-bound, and thus
are not helped by synchronized interrupts, and thus tend to
skew them, which reduces lock contention enough that large
- leaf-level fanouts work well.
+ leaf-level fanouts work well. That said, setting leaf-level
+ fanout to a large number will likely cause problematic
+ lock contention on the leaf-level rcu_node structures unless
+ you boot with the skew_tick kernel parameter.

Select a specific number if testing RCU itself.

- Select the maximum permissible value for large systems.
+ Select the maximum permissible value for large systems, but
+ please understand that you may also need to set the
+ skew_tick kernel boot parameter to avoid contention
+ on the rcu_node structure's locks.

Take the default if unsure.

--
2.5.2

2017-04-12 16:58:01

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 13/13] rcu: Fix typo in PER_RCU_NODE_PERIOD header comment

This commit just changes a "the the" to "the" to reduce repetition.

Reported-by: Michalis Kokologiannakis <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7c238604df18..b1679e8cc5ed 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -199,7 +199,7 @@ static const int gp_cleanup_delay;

/*
* Number of grace periods between delays, normalized by the duration of
- * the delay. The longer the the delay, the more the grace periods between
+ * the delay. The longer the delay, the more the grace periods between
* each delay. The reason for this normalization is that it means that,
* for non-zero delays, the overall slowdown of grace periods is constant
* regardless of the duration of the delay. This arrangement balances
--
2.5.2

2017-04-12 16:58:28

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 10/13] torture: Use correct path for Kconfig fragment for duplicates

Currently, the rcutorture scripting will give an error message if
running a duplicate scenario that happens also to have a non-existent
build directory (b1, b2, ... in the rcutorture directory). Worse yet, if
the build directory has already been created and used for a real build,
the script will silently grab the wrong Kconfig fragment, which could
cause confusion to the poor sap (me) analyzing old test results. At
least the actual test runs correctly...

This commit therefore accesses the Kconfig fragment from the results
directory corresponding to the first of the duplicate scenarios, for
which a build was actually carried out. This prevents both the messages
and at least one form of later confusion.

Signed-off-by: Paul E. McKenney <[email protected]>
---
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index ea6e373edc27..93eede4e8fbe 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -170,7 +170,7 @@ qemu_append="`identify_qemu_append "$QEMU"`"
# Pull in Kconfig-fragment boot parameters
boot_args="`configfrag_boot_params "$boot_args" "$config_template"`"
# Generate kernel-version-specific boot parameters
-boot_args="`per_version_boot_params "$boot_args" $builddir/.config $seconds`"
+boot_args="`per_version_boot_params "$boot_args" $resdir/.config $seconds`"

if test -n "$TORTURE_BUILDONLY"
then
--
2.5.2

2017-04-12 16:58:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 11/13] rcu: Use bool value directly

From: Nicholas Mc Guire <[email protected]>

The beenonline variable is declared bool so there is no need for an
explicit comparison, especially not against the constant zero.

Signed-off-by: Nicholas Mc Guire <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c4f195dd7c94..7c238604df18 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3085,7 +3085,7 @@ __rcu_process_callbacks(struct rcu_state *rsp)
bool needwake;
struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);

- WARN_ON_ONCE(rdp->beenonline == 0);
+ WARN_ON_ONCE(!rdp->beenonline);

/* Update RCU state based on any recent quiescent states. */
rcu_check_quiescent_state(rsp, rdp);
--
2.5.2

2017-04-12 17:00:04

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
[ paulmck: Add "tombstone" comments as requested by Eric Dumazet. ]
---
Documentation/RCU/00-INDEX | 2 +-
Documentation/RCU/rculist_nulls.txt | 6 +++---
Documentation/RCU/whatisRCU.txt | 3 ++-
drivers/gpu/drm/i915/i915_gem.c | 2 +-
drivers/gpu/drm/i915/i915_gem_request.h | 2 +-
drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c | 2 +-
fs/jbd2/journal.c | 2 +-
fs/signalfd.c | 2 +-
include/linux/dma-fence.h | 4 ++--
include/linux/slab.h | 6 ++++--
include/net/sock.h | 2 +-
kernel/fork.c | 4 ++--
kernel/signal.c | 2 +-
mm/kasan/kasan.c | 6 +++---
mm/kmemcheck.c | 2 +-
mm/rmap.c | 4 ++--
mm/slab.c | 6 +++---
mm/slab.h | 4 ++--
mm/slab_common.c | 6 +++---
mm/slob.c | 6 +++---
mm/slub.c | 12 ++++++------
net/dccp/ipv4.c | 2 +-
net/dccp/ipv6.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv6/tcp_ipv6.c | 2 +-
net/llc/af_llc.c | 2 +-
net/llc/llc_conn.c | 4 ++--
net/llc/llc_sap.c | 2 +-
net/netfilter/nf_conntrack_core.c | 8 ++++----
net/smc/af_smc.c | 2 +-
30 files changed, 57 insertions(+), 54 deletions(-)

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index f773a264ae02..1672573b037a 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -17,7 +17,7 @@ rcu_dereference.txt
rcubarrier.txt
- RCU and Unloadable Modules
rculist_nulls.txt
- - RCU list primitives for use with SLAB_DESTROY_BY_RCU
+ - RCU list primitives for use with SLAB_TYPESAFE_BY_RCU
rcuref.txt
- Reference-count design for elements of lists/arrays protected by RCU
rcu.txt
diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
index 18f9651ff23d..8151f0195f76 100644
--- a/Documentation/RCU/rculist_nulls.txt
+++ b/Documentation/RCU/rculist_nulls.txt
@@ -1,5 +1,5 @@
Using hlist_nulls to protect read-mostly linked lists and
-objects using SLAB_DESTROY_BY_RCU allocations.
+objects using SLAB_TYPESAFE_BY_RCU allocations.

Please read the basics in Documentation/RCU/listRCU.txt

@@ -7,7 +7,7 @@ Using special makers (called 'nulls') is a convenient way
to solve following problem :

A typical RCU linked list managing objects which are
-allocated with SLAB_DESTROY_BY_RCU kmem_cache can
+allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
use following algos :

1) Lookup algo
@@ -96,7 +96,7 @@ unlock_chain(); // typically a spin_unlock()
3) Remove algo
--------------
Nothing special here, we can use a standard RCU hlist deletion.
-But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused
+But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
very very fast (before the end of RCU grace period)

if (put_last_reference_on(obj) {
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 5cbd8b2395b8..91c912e86915 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -925,7 +925,8 @@ d. Do you need RCU grace periods to complete even in the face

e. Is your workload too update-intensive for normal use of
RCU, but inappropriate for other synchronization mechanisms?
- If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
+ If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
+ named SLAB_DESTROY_BY_RCU). But please be careful!

f. Do you need read-side critical sections that are respected
even though they are in the middle of the idle loop, during
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6908123162d1..3b668895ac24 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4552,7 +4552,7 @@ i915_gem_load_init(struct drm_i915_private *dev_priv)
dev_priv->requests = KMEM_CACHE(drm_i915_gem_request,
SLAB_HWCACHE_ALIGN |
SLAB_RECLAIM_ACCOUNT |
- SLAB_DESTROY_BY_RCU);
+ SLAB_TYPESAFE_BY_RCU);
if (!dev_priv->requests)
goto err_vmas;

diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
index ea511f06efaf..9ee2750e1dde 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -493,7 +493,7 @@ static inline struct drm_i915_gem_request *
__i915_gem_active_get_rcu(const struct i915_gem_active *active)
{
/* Performing a lockless retrieval of the active request is super
- * tricky. SLAB_DESTROY_BY_RCU merely guarantees that the backing
+ * tricky. SLAB_TYPESAFE_BY_RCU merely guarantees that the backing
* slab of request objects will not be freed whilst we hold the
* RCU read lock. It does not guarantee that the request itself
* will not be freed and then *reused*. Viz,
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c b/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
index 12647af5a336..e7fb47e84a93 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
@@ -1071,7 +1071,7 @@ int ldlm_init(void)
ldlm_lock_slab = kmem_cache_create("ldlm_locks",
sizeof(struct ldlm_lock), 0,
SLAB_HWCACHE_ALIGN |
- SLAB_DESTROY_BY_RCU, NULL);
+ SLAB_TYPESAFE_BY_RCU, NULL);
if (!ldlm_lock_slab) {
kmem_cache_destroy(ldlm_resource_slab);
return -ENOMEM;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a1a359bfcc9c..7f8f962454e5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2340,7 +2340,7 @@ static int jbd2_journal_init_journal_head_cache(void)
jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
sizeof(struct journal_head),
0, /* offset */
- SLAB_TEMPORARY | SLAB_DESTROY_BY_RCU,
+ SLAB_TEMPORARY | SLAB_TYPESAFE_BY_RCU,
NULL); /* ctor */
retval = 0;
if (!jbd2_journal_head_cache) {
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 270221fcef42..7e3d71109f51 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -38,7 +38,7 @@ void signalfd_cleanup(struct sighand_struct *sighand)
/*
* The lockless check can race with remove_wait_queue() in progress,
* but in this case its caller should run under rcu_read_lock() and
- * sighand_cachep is SLAB_DESTROY_BY_RCU, we can safely return.
+ * sighand_cachep is SLAB_TYPESAFE_BY_RCU, we can safely return.
*/
if (likely(!waitqueue_active(wqh)))
return;
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 6048fa404e57..a5195a7d6f77 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -229,7 +229,7 @@ static inline struct dma_fence *dma_fence_get_rcu(struct dma_fence *fence)
*
* Function returns NULL if no refcount could be obtained, or the fence.
* This function handles acquiring a reference to a fence that may be
- * reallocated within the RCU grace period (such as with SLAB_DESTROY_BY_RCU),
+ * reallocated within the RCU grace period (such as with SLAB_TYPESAFE_BY_RCU),
* so long as the caller is using RCU on the pointer to the fence.
*
* An alternative mechanism is to employ a seqlock to protect a bunch of
@@ -257,7 +257,7 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep)
* have successfully acquire a reference to it. If it no
* longer matches, we are holding a reference to some other
* reallocated pointer. This is possible if the allocator
- * is using a freelist like SLAB_DESTROY_BY_RCU where the
+ * is using a freelist like SLAB_TYPESAFE_BY_RCU where the
* fence remains valid for the RCU grace period, but it
* may be reallocated. When using such allocators, we are
* responsible for ensuring the reference we get is to
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3c37a8c51921..04a7f7993e67 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -28,7 +28,7 @@
#define SLAB_STORE_USER 0x00010000UL /* DEBUG: Store the last owner for bug hunting */
#define SLAB_PANIC 0x00040000UL /* Panic if kmem_cache_create() fails */
/*
- * SLAB_DESTROY_BY_RCU - **WARNING** READ THIS!
+ * SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
*
* This delays freeing the SLAB page by a grace period, it does _NOT_
* delay object freeing. This means that if you do kmem_cache_free()
@@ -61,8 +61,10 @@
*
* rcu_read_lock before reading the address, then rcu_read_unlock after
* taking the spinlock within the structure expected at that address.
+ *
+ * Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.
*/
-#define SLAB_DESTROY_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
+#define SLAB_TYPESAFE_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */
#define SLAB_TRACE 0x00200000UL /* Trace allocations and frees */

diff --git a/include/net/sock.h b/include/net/sock.h
index 5e5997654db6..59cdccaa30e7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -993,7 +993,7 @@ struct smc_hashinfo;
struct module;

/*
- * caches using SLAB_DESTROY_BY_RCU should let .next pointer from nulls nodes
+ * caches using SLAB_TYPESAFE_BY_RCU should let .next pointer from nulls nodes
* un-modified. Special care is taken when initializing object to zero.
*/
static inline void sk_prot_clear_nulls(struct sock *sk, int size)
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80e93d..9330ce24f1bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1313,7 +1313,7 @@ void __cleanup_sighand(struct sighand_struct *sighand)
if (atomic_dec_and_test(&sighand->count)) {
signalfd_cleanup(sighand);
/*
- * sighand_cachep is SLAB_DESTROY_BY_RCU so we can free it
+ * sighand_cachep is SLAB_TYPESAFE_BY_RCU so we can free it
* without an RCU grace period, see __lock_task_sighand().
*/
kmem_cache_free(sighand_cachep, sighand);
@@ -2144,7 +2144,7 @@ void __init proc_caches_init(void)
{
sighand_cachep = kmem_cache_create("sighand_cache",
sizeof(struct sighand_struct), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_DESTROY_BY_RCU|
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
SLAB_NOTRACK|SLAB_ACCOUNT, sighand_ctor);
signal_cachep = kmem_cache_create("signal_cache",
sizeof(struct signal_struct), 0,
diff --git a/kernel/signal.c b/kernel/signal.c
index 7e59ebc2c25e..6df5f72158e4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1237,7 +1237,7 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
}
/*
* This sighand can be already freed and even reused, but
- * we rely on SLAB_DESTROY_BY_RCU and sighand_ctor() which
+ * we rely on SLAB_TYPESAFE_BY_RCU and sighand_ctor() which
* initializes ->siglock: this slab can't go away, it has
* the same object type, ->siglock can't be reinitialized.
*
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 98b27195e38b..4b20061102f6 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -413,7 +413,7 @@ void kasan_cache_create(struct kmem_cache *cache, size_t *size,
*size += sizeof(struct kasan_alloc_meta);

/* Add free meta. */
- if (cache->flags & SLAB_DESTROY_BY_RCU || cache->ctor ||
+ if (cache->flags & SLAB_TYPESAFE_BY_RCU || cache->ctor ||
cache->object_size < sizeof(struct kasan_free_meta)) {
cache->kasan_info.free_meta_offset = *size;
*size += sizeof(struct kasan_free_meta);
@@ -561,7 +561,7 @@ static void kasan_poison_slab_free(struct kmem_cache *cache, void *object)
unsigned long rounded_up_size = round_up(size, KASAN_SHADOW_SCALE_SIZE);

/* RCU slabs could be legally used after free within the RCU period */
- if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return;

kasan_poison_shadow(object, rounded_up_size, KASAN_KMALLOC_FREE);
@@ -572,7 +572,7 @@ bool kasan_slab_free(struct kmem_cache *cache, void *object)
s8 shadow_byte;

/* RCU slabs could be legally used after free within the RCU period */
- if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return false;

shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object));
diff --git a/mm/kmemcheck.c b/mm/kmemcheck.c
index 5bf191756a4a..2d5959c5f7c5 100644
--- a/mm/kmemcheck.c
+++ b/mm/kmemcheck.c
@@ -95,7 +95,7 @@ void kmemcheck_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, void *object,
void kmemcheck_slab_free(struct kmem_cache *s, void *object, size_t size)
{
/* TODO: RCU freeing is unsupported for now; hide false positives. */
- if (!s->ctor && !(s->flags & SLAB_DESTROY_BY_RCU))
+ if (!s->ctor && !(s->flags & SLAB_TYPESAFE_BY_RCU))
kmemcheck_mark_freed(object, size);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index 49ed681ccc7b..8ffd59df8a3f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -430,7 +430,7 @@ static void anon_vma_ctor(void *data)
void __init anon_vma_init(void)
{
anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
- 0, SLAB_DESTROY_BY_RCU|SLAB_PANIC|SLAB_ACCOUNT,
+ 0, SLAB_TYPESAFE_BY_RCU|SLAB_PANIC|SLAB_ACCOUNT,
anon_vma_ctor);
anon_vma_chain_cachep = KMEM_CACHE(anon_vma_chain,
SLAB_PANIC|SLAB_ACCOUNT);
@@ -481,7 +481,7 @@ struct anon_vma *page_get_anon_vma(struct page *page)
* If this page is still mapped, then its anon_vma cannot have been
* freed. But if it has been unmapped, we have no security against the
* anon_vma structure being freed and reused (for another anon_vma:
- * SLAB_DESTROY_BY_RCU guarantees that - so the atomic_inc_not_zero()
+ * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
* above cannot corrupt).
*/
if (!page_mapped(page)) {
diff --git a/mm/slab.c b/mm/slab.c
index 807d86c76908..93c827864862 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1728,7 +1728,7 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)

freelist = page->freelist;
slab_destroy_debugcheck(cachep, page);
- if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cachep->flags & SLAB_TYPESAFE_BY_RCU))
call_rcu(&page->rcu_head, kmem_rcu_free);
else
kmem_freepages(cachep, page);
@@ -1924,7 +1924,7 @@ static bool set_objfreelist_slab_cache(struct kmem_cache *cachep,

cachep->num = 0;

- if (cachep->ctor || flags & SLAB_DESTROY_BY_RCU)
+ if (cachep->ctor || flags & SLAB_TYPESAFE_BY_RCU)
return false;

left = calculate_slab_order(cachep, size,
@@ -2030,7 +2030,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
- if (!(flags & SLAB_DESTROY_BY_RCU))
+ if (!(flags & SLAB_TYPESAFE_BY_RCU))
flags |= SLAB_POISON;
#endif
#endif
diff --git a/mm/slab.h b/mm/slab.h
index 65e7c3fcac72..9cfcf099709c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -126,7 +126,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,

/* Legal flag mask for kmem_cache_create(), for various configurations */
#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | SLAB_PANIC | \
- SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS )
+ SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS )

#if defined(CONFIG_DEBUG_SLAB)
#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
@@ -415,7 +415,7 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
* back there or track user information then we can
* only use the space before that information.
*/
- if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+ if (s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_STORE_USER))
return s->inuse;
/*
* Else we can use all the padding etc for the allocation
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 09d0e849b07f..01a0fe2eb332 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -39,7 +39,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
* Set of flags that will prevent slab merging
*/
#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
- SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \
+ SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
SLAB_FAILSLAB | SLAB_KASAN)

#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
@@ -500,7 +500,7 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work)
struct kmem_cache *s, *s2;

/*
- * On destruction, SLAB_DESTROY_BY_RCU kmem_caches are put on the
+ * On destruction, SLAB_TYPESAFE_BY_RCU kmem_caches are put on the
* @slab_caches_to_rcu_destroy list. The slab pages are freed
* through RCU and and the associated kmem_cache are dereferenced
* while freeing the pages, so the kmem_caches should be freed only
@@ -537,7 +537,7 @@ static int shutdown_cache(struct kmem_cache *s)
memcg_unlink_cache(s);
list_del(&s->list);

- if (s->flags & SLAB_DESTROY_BY_RCU) {
+ if (s->flags & SLAB_TYPESAFE_BY_RCU) {
list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
schedule_work(&slab_caches_to_rcu_destroy_work);
} else {
diff --git a/mm/slob.c b/mm/slob.c
index eac04d4357ec..1bae78d71096 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -126,7 +126,7 @@ static inline void clear_slob_page_free(struct page *sp)

/*
* struct slob_rcu is inserted at the tail of allocated slob blocks, which
- * were created with a SLAB_DESTROY_BY_RCU slab. slob_rcu is used to free
+ * were created with a SLAB_TYPESAFE_BY_RCU slab. slob_rcu is used to free
* the block using call_rcu.
*/
struct slob_rcu {
@@ -524,7 +524,7 @@ EXPORT_SYMBOL(ksize);

int __kmem_cache_create(struct kmem_cache *c, unsigned long flags)
{
- if (flags & SLAB_DESTROY_BY_RCU) {
+ if (flags & SLAB_TYPESAFE_BY_RCU) {
/* leave room for rcu footer at the end of object */
c->size += sizeof(struct slob_rcu);
}
@@ -598,7 +598,7 @@ static void kmem_rcu_free(struct rcu_head *head)
void kmem_cache_free(struct kmem_cache *c, void *b)
{
kmemleak_free_recursive(b, c->flags);
- if (unlikely(c->flags & SLAB_DESTROY_BY_RCU)) {
+ if (unlikely(c->flags & SLAB_TYPESAFE_BY_RCU)) {
struct slob_rcu *slob_rcu;
slob_rcu = b + (c->size - sizeof(struct slob_rcu));
slob_rcu->size = c->size;
diff --git a/mm/slub.c b/mm/slub.c
index 7f4bc7027ed5..57e5156f02be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1687,7 +1687,7 @@ static void rcu_free_slab(struct rcu_head *h)

static void free_slab(struct kmem_cache *s, struct page *page)
{
- if (unlikely(s->flags & SLAB_DESTROY_BY_RCU)) {
+ if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) {
struct rcu_head *head;

if (need_reserve_slab_rcu) {
@@ -2963,7 +2963,7 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
* slab_free_freelist_hook() could have put the items into quarantine.
* If so, no need to free them.
*/
- if (s->flags & SLAB_KASAN && !(s->flags & SLAB_DESTROY_BY_RCU))
+ if (s->flags & SLAB_KASAN && !(s->flags & SLAB_TYPESAFE_BY_RCU))
return;
do_slab_free(s, page, head, tail, cnt, addr);
}
@@ -3433,7 +3433,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
* the slab may touch the object after free or before allocation
* then we should never poison the object itself.
*/
- if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
+ if ((flags & SLAB_POISON) && !(flags & SLAB_TYPESAFE_BY_RCU) &&
!s->ctor)
s->flags |= __OBJECT_POISON;
else
@@ -3455,7 +3455,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
*/
s->inuse = size;

- if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
+ if (((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
s->ctor)) {
/*
* Relocate free pointer after the object if it is not
@@ -3537,7 +3537,7 @@ static int kmem_cache_open(struct kmem_cache *s, unsigned long flags)
s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
s->reserved = 0;

- if (need_reserve_slab_rcu && (s->flags & SLAB_DESTROY_BY_RCU))
+ if (need_reserve_slab_rcu && (s->flags & SLAB_TYPESAFE_BY_RCU))
s->reserved = sizeof(struct rcu_head);

if (!calculate_sizes(s, -1))
@@ -5042,7 +5042,7 @@ SLAB_ATTR_RO(cache_dma);

static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
{
- return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_TYPESAFE_BY_RCU));
}
SLAB_ATTR_RO(destroy_by_rcu);

diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 409d0cfd3447..90210a0e3888 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -950,7 +950,7 @@ static struct proto dccp_v4_prot = {
.orphan_count = &dccp_orphan_count,
.max_header = MAX_DCCP_HEADER,
.obj_size = sizeof(struct dccp_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.rsk_prot = &dccp_request_sock_ops,
.twsk_prot = &dccp_timewait_sock_ops,
.h.hashinfo = &dccp_hashinfo,
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 233b57367758..b4019a5e4551 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -1012,7 +1012,7 @@ static struct proto dccp_v6_prot = {
.orphan_count = &dccp_orphan_count,
.max_header = MAX_DCCP_HEADER,
.obj_size = sizeof(struct dccp6_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.rsk_prot = &dccp6_request_sock_ops,
.twsk_prot = &dccp6_timewait_sock_ops,
.h.hashinfo = &dccp_hashinfo,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9a89b8deafae..82c89abeb989 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2398,7 +2398,7 @@ struct proto tcp_prot = {
.sysctl_rmem = sysctl_tcp_rmem,
.max_header = MAX_TCP_HEADER,
.obj_size = sizeof(struct tcp_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.twsk_prot = &tcp_timewait_sock_ops,
.rsk_prot = &tcp_request_sock_ops,
.h.hashinfo = &tcp_hashinfo,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 60a5295a7de6..bdbc4327ebee 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1919,7 +1919,7 @@ struct proto tcpv6_prot = {
.sysctl_rmem = sysctl_tcp_rmem,
.max_header = MAX_TCP_HEADER,
.obj_size = sizeof(struct tcp6_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.twsk_prot = &tcp6_timewait_sock_ops,
.rsk_prot = &tcp6_request_sock_ops,
.h.hashinfo = &tcp_hashinfo,
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 06186d608a27..d096ca563054 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -142,7 +142,7 @@ static struct proto llc_proto = {
.name = "LLC",
.owner = THIS_MODULE,
.obj_size = sizeof(struct llc_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
};

/**
diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c
index 8bc5a1bd2d45..9b02c13d258b 100644
--- a/net/llc/llc_conn.c
+++ b/net/llc/llc_conn.c
@@ -506,7 +506,7 @@ static struct sock *__llc_lookup_established(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_estab_match(sap, daddr, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
@@ -565,7 +565,7 @@ static struct sock *__llc_lookup_listener(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_listener_match(sap, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
diff --git a/net/llc/llc_sap.c b/net/llc/llc_sap.c
index 5404d0d195cc..63b6ab056370 100644
--- a/net/llc/llc_sap.c
+++ b/net/llc/llc_sap.c
@@ -328,7 +328,7 @@ static struct sock *llc_lookup_dgram(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_dgram_match(sap, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 071b97fcbefb..fdcdac7916b2 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -914,7 +914,7 @@ static unsigned int early_drop_list(struct net *net,
continue;

/* kill only if still in same netns -- might have moved due to
- * SLAB_DESTROY_BY_RCU rules.
+ * SLAB_TYPESAFE_BY_RCU rules.
*
* We steal the timer reference. If that fails timer has
* already fired or someone else deleted it. Just drop ref
@@ -1069,7 +1069,7 @@ __nf_conntrack_alloc(struct net *net,

/*
* Do not use kmem_cache_zalloc(), as this cache uses
- * SLAB_DESTROY_BY_RCU.
+ * SLAB_TYPESAFE_BY_RCU.
*/
ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
if (ct == NULL)
@@ -1114,7 +1114,7 @@ void nf_conntrack_free(struct nf_conn *ct)
struct net *net = nf_ct_net(ct);

/* A freed object has refcnt == 0, that's
- * the golden rule for SLAB_DESTROY_BY_RCU
+ * the golden rule for SLAB_TYPESAFE_BY_RCU
*/
NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);

@@ -1878,7 +1878,7 @@ int nf_conntrack_init_start(void)
nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
sizeof(struct nf_conn),
NFCT_INFOMASK + 1,
- SLAB_DESTROY_BY_RCU | SLAB_HWCACHE_ALIGN, NULL);
+ SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN, NULL);
if (!nf_conntrack_cachep)
goto err_cachep;

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 85837ab90e89..d34bbd6d8f38 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -101,7 +101,7 @@ struct proto smc_proto = {
.unhash = smc_unhash_sk,
.obj_size = sizeof(struct smc_sock),
.h.smc_hash = &smc_v4_hashinfo,
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
};
EXPORT_SYMBOL_GPL(smc_proto);

--
2.5.2

2017-04-13 09:13:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On Wed, Apr 12, 2017 at 09:55:37AM -0700, Paul E. McKenney wrote:
> A group of Linux kernel hackers reported chasing a bug that resulted
> from their assumption that SLAB_DESTROY_BY_RCU provided an existence
> guarantee, that is, that no block from such a slab would be reallocated
> during an RCU read-side critical section. Of course, that is not the
> case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
> slab of blocks.

And that while we wrote a huge honking comment right along with it...

> [ paulmck: Add "tombstone" comments as requested by Eric Dumazet. ]

I cannot find any occurrence of "tomb" or "TOMB" in the actual patch,
confused?

2017-04-13 09:14:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/13] lockdep: Use "WARNING" tag on lockdep splats

On Wed, Apr 12, 2017 at 09:55:38AM -0700, Paul E. McKenney wrote:
> This commit changes lockdep splats to begin lines with "WARNING" and
> to use pr_warn() instead of printk(). This change eases scripted
> analysis of kernel console output.
>
> Reported-by: Dmitry Vyukov <[email protected]>
> Reported-by: Ingo Molnar <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Acked-by: Dmitry Vyukov <[email protected]>
> ---
> kernel/locking/lockdep.c | 86 +++++++++++++++++++++---------------------
> kernel/locking/rtmutex-debug.c | 9 +++--
> 2 files changed, 48 insertions(+), 47 deletions(-)
>
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index a95e5d1f4a9c..e9d4f85b290c 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -1144,10 +1144,10 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
> return 0;
>
> printk("\n");
> - printk("======================================================\n");
> - printk("[ INFO: possible circular locking dependency detected ]\n");
> + pr_warn("======================================================\n");
> + pr_warn("WARNING: possible circular locking dependency detected\n");
> print_kernel_ident();
> - printk("-------------------------------------------------------\n");
> + pr_warn("------------------------------------------------------\n");
> printk("%s/%d is trying to acquire lock:\n",
> curr->comm, task_pid_nr(curr));
> print_lock(check_src);

Blergh, not a fan of this patch. Now we have an odd mix of pr_crap() and
printk().

2017-04-13 09:15:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Wed, Apr 12, 2017 at 09:55:40AM -0700, Paul E. McKenney wrote:
> If you set RCU_FANOUT_LEAF too high, you can get lock contention
> on the leaf rcu_node, and you should boot with the skew_tick kernel
> parameter set in order to avoid this lock contention. This commit
> therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
> this.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> init/Kconfig | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index a92f27da4a27..946e561e67b7 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
> initialization. These systems tend to run CPU-bound, and thus
> are not helped by synchronized interrupts, and thus tend to
> skew them, which reduces lock contention enough that large
> - leaf-level fanouts work well.
> + leaf-level fanouts work well. That said, setting leaf-level
> + fanout to a large number will likely cause problematic
> + lock contention on the leaf-level rcu_node structures unless
> + you boot with the skew_tick kernel parameter.

Why mention a way out of a problem you shouldn't have to begin with?

Just state its bad and result in lock contention and leave it at that.

2017-04-13 09:18:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Wed, Apr 12, 2017 at 09:55:43AM -0700, Paul E. McKenney wrote:
> However, a little future-proofing is a good thing,
> especially given that smp_mb__before_atomic() is only required to
> provide acquire semantics rather than full ordering. This commit
> therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> in sync_exp_work_done().

Oh!? As far as I'm away the smp_mb__{before,after}_atomic() really must
provide full MB, no confusion about that.

We have other primitives for acquire/release.

2017-04-13 09:34:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 11:18:32AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 09:55:43AM -0700, Paul E. McKenney wrote:
> > However, a little future-proofing is a good thing,
> > especially given that smp_mb__before_atomic() is only required to
> > provide acquire semantics rather than full ordering. This commit
> > therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > in sync_exp_work_done().
>
> Oh!? As far as I'm away the smp_mb__{before,after}_atomic() really must

s/away/aware/ typing hard

> provide full MB, no confusion about that.
>
> We have other primitives for acquire/release.

2017-04-13 11:07:09

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On 04/13/2017 11:12 AM, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 09:55:37AM -0700, Paul E. McKenney wrote:
>> A group of Linux kernel hackers reported chasing a bug that resulted
>> from their assumption that SLAB_DESTROY_BY_RCU provided an existence
>> guarantee, that is, that no block from such a slab would be reallocated
>> during an RCU read-side critical section. Of course, that is not the
>> case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
>> slab of blocks.
>
> And that while we wrote a huge honking comment right along with it...
>
>> [ paulmck: Add "tombstone" comments as requested by Eric Dumazet. ]
>
> I cannot find any occurrence of "tomb" or "TOMB" in the actual patch,
> confused?

It's the comments such as:

+ * Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.

so that people who remember the old name can git grep its fate.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2017-04-13 16:00:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On Thu, Apr 13, 2017 at 01:06:56PM +0200, Vlastimil Babka wrote:
> On 04/13/2017 11:12 AM, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2017 at 09:55:37AM -0700, Paul E. McKenney wrote:
> >> A group of Linux kernel hackers reported chasing a bug that resulted
> >> from their assumption that SLAB_DESTROY_BY_RCU provided an existence
> >> guarantee, that is, that no block from such a slab would be reallocated
> >> during an RCU read-side critical section. Of course, that is not the
> >> case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
> >> slab of blocks.
> >
> > And that while we wrote a huge honking comment right along with it...
> >
> >> [ paulmck: Add "tombstone" comments as requested by Eric Dumazet. ]
> >
> > I cannot find any occurrence of "tomb" or "TOMB" in the actual patch,
> > confused?
>
> It's the comments such as:
>
> + * Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.
>
> so that people who remember the old name can git grep its fate.

Exactly!

But I must confess that "tombstone" was an excessively obscure word
choice, even for native English speakers. I have reworded as follows:

[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]

Does that help?

Thanx, Paul

2017-04-13 16:01:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/13] lockdep: Use "WARNING" tag on lockdep splats

On Thu, Apr 13, 2017 at 11:14:18AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 09:55:38AM -0700, Paul E. McKenney wrote:
> > This commit changes lockdep splats to begin lines with "WARNING" and
> > to use pr_warn() instead of printk(). This change eases scripted
> > analysis of kernel console output.
> >
> > Reported-by: Dmitry Vyukov <[email protected]>
> > Reported-by: Ingo Molnar <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Acked-by: Dmitry Vyukov <[email protected]>
> > ---
> > kernel/locking/lockdep.c | 86 +++++++++++++++++++++---------------------
> > kernel/locking/rtmutex-debug.c | 9 +++--
> > 2 files changed, 48 insertions(+), 47 deletions(-)
> >
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index a95e5d1f4a9c..e9d4f85b290c 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -1144,10 +1144,10 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
> > return 0;
> >
> > printk("\n");
> > - printk("======================================================\n");
> > - printk("[ INFO: possible circular locking dependency detected ]\n");
> > + pr_warn("======================================================\n");
> > + pr_warn("WARNING: possible circular locking dependency detected\n");
> > print_kernel_ident();
> > - printk("-------------------------------------------------------\n");
> > + pr_warn("------------------------------------------------------\n");
> > printk("%s/%d is trying to acquire lock:\n",
> > curr->comm, task_pid_nr(curr));
> > print_lock(check_src);
>
> Blergh, not a fan of this patch. Now we have an odd mix of pr_crap() and
> printk().

Would you be OK with all of the lockdep messages being updated to
WARNING rather than just the RCU-related ones?

Thanx, Paul

2017-04-13 16:03:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 11:15:35AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 09:55:40AM -0700, Paul E. McKenney wrote:
> > If you set RCU_FANOUT_LEAF too high, you can get lock contention
> > on the leaf rcu_node, and you should boot with the skew_tick kernel
> > parameter set in order to avoid this lock contention. This commit
> > therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
> > this.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > init/Kconfig | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index a92f27da4a27..946e561e67b7 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
> > initialization. These systems tend to run CPU-bound, and thus
> > are not helped by synchronized interrupts, and thus tend to
> > skew them, which reduces lock contention enough that large
> > - leaf-level fanouts work well.
> > + leaf-level fanouts work well. That said, setting leaf-level
> > + fanout to a large number will likely cause problematic
> > + lock contention on the leaf-level rcu_node structures unless
> > + you boot with the skew_tick kernel parameter.
>
> Why mention a way out of a problem you shouldn't have to begin with?
>
> Just state its bad and result in lock contention and leave it at that.

To avoid people tuning huge machines having to wait for me to give
them an answer as to why they are suffering lock contention after
cranking up the value of RCU_FANOUT_LEAF.

Or am I missing your point?

Thanx, Paul

2017-04-13 16:10:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 11:18:32AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 09:55:43AM -0700, Paul E. McKenney wrote:
> > However, a little future-proofing is a good thing,
> > especially given that smp_mb__before_atomic() is only required to
> > provide acquire semantics rather than full ordering. This commit
> > therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > in sync_exp_work_done().
>
> Oh!? As far as I'm away the smp_mb__{before,after}_atomic() really must
> provide full MB, no confusion about that.
>
> We have other primitives for acquire/release.

Hmmm... Rechecking atomic_ops.txt, it does appear that you are quite
correct. Adding Will and Dmitry on CC, but dropping this patch for now.

Thanx, Paul

2017-04-13 16:17:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On Thu, Apr 13, 2017 at 01:06:56PM +0200, Vlastimil Babka wrote:
> On 04/13/2017 11:12 AM, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2017 at 09:55:37AM -0700, Paul E. McKenney wrote:
> >> A group of Linux kernel hackers reported chasing a bug that resulted
> >> from their assumption that SLAB_DESTROY_BY_RCU provided an existence
> >> guarantee, that is, that no block from such a slab would be reallocated
> >> during an RCU read-side critical section. Of course, that is not the
> >> case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
> >> slab of blocks.
> >
> > And that while we wrote a huge honking comment right along with it...
> >
> >> [ paulmck: Add "tombstone" comments as requested by Eric Dumazet. ]
> >
> > I cannot find any occurrence of "tomb" or "TOMB" in the actual patch,
> > confused?
>
> It's the comments such as:
>
> + * Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.
>
> so that people who remember the old name can git grep its fate.

git log -S SLAB_DESTROY_BY_RCU


2017-04-13 16:20:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 09:03:33AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 13, 2017 at 11:15:35AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2017 at 09:55:40AM -0700, Paul E. McKenney wrote:
> > > If you set RCU_FANOUT_LEAF too high, you can get lock contention
> > > on the leaf rcu_node, and you should boot with the skew_tick kernel
> > > parameter set in order to avoid this lock contention. This commit
> > > therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
> > > this.
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > ---
> > > init/Kconfig | 10 ++++++++--
> > > 1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index a92f27da4a27..946e561e67b7 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
> > > initialization. These systems tend to run CPU-bound, and thus
> > > are not helped by synchronized interrupts, and thus tend to
> > > skew them, which reduces lock contention enough that large
> > > - leaf-level fanouts work well.
> > > + leaf-level fanouts work well. That said, setting leaf-level
> > > + fanout to a large number will likely cause problematic
> > > + lock contention on the leaf-level rcu_node structures unless
> > > + you boot with the skew_tick kernel parameter.
> >
> > Why mention a way out of a problem you shouldn't have to begin with?
> >
> > Just state its bad and result in lock contention and leave it at that.
>
> To avoid people tuning huge machines having to wait for me to give
> them an answer as to why they are suffering lock contention after
> cranking up the value of RCU_FANOUT_LEAF.
>
> Or am I missing your point?

Your answer should be: don't do that then. Not provide them a shady work
around.

tick skew isn't pretty and has other problems (there's a reason its not
on by default). You're then doing two things you shouldn't.

2017-04-13 16:24:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 09:10:42AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 13, 2017 at 11:18:32AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2017 at 09:55:43AM -0700, Paul E. McKenney wrote:
> > > However, a little future-proofing is a good thing,
> > > especially given that smp_mb__before_atomic() is only required to
> > > provide acquire semantics rather than full ordering. This commit
> > > therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > > in sync_exp_work_done().
> >
> > Oh!? As far as I'm away the smp_mb__{before,after}_atomic() really must
> > provide full MB, no confusion about that.
> >
> > We have other primitives for acquire/release.
>
> Hmmm... Rechecking atomic_ops.txt, it does appear that you are quite
> correct. Adding Will and Dmitry on CC, but dropping this patch for now.

I'm afraid that document is woefully out dated. I'm surprised it says
anything on the subject.

2017-04-13 16:55:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 06:19:48PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 09:03:33AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 13, 2017 at 11:15:35AM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 12, 2017 at 09:55:40AM -0700, Paul E. McKenney wrote:
> > > > If you set RCU_FANOUT_LEAF too high, you can get lock contention
> > > > on the leaf rcu_node, and you should boot with the skew_tick kernel
> > > > parameter set in order to avoid this lock contention. This commit
> > > > therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
> > > > this.
> > > >
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > ---
> > > > init/Kconfig | 10 ++++++++--
> > > > 1 file changed, 8 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/init/Kconfig b/init/Kconfig
> > > > index a92f27da4a27..946e561e67b7 100644
> > > > --- a/init/Kconfig
> > > > +++ b/init/Kconfig
> > > > @@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
> > > > initialization. These systems tend to run CPU-bound, and thus
> > > > are not helped by synchronized interrupts, and thus tend to
> > > > skew them, which reduces lock contention enough that large
> > > > - leaf-level fanouts work well.
> > > > + leaf-level fanouts work well. That said, setting leaf-level
> > > > + fanout to a large number will likely cause problematic
> > > > + lock contention on the leaf-level rcu_node structures unless
> > > > + you boot with the skew_tick kernel parameter.
> > >
> > > Why mention a way out of a problem you shouldn't have to begin with?
> > >
> > > Just state its bad and result in lock contention and leave it at that.
> >
> > To avoid people tuning huge machines having to wait for me to give
> > them an answer as to why they are suffering lock contention after
> > cranking up the value of RCU_FANOUT_LEAF.
> >
> > Or am I missing your point?
>
> Your answer should be: don't do that then. Not provide them a shady work
> around.
>
> tick skew isn't pretty and has other problems (there's a reason its not
> on by default). You're then doing two things you shouldn't.

The tick skew problem that I know of is energy efficiency for light
workloads. This doesn't normally apply to the large heavily loaded
systems on which people skew ticks.

So what are the other problems?

Thanx, Paul

2017-04-13 16:58:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 06:24:09PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 09:10:42AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 13, 2017 at 11:18:32AM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 12, 2017 at 09:55:43AM -0700, Paul E. McKenney wrote:
> > > > However, a little future-proofing is a good thing,
> > > > especially given that smp_mb__before_atomic() is only required to
> > > > provide acquire semantics rather than full ordering. This commit
> > > > therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > > > in sync_exp_work_done().
> > >
> > > Oh!? As far as I'm away the smp_mb__{before,after}_atomic() really must
> > > provide full MB, no confusion about that.
> > >
> > > We have other primitives for acquire/release.
> >
> > Hmmm... Rechecking atomic_ops.txt, it does appear that you are quite
> > correct. Adding Will and Dmitry on CC, but dropping this patch for now.
>
> I'm afraid that document is woefully out dated. I'm surprised it says
> anything on the subject.

And there is some difference of opinion. Some believe that the
smp_mb__before_atomic() only guarantees acquire and smp_mb__after_atomic()
only guarantees release, but all current architectures provide full
ordering, as you noted and as stated in atomic_ops.txt.

How do we decide?

Once we do decide, atomic_ops.txt of course needs to be updated accordingly.

Thanx, Paul

2017-04-13 17:04:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 09:55:16AM -0700, Paul E. McKenney wrote:
> > > To avoid people tuning huge machines having to wait for me to give
> > > them an answer as to why they are suffering lock contention after
> > > cranking up the value of RCU_FANOUT_LEAF.

So is there a good reason to increase FANOUT_LEAF ?

> > > Or am I missing your point?
> >
> > Your answer should be: don't do that then. Not provide them a shady work
> > around.
> >
> > tick skew isn't pretty and has other problems (there's a reason its not
> > on by default). You're then doing two things you shouldn't.
>
> The tick skew problem that I know of is energy efficiency for light
> workloads. This doesn't normally apply to the large heavily loaded
> systems on which people skew ticks.
>
> So what are the other problems?

If the jiffy updater bounces between CPUs (as is not uncommon) the
duration of the jiffy becomes an average (I think we fixed it where it
could go too fast by always jumping to a CPU which has a short jiffy,
but I'm not sure).

This further complicates some of the jiffy based loops (which arguably
should go away anyway).

And I have vague memories of it actually causing lock contention, but
I've forgotten how that worked.

2017-04-13 17:10:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 09:57:55AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 13, 2017 at 06:24:09PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 13, 2017 at 09:10:42AM -0700, Paul E. McKenney wrote:
> > > On Thu, Apr 13, 2017 at 11:18:32AM +0200, Peter Zijlstra wrote:
> > > > On Wed, Apr 12, 2017 at 09:55:43AM -0700, Paul E. McKenney wrote:
> > > > > However, a little future-proofing is a good thing,
> > > > > especially given that smp_mb__before_atomic() is only required to
> > > > > provide acquire semantics rather than full ordering. This commit
> > > > > therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > > > > in sync_exp_work_done().
> > > >
> > > > Oh!? As far as I'm away the smp_mb__{before,after}_atomic() really must
> > > > provide full MB, no confusion about that.
> > > >
> > > > We have other primitives for acquire/release.
> > >
> > > Hmmm... Rechecking atomic_ops.txt, it does appear that you are quite
> > > correct. Adding Will and Dmitry on CC, but dropping this patch for now.
> >
> > I'm afraid that document is woefully out dated. I'm surprised it says
> > anything on the subject.
>
> And there is some difference of opinion. Some believe that the
> smp_mb__before_atomic() only guarantees acquire and smp_mb__after_atomic()
> only guarantees release, but all current architectures provide full
> ordering, as you noted and as stated in atomic_ops.txt.

Which 'some' think it only provides acquire/release ?

I made very sure -- when I renamed/audited/wrote all this -- that they
indeed do a full memory barrier.

> How do we decide?

I say its a full mb, always was.

People used it to create acquire/release _like_ constructs, because we
simply didn't have anything else.

Also, I think Linus once opined that acquire/release is part of a
store/load (hence smp_store_release/smp_load_acquire) and not a barrier.

> Once we do decide, atomic_ops.txt of course needs to be updated accordingly.

There was so much missing there that I didn't quite know where to start.

2017-04-13 17:24:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On Thu, Apr 13, 2017 at 06:17:09PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 01:06:56PM +0200, Vlastimil Babka wrote:
> > On 04/13/2017 11:12 AM, Peter Zijlstra wrote:
> > > On Wed, Apr 12, 2017 at 09:55:37AM -0700, Paul E. McKenney wrote:
> > >> A group of Linux kernel hackers reported chasing a bug that resulted
> > >> from their assumption that SLAB_DESTROY_BY_RCU provided an existence
> > >> guarantee, that is, that no block from such a slab would be reallocated
> > >> during an RCU read-side critical section. Of course, that is not the
> > >> case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
> > >> slab of blocks.
> > >
> > > And that while we wrote a huge honking comment right along with it...
> > >
> > >> [ paulmck: Add "tombstone" comments as requested by Eric Dumazet. ]
> > >
> > > I cannot find any occurrence of "tomb" or "TOMB" in the actual patch,
> > > confused?
> >
> > It's the comments such as:
> >
> > + * Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.
> >
> > so that people who remember the old name can git grep its fate.
>
> git log -S SLAB_DESTROY_BY_RCU

My (perhaps naive) hope is that having more than one path to
the information will reduce the number of "Whatever happened to
SLAB_DESTROY_BY_RCU?" queries.

Thanx, Paul

2017-04-13 17:31:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 07:04:34PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 09:55:16AM -0700, Paul E. McKenney wrote:
> > > > To avoid people tuning huge machines having to wait for me to give
> > > > them an answer as to why they are suffering lock contention after
> > > > cranking up the value of RCU_FANOUT_LEAF.
>
> So is there a good reason to increase FANOUT_LEAF ?

Increasing it reduces the number of rcu_node structures, and thus the
number of cache misses during grace-period initialization and cleanup.
This has proven necessary in the past on large machines having long
memory latencies. And there are starting to be some pretty big machines
running in production, and even for typical commerical workloads.

> > > > Or am I missing your point?
> > >
> > > Your answer should be: don't do that then. Not provide them a shady work
> > > around.
> > >
> > > tick skew isn't pretty and has other problems (there's a reason its not
> > > on by default). You're then doing two things you shouldn't.
> >
> > The tick skew problem that I know of is energy efficiency for light
> > workloads. This doesn't normally apply to the large heavily loaded
> > systems on which people skew ticks.
> >
> > So what are the other problems?
>
> If the jiffy updater bounces between CPUs (as is not uncommon) the
> duration of the jiffy becomes an average (I think we fixed it where it
> could go too fast by always jumping to a CPU which has a short jiffy,
> but I'm not sure).
>
> This further complicates some of the jiffy based loops (which arguably
> should go away anyway).

Fair enough. And I am OK with jiffies going away as long as there
is a low-overhead rough-and-ready timing mechanism replacing it.

> And I have vague memories of it actually causing lock contention, but
> I've forgotten how that worked.

That is a new one on me. I can easily see how not skewing ticks could
cause serious lock contention, but am missing how skewed ticks would
do so.

Thanx, Paul

2017-04-13 17:40:02

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 07:10:27PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 09:57:55AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 13, 2017 at 06:24:09PM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 13, 2017 at 09:10:42AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Apr 13, 2017 at 11:18:32AM +0200, Peter Zijlstra wrote:
> > > > > On Wed, Apr 12, 2017 at 09:55:43AM -0700, Paul E. McKenney wrote:
> > > > > > However, a little future-proofing is a good thing,
> > > > > > especially given that smp_mb__before_atomic() is only required to
> > > > > > provide acquire semantics rather than full ordering. This commit
> > > > > > therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > > > > > in sync_exp_work_done().
> > > > >
> > > > > Oh!? As far as I'm away the smp_mb__{before,after}_atomic() really must
> > > > > provide full MB, no confusion about that.
> > > > >
> > > > > We have other primitives for acquire/release.
> > > >
> > > > Hmmm... Rechecking atomic_ops.txt, it does appear that you are quite
> > > > correct. Adding Will and Dmitry on CC, but dropping this patch for now.
> > >
> > > I'm afraid that document is woefully out dated. I'm surprised it says
> > > anything on the subject.
> >
> > And there is some difference of opinion. Some believe that the
> > smp_mb__before_atomic() only guarantees acquire and smp_mb__after_atomic()
> > only guarantees release, but all current architectures provide full
> > ordering, as you noted and as stated in atomic_ops.txt.
>
> Which 'some' think it only provides acquire/release ?
>
> I made very sure -- when I renamed/audited/wrote all this -- that they
> indeed do a full memory barrier.
>
> > How do we decide?
>
> I say its a full mb, always was.
>
> People used it to create acquire/release _like_ constructs, because we
> simply didn't have anything else.
>
> Also, I think Linus once opined that acquire/release is part of a
> store/load (hence smp_store_release/smp_load_acquire) and not a barrier.
>
> > Once we do decide, atomic_ops.txt of course needs to be updated accordingly.
>
> There was so much missing there that I didn't quite know where to start.

Well, if there are no objections, I will fix up the smp_mb__before_atomic()
and smp_mb__after_atomic() pieces.

I suppose that one alternative is the new variant of kerneldoc, though
very few of these functions have comment headers, let alone kerneldoc
headers. Which reminds me, the question of spin_unlock_wait() and
spin_is_locked() semantics came up a bit ago. Here is what I believe
to be the case. Does this match others' expectations?

o spin_unlock_wait() semantics:

1. Any access in any critical section prior to the
spin_unlock_wait() is visible to all code following
(in program order) the spin_unlock_wait().

2. Any access prior (in program order) to the
spin_unlock_wait() is visible to any critical
section following the spin_unlock_wait().

o spin_is_locked() semantics: Half of spin_unlock_wait(),
but only if it returns false:

1. Any access in any critical section prior to the
spin_unlock_wait() is visible to all code following
(in program order) the spin_unlock_wait().

Thanx, Paul

2017-04-13 17:46:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 10:31:00AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 13, 2017 at 07:04:34PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 13, 2017 at 09:55:16AM -0700, Paul E. McKenney wrote:
> > > > > To avoid people tuning huge machines having to wait for me to give
> > > > > them an answer as to why they are suffering lock contention after
> > > > > cranking up the value of RCU_FANOUT_LEAF.
> >
> > So is there a good reason to increase FANOUT_LEAF ?
>
> Increasing it reduces the number of rcu_node structures, and thus the
> number of cache misses during grace-period initialization and cleanup.
> This has proven necessary in the past on large machines having long
> memory latencies. And there are starting to be some pretty big machines
> running in production, and even for typical commerical workloads.

Is that perhaps a good moment to look at aligning the cpus in said nodes
with the cache topology?

2017-04-13 17:51:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 10:39:51AM -0700, Paul E. McKenney wrote:

> Well, if there are no objections, I will fix up the smp_mb__before_atomic()
> and smp_mb__after_atomic() pieces.

Feel free.

> I suppose that one alternative is the new variant of kerneldoc, though
> very few of these functions have comment headers, let alone kerneldoc
> headers. Which reminds me, the question of spin_unlock_wait() and
> spin_is_locked() semantics came up a bit ago. Here is what I believe
> to be the case. Does this match others' expectations?
>
> o spin_unlock_wait() semantics:
>
> 1. Any access in any critical section prior to the
> spin_unlock_wait() is visible to all code following
> (in program order) the spin_unlock_wait().
>
> 2. Any access prior (in program order) to the
> spin_unlock_wait() is visible to any critical
> section following the spin_unlock_wait().
>
> o spin_is_locked() semantics: Half of spin_unlock_wait(),
> but only if it returns false:
>
> 1. Any access in any critical section prior to the
> spin_unlock_wait() is visible to all code following
> (in program order) the spin_unlock_wait().

Urgh.. yes those are pain. The best advise is to not use them.

055ce0fd1b86 ("locking/qspinlock: Add comments")


2017-04-13 17:59:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 07:51:36PM +0200, Peter Zijlstra wrote:

> > I suppose that one alternative is the new variant of kerneldoc, though
> > very few of these functions have comment headers, let alone kerneldoc
> > headers. Which reminds me, the question of spin_unlock_wait() and
> > spin_is_locked() semantics came up a bit ago. Here is what I believe
> > to be the case. Does this match others' expectations?
> >
> > o spin_unlock_wait() semantics:
> >
> > 1. Any access in any critical section prior to the
> > spin_unlock_wait() is visible to all code following
> > (in program order) the spin_unlock_wait().
> >
> > 2. Any access prior (in program order) to the
> > spin_unlock_wait() is visible to any critical
> > section following the spin_unlock_wait().
> >
> > o spin_is_locked() semantics: Half of spin_unlock_wait(),
> > but only if it returns false:
> >
> > 1. Any access in any critical section prior to the
> > spin_unlock_wait() is visible to all code following
> > (in program order) the spin_unlock_wait().
>
> Urgh.. yes those are pain. The best advise is to not use them.
>
> 055ce0fd1b86 ("locking/qspinlock: Add comments")

The big problem with spin_unlock_wait(), aside from the icky barrier
semantics, is that it tends to end up prone to starvation. So where
spin_lock()+spin_unlock() have guaranteed fwd progress if the lock is
fair (ticket,queued,etc..) spin_unlock_wait() must often lack that
guarantee.

Equally, spin_unlock_wait() was intended to be 'cheap' and be a
read-only loop, but in order to satisfy the barrier requirements, it
ends up doing stores anyway (see for example the arm64 and ppc
implementations).



2017-04-13 18:19:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 07:46:31PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:31:00AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 13, 2017 at 07:04:34PM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 13, 2017 at 09:55:16AM -0700, Paul E. McKenney wrote:
> > > > > > To avoid people tuning huge machines having to wait for me to give
> > > > > > them an answer as to why they are suffering lock contention after
> > > > > > cranking up the value of RCU_FANOUT_LEAF.
> > >
> > > So is there a good reason to increase FANOUT_LEAF ?
> >
> > Increasing it reduces the number of rcu_node structures, and thus the
> > number of cache misses during grace-period initialization and cleanup.
> > This has proven necessary in the past on large machines having long
> > memory latencies. And there are starting to be some pretty big machines
> > running in production, and even for typical commerical workloads.
>
> Is that perhaps a good moment to look at aligning the cpus in said nodes
> with the cache topology?

We have been here before, haven't we? Over and over again. ;-)

As always...

First get me some system-level data showing that the current layout is
causing a real problem. RCU's fastpath code doesn't come anywhere near
the rcu_node tree, so in the absence of such data, I of course remain
quite doubtful that there is a real need. And painfully aware of the
required increase in complexity.

But if there is a real need demonstrated by real system-level data,
I will of course make the needed changes, as I have done many times in
the past in response to other requests.

Thanx, Paul

2017-04-13 18:23:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 11:19:26AM -0700, Paul E. McKenney wrote:

> First get me some system-level data showing that the current layout is
> causing a real problem. RCU's fastpath code doesn't come anywhere near
> the rcu_node tree, so in the absence of such data, I of course remain
> quite doubtful that there is a real need. And painfully aware of the
> required increase in complexity.
>
> But if there is a real need demonstrated by real system-level data,
> I will of course make the needed changes, as I have done many times in
> the past in response to other requests.

I read what you wrote here:

> > > Increasing it reduces the number of rcu_node structures, and thus the
> > > number of cache misses during grace-period initialization and cleanup.
> > > This has proven necessary in the past on large machines having long
> > > memory latencies. And there are starting to be some pretty big machines
> > > running in production, and even for typical commerical workloads.

to mean you had exactly that pain. Or am I now totally not understanding
you?

2017-04-13 18:29:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 10:31:00AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 13, 2017 at 07:04:34PM +0200, Peter Zijlstra wrote:

> > And I have vague memories of it actually causing lock contention, but
> > I've forgotten how that worked.
>
> That is a new one on me. I can easily see how not skewing ticks could
> cause serious lock contention, but am missing how skewed ticks would
> do so.

It could've been something like cacheline bouncing. Where with a
synchronized tick, the (global) cacheline would get used by all CPUs on
a node before heading out to the next node etc.. Where with a skewed
tick, it would forever bounce around.

2017-04-13 18:42:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 08:23:09PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 11:19:26AM -0700, Paul E. McKenney wrote:
>
> > First get me some system-level data showing that the current layout is
> > causing a real problem. RCU's fastpath code doesn't come anywhere near
> > the rcu_node tree, so in the absence of such data, I of course remain
> > quite doubtful that there is a real need. And painfully aware of the
> > required increase in complexity.
> >
> > But if there is a real need demonstrated by real system-level data,
> > I will of course make the needed changes, as I have done many times in
> > the past in response to other requests.
>
> I read what you wrote here:
>
> > > > Increasing it reduces the number of rcu_node structures, and thus the
> > > > number of cache misses during grace-period initialization and cleanup.
> > > > This has proven necessary in the past on large machines having long
> > > > memory latencies. And there are starting to be some pretty big machines
> > > > running in production, and even for typical commerical workloads.
>
> to mean you had exactly that pain. Or am I now totally not understanding
> you?

I believe that you are missing the fact that RCU grace-period
initialization and cleanup walks through the rcu_node tree breadth
first, using rcu_for_each_node_breadth_first(). This macro (shown below)
implements this breadth-first walk using a simple sequential traversal of
the ->node[] array that provides the structures making up the rcu_node
tree. As you can see, this scan is completely independent of how CPU
numbers might be mapped to rcu_data slots in the leaf rcu_node structures.

Thanx, Paul

/*
* Do a full breadth-first scan of the rcu_node structures for the
* specified rcu_state structure.
*/
#define rcu_for_each_node_breadth_first(rsp, rnp) \
for ((rnp) = &(rsp)->node[0]; \
(rnp) < &(rsp)->node[rcu_num_nodes]; (rnp)++)

2017-04-13 19:42:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 08:29:39PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:31:00AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 13, 2017 at 07:04:34PM +0200, Peter Zijlstra wrote:
>
> > > And I have vague memories of it actually causing lock contention, but
> > > I've forgotten how that worked.
> >
> > That is a new one on me. I can easily see how not skewing ticks could
> > cause serious lock contention, but am missing how skewed ticks would
> > do so.
>
> It could've been something like cacheline bouncing. Where with a
> synchronized tick, the (global) cacheline would get used by all CPUs on
> a node before heading out to the next node etc.. Where with a skewed
> tick, it would forever bounce around.

In other words, motivating the order of the skewed ticks to be guided
by hardware locality?

Thanx, Paul

2017-04-13 21:30:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On Thu, Apr 13, 2017 at 9:17 AM, Peter Zijlstra <[email protected]> wrote:

> git log -S SLAB_DESTROY_BY_RCU

Maybe, but "git log -S" is damn slow at least here.

While "git grep" is _very_ fast

2017-04-14 08:46:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On Thu, Apr 13, 2017 at 02:30:19PM -0700, Eric Dumazet wrote:
> On Thu, Apr 13, 2017 at 9:17 AM, Peter Zijlstra <[email protected]> wrote:
>
> > git log -S SLAB_DESTROY_BY_RCU
>
> Maybe, but "git log -S" is damn slow at least here.
>
> While "git grep" is _very_ fast

All true. But in general we don't leave endless markers around like
this.

For instance:

/* the function formerly known as smp_mb__before_clear_bit() */

is not part of the kernel tree. People that used that thing out of tree
get to deal with it in whatever way they see fit.

2017-04-14 13:40:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 01/13] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On Fri, Apr 14, 2017 at 10:45:44AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 02:30:19PM -0700, Eric Dumazet wrote:
> > On Thu, Apr 13, 2017 at 9:17 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > > git log -S SLAB_DESTROY_BY_RCU
> >
> > Maybe, but "git log -S" is damn slow at least here.
> >
> > While "git grep" is _very_ fast
>
> All true. But in general we don't leave endless markers around like
> this.
>
> For instance:
>
> /* the function formerly known as smp_mb__before_clear_bit() */
>
> is not part of the kernel tree. People that used that thing out of tree
> get to deal with it in whatever way they see fit.

Sometimes we don't provide markers and sometimes we do:

$ git grep synchronize_kernel
Documentation/RCU/RTFP.txt:,Title="API change: synchronize_kernel() deprecated"
Documentation/RCU/RTFP.txt: Jon Corbet describes deprecation of synchronize_kernel()
kernel/rcu/tree.c: * synchronize_kernel() API. In contrast, synchronize_rcu() only

Given that it has been more than a decade, I could easily see my way to
removing this synchronize_kernel() tombstone in kernel/rcu/tree.c if
people are annoyed by it. But thus far, no one has complained.

So how long should we wait to remove the SLAB_DESTROY_BY_RCU tombstone?
I can easily add an event to my calendar to remind me to remove it.

Thanx, Paul

2017-04-17 23:27:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

Hello!

This v2 series contains the following fixes:

1. Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU.

2. Use "WARNING" tag on RCU's lockdep splats.

3. Update obsolete callback_head comment.

4. Make RCU_FANOUT_LEAF help text more explicit about skew_tick.

5. Remove obsolete comment from rcu_future_gp_cleanup() header.

6. Disable sparse warning emitted by hlist_add_tail_rcu(), courtesy
of Michael S. Tsirkin.

7. Improve comments for hotplug/suspend/hibernate functions.

8. Use correct path for Kconfig fragment for duplicate rcutorture
test scenarios.

9. Use bool value directly for ->beenonline comparison, courtesy
of Nicholas Mc Guire.

10. Use true/false in assignment to bool variable rcu_nocb_poll,
courtesy of Nicholas Mc Guire.

11. Fix typo in PER_RCU_NODE_PERIOD header comment.

Changes since v1:

o Applied review feedback from Peter Zijlstra, Vlastimil Babka,
and Eric Dumazet.

o Dropped v1 patch #7 ("Add smp_mb__after_atomic() to
sync_exp_work_done()"), as ensuing discussion confirmed that
smp_mb__before_atomic() guarantees a full barrier.

o Moved v1 patch #9 ("Use static initialization for "srcu" in
mm/mmu_notifier.c") to the srcu series because 0day Test Robot
showed that it needs to be there.

Thanx, Paul

------------------------------------------------------------------------

Documentation/RCU/00-INDEX | 2
Documentation/RCU/rculist_nulls.txt | 6 -
Documentation/RCU/whatisRCU.txt | 3
drivers/gpu/drm/i915/i915_gem.c | 2
drivers/gpu/drm/i915/i915_gem_request.h | 2
drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c | 2
fs/jbd2/journal.c | 2
fs/signalfd.c | 2
include/linux/dma-fence.h | 4
include/linux/rculist.h | 3
include/linux/slab.h | 6 -
include/linux/types.h | 2
include/net/sock.h | 2
init/Kconfig | 10 +
kernel/fork.c | 4
kernel/locking/lockdep.c | 86 +++++++--------
kernel/locking/rtmutex-debug.c | 9 -
kernel/rcu/tree.c | 49 ++++++--
kernel/rcu/tree_plugin.h | 2
kernel/signal.c | 2
mm/kasan/kasan.c | 6 -
mm/kmemcheck.c | 2
mm/rmap.c | 4
mm/slab.c | 6 -
mm/slab.h | 4
mm/slab_common.c | 6 -
mm/slob.c | 6 -
mm/slub.c | 12 +-
net/dccp/ipv4.c | 2
net/dccp/ipv6.c | 2
net/ipv4/tcp_ipv4.c | 2
net/ipv6/tcp_ipv6.c | 2
net/llc/af_llc.c | 2
net/llc/llc_conn.c | 4
net/llc/llc_sap.c | 2
net/netfilter/nf_conntrack_core.c | 8 -
net/smc/af_smc.c | 2
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2
38 files changed, 158 insertions(+), 116 deletions(-)

2017-04-17 23:29:06

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 03/11] types: Update obsolete callback_head comment

The comment header for callback_head (and thus for rcu_head) states that
the bottom two bits of a pointer to these structures must be zero. This
is obsolete: The new requirement is that only the bottom bit need be
zero. This commit therefore updates this comment.

Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/types.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/types.h b/include/linux/types.h
index 1e7bd24848fc..258099a4ed82 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -209,7 +209,7 @@ struct ustat {
* naturally due ABI requirements, but some architectures (like CRIS) have
* weird ABI and we need to ask it explicitly.
*
- * The alignment is required to guarantee that bits 0 and 1 of @next will be
+ * The alignment is required to guarantee that bit 0 of @next will be
* clear under normal conditions -- as long as we use call_rcu(),
* call_rcu_bh(), call_rcu_sched(), or call_srcu() to queue callback.
*
--
2.5.2

2017-04-17 23:29:11

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 06/11] hlist_add_tail_rcu disable sparse warning

From: "Michael S. Tsirkin" <[email protected]>

sparse is unhappy about this code in hlist_add_tail_rcu:

struct hlist_node *i, *last = NULL;

for (i = hlist_first_rcu(h); i; i = hlist_next_rcu(i))
last = i;

This is because hlist_next_rcu and hlist_next_rcu return
__rcu pointers.

It's a false positive - it's a write side primitive and so
does not need to be called in a read side critical section.

The following trivial patch disables the warning
without changing the behaviour in any way.

Note: __hlist_for_each_rcu would also remove the warning but it would be
confusing since it calls rcu_derefence and is designed to run in the rcu
read side critical section.

Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rculist.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 4f7a9561b8c4..b1fd8bf85fdc 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -509,7 +509,8 @@ static inline void hlist_add_tail_rcu(struct hlist_node *n,
{
struct hlist_node *i, *last = NULL;

- for (i = hlist_first_rcu(h); i; i = hlist_next_rcu(i))
+ /* Note: write side code, so rcu accessors are not needed. */
+ for (i = h->first; i; i = i->next)
last = i;

if (last) {
--
2.5.2

2017-04-17 23:29:08

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 11/11] rcu: Fix typo in PER_RCU_NODE_PERIOD header comment

This commit just changes a "the the" to "the" to reduce repetition.

Reported-by: Michalis Kokologiannakis <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7c238604df18..b1679e8cc5ed 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -199,7 +199,7 @@ static const int gp_cleanup_delay;

/*
* Number of grace periods between delays, normalized by the duration of
- * the delay. The longer the the delay, the more the grace periods between
+ * the delay. The longer the delay, the more the grace periods between
* each delay. The reason for this normalization is that it means that,
* for non-zero delays, the overall slowdown of grace periods is constant
* regardless of the duration of the delay. This arrangement balances
--
2.5.2

2017-04-17 23:29:43

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 07/11] rcu: Improve comments for hotplug/suspend/hibernate functions

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 41 +++++++++++++++++++++++++++++++++++++----
1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bdaa69d23a8a..c4f195dd7c94 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3894,6 +3894,10 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}

+/*
+ * Invoked early in the CPU-online process, when pretty much all
+ * services are available. The incoming CPU is not present.
+ */
int rcutree_prepare_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3907,6 +3911,9 @@ int rcutree_prepare_cpu(unsigned int cpu)
return 0;
}

+/*
+ * Update RCU priority boot kthread affinity for CPU-hotplug changes.
+ */
static void rcutree_affinity_setting(unsigned int cpu, int outgoing)
{
struct rcu_data *rdp = per_cpu_ptr(rcu_state_p->rda, cpu);
@@ -3914,6 +3921,10 @@ static void rcutree_affinity_setting(unsigned int cpu, int outgoing)
rcu_boost_kthread_setaffinity(rdp->mynode, outgoing);
}

+/*
+ * Near the end of the CPU-online process. Pretty much all services
+ * enabled, and the CPU is now very much alive.
+ */
int rcutree_online_cpu(unsigned int cpu)
{
sync_sched_exp_online_cleanup(cpu);
@@ -3921,13 +3932,19 @@ int rcutree_online_cpu(unsigned int cpu)
return 0;
}

+/*
+ * Near the beginning of the process. The CPU is still very much alive
+ * with pretty much all services enabled.
+ */
int rcutree_offline_cpu(unsigned int cpu)
{
rcutree_affinity_setting(cpu, cpu);
return 0;
}

-
+/*
+ * Near the end of the offline process. We do only tracing here.
+ */
int rcutree_dying_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3937,6 +3954,9 @@ int rcutree_dying_cpu(unsigned int cpu)
return 0;
}

+/*
+ * The outgoing CPU is gone and we are running elsewhere.
+ */
int rcutree_dead_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3954,6 +3974,10 @@ int rcutree_dead_cpu(unsigned int cpu)
* incoming CPUs are not allowed to use RCU read-side critical sections
* until this function is called. Failing to observe this restriction
* will result in lockdep splats.
+ *
+ * Note that this function is special in that it is invoked directly
+ * from the incoming CPU rather than from the cpuhp_step mechanism.
+ * This is because this function must be invoked at a precise location.
*/
void rcu_cpu_starting(unsigned int cpu)
{
@@ -3979,9 +4003,6 @@ void rcu_cpu_starting(unsigned int cpu)
* The CPU is exiting the idle loop into the arch_cpu_idle_dead()
* function. We now remove it from the rcu_node tree's ->qsmaskinit
* bit masks.
- * The CPU is exiting the idle loop into the arch_cpu_idle_dead()
- * function. We now remove it from the rcu_node tree's ->qsmaskinit
- * bit masks.
*/
static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
{
@@ -3997,6 +4018,14 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}

+/*
+ * The outgoing function has no further need of RCU, so remove it from
+ * the list of CPUs that RCU must track.
+ *
+ * Note that this function is special in that it is invoked directly
+ * from the outgoing CPU rather than from the cpuhp_step mechanism.
+ * This is because this function must be invoked at a precise location.
+ */
void rcu_report_dead(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -4011,6 +4040,10 @@ void rcu_report_dead(unsigned int cpu)
}
#endif

+/*
+ * On non-huge systems, use expedited RCU grace periods to make suspend
+ * and hibernation run faster.
+ */
static int rcu_pm_notify(struct notifier_block *self,
unsigned long action, void *hcpu)
{
--
2.5.2

2017-04-17 23:29:48

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 02/11] lockdep: Use "WARNING" tag on lockdep splats

This commit changes lockdep splats to begin lines with "WARNING" and
to use pr_warn() instead of printk(). This change eases scripted
analysis of kernel console output.

Reported-by: Dmitry Vyukov <[email protected]>
Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
---
kernel/locking/lockdep.c | 86 +++++++++++++++++++++---------------------
kernel/locking/rtmutex-debug.c | 9 +++--
2 files changed, 48 insertions(+), 47 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index a95e5d1f4a9c..e9d4f85b290c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1144,10 +1144,10 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
return 0;

printk("\n");
- printk("======================================================\n");
- printk("[ INFO: possible circular locking dependency detected ]\n");
+ pr_warn("======================================================\n");
+ pr_warn("WARNING: possible circular locking dependency detected\n");
print_kernel_ident();
- printk("-------------------------------------------------------\n");
+ pr_warn("------------------------------------------------------\n");
printk("%s/%d is trying to acquire lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(check_src);
@@ -1482,11 +1482,11 @@ print_bad_irq_dependency(struct task_struct *curr,
return 0;

printk("\n");
- printk("======================================================\n");
- printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
+ pr_warn("=====================================================\n");
+ pr_warn("WARNING: %s-safe -> %s-unsafe lock order detected\n",
irqclass, irqclass);
print_kernel_ident();
- printk("------------------------------------------------------\n");
+ pr_warn("-----------------------------------------------------\n");
printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
curr->comm, task_pid_nr(curr),
curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
@@ -1711,10 +1711,10 @@ print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
return 0;

printk("\n");
- printk("=============================================\n");
- printk("[ INFO: possible recursive locking detected ]\n");
+ pr_warn("============================================\n");
+ pr_warn("WARNING: possible recursive locking detected\n");
print_kernel_ident();
- printk("---------------------------------------------\n");
+ pr_warn("--------------------------------------------\n");
printk("%s/%d is trying to acquire lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(next);
@@ -2061,10 +2061,10 @@ static void print_collision(struct task_struct *curr,
struct lock_chain *chain)
{
printk("\n");
- printk("======================\n");
- printk("[chain_key collision ]\n");
+ pr_warn("============================\n");
+ pr_warn("WARNING: chain_key collision\n");
print_kernel_ident();
- printk("----------------------\n");
+ pr_warn("----------------------------\n");
printk("%s/%d: ", current->comm, task_pid_nr(current));
printk("Hash chain already cached but the contents don't match!\n");

@@ -2360,10 +2360,10 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this,
return 0;

printk("\n");
- printk("=================================\n");
- printk("[ INFO: inconsistent lock state ]\n");
+ pr_warn("================================\n");
+ pr_warn("WARNING: inconsistent lock state\n");
print_kernel_ident();
- printk("---------------------------------\n");
+ pr_warn("--------------------------------\n");

printk("inconsistent {%s} -> {%s} usage.\n",
usage_str[prev_bit], usage_str[new_bit]);
@@ -2425,10 +2425,10 @@ print_irq_inversion_bug(struct task_struct *curr,
return 0;

printk("\n");
- printk("=========================================================\n");
- printk("[ INFO: possible irq lock inversion dependency detected ]\n");
+ pr_warn("========================================================\n");
+ pr_warn("WARNING: possible irq lock inversion dependency detected\n");
print_kernel_ident();
- printk("---------------------------------------------------------\n");
+ pr_warn("--------------------------------------------------------\n");
printk("%s/%d just changed the state of lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(this);
@@ -3170,10 +3170,10 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
return 0;

printk("\n");
- printk("==================================\n");
- printk("[ BUG: Nested lock was not taken ]\n");
+ pr_warn("==================================\n");
+ pr_warn("WARNING: Nested lock was not taken\n");
print_kernel_ident();
- printk("----------------------------------\n");
+ pr_warn("----------------------------------\n");

printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
print_lock(hlock);
@@ -3383,10 +3383,10 @@ print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
return 0;

printk("\n");
- printk("=====================================\n");
- printk("[ BUG: bad unlock balance detected! ]\n");
+ pr_warn("=====================================\n");
+ pr_warn("WARNING: bad unlock balance detected!\n");
print_kernel_ident();
- printk("-------------------------------------\n");
+ pr_warn("-------------------------------------\n");
printk("%s/%d is trying to release lock (",
curr->comm, task_pid_nr(curr));
print_lockdep_cache(lock);
@@ -3880,10 +3880,10 @@ print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
return 0;

printk("\n");
- printk("=================================\n");
- printk("[ BUG: bad contention detected! ]\n");
+ pr_warn("=================================\n");
+ pr_warn("WARNING: bad contention detected!\n");
print_kernel_ident();
- printk("---------------------------------\n");
+ pr_warn("---------------------------------\n");
printk("%s/%d is trying to contend lock (",
curr->comm, task_pid_nr(curr));
print_lockdep_cache(lock);
@@ -4244,10 +4244,10 @@ print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
return;

printk("\n");
- printk("=========================\n");
- printk("[ BUG: held lock freed! ]\n");
+ pr_warn("=========================\n");
+ pr_warn("WARNING: held lock freed!\n");
print_kernel_ident();
- printk("-------------------------\n");
+ pr_warn("-------------------------\n");
printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
print_lock(hlock);
@@ -4302,11 +4302,11 @@ static void print_held_locks_bug(void)
return;

printk("\n");
- printk("=====================================\n");
- printk("[ BUG: %s/%d still has locks held! ]\n",
+ pr_warn("====================================\n");
+ pr_warn("WARNING: %s/%d still has locks held!\n",
current->comm, task_pid_nr(current));
print_kernel_ident();
- printk("-------------------------------------\n");
+ pr_warn("------------------------------------\n");
lockdep_print_held_locks(current);
printk("\nstack backtrace:\n");
dump_stack();
@@ -4371,7 +4371,7 @@ void debug_show_all_locks(void)
} while_each_thread(g, p);

printk("\n");
- printk("=============================================\n\n");
+ pr_warn("=============================================\n\n");

if (unlock)
read_unlock(&tasklist_lock);
@@ -4401,10 +4401,10 @@ asmlinkage __visible void lockdep_sys_exit(void)
if (!debug_locks_off())
return;
printk("\n");
- printk("================================================\n");
- printk("[ BUG: lock held when returning to user space! ]\n");
+ pr_warn("================================================\n");
+ pr_warn("WARNING: lock held when returning to user space!\n");
print_kernel_ident();
- printk("------------------------------------------------\n");
+ pr_warn("------------------------------------------------\n");
printk("%s/%d is leaving the kernel with locks still held!\n",
curr->comm, curr->pid);
lockdep_print_held_locks(curr);
@@ -4421,13 +4421,13 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
#endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
/* Note: the following can be executed concurrently, so be careful. */
printk("\n");
- pr_err("===============================\n");
- pr_err("[ ERR: suspicious RCU usage. ]\n");
+ pr_warn("=============================\n");
+ pr_warn("WARNING: suspicious RCU usage\n");
print_kernel_ident();
- pr_err("-------------------------------\n");
- pr_err("%s:%d %s!\n", file, line, s);
- pr_err("\nother info that might help us debug this:\n\n");
- pr_err("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
+ pr_warn("-----------------------------\n");
+ printk("%s:%d %s!\n", file, line, s);
+ printk("\nother info that might help us debug this:\n\n");
+ printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
!rcu_lockdep_current_cpu_online()
? "RCU used illegally from offline CPU!\n"
: !rcu_is_watching()
diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
index 97ee9df32e0f..db4f55211b04 100644
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -102,10 +102,11 @@ void debug_rt_mutex_print_deadlock(struct rt_mutex_waiter *waiter)
return;
}

- printk("\n============================================\n");
- printk( "[ BUG: circular locking deadlock detected! ]\n");
- printk("%s\n", print_tainted());
- printk( "--------------------------------------------\n");
+ pr_warn("\n");
+ pr_warn("============================================\n");
+ pr_warn("WARNING: circular locking deadlock detected!\n");
+ pr_warn("%s\n", print_tainted());
+ pr_warn("--------------------------------------------\n");
printk("%s/%d is deadlocking current task %s/%d\n\n",
task->comm, task_pid_nr(task),
current->comm, task_pid_nr(current));
--
2.5.2

2017-04-17 23:29:41

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 01/11] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
---
Documentation/RCU/00-INDEX | 2 +-
Documentation/RCU/rculist_nulls.txt | 6 +++---
Documentation/RCU/whatisRCU.txt | 3 ++-
drivers/gpu/drm/i915/i915_gem.c | 2 +-
drivers/gpu/drm/i915/i915_gem_request.h | 2 +-
drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c | 2 +-
fs/jbd2/journal.c | 2 +-
fs/signalfd.c | 2 +-
include/linux/dma-fence.h | 4 ++--
include/linux/slab.h | 6 ++++--
include/net/sock.h | 2 +-
kernel/fork.c | 4 ++--
kernel/signal.c | 2 +-
mm/kasan/kasan.c | 6 +++---
mm/kmemcheck.c | 2 +-
mm/rmap.c | 4 ++--
mm/slab.c | 6 +++---
mm/slab.h | 4 ++--
mm/slab_common.c | 6 +++---
mm/slob.c | 6 +++---
mm/slub.c | 12 ++++++------
net/dccp/ipv4.c | 2 +-
net/dccp/ipv6.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv6/tcp_ipv6.c | 2 +-
net/llc/af_llc.c | 2 +-
net/llc/llc_conn.c | 4 ++--
net/llc/llc_sap.c | 2 +-
net/netfilter/nf_conntrack_core.c | 8 ++++----
net/smc/af_smc.c | 2 +-
30 files changed, 57 insertions(+), 54 deletions(-)

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index f773a264ae02..1672573b037a 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -17,7 +17,7 @@ rcu_dereference.txt
rcubarrier.txt
- RCU and Unloadable Modules
rculist_nulls.txt
- - RCU list primitives for use with SLAB_DESTROY_BY_RCU
+ - RCU list primitives for use with SLAB_TYPESAFE_BY_RCU
rcuref.txt
- Reference-count design for elements of lists/arrays protected by RCU
rcu.txt
diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
index 18f9651ff23d..8151f0195f76 100644
--- a/Documentation/RCU/rculist_nulls.txt
+++ b/Documentation/RCU/rculist_nulls.txt
@@ -1,5 +1,5 @@
Using hlist_nulls to protect read-mostly linked lists and
-objects using SLAB_DESTROY_BY_RCU allocations.
+objects using SLAB_TYPESAFE_BY_RCU allocations.

Please read the basics in Documentation/RCU/listRCU.txt

@@ -7,7 +7,7 @@ Using special makers (called 'nulls') is a convenient way
to solve following problem :

A typical RCU linked list managing objects which are
-allocated with SLAB_DESTROY_BY_RCU kmem_cache can
+allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
use following algos :

1) Lookup algo
@@ -96,7 +96,7 @@ unlock_chain(); // typically a spin_unlock()
3) Remove algo
--------------
Nothing special here, we can use a standard RCU hlist deletion.
-But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused
+But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
very very fast (before the end of RCU grace period)

if (put_last_reference_on(obj) {
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 5cbd8b2395b8..91c912e86915 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -925,7 +925,8 @@ d. Do you need RCU grace periods to complete even in the face

e. Is your workload too update-intensive for normal use of
RCU, but inappropriate for other synchronization mechanisms?
- If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
+ If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
+ named SLAB_DESTROY_BY_RCU). But please be careful!

f. Do you need read-side critical sections that are respected
even though they are in the middle of the idle loop, during
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6908123162d1..3b668895ac24 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4552,7 +4552,7 @@ i915_gem_load_init(struct drm_i915_private *dev_priv)
dev_priv->requests = KMEM_CACHE(drm_i915_gem_request,
SLAB_HWCACHE_ALIGN |
SLAB_RECLAIM_ACCOUNT |
- SLAB_DESTROY_BY_RCU);
+ SLAB_TYPESAFE_BY_RCU);
if (!dev_priv->requests)
goto err_vmas;

diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
index ea511f06efaf..9ee2750e1dde 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -493,7 +493,7 @@ static inline struct drm_i915_gem_request *
__i915_gem_active_get_rcu(const struct i915_gem_active *active)
{
/* Performing a lockless retrieval of the active request is super
- * tricky. SLAB_DESTROY_BY_RCU merely guarantees that the backing
+ * tricky. SLAB_TYPESAFE_BY_RCU merely guarantees that the backing
* slab of request objects will not be freed whilst we hold the
* RCU read lock. It does not guarantee that the request itself
* will not be freed and then *reused*. Viz,
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c b/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
index 12647af5a336..e7fb47e84a93 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
@@ -1071,7 +1071,7 @@ int ldlm_init(void)
ldlm_lock_slab = kmem_cache_create("ldlm_locks",
sizeof(struct ldlm_lock), 0,
SLAB_HWCACHE_ALIGN |
- SLAB_DESTROY_BY_RCU, NULL);
+ SLAB_TYPESAFE_BY_RCU, NULL);
if (!ldlm_lock_slab) {
kmem_cache_destroy(ldlm_resource_slab);
return -ENOMEM;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a1a359bfcc9c..7f8f962454e5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2340,7 +2340,7 @@ static int jbd2_journal_init_journal_head_cache(void)
jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
sizeof(struct journal_head),
0, /* offset */
- SLAB_TEMPORARY | SLAB_DESTROY_BY_RCU,
+ SLAB_TEMPORARY | SLAB_TYPESAFE_BY_RCU,
NULL); /* ctor */
retval = 0;
if (!jbd2_journal_head_cache) {
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 270221fcef42..7e3d71109f51 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -38,7 +38,7 @@ void signalfd_cleanup(struct sighand_struct *sighand)
/*
* The lockless check can race with remove_wait_queue() in progress,
* but in this case its caller should run under rcu_read_lock() and
- * sighand_cachep is SLAB_DESTROY_BY_RCU, we can safely return.
+ * sighand_cachep is SLAB_TYPESAFE_BY_RCU, we can safely return.
*/
if (likely(!waitqueue_active(wqh)))
return;
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 6048fa404e57..a5195a7d6f77 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -229,7 +229,7 @@ static inline struct dma_fence *dma_fence_get_rcu(struct dma_fence *fence)
*
* Function returns NULL if no refcount could be obtained, or the fence.
* This function handles acquiring a reference to a fence that may be
- * reallocated within the RCU grace period (such as with SLAB_DESTROY_BY_RCU),
+ * reallocated within the RCU grace period (such as with SLAB_TYPESAFE_BY_RCU),
* so long as the caller is using RCU on the pointer to the fence.
*
* An alternative mechanism is to employ a seqlock to protect a bunch of
@@ -257,7 +257,7 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep)
* have successfully acquire a reference to it. If it no
* longer matches, we are holding a reference to some other
* reallocated pointer. This is possible if the allocator
- * is using a freelist like SLAB_DESTROY_BY_RCU where the
+ * is using a freelist like SLAB_TYPESAFE_BY_RCU where the
* fence remains valid for the RCU grace period, but it
* may be reallocated. When using such allocators, we are
* responsible for ensuring the reference we get is to
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3c37a8c51921..04a7f7993e67 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -28,7 +28,7 @@
#define SLAB_STORE_USER 0x00010000UL /* DEBUG: Store the last owner for bug hunting */
#define SLAB_PANIC 0x00040000UL /* Panic if kmem_cache_create() fails */
/*
- * SLAB_DESTROY_BY_RCU - **WARNING** READ THIS!
+ * SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
*
* This delays freeing the SLAB page by a grace period, it does _NOT_
* delay object freeing. This means that if you do kmem_cache_free()
@@ -61,8 +61,10 @@
*
* rcu_read_lock before reading the address, then rcu_read_unlock after
* taking the spinlock within the structure expected at that address.
+ *
+ * Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.
*/
-#define SLAB_DESTROY_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
+#define SLAB_TYPESAFE_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */
#define SLAB_TRACE 0x00200000UL /* Trace allocations and frees */

diff --git a/include/net/sock.h b/include/net/sock.h
index 5e5997654db6..59cdccaa30e7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -993,7 +993,7 @@ struct smc_hashinfo;
struct module;

/*
- * caches using SLAB_DESTROY_BY_RCU should let .next pointer from nulls nodes
+ * caches using SLAB_TYPESAFE_BY_RCU should let .next pointer from nulls nodes
* un-modified. Special care is taken when initializing object to zero.
*/
static inline void sk_prot_clear_nulls(struct sock *sk, int size)
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80e93d..9330ce24f1bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1313,7 +1313,7 @@ void __cleanup_sighand(struct sighand_struct *sighand)
if (atomic_dec_and_test(&sighand->count)) {
signalfd_cleanup(sighand);
/*
- * sighand_cachep is SLAB_DESTROY_BY_RCU so we can free it
+ * sighand_cachep is SLAB_TYPESAFE_BY_RCU so we can free it
* without an RCU grace period, see __lock_task_sighand().
*/
kmem_cache_free(sighand_cachep, sighand);
@@ -2144,7 +2144,7 @@ void __init proc_caches_init(void)
{
sighand_cachep = kmem_cache_create("sighand_cache",
sizeof(struct sighand_struct), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_DESTROY_BY_RCU|
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
SLAB_NOTRACK|SLAB_ACCOUNT, sighand_ctor);
signal_cachep = kmem_cache_create("signal_cache",
sizeof(struct signal_struct), 0,
diff --git a/kernel/signal.c b/kernel/signal.c
index 7e59ebc2c25e..6df5f72158e4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1237,7 +1237,7 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
}
/*
* This sighand can be already freed and even reused, but
- * we rely on SLAB_DESTROY_BY_RCU and sighand_ctor() which
+ * we rely on SLAB_TYPESAFE_BY_RCU and sighand_ctor() which
* initializes ->siglock: this slab can't go away, it has
* the same object type, ->siglock can't be reinitialized.
*
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 98b27195e38b..4b20061102f6 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -413,7 +413,7 @@ void kasan_cache_create(struct kmem_cache *cache, size_t *size,
*size += sizeof(struct kasan_alloc_meta);

/* Add free meta. */
- if (cache->flags & SLAB_DESTROY_BY_RCU || cache->ctor ||
+ if (cache->flags & SLAB_TYPESAFE_BY_RCU || cache->ctor ||
cache->object_size < sizeof(struct kasan_free_meta)) {
cache->kasan_info.free_meta_offset = *size;
*size += sizeof(struct kasan_free_meta);
@@ -561,7 +561,7 @@ static void kasan_poison_slab_free(struct kmem_cache *cache, void *object)
unsigned long rounded_up_size = round_up(size, KASAN_SHADOW_SCALE_SIZE);

/* RCU slabs could be legally used after free within the RCU period */
- if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return;

kasan_poison_shadow(object, rounded_up_size, KASAN_KMALLOC_FREE);
@@ -572,7 +572,7 @@ bool kasan_slab_free(struct kmem_cache *cache, void *object)
s8 shadow_byte;

/* RCU slabs could be legally used after free within the RCU period */
- if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return false;

shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object));
diff --git a/mm/kmemcheck.c b/mm/kmemcheck.c
index 5bf191756a4a..2d5959c5f7c5 100644
--- a/mm/kmemcheck.c
+++ b/mm/kmemcheck.c
@@ -95,7 +95,7 @@ void kmemcheck_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, void *object,
void kmemcheck_slab_free(struct kmem_cache *s, void *object, size_t size)
{
/* TODO: RCU freeing is unsupported for now; hide false positives. */
- if (!s->ctor && !(s->flags & SLAB_DESTROY_BY_RCU))
+ if (!s->ctor && !(s->flags & SLAB_TYPESAFE_BY_RCU))
kmemcheck_mark_freed(object, size);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index 49ed681ccc7b..8ffd59df8a3f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -430,7 +430,7 @@ static void anon_vma_ctor(void *data)
void __init anon_vma_init(void)
{
anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
- 0, SLAB_DESTROY_BY_RCU|SLAB_PANIC|SLAB_ACCOUNT,
+ 0, SLAB_TYPESAFE_BY_RCU|SLAB_PANIC|SLAB_ACCOUNT,
anon_vma_ctor);
anon_vma_chain_cachep = KMEM_CACHE(anon_vma_chain,
SLAB_PANIC|SLAB_ACCOUNT);
@@ -481,7 +481,7 @@ struct anon_vma *page_get_anon_vma(struct page *page)
* If this page is still mapped, then its anon_vma cannot have been
* freed. But if it has been unmapped, we have no security against the
* anon_vma structure being freed and reused (for another anon_vma:
- * SLAB_DESTROY_BY_RCU guarantees that - so the atomic_inc_not_zero()
+ * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
* above cannot corrupt).
*/
if (!page_mapped(page)) {
diff --git a/mm/slab.c b/mm/slab.c
index 807d86c76908..93c827864862 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1728,7 +1728,7 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)

freelist = page->freelist;
slab_destroy_debugcheck(cachep, page);
- if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cachep->flags & SLAB_TYPESAFE_BY_RCU))
call_rcu(&page->rcu_head, kmem_rcu_free);
else
kmem_freepages(cachep, page);
@@ -1924,7 +1924,7 @@ static bool set_objfreelist_slab_cache(struct kmem_cache *cachep,

cachep->num = 0;

- if (cachep->ctor || flags & SLAB_DESTROY_BY_RCU)
+ if (cachep->ctor || flags & SLAB_TYPESAFE_BY_RCU)
return false;

left = calculate_slab_order(cachep, size,
@@ -2030,7 +2030,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
- if (!(flags & SLAB_DESTROY_BY_RCU))
+ if (!(flags & SLAB_TYPESAFE_BY_RCU))
flags |= SLAB_POISON;
#endif
#endif
diff --git a/mm/slab.h b/mm/slab.h
index 65e7c3fcac72..9cfcf099709c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -126,7 +126,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,

/* Legal flag mask for kmem_cache_create(), for various configurations */
#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | SLAB_PANIC | \
- SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS )
+ SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS )

#if defined(CONFIG_DEBUG_SLAB)
#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
@@ -415,7 +415,7 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
* back there or track user information then we can
* only use the space before that information.
*/
- if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+ if (s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_STORE_USER))
return s->inuse;
/*
* Else we can use all the padding etc for the allocation
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 09d0e849b07f..01a0fe2eb332 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -39,7 +39,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
* Set of flags that will prevent slab merging
*/
#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
- SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \
+ SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
SLAB_FAILSLAB | SLAB_KASAN)

#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
@@ -500,7 +500,7 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work)
struct kmem_cache *s, *s2;

/*
- * On destruction, SLAB_DESTROY_BY_RCU kmem_caches are put on the
+ * On destruction, SLAB_TYPESAFE_BY_RCU kmem_caches are put on the
* @slab_caches_to_rcu_destroy list. The slab pages are freed
* through RCU and and the associated kmem_cache are dereferenced
* while freeing the pages, so the kmem_caches should be freed only
@@ -537,7 +537,7 @@ static int shutdown_cache(struct kmem_cache *s)
memcg_unlink_cache(s);
list_del(&s->list);

- if (s->flags & SLAB_DESTROY_BY_RCU) {
+ if (s->flags & SLAB_TYPESAFE_BY_RCU) {
list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
schedule_work(&slab_caches_to_rcu_destroy_work);
} else {
diff --git a/mm/slob.c b/mm/slob.c
index eac04d4357ec..1bae78d71096 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -126,7 +126,7 @@ static inline void clear_slob_page_free(struct page *sp)

/*
* struct slob_rcu is inserted at the tail of allocated slob blocks, which
- * were created with a SLAB_DESTROY_BY_RCU slab. slob_rcu is used to free
+ * were created with a SLAB_TYPESAFE_BY_RCU slab. slob_rcu is used to free
* the block using call_rcu.
*/
struct slob_rcu {
@@ -524,7 +524,7 @@ EXPORT_SYMBOL(ksize);

int __kmem_cache_create(struct kmem_cache *c, unsigned long flags)
{
- if (flags & SLAB_DESTROY_BY_RCU) {
+ if (flags & SLAB_TYPESAFE_BY_RCU) {
/* leave room for rcu footer at the end of object */
c->size += sizeof(struct slob_rcu);
}
@@ -598,7 +598,7 @@ static void kmem_rcu_free(struct rcu_head *head)
void kmem_cache_free(struct kmem_cache *c, void *b)
{
kmemleak_free_recursive(b, c->flags);
- if (unlikely(c->flags & SLAB_DESTROY_BY_RCU)) {
+ if (unlikely(c->flags & SLAB_TYPESAFE_BY_RCU)) {
struct slob_rcu *slob_rcu;
slob_rcu = b + (c->size - sizeof(struct slob_rcu));
slob_rcu->size = c->size;
diff --git a/mm/slub.c b/mm/slub.c
index 7f4bc7027ed5..57e5156f02be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1687,7 +1687,7 @@ static void rcu_free_slab(struct rcu_head *h)

static void free_slab(struct kmem_cache *s, struct page *page)
{
- if (unlikely(s->flags & SLAB_DESTROY_BY_RCU)) {
+ if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) {
struct rcu_head *head;

if (need_reserve_slab_rcu) {
@@ -2963,7 +2963,7 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
* slab_free_freelist_hook() could have put the items into quarantine.
* If so, no need to free them.
*/
- if (s->flags & SLAB_KASAN && !(s->flags & SLAB_DESTROY_BY_RCU))
+ if (s->flags & SLAB_KASAN && !(s->flags & SLAB_TYPESAFE_BY_RCU))
return;
do_slab_free(s, page, head, tail, cnt, addr);
}
@@ -3433,7 +3433,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
* the slab may touch the object after free or before allocation
* then we should never poison the object itself.
*/
- if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
+ if ((flags & SLAB_POISON) && !(flags & SLAB_TYPESAFE_BY_RCU) &&
!s->ctor)
s->flags |= __OBJECT_POISON;
else
@@ -3455,7 +3455,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
*/
s->inuse = size;

- if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
+ if (((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
s->ctor)) {
/*
* Relocate free pointer after the object if it is not
@@ -3537,7 +3537,7 @@ static int kmem_cache_open(struct kmem_cache *s, unsigned long flags)
s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
s->reserved = 0;

- if (need_reserve_slab_rcu && (s->flags & SLAB_DESTROY_BY_RCU))
+ if (need_reserve_slab_rcu && (s->flags & SLAB_TYPESAFE_BY_RCU))
s->reserved = sizeof(struct rcu_head);

if (!calculate_sizes(s, -1))
@@ -5042,7 +5042,7 @@ SLAB_ATTR_RO(cache_dma);

static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
{
- return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_TYPESAFE_BY_RCU));
}
SLAB_ATTR_RO(destroy_by_rcu);

diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 409d0cfd3447..90210a0e3888 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -950,7 +950,7 @@ static struct proto dccp_v4_prot = {
.orphan_count = &dccp_orphan_count,
.max_header = MAX_DCCP_HEADER,
.obj_size = sizeof(struct dccp_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.rsk_prot = &dccp_request_sock_ops,
.twsk_prot = &dccp_timewait_sock_ops,
.h.hashinfo = &dccp_hashinfo,
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 233b57367758..b4019a5e4551 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -1012,7 +1012,7 @@ static struct proto dccp_v6_prot = {
.orphan_count = &dccp_orphan_count,
.max_header = MAX_DCCP_HEADER,
.obj_size = sizeof(struct dccp6_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.rsk_prot = &dccp6_request_sock_ops,
.twsk_prot = &dccp6_timewait_sock_ops,
.h.hashinfo = &dccp_hashinfo,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9a89b8deafae..82c89abeb989 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2398,7 +2398,7 @@ struct proto tcp_prot = {
.sysctl_rmem = sysctl_tcp_rmem,
.max_header = MAX_TCP_HEADER,
.obj_size = sizeof(struct tcp_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.twsk_prot = &tcp_timewait_sock_ops,
.rsk_prot = &tcp_request_sock_ops,
.h.hashinfo = &tcp_hashinfo,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 60a5295a7de6..bdbc4327ebee 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1919,7 +1919,7 @@ struct proto tcpv6_prot = {
.sysctl_rmem = sysctl_tcp_rmem,
.max_header = MAX_TCP_HEADER,
.obj_size = sizeof(struct tcp6_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.twsk_prot = &tcp6_timewait_sock_ops,
.rsk_prot = &tcp6_request_sock_ops,
.h.hashinfo = &tcp_hashinfo,
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 06186d608a27..d096ca563054 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -142,7 +142,7 @@ static struct proto llc_proto = {
.name = "LLC",
.owner = THIS_MODULE,
.obj_size = sizeof(struct llc_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
};

/**
diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c
index 8bc5a1bd2d45..9b02c13d258b 100644
--- a/net/llc/llc_conn.c
+++ b/net/llc/llc_conn.c
@@ -506,7 +506,7 @@ static struct sock *__llc_lookup_established(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_estab_match(sap, daddr, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
@@ -565,7 +565,7 @@ static struct sock *__llc_lookup_listener(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_listener_match(sap, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
diff --git a/net/llc/llc_sap.c b/net/llc/llc_sap.c
index 5404d0d195cc..63b6ab056370 100644
--- a/net/llc/llc_sap.c
+++ b/net/llc/llc_sap.c
@@ -328,7 +328,7 @@ static struct sock *llc_lookup_dgram(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_dgram_match(sap, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 071b97fcbefb..fdcdac7916b2 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -914,7 +914,7 @@ static unsigned int early_drop_list(struct net *net,
continue;

/* kill only if still in same netns -- might have moved due to
- * SLAB_DESTROY_BY_RCU rules.
+ * SLAB_TYPESAFE_BY_RCU rules.
*
* We steal the timer reference. If that fails timer has
* already fired or someone else deleted it. Just drop ref
@@ -1069,7 +1069,7 @@ __nf_conntrack_alloc(struct net *net,

/*
* Do not use kmem_cache_zalloc(), as this cache uses
- * SLAB_DESTROY_BY_RCU.
+ * SLAB_TYPESAFE_BY_RCU.
*/
ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
if (ct == NULL)
@@ -1114,7 +1114,7 @@ void nf_conntrack_free(struct nf_conn *ct)
struct net *net = nf_ct_net(ct);

/* A freed object has refcnt == 0, that's
- * the golden rule for SLAB_DESTROY_BY_RCU
+ * the golden rule for SLAB_TYPESAFE_BY_RCU
*/
NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);

@@ -1878,7 +1878,7 @@ int nf_conntrack_init_start(void)
nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
sizeof(struct nf_conn),
NFCT_INFOMASK + 1,
- SLAB_DESTROY_BY_RCU | SLAB_HWCACHE_ALIGN, NULL);
+ SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN, NULL);
if (!nf_conntrack_cachep)
goto err_cachep;

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 85837ab90e89..d34bbd6d8f38 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -101,7 +101,7 @@ struct proto smc_proto = {
.unhash = smc_unhash_sk,
.obj_size = sizeof(struct smc_sock),
.h.smc_hash = &smc_v4_hashinfo,
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
};
EXPORT_SYMBOL_GPL(smc_proto);

--
2.5.2

2017-04-17 23:30:01

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 10/11] rcu: Use true/false in assignment to bool

From: Nicholas Mc Guire <[email protected]>

This commit makes the parse_rcu_nocb_poll() function assign true
(rather than the constant 1) to the bool variable rcu_nocb_poll.

Signed-off-by: Nicholas Mc Guire <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0a62a8f1caac..f4b7a9be1a44 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1709,7 +1709,7 @@ __setup("rcu_nocbs=", rcu_nocb_setup);

static int __init parse_rcu_nocb_poll(char *arg)
{
- rcu_nocb_poll = 1;
+ rcu_nocb_poll = true;
return 0;
}
early_param("rcu_nocb_poll", parse_rcu_nocb_poll);
--
2.5.2

2017-04-17 23:30:42

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 09/11] rcu: Use bool value directly

From: Nicholas Mc Guire <[email protected]>

The beenonline variable is declared bool so there is no need for an
explicit comparison, especially not against the constant zero.

Signed-off-by: Nicholas Mc Guire <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c4f195dd7c94..7c238604df18 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3085,7 +3085,7 @@ __rcu_process_callbacks(struct rcu_state *rsp)
bool needwake;
struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);

- WARN_ON_ONCE(rdp->beenonline == 0);
+ WARN_ON_ONCE(!rdp->beenonline);

/* Update RCU state based on any recent quiescent states. */
rcu_check_quiescent_state(rsp, rdp);
--
2.5.2

2017-04-17 23:30:45

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 05/11] rcu: Remove obsolete comment from rcu_future_gp_cleanup() header

The rcu_nocb_gp_cleanup() function is now invoked elsewhere, so this
commit drags this comment into the year 2017.

Reported-by: Michalis Kokologiannakis <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 50fee7689e71..bdaa69d23a8a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1793,9 +1793,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,

/*
* Clean up any old requests for the just-ended grace period. Also return
- * whether any additional grace periods have been requested. Also invoke
- * rcu_nocb_gp_cleanup() in order to wake up any no-callbacks kthreads
- * waiting for this grace period to complete.
+ * whether any additional grace periods have been requested.
*/
static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
{
--
2.5.2

2017-04-17 23:30:39

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 04/11] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

If you set RCU_FANOUT_LEAF too high, you can get lock contention
on the leaf rcu_node, and you should boot with the skew_tick kernel
parameter set in order to avoid this lock contention. This commit
therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
this.

Signed-off-by: Paul E. McKenney <[email protected]>
---
init/Kconfig | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a92f27da4a27..946e561e67b7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
initialization. These systems tend to run CPU-bound, and thus
are not helped by synchronized interrupts, and thus tend to
skew them, which reduces lock contention enough that large
- leaf-level fanouts work well.
+ leaf-level fanouts work well. That said, setting leaf-level
+ fanout to a large number will likely cause problematic
+ lock contention on the leaf-level rcu_node structures unless
+ you boot with the skew_tick kernel parameter.

Select a specific number if testing RCU itself.

- Select the maximum permissible value for large systems.
+ Select the maximum permissible value for large systems, but
+ please understand that you may also need to set the
+ skew_tick kernel boot parameter to avoid contention
+ on the rcu_node structure's locks.

Take the default if unsure.

--
2.5.2

2017-04-17 23:31:20

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 tip/core/rcu 08/11] torture: Use correct path for Kconfig fragment for duplicates

Currently, the rcutorture scripting will give an error message if
running a duplicate scenario that happens also to have a non-existent
build directory (b1, b2, ... in the rcutorture directory). Worse yet, if
the build directory has already been created and used for a real build,
the script will silently grab the wrong Kconfig fragment, which could
cause confusion to the poor sap (me) analyzing old test results. At
least the actual test runs correctly...

This commit therefore accesses the Kconfig fragment from the results
directory corresponding to the first of the duplicate scenarios, for
which a build was actually carried out. This prevents both the messages
and at least one form of later confusion.

Signed-off-by: Paul E. McKenney <[email protected]>
---
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index ea6e373edc27..93eede4e8fbe 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -170,7 +170,7 @@ qemu_append="`identify_qemu_append "$QEMU"`"
# Pull in Kconfig-fragment boot parameters
boot_args="`configfrag_boot_params "$boot_args" "$config_template"`"
# Generate kernel-version-specific boot parameters
-boot_args="`per_version_boot_params "$boot_args" $builddir/.config $seconds`"
+boot_args="`per_version_boot_params "$boot_args" $resdir/.config $seconds`"

if test -n "$TORTURE_BUILDONLY"
then
--
2.5.2

2017-04-18 00:14:32

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 01/11] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

On Mon, 17 Apr 2017, Paul E. McKenney wrote:

> A group of Linux kernel hackers reported chasing a bug that resulted
> from their assumption that SLAB_DESTROY_BY_RCU provided an existence
> guarantee, that is, that no block from such a slab would be reallocated
> during an RCU read-side critical section. Of course, that is not the
> case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
> slab of blocks.
>
> However, there is a phrase for this, namely "type safety". This commit
> therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
> to avoid future instances of this sort of confusion.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> [ paulmck: Add comments mentioning the old name, as requested by Eric
> Dumazet, in order to help people familiar with the old name find
> the new one. ]

Acked-by: David Rientjes <[email protected]>

2017-04-18 00:18:38

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 04/11] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Mon, Apr 17, 2017 at 04:28:51PM -0700, Paul E. McKenney wrote:
> If you set RCU_FANOUT_LEAF too high, you can get lock contention
> on the leaf rcu_node, and you should boot with the skew_tick kernel
> parameter set in order to avoid this lock contention. This commit
> therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
> this.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> init/Kconfig | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index a92f27da4a27..946e561e67b7 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
> initialization. These systems tend to run CPU-bound, and thus
> are not helped by synchronized interrupts, and thus tend to
> skew them, which reduces lock contention enough that large
> - leaf-level fanouts work well.
> + leaf-level fanouts work well. That said, setting leaf-level
> + fanout to a large number will likely cause problematic
> + lock contention on the leaf-level rcu_node structures unless
> + you boot with the skew_tick kernel parameter.
>
> Select a specific number if testing RCU itself.
>
> - Select the maximum permissible value for large systems.
> + Select the maximum permissible value for large systems, but
> + please understand that you may also need to set the
> + skew_tick kernel boot parameter to avoid contention
> + on the rcu_node structure's locks.

Nit: the indentation seems wrong here.

2017-04-18 18:42:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 04/11] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Mon, Apr 17, 2017 at 05:18:25PM -0700, Josh Triplett wrote:
> On Mon, Apr 17, 2017 at 04:28:51PM -0700, Paul E. McKenney wrote:
> > If you set RCU_FANOUT_LEAF too high, you can get lock contention
> > on the leaf rcu_node, and you should boot with the skew_tick kernel
> > parameter set in order to avoid this lock contention. This commit
> > therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
> > this.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > init/Kconfig | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index a92f27da4a27..946e561e67b7 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
> > initialization. These systems tend to run CPU-bound, and thus
> > are not helped by synchronized interrupts, and thus tend to
> > skew them, which reduces lock contention enough that large
> > - leaf-level fanouts work well.
> > + leaf-level fanouts work well. That said, setting leaf-level
> > + fanout to a large number will likely cause problematic
> > + lock contention on the leaf-level rcu_node structures unless
> > + you boot with the skew_tick kernel parameter.
> >
> > Select a specific number if testing RCU itself.
> >
> > - Select the maximum permissible value for large systems.
> > + Select the maximum permissible value for large systems, but
> > + please understand that you may also need to set the
> > + skew_tick kernel boot parameter to avoid contention
> > + on the rcu_node structure's locks.
>
> Nit: the indentation seems wrong here.

Right you are! Fixed, thank you!

Thanx, Paul

2017-04-19 11:29:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12


So the thing Maz complained about is because KVM assumes
synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
This series 'breaks' that.

I've not looked hard enough at the new SRCU to see if its possible to
re-instate that feature.

2017-04-19 11:35:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 01:28:45PM +0200, Peter Zijlstra wrote:
>
> So the thing Maz complained about is because KVM assumes
> synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> This series 'breaks' that.


Could've been call_srcu() instead. Looking at the code that triggers the
sp->running case and we slow down. That is !running will queue and
insta-complete the callback, resulting in done=true and no waiting.

>
> I've not looked hard enough at the new SRCU to see if its possible to
> re-instate that feature.

2017-04-19 11:48:25

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On 04/19/2017 01:28 PM, Peter Zijlstra wrote:
>
> So the thing Maz complained about is because KVM assumes
> synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> This series 'breaks' that.

Why is such a behaviour change not mentioned in the cover letter?
I could not find anything in the patch descriptions that would
indicate a slowdown. How much slower did it get?

But indeed, there are several places at KVM startup which have been
reworked to srcu since normal rcu was too slow for several usecases.
(Mostly registering devices and related data structures at startup,
basically the qemu/kvm coldplug interaction)
>

> I've not looked hard enough at the new SRCU to see if its possible to
> re-instate that feature.
>


2017-04-19 12:09:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 01:48:08PM +0200, Christian Borntraeger wrote:
> On 04/19/2017 01:28 PM, Peter Zijlstra wrote:
> >
> > So the thing Maz complained about is because KVM assumes
> > synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> > This series 'breaks' that.
>
> Why is such a behaviour change not mentioned in the cover letter?
> I could not find anything in the patch descriptions that would
> indicate a slowdown. How much slower did it get?
>
> But indeed, there are several places at KVM startup which have been
> reworked to srcu since normal rcu was too slow for several usecases.
> (Mostly registering devices and related data structures at startup,
> basically the qemu/kvm coldplug interaction)

I suspect Paul is not considering this a 'normal' RCU feature, and
therefore didn't think about changing this.

I know I was fairly surprised by this requirement when I ran into it;
and only accidentally remembered it now that maz complained.

2017-04-19 12:51:46

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On 19/04/17 13:08, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 01:48:08PM +0200, Christian Borntraeger wrote:
>> On 04/19/2017 01:28 PM, Peter Zijlstra wrote:
>>>
>>> So the thing Maz complained about is because KVM assumes
>>> synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
>>> This series 'breaks' that.
>>
>> Why is such a behaviour change not mentioned in the cover letter?
>> I could not find anything in the patch descriptions that would
>> indicate a slowdown. How much slower did it get?
>>
>> But indeed, there are several places at KVM startup which have been
>> reworked to srcu since normal rcu was too slow for several usecases.
>> (Mostly registering devices and related data structures at startup,
>> basically the qemu/kvm coldplug interaction)
>
> I suspect Paul is not considering this a 'normal' RCU feature, and
> therefore didn't think about changing this.
>
> I know I was fairly surprised by this requirement when I ran into it;
> and only accidentally remembered it now that maz complained.

The issue I noticed yesterday has been addressed here:

https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=dev.2017.04.17a&id=6eec94fe40e294b04d32c8ef552e28fa6159bdad

and was triggered by the constant mapping/unmapping of memslots that
QEMU triggers when emulating a NOR flash that UEFI uses for storing its
variables.

So far, I'm not seeing any other spectacular regression introduced by
this series.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2017-04-19 13:03:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 01:28:45PM +0200, Peter Zijlstra wrote:
>
> So the thing Maz complained about is because KVM assumes
> synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> This series 'breaks' that.
>
> I've not looked hard enough at the new SRCU to see if its possible to
> re-instate that feature.

And with the fix I gave Maz, the parallelized version is near enough
to being free as well. It was just a stupid bug on my part: I forgot
to check for expedited when scheduling callbacks.

Thanx, Paul

2017-04-19 13:16:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 06:02:45AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 19, 2017 at 01:28:45PM +0200, Peter Zijlstra wrote:
> >
> > So the thing Maz complained about is because KVM assumes
> > synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> > This series 'breaks' that.
> >
> > I've not looked hard enough at the new SRCU to see if its possible to
> > re-instate that feature.
>
> And with the fix I gave Maz, the parallelized version is near enough
> to being free as well. It was just a stupid bug on my part: I forgot
> to check for expedited when scheduling callbacks.

Right, although for the old SRCU it was true for !expedited as well.
Just turns out the KVM memslots crud already uses
synchronize_srcu_expedited().

<rant>without a friggin' comment; hate @expedited</rant>

2017-04-19 13:22:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 01:48:08PM +0200, Christian Borntraeger wrote:
> On 04/19/2017 01:28 PM, Peter Zijlstra wrote:
> >
> > So the thing Maz complained about is because KVM assumes
> > synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> > This series 'breaks' that.
>
> Why is such a behaviour change not mentioned in the cover letter?
> I could not find anything in the patch descriptions that would
> indicate a slowdown. How much slower did it get?

It was an 8x slowdown in boot time of a guest OS running UEFI, from
five seconds to forty seconds. The fix restored the original boot time.

Why didn't I report the slowdown in my cover letter? Because I didn't
realize that I had created such a stupid bug! ;-)

Why didn't my testing reveal the bug? Because in my rcutorture testing,
the buggy code runs about as fast as the original, and the fixed new code
runs about an order of magnitude faster. This is because rcutorture's
performance statistics are mostly sensitive to throughput, while Marc's
boot-time run is mostly sensitive to latency.

> But indeed, there are several places at KVM startup which have been
> reworked to srcu since normal rcu was too slow for several usecases.
> (Mostly registering devices and related data structures at startup,
> basically the qemu/kvm coldplug interaction)

And here is the patch that restored Marc's boot speed. It simply changes
the original (buggy) fixed delay for no delay in the expedited case and
the same fixed delay in the non-expedited case.

Thanx, Paul

------------------------------------------------------------------------

commit 66ae176ab33dd3afa0b944d149fe8240e65743f9
Author: Paul E. McKenney <[email protected]>
Date: Tue Apr 18 10:28:31 2017 -0700

srcu: Expedite srcu_schedule_cbs_snp() callback invocation

Although Tree SRCU does reduce delays when there is at least one
synchronize_srcu_expedited() invocation pending, srcu_schedule_cbs_snp()
still waits for SRCU_INTERVAL before invoking callbacks. Since
synchronize_srcu_expedited() now posts a callback and waits for
that callback to do a wakeup, this destroys the expedited nature of
synchronize_srcu_expedited(). This destruction became apparent to
Marc Zyngier in the guise of a guest-OS bootup slowdown from five
seconds to no fewer than forty seconds.

This commit therefore invokes callbacks immediately at the end of the
grace period when there is at least one synchronize_srcu_expedited()
invocation pending. This brought Marc's guest-OS bootup times back
into the realm of reason.

Reported-by: Marc Zyngier <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Tested-by: Marc Zyngier <[email protected]>

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index f9c684d79faa..e11b89a363f7 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -442,7 +442,8 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *sp, struct srcu_node *snp)
int cpu;

for (cpu = snp->grplo; cpu <= snp->grphi; cpu++)
- srcu_schedule_cbs_sdp(per_cpu_ptr(sp->sda, cpu), SRCU_INTERVAL);
+ srcu_schedule_cbs_sdp(per_cpu_ptr(sp->sda, cpu),
+ atomic_read(&sp->srcu_exp_cnt) ? 0 : SRCU_INTERVAL);
}

/*

2017-04-19 13:22:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Thu, Apr 13, 2017 at 11:42:32AM -0700, Paul E. McKenney wrote:

> I believe that you are missing the fact that RCU grace-period
> initialization and cleanup walks through the rcu_node tree breadth
> first, using rcu_for_each_node_breadth_first().

Indeed. That is the part I completely missed.

> This macro (shown below)
> implements this breadth-first walk using a simple sequential traversal of
> the ->node[] array that provides the structures making up the rcu_node
> tree. As you can see, this scan is completely independent of how CPU
> numbers might be mapped to rcu_data slots in the leaf rcu_node structures.

So this code is clearly not a hotpath, but still its performance
matters?

Seems like you cannot win here :/

2017-04-19 13:26:01

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On 04/19/2017 03:22 PM, Paul E. McKenney wrote:
> On Wed, Apr 19, 2017 at 01:48:08PM +0200, Christian Borntraeger wrote:
>> On 04/19/2017 01:28 PM, Peter Zijlstra wrote:
>>>
>>> So the thing Maz complained about is because KVM assumes
>>> synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
>>> This series 'breaks' that.
>>
>> Why is such a behaviour change not mentioned in the cover letter?
>> I could not find anything in the patch descriptions that would
>> indicate a slowdown. How much slower did it get?
>
> It was an 8x slowdown in boot time of a guest OS running UEFI, from
> five seconds to forty seconds. The fix restored the original boot time.
>
> Why didn't I report the slowdown in my cover letter? Because I didn't
> realize that I had created such a stupid bug! ;-)
>
> Why didn't my testing reveal the bug? Because in my rcutorture testing,
> the buggy code runs about as fast as the original, and the fixed new code
> runs about an order of magnitude faster. This is because rcutorture's
> performance statistics are mostly sensitive to throughput, while Marc's
> boot-time run is mostly sensitive to latency.
>
>> But indeed, there are several places at KVM startup which have been
>> reworked to srcu since normal rcu was too slow for several usecases.
>> (Mostly registering devices and related data structures at startup,
>> basically the qemu/kvm coldplug interaction)
>
> And here is the patch that restored Marc's boot speed. It simply changes
> the original (buggy) fixed delay for no delay in the expedited case and
> the same fixed delay in the non-expedited case.
>
> Thanx, Paul

Ok, so it was not a fundamental rework, it was just a bug.
Then nevermind :-)



2017-04-19 13:48:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Wed, Apr 19, 2017 at 03:22:26PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 11:42:32AM -0700, Paul E. McKenney wrote:
>
> > I believe that you are missing the fact that RCU grace-period
> > initialization and cleanup walks through the rcu_node tree breadth
> > first, using rcu_for_each_node_breadth_first().
>
> Indeed. That is the part I completely missed.
>
> > This macro (shown below)
> > implements this breadth-first walk using a simple sequential traversal of
> > the ->node[] array that provides the structures making up the rcu_node
> > tree. As you can see, this scan is completely independent of how CPU
> > numbers might be mapped to rcu_data slots in the leaf rcu_node structures.
>
> So this code is clearly not a hotpath, but still its performance
> matters?
>
> Seems like you cannot win here :/

So I sort of see what that code does, but I cannot quite grasp from the
comments near there _why_ it is doing this.

My thinking is that normal (active CPUs) will update their state at tick
time through the tree, and once the state reaches the root node, IOW all
CPUs agree they've observed that particular state, we advance the global
state, rinse repeat. That's how tree-rcu works.

NOHZ-idle stuff would be excluded entirely; that is, if we're allowed to
go idle we're up-to-date, and completely drop out of the state tracking.
When we become active again, we can simply sync the CPU's state to the
active state and go from there -- ignoring whatever happened in the
mean-time.

So why do we have to do machine wide updates? How can we get at the end
up a grace period without all CPUs already agreeing that its complete?

/me puzzled.

2017-04-19 14:47:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 02:08:47PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 01:48:08PM +0200, Christian Borntraeger wrote:
> > On 04/19/2017 01:28 PM, Peter Zijlstra wrote:
> > >
> > > So the thing Maz complained about is because KVM assumes
> > > synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> > > This series 'breaks' that.
> >
> > Why is such a behaviour change not mentioned in the cover letter?
> > I could not find anything in the patch descriptions that would
> > indicate a slowdown. How much slower did it get?
> >
> > But indeed, there are several places at KVM startup which have been
> > reworked to srcu since normal rcu was too slow for several usecases.
> > (Mostly registering devices and related data structures at startup,
> > basically the qemu/kvm coldplug interaction)
>
> I suspect Paul is not considering this a 'normal' RCU feature, and
> therefore didn't think about changing this.
>
> I know I was fairly surprised by this requirement when I ran into it;
> and only accidentally remembered it now that maz complained.

Indeed -- the natural thing to have done back when KVM's scalability was
first being worked on would have been to simply change synchronize_rcu()
to synchronize_rcu_expedited(). However, at that time, these things
did try_stop_cpus() and the like, which was really bad for latency.
Moving to SRCU avoided this problem. Of course, now that KVM uses
SRCU, why change unless there is a problem? Besides, I vaguely recall
some KVM cases where srcu_read_lock() is used from CPUs that look to
be idle or offline from RCU's perspective, and that sort of thing only
works for SRCU.

Which reminds me...

The RCU expedited primitives have been completely rewritten since then,
and no longer use try_stop_cpus(), no longer disturb idle CPUs, and no
longer disturb nohz_full CPUs running in userspace. In addition, there
is the rcupdate.rcu_normal kernel boot paramter for those who want to
completely avoid RCU expedited primitives.

So it seems to me to be time for the patch below. Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit 333d383fad42b4bdef3d27d91e940a6eafed3f91
Author: Paul E. McKenney <[email protected]>
Date: Wed Apr 19 07:37:45 2017 -0700

checkpatch: Remove checks for expedited grace periods

There was a time when the expedited grace-period primitives
(synchronize_rcu_expedited(), synchronize_rcu_bh_expedited(), and
synchronize_sched_expedited()) used rather antisocial kernel
facilities like try_stop_cpus(). However, they have since been
housebroken to use only single-CPU IPIs, and typically cause less
disturbance than a scheduling-clock interrupt. Furthermore, this
disturbance can be eliminated entirely using NO_HZ_FULL on the
one hand or the rcupdate.rcu_normal boot parameter on the other.

This commit therefore removes checkpatch's complaints about use
of the expedited RCU primitives.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index baa3c7be04ad..64bf2a091368 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -5511,23 +5511,6 @@ sub process {
}
}

-# Check for expedited grace periods that interrupt non-idle non-nohz
-# online CPUs. These expedited can therefore degrade real-time response
-# if used carelessly, and should be avoided where not absolutely
-# needed. It is always OK to use synchronize_rcu_expedited() and
-# synchronize_sched_expedited() at boot time (before real-time applications
-# start) and in error situations where real-time response is compromised in
-# any case. Note that synchronize_srcu_expedited() does -not- interrupt
-# other CPUs, so don't warn on uses of synchronize_srcu_expedited().
-# Of course, nothing comes for free, and srcu_read_lock() and
-# srcu_read_unlock() do contain full memory barriers in payment for
-# synchronize_srcu_expedited() non-interruption properties.
- if ($line =~ /\b(synchronize_rcu_expedited|synchronize_sched_expedited)\(/) {
- WARN("EXPEDITED_RCU_GRACE_PERIOD",
- "expedited RCU grace periods should be avoided where they can degrade real-time response\n" . $herecurr);
-
- }
-
# check of hardware specific defines
if ($line =~ m@^.\s*\#\s*if.*\b(__i386__|__powerpc64__|__sun__|__s390x__)\b@ && $realfile !~ m@include/asm-@) {
CHK("ARCH_DEFINES",

2017-04-19 14:51:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Wed, Apr 19, 2017 at 03:22:26PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 11:42:32AM -0700, Paul E. McKenney wrote:
>
> > I believe that you are missing the fact that RCU grace-period
> > initialization and cleanup walks through the rcu_node tree breadth
> > first, using rcu_for_each_node_breadth_first().
>
> Indeed. That is the part I completely missed.
>
> > This macro (shown below)
> > implements this breadth-first walk using a simple sequential traversal of
> > the ->node[] array that provides the structures making up the rcu_node
> > tree. As you can see, this scan is completely independent of how CPU
> > numbers might be mapped to rcu_data slots in the leaf rcu_node structures.
>
> So this code is clearly not a hotpath, but still its performance
> matters?
>
> Seems like you cannot win here :/

Welcome to my world!!! ;-)

But yes, running on 4096-CPU systems can put some serious stress on
some surprising areas. Especially when those systems have cache-miss
latencies well in excess of a microsecond, and the users are nevertheless
expecting scheduling latencies well below 100 microseconds.

It was a fun challenge, I grant you that!

Thanx, Paul

2017-04-19 14:52:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 07:47:30AM -0700, Paul E. McKenney wrote:
> The RCU expedited primitives have been completely rewritten since then,
> and no longer use try_stop_cpus(), no longer disturb idle CPUs, and no
> longer disturb nohz_full CPUs running in userspace. In addition, there
> is the rcupdate.rcu_normal kernel boot paramter for those who want to
> completely avoid RCU expedited primitives.
>
> So it seems to me to be time for the patch below. Thoughts?

So I forgot all the details again; but if I'm not mistaken it still
prods CPUs with IPIs (just not idle/nohz_full CPUs). So its still not
ideal to sprinkle them around.

Which would still argue against using them too much.

2017-04-19 14:59:00

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 07:47:30AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 19, 2017 at 02:08:47PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 19, 2017 at 01:48:08PM +0200, Christian Borntraeger wrote:
> > > On 04/19/2017 01:28 PM, Peter Zijlstra wrote:
> > > >
> > > > So the thing Maz complained about is because KVM assumes
> > > > synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> > > > This series 'breaks' that.
> > >
> > > Why is such a behaviour change not mentioned in the cover letter?
> > > I could not find anything in the patch descriptions that would
> > > indicate a slowdown. How much slower did it get?
> > >
> > > But indeed, there are several places at KVM startup which have been
> > > reworked to srcu since normal rcu was too slow for several usecases.
> > > (Mostly registering devices and related data structures at startup,
> > > basically the qemu/kvm coldplug interaction)
> >
> > I suspect Paul is not considering this a 'normal' RCU feature, and
> > therefore didn't think about changing this.
> >
> > I know I was fairly surprised by this requirement when I ran into it;
> > and only accidentally remembered it now that maz complained.
>
> Indeed -- the natural thing to have done back when KVM's scalability was
> first being worked on would have been to simply change synchronize_rcu()
> to synchronize_rcu_expedited(). However, at that time, these things
> did try_stop_cpus() and the like, which was really bad for latency.
> Moving to SRCU avoided this problem. Of course, now that KVM uses
> SRCU, why change unless there is a problem? Besides, I vaguely recall
> some KVM cases where srcu_read_lock() is used from CPUs that look to
> be idle or offline from RCU's perspective, and that sort of thing only
> works for SRCU.
>
> Which reminds me...
>
> The RCU expedited primitives have been completely rewritten since then,
> and no longer use try_stop_cpus(), no longer disturb idle CPUs, and no
> longer disturb nohz_full CPUs running in userspace. In addition, there
> is the rcupdate.rcu_normal kernel boot paramter for those who want to
> completely avoid RCU expedited primitives.
>
> So it seems to me to be time for the patch below. Thoughts?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 333d383fad42b4bdef3d27d91e940a6eafed3f91
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Apr 19 07:37:45 2017 -0700
>
> checkpatch: Remove checks for expedited grace periods
>
> There was a time when the expedited grace-period primitives
> (synchronize_rcu_expedited(), synchronize_rcu_bh_expedited(), and
> synchronize_sched_expedited()) used rather antisocial kernel
> facilities like try_stop_cpus(). However, they have since been
> housebroken to use only single-CPU IPIs, and typically cause less
> disturbance than a scheduling-clock interrupt. Furthermore, this
> disturbance can be eliminated entirely using NO_HZ_FULL on the
> one hand or the rcupdate.rcu_normal boot parameter on the other.
>
> This commit therefore removes checkpatch's complaints about use
> of the expedited RCU primitives.
>
> Signed-off-by: Paul E. McKenney <[email protected]>

Reviewed-by: Josh Triplett <[email protected]>

Still something to hesitate a bit before using, but not something
checkpatch should warn about.

2017-04-19 15:00:40

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 02/11] lockdep: Use "WARNING" tag on lockdep splats

On Mon, Apr 17, 2017 at 04:28:49PM -0700, Paul E. McKenney wrote:
> This commit changes lockdep splats to begin lines with "WARNING" and
> to use pr_warn() instead of printk(). This change eases scripted
> analysis of kernel console output.
>
> Reported-by: Dmitry Vyukov <[email protected]>
> Reported-by: Ingo Molnar <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Acked-by: Dmitry Vyukov <[email protected]>

Reviewed-by: Josh Triplett <[email protected]>

Any reason not to change the adjacent calls to printk (without a
priority) to pr_warn?

> kernel/locking/lockdep.c | 86 +++++++++++++++++++++---------------------
> kernel/locking/rtmutex-debug.c | 9 +++--
> 2 files changed, 48 insertions(+), 47 deletions(-)
>
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index a95e5d1f4a9c..e9d4f85b290c 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -1144,10 +1144,10 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
> return 0;
>
> printk("\n");
> - printk("======================================================\n");
> - printk("[ INFO: possible circular locking dependency detected ]\n");
> + pr_warn("======================================================\n");
> + pr_warn("WARNING: possible circular locking dependency detected\n");
> print_kernel_ident();
> - printk("-------------------------------------------------------\n");
> + pr_warn("------------------------------------------------------\n");
> printk("%s/%d is trying to acquire lock:\n",
> curr->comm, task_pid_nr(curr));
> print_lock(check_src);
> @@ -1482,11 +1482,11 @@ print_bad_irq_dependency(struct task_struct *curr,
> return 0;
>
> printk("\n");
> - printk("======================================================\n");
> - printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
> + pr_warn("=====================================================\n");
> + pr_warn("WARNING: %s-safe -> %s-unsafe lock order detected\n",
> irqclass, irqclass);
> print_kernel_ident();
> - printk("------------------------------------------------------\n");
> + pr_warn("-----------------------------------------------------\n");
> printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
> curr->comm, task_pid_nr(curr),
> curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
> @@ -1711,10 +1711,10 @@ print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
> return 0;
>
> printk("\n");
> - printk("=============================================\n");
> - printk("[ INFO: possible recursive locking detected ]\n");
> + pr_warn("============================================\n");
> + pr_warn("WARNING: possible recursive locking detected\n");
> print_kernel_ident();
> - printk("---------------------------------------------\n");
> + pr_warn("--------------------------------------------\n");
> printk("%s/%d is trying to acquire lock:\n",
> curr->comm, task_pid_nr(curr));
> print_lock(next);
> @@ -2061,10 +2061,10 @@ static void print_collision(struct task_struct *curr,
> struct lock_chain *chain)
> {
> printk("\n");
> - printk("======================\n");
> - printk("[chain_key collision ]\n");
> + pr_warn("============================\n");
> + pr_warn("WARNING: chain_key collision\n");
> print_kernel_ident();
> - printk("----------------------\n");
> + pr_warn("----------------------------\n");
> printk("%s/%d: ", current->comm, task_pid_nr(current));
> printk("Hash chain already cached but the contents don't match!\n");
>
> @@ -2360,10 +2360,10 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this,
> return 0;
>
> printk("\n");
> - printk("=================================\n");
> - printk("[ INFO: inconsistent lock state ]\n");
> + pr_warn("================================\n");
> + pr_warn("WARNING: inconsistent lock state\n");
> print_kernel_ident();
> - printk("---------------------------------\n");
> + pr_warn("--------------------------------\n");
>
> printk("inconsistent {%s} -> {%s} usage.\n",
> usage_str[prev_bit], usage_str[new_bit]);
> @@ -2425,10 +2425,10 @@ print_irq_inversion_bug(struct task_struct *curr,
> return 0;
>
> printk("\n");
> - printk("=========================================================\n");
> - printk("[ INFO: possible irq lock inversion dependency detected ]\n");
> + pr_warn("========================================================\n");
> + pr_warn("WARNING: possible irq lock inversion dependency detected\n");
> print_kernel_ident();
> - printk("---------------------------------------------------------\n");
> + pr_warn("--------------------------------------------------------\n");
> printk("%s/%d just changed the state of lock:\n",
> curr->comm, task_pid_nr(curr));
> print_lock(this);
> @@ -3170,10 +3170,10 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
> return 0;
>
> printk("\n");
> - printk("==================================\n");
> - printk("[ BUG: Nested lock was not taken ]\n");
> + pr_warn("==================================\n");
> + pr_warn("WARNING: Nested lock was not taken\n");
> print_kernel_ident();
> - printk("----------------------------------\n");
> + pr_warn("----------------------------------\n");
>
> printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
> print_lock(hlock);
> @@ -3383,10 +3383,10 @@ print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
> return 0;
>
> printk("\n");
> - printk("=====================================\n");
> - printk("[ BUG: bad unlock balance detected! ]\n");
> + pr_warn("=====================================\n");
> + pr_warn("WARNING: bad unlock balance detected!\n");
> print_kernel_ident();
> - printk("-------------------------------------\n");
> + pr_warn("-------------------------------------\n");
> printk("%s/%d is trying to release lock (",
> curr->comm, task_pid_nr(curr));
> print_lockdep_cache(lock);
> @@ -3880,10 +3880,10 @@ print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
> return 0;
>
> printk("\n");
> - printk("=================================\n");
> - printk("[ BUG: bad contention detected! ]\n");
> + pr_warn("=================================\n");
> + pr_warn("WARNING: bad contention detected!\n");
> print_kernel_ident();
> - printk("---------------------------------\n");
> + pr_warn("---------------------------------\n");
> printk("%s/%d is trying to contend lock (",
> curr->comm, task_pid_nr(curr));
> print_lockdep_cache(lock);
> @@ -4244,10 +4244,10 @@ print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
> return;
>
> printk("\n");
> - printk("=========================\n");
> - printk("[ BUG: held lock freed! ]\n");
> + pr_warn("=========================\n");
> + pr_warn("WARNING: held lock freed!\n");
> print_kernel_ident();
> - printk("-------------------------\n");
> + pr_warn("-------------------------\n");
> printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
> curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
> print_lock(hlock);
> @@ -4302,11 +4302,11 @@ static void print_held_locks_bug(void)
> return;
>
> printk("\n");
> - printk("=====================================\n");
> - printk("[ BUG: %s/%d still has locks held! ]\n",
> + pr_warn("====================================\n");
> + pr_warn("WARNING: %s/%d still has locks held!\n",
> current->comm, task_pid_nr(current));
> print_kernel_ident();
> - printk("-------------------------------------\n");
> + pr_warn("------------------------------------\n");
> lockdep_print_held_locks(current);
> printk("\nstack backtrace:\n");
> dump_stack();
> @@ -4371,7 +4371,7 @@ void debug_show_all_locks(void)
> } while_each_thread(g, p);
>
> printk("\n");
> - printk("=============================================\n\n");
> + pr_warn("=============================================\n\n");
>
> if (unlock)
> read_unlock(&tasklist_lock);
> @@ -4401,10 +4401,10 @@ asmlinkage __visible void lockdep_sys_exit(void)
> if (!debug_locks_off())
> return;
> printk("\n");
> - printk("================================================\n");
> - printk("[ BUG: lock held when returning to user space! ]\n");
> + pr_warn("================================================\n");
> + pr_warn("WARNING: lock held when returning to user space!\n");
> print_kernel_ident();
> - printk("------------------------------------------------\n");
> + pr_warn("------------------------------------------------\n");
> printk("%s/%d is leaving the kernel with locks still held!\n",
> curr->comm, curr->pid);
> lockdep_print_held_locks(curr);
> @@ -4421,13 +4421,13 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
> #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
> /* Note: the following can be executed concurrently, so be careful. */
> printk("\n");
> - pr_err("===============================\n");
> - pr_err("[ ERR: suspicious RCU usage. ]\n");
> + pr_warn("=============================\n");
> + pr_warn("WARNING: suspicious RCU usage\n");
> print_kernel_ident();
> - pr_err("-------------------------------\n");
> - pr_err("%s:%d %s!\n", file, line, s);
> - pr_err("\nother info that might help us debug this:\n\n");
> - pr_err("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
> + pr_warn("-----------------------------\n");
> + printk("%s:%d %s!\n", file, line, s);
> + printk("\nother info that might help us debug this:\n\n");
> + printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
> !rcu_lockdep_current_cpu_online()
> ? "RCU used illegally from offline CPU!\n"
> : !rcu_is_watching()
> diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
> index 97ee9df32e0f..db4f55211b04 100644
> --- a/kernel/locking/rtmutex-debug.c
> +++ b/kernel/locking/rtmutex-debug.c
> @@ -102,10 +102,11 @@ void debug_rt_mutex_print_deadlock(struct rt_mutex_waiter *waiter)
> return;
> }
>
> - printk("\n============================================\n");
> - printk( "[ BUG: circular locking deadlock detected! ]\n");
> - printk("%s\n", print_tainted());
> - printk( "--------------------------------------------\n");
> + pr_warn("\n");
> + pr_warn("============================================\n");
> + pr_warn("WARNING: circular locking deadlock detected!\n");
> + pr_warn("%s\n", print_tainted());
> + pr_warn("--------------------------------------------\n");
> printk("%s/%d is deadlocking current task %s/%d\n\n",
> task->comm, task_pid_nr(task),
> current->comm, task_pid_nr(current));
> --
> 2.5.2
>

2017-04-19 15:03:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 07:58:44AM -0700, Josh Triplett wrote:
>
> Still something to hesitate a bit before using, but not something
> checkpatch should warn about.

How else will you get people to hesitate?

2017-04-19 15:08:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Wed, Apr 19, 2017 at 03:48:35PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 03:22:26PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 13, 2017 at 11:42:32AM -0700, Paul E. McKenney wrote:
> >
> > > I believe that you are missing the fact that RCU grace-period
> > > initialization and cleanup walks through the rcu_node tree breadth
> > > first, using rcu_for_each_node_breadth_first().
> >
> > Indeed. That is the part I completely missed.
> >
> > > This macro (shown below)
> > > implements this breadth-first walk using a simple sequential traversal of
> > > the ->node[] array that provides the structures making up the rcu_node
> > > tree. As you can see, this scan is completely independent of how CPU
> > > numbers might be mapped to rcu_data slots in the leaf rcu_node structures.
> >
> > So this code is clearly not a hotpath, but still its performance
> > matters?
> >
> > Seems like you cannot win here :/
>
> So I sort of see what that code does, but I cannot quite grasp from the
> comments near there _why_ it is doing this.
>
> My thinking is that normal (active CPUs) will update their state at tick
> time through the tree, and once the state reaches the root node, IOW all
> CPUs agree they've observed that particular state, we advance the global
> state, rinse repeat. That's how tree-rcu works.
>
> NOHZ-idle stuff would be excluded entirely; that is, if we're allowed to
> go idle we're up-to-date, and completely drop out of the state tracking.
> When we become active again, we can simply sync the CPU's state to the
> active state and go from there -- ignoring whatever happened in the
> mean-time.
>
> So why do we have to do machine wide updates? How can we get at the end
> up a grace period without all CPUs already agreeing that its complete?
>
> /me puzzled.

This a decent overall summary of how RCU grace periods work, but there
are quite a few corner cases that complicate things. In this email,
I will focus on just one of them, starting with CPUs returning from
NOHZ-idle state.

In theory, you are correct when you say that we could have CPUs sync up
with current RCU state immediately upon return from idle. In practice,
people are already screaming at me about the single CPU-local atomic
operation and memory barriers, so adding code on the idle-exit fastpath
to acquire the leaf rcu_node structure's lock and grab the current
state would do nothing but cause Marc Zyngier and many others to report
performance bugs to me.

And even that would not be completely sufficient. After all, the state
in the leaf rcu_node structure will be out of date during grace-period
initialization and cleanup. So to -completely- synchronize state for
the incoming CPU, I would have to acquire the root rcu_node structure's
lock and look at the live state. Needless to say, the performance and
scalability implications of acquiring a global lock on each and every
idle exit event is not going to be at all pretty.

This means that even non-idle CPUs must necessarily be allowed to have
different about which grace period is currently in effect. We simply
cannot have total agreement on when a given grace period starts or
ends, because such agreement is just too expensive. Therefore, when a
grace period begins, the grace-period kthread scans the rcu_node tree
propagating this transition through the rcu_node tree. And similarly
when a grace period ends.

Because the rcu_node tree is mapped into a dense array, and because
the scan proceeds in index order, the scan operation is pretty much
best-case for the cache hardware. But on large machines with large
cache-miss latencies, it can still inflict a bit of pain -- almost all
of which has been addressed by the switch to grace-period kthreads.

Hey, you asked!!! ;-)

Thanx, Paul

2017-04-19 15:13:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 04:52:15PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 07:47:30AM -0700, Paul E. McKenney wrote:
> > The RCU expedited primitives have been completely rewritten since then,
> > and no longer use try_stop_cpus(), no longer disturb idle CPUs, and no
> > longer disturb nohz_full CPUs running in userspace. In addition, there
> > is the rcupdate.rcu_normal kernel boot paramter for those who want to
> > completely avoid RCU expedited primitives.
> >
> > So it seems to me to be time for the patch below. Thoughts?
>
> So I forgot all the details again; but if I'm not mistaken it still
> prods CPUs with IPIs (just not idle/nohz_full CPUs). So its still not
> ideal to sprinkle them around.
>
> Which would still argue against using them too much.

True, but we have any number of things in the kernel that do IPIs,
including simple wakeups. Adding checkpatch warnings for all of them
seems silly, as does singling out only one of them. Hence the patch.

Thanx, Paul

2017-04-19 15:17:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 05:03:09PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 07:58:44AM -0700, Josh Triplett wrote:
> >
> > Still something to hesitate a bit before using, but not something
> > checkpatch should warn about.
>
> How else will you get people to hesitate?

The fact that checkpatch has been warning about it for quite some time
should suffice for the near future. Longer term, there is no more reason
to complain about synchronize_rcu_expedited() than there is to complain
about smp_call_function().

Thanx, Paul

2017-04-19 15:37:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 03:15:53PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 06:02:45AM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 19, 2017 at 01:28:45PM +0200, Peter Zijlstra wrote:
> > >
> > > So the thing Maz complained about is because KVM assumes
> > > synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> > > This series 'breaks' that.
> > >
> > > I've not looked hard enough at the new SRCU to see if its possible to
> > > re-instate that feature.
> >
> > And with the fix I gave Maz, the parallelized version is near enough
> > to being free as well. It was just a stupid bug on my part: I forgot
> > to check for expedited when scheduling callbacks.
>
> Right, although for the old SRCU it was true for !expedited as well.

Which is all good fun until someone does a call_srcu() on each and
every munmap() syscall. ;-)

> Just turns out the KVM memslots crud already uses
> synchronize_srcu_expedited().
>
> <rant>without a friggin' comment; hate @expedited</rant>

And I won't even try defend the old try_stop_cpus()-based expedited
algorithm in today's context, even if it did seem to be a good idea
at the time. That said, back at that time, the expectation was that
expedited grace periods would only be used for very rare boot-time
configuration changes, at which time who cares? But there are a lot
more expedited use cases these days, so the implementation had to change.

But the current code is much better housebroken. ;-)

Thanx, Paul

2017-04-19 15:40:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Wed, Apr 19, 2017 at 08:08:09AM -0700, Paul E. McKenney wrote:
> And even that would not be completely sufficient. After all, the state
> in the leaf rcu_node structure will be out of date during grace-period
> initialization and cleanup. So to -completely- synchronize state for
> the incoming CPU, I would have to acquire the root rcu_node structure's
> lock and look at the live state. Needless to say, the performance and
> scalability implications of acquiring a global lock on each and every
> idle exit event is not going to be at all pretty.

Arguably you could use a seqlock to read the global state. Will still
ponder things a bit more, esp. those bugs you pointed me at from just
reading gpnum.

2017-04-19 15:43:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 08:37:03AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 19, 2017 at 03:15:53PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 19, 2017 at 06:02:45AM -0700, Paul E. McKenney wrote:
> > > On Wed, Apr 19, 2017 at 01:28:45PM +0200, Peter Zijlstra wrote:
> > > >
> > > > So the thing Maz complained about is because KVM assumes
> > > > synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> > > > This series 'breaks' that.
> > > >
> > > > I've not looked hard enough at the new SRCU to see if its possible to
> > > > re-instate that feature.
> > >
> > > And with the fix I gave Maz, the parallelized version is near enough
> > > to being free as well. It was just a stupid bug on my part: I forgot
> > > to check for expedited when scheduling callbacks.
> >
> > Right, although for the old SRCU it was true for !expedited as well.
>
> Which is all good fun until someone does a call_srcu() on each and
> every munmap() syscall. ;-)

Well, that being a different SRCU domain doesn't affect the KVM memslot
domain thingy ;-)

> But the current code is much better housebroken. ;-)

It is. But a workload that manages to hit sync_expedited in a loop on
all CPUs is still O(n^2) work. And the more sync_expedited instances we
have, the more likely that becomes.

2017-04-19 16:13:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

On Wed, Apr 19, 2017 at 05:43:43PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 08:37:03AM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 19, 2017 at 03:15:53PM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 19, 2017 at 06:02:45AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Apr 19, 2017 at 01:28:45PM +0200, Peter Zijlstra wrote:
> > > > >
> > > > > So the thing Maz complained about is because KVM assumes
> > > > > synchronize_srcu() is 'free' when there is no srcu_read_lock() activity.
> > > > > This series 'breaks' that.
> > > > >
> > > > > I've not looked hard enough at the new SRCU to see if its possible to
> > > > > re-instate that feature.
> > > >
> > > > And with the fix I gave Maz, the parallelized version is near enough
> > > > to being free as well. It was just a stupid bug on my part: I forgot
> > > > to check for expedited when scheduling callbacks.
> > >
> > > Right, although for the old SRCU it was true for !expedited as well.
> >
> > Which is all good fun until someone does a call_srcu() on each and
> > every munmap() syscall. ;-)
>
> Well, that being a different SRCU domain doesn't affect the KVM memslot
> domain thingy ;-)

Other than the excessive quantities of CPU time consumed...

> > But the current code is much better housebroken. ;-)
>
> It is. But a workload that manages to hit sync_expedited in a loop on
> all CPUs is still O(n^2) work. And the more sync_expedited instances we
> have, the more likely that becomes.

In most cases, it shouldn't be -that- hard to loop through the CPUs and
then do a single sync_expedited at the end of the loop.

Thanx, Paul

2017-04-19 16:14:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 04/13] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

On Wed, Apr 19, 2017 at 05:40:40PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 08:08:09AM -0700, Paul E. McKenney wrote:
> > And even that would not be completely sufficient. After all, the state
> > in the leaf rcu_node structure will be out of date during grace-period
> > initialization and cleanup. So to -completely- synchronize state for
> > the incoming CPU, I would have to acquire the root rcu_node structure's
> > lock and look at the live state. Needless to say, the performance and
> > scalability implications of acquiring a global lock on each and every
> > idle exit event is not going to be at all pretty.
>
> Arguably you could use a seqlock to read the global state. Will still
> ponder things a bit more, esp. those bugs you pointed me at from just
> reading gpnum.

Looking forward to hearing what you come up with!

Thanx, Paul

2017-04-19 16:26:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 02/11] lockdep: Use "WARNING" tag on lockdep splats

On Wed, Apr 19, 2017 at 08:00:22AM -0700, Josh Triplett wrote:
> On Mon, Apr 17, 2017 at 04:28:49PM -0700, Paul E. McKenney wrote:
> > This commit changes lockdep splats to begin lines with "WARNING" and
> > to use pr_warn() instead of printk(). This change eases scripted
> > analysis of kernel console output.
> >
> > Reported-by: Dmitry Vyukov <[email protected]>
> > Reported-by: Ingo Molnar <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Acked-by: Dmitry Vyukov <[email protected]>
>
> Reviewed-by: Josh Triplett <[email protected]>

Thank you!

> Any reason not to change the adjacent calls to printk (without a
> priority) to pr_warn?

There was some discussion of changing them all throughout the file,
not sure where we left that.

Thanx, Paul

> > kernel/locking/lockdep.c | 86 +++++++++++++++++++++---------------------
> > kernel/locking/rtmutex-debug.c | 9 +++--
> > 2 files changed, 48 insertions(+), 47 deletions(-)
> >
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index a95e5d1f4a9c..e9d4f85b290c 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -1144,10 +1144,10 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
> > return 0;
> >
> > printk("\n");
> > - printk("======================================================\n");
> > - printk("[ INFO: possible circular locking dependency detected ]\n");
> > + pr_warn("======================================================\n");
> > + pr_warn("WARNING: possible circular locking dependency detected\n");
> > print_kernel_ident();
> > - printk("-------------------------------------------------------\n");
> > + pr_warn("------------------------------------------------------\n");
> > printk("%s/%d is trying to acquire lock:\n",
> > curr->comm, task_pid_nr(curr));
> > print_lock(check_src);
> > @@ -1482,11 +1482,11 @@ print_bad_irq_dependency(struct task_struct *curr,
> > return 0;
> >
> > printk("\n");
> > - printk("======================================================\n");
> > - printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
> > + pr_warn("=====================================================\n");
> > + pr_warn("WARNING: %s-safe -> %s-unsafe lock order detected\n",
> > irqclass, irqclass);
> > print_kernel_ident();
> > - printk("------------------------------------------------------\n");
> > + pr_warn("-----------------------------------------------------\n");
> > printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
> > curr->comm, task_pid_nr(curr),
> > curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
> > @@ -1711,10 +1711,10 @@ print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
> > return 0;
> >
> > printk("\n");
> > - printk("=============================================\n");
> > - printk("[ INFO: possible recursive locking detected ]\n");
> > + pr_warn("============================================\n");
> > + pr_warn("WARNING: possible recursive locking detected\n");
> > print_kernel_ident();
> > - printk("---------------------------------------------\n");
> > + pr_warn("--------------------------------------------\n");
> > printk("%s/%d is trying to acquire lock:\n",
> > curr->comm, task_pid_nr(curr));
> > print_lock(next);
> > @@ -2061,10 +2061,10 @@ static void print_collision(struct task_struct *curr,
> > struct lock_chain *chain)
> > {
> > printk("\n");
> > - printk("======================\n");
> > - printk("[chain_key collision ]\n");
> > + pr_warn("============================\n");
> > + pr_warn("WARNING: chain_key collision\n");
> > print_kernel_ident();
> > - printk("----------------------\n");
> > + pr_warn("----------------------------\n");
> > printk("%s/%d: ", current->comm, task_pid_nr(current));
> > printk("Hash chain already cached but the contents don't match!\n");
> >
> > @@ -2360,10 +2360,10 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this,
> > return 0;
> >
> > printk("\n");
> > - printk("=================================\n");
> > - printk("[ INFO: inconsistent lock state ]\n");
> > + pr_warn("================================\n");
> > + pr_warn("WARNING: inconsistent lock state\n");
> > print_kernel_ident();
> > - printk("---------------------------------\n");
> > + pr_warn("--------------------------------\n");
> >
> > printk("inconsistent {%s} -> {%s} usage.\n",
> > usage_str[prev_bit], usage_str[new_bit]);
> > @@ -2425,10 +2425,10 @@ print_irq_inversion_bug(struct task_struct *curr,
> > return 0;
> >
> > printk("\n");
> > - printk("=========================================================\n");
> > - printk("[ INFO: possible irq lock inversion dependency detected ]\n");
> > + pr_warn("========================================================\n");
> > + pr_warn("WARNING: possible irq lock inversion dependency detected\n");
> > print_kernel_ident();
> > - printk("---------------------------------------------------------\n");
> > + pr_warn("--------------------------------------------------------\n");
> > printk("%s/%d just changed the state of lock:\n",
> > curr->comm, task_pid_nr(curr));
> > print_lock(this);
> > @@ -3170,10 +3170,10 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
> > return 0;
> >
> > printk("\n");
> > - printk("==================================\n");
> > - printk("[ BUG: Nested lock was not taken ]\n");
> > + pr_warn("==================================\n");
> > + pr_warn("WARNING: Nested lock was not taken\n");
> > print_kernel_ident();
> > - printk("----------------------------------\n");
> > + pr_warn("----------------------------------\n");
> >
> > printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
> > print_lock(hlock);
> > @@ -3383,10 +3383,10 @@ print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
> > return 0;
> >
> > printk("\n");
> > - printk("=====================================\n");
> > - printk("[ BUG: bad unlock balance detected! ]\n");
> > + pr_warn("=====================================\n");
> > + pr_warn("WARNING: bad unlock balance detected!\n");
> > print_kernel_ident();
> > - printk("-------------------------------------\n");
> > + pr_warn("-------------------------------------\n");
> > printk("%s/%d is trying to release lock (",
> > curr->comm, task_pid_nr(curr));
> > print_lockdep_cache(lock);
> > @@ -3880,10 +3880,10 @@ print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
> > return 0;
> >
> > printk("\n");
> > - printk("=================================\n");
> > - printk("[ BUG: bad contention detected! ]\n");
> > + pr_warn("=================================\n");
> > + pr_warn("WARNING: bad contention detected!\n");
> > print_kernel_ident();
> > - printk("---------------------------------\n");
> > + pr_warn("---------------------------------\n");
> > printk("%s/%d is trying to contend lock (",
> > curr->comm, task_pid_nr(curr));
> > print_lockdep_cache(lock);
> > @@ -4244,10 +4244,10 @@ print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
> > return;
> >
> > printk("\n");
> > - printk("=========================\n");
> > - printk("[ BUG: held lock freed! ]\n");
> > + pr_warn("=========================\n");
> > + pr_warn("WARNING: held lock freed!\n");
> > print_kernel_ident();
> > - printk("-------------------------\n");
> > + pr_warn("-------------------------\n");
> > printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
> > curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
> > print_lock(hlock);
> > @@ -4302,11 +4302,11 @@ static void print_held_locks_bug(void)
> > return;
> >
> > printk("\n");
> > - printk("=====================================\n");
> > - printk("[ BUG: %s/%d still has locks held! ]\n",
> > + pr_warn("====================================\n");
> > + pr_warn("WARNING: %s/%d still has locks held!\n",
> > current->comm, task_pid_nr(current));
> > print_kernel_ident();
> > - printk("-------------------------------------\n");
> > + pr_warn("------------------------------------\n");
> > lockdep_print_held_locks(current);
> > printk("\nstack backtrace:\n");
> > dump_stack();
> > @@ -4371,7 +4371,7 @@ void debug_show_all_locks(void)
> > } while_each_thread(g, p);
> >
> > printk("\n");
> > - printk("=============================================\n\n");
> > + pr_warn("=============================================\n\n");
> >
> > if (unlock)
> > read_unlock(&tasklist_lock);
> > @@ -4401,10 +4401,10 @@ asmlinkage __visible void lockdep_sys_exit(void)
> > if (!debug_locks_off())
> > return;
> > printk("\n");
> > - printk("================================================\n");
> > - printk("[ BUG: lock held when returning to user space! ]\n");
> > + pr_warn("================================================\n");
> > + pr_warn("WARNING: lock held when returning to user space!\n");
> > print_kernel_ident();
> > - printk("------------------------------------------------\n");
> > + pr_warn("------------------------------------------------\n");
> > printk("%s/%d is leaving the kernel with locks still held!\n",
> > curr->comm, curr->pid);
> > lockdep_print_held_locks(curr);
> > @@ -4421,13 +4421,13 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
> > #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
> > /* Note: the following can be executed concurrently, so be careful. */
> > printk("\n");
> > - pr_err("===============================\n");
> > - pr_err("[ ERR: suspicious RCU usage. ]\n");
> > + pr_warn("=============================\n");
> > + pr_warn("WARNING: suspicious RCU usage\n");
> > print_kernel_ident();
> > - pr_err("-------------------------------\n");
> > - pr_err("%s:%d %s!\n", file, line, s);
> > - pr_err("\nother info that might help us debug this:\n\n");
> > - pr_err("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
> > + pr_warn("-----------------------------\n");
> > + printk("%s:%d %s!\n", file, line, s);
> > + printk("\nother info that might help us debug this:\n\n");
> > + printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
> > !rcu_lockdep_current_cpu_online()
> > ? "RCU used illegally from offline CPU!\n"
> > : !rcu_is_watching()
> > diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
> > index 97ee9df32e0f..db4f55211b04 100644
> > --- a/kernel/locking/rtmutex-debug.c
> > +++ b/kernel/locking/rtmutex-debug.c
> > @@ -102,10 +102,11 @@ void debug_rt_mutex_print_deadlock(struct rt_mutex_waiter *waiter)
> > return;
> > }
> >
> > - printk("\n============================================\n");
> > - printk( "[ BUG: circular locking deadlock detected! ]\n");
> > - printk("%s\n", print_tainted());
> > - printk( "--------------------------------------------\n");
> > + pr_warn("\n");
> > + pr_warn("============================================\n");
> > + pr_warn("WARNING: circular locking deadlock detected!\n");
> > + pr_warn("%s\n", print_tainted());
> > + pr_warn("--------------------------------------------\n");
> > printk("%s/%d is deadlocking current task %s/%d\n\n",
> > task->comm, task_pid_nr(task),
> > current->comm, task_pid_nr(current));
> > --
> > 2.5.2
> >
>

2017-04-19 16:46:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 tip/core/rcu 0/13] Miscellaneous fixes for 4.12

Hello!

This v3 series contains the following fixes:

1. Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU.

2. Use "WARNING" tag on RCU's lockdep splats.

3. Update obsolete callback_head comment.

4. Make RCU_FANOUT_LEAF help text more explicit about skew_tick.

5. Remove obsolete comment from rcu_future_gp_cleanup() header.

6. Disable sparse warning emitted by hlist_add_tail_rcu(), courtesy
of Michael S. Tsirkin.

7. Improve comments for hotplug/suspend/hibernate functions.

8. Use correct path for Kconfig fragment for duplicate rcutorture
test scenarios.

9. Use bool value directly for ->beenonline comparison, courtesy
of Nicholas Mc Guire.

10. Use true/false in assignment to bool variable rcu_nocb_poll,
courtesy of Nicholas Mc Guire.

11. Fix typo in PER_RCU_NODE_PERIOD header comment.

Changes since v2:

o Fixed indentation in RCU_FANOOUT_LEAF Kconfig option help text,
as noted by Josh Triplett.

o Applied feedback from David Rientjes.

Changes since v1:

o Applied review feedback from Peter Zijlstra, Vlastimil Babka,
and Eric Dumazet.

o Dropped v1 patch #7 ("Add smp_mb__after_atomic() to
sync_exp_work_done()"), as ensuing discussion confirmed that
smp_mb__before_atomic() guarantees a full barrier.

o Moved v1 patch #9 ("Use static initialization for "srcu" in
mm/mmu_notifier.c") to the srcu series because 0day Test Robot
showed that it needs to be there.

Thanx, Paul

------------------------------------------------------------------------

Documentation/RCU/00-INDEX | 2
Documentation/RCU/rculist_nulls.txt | 6 -
Documentation/RCU/whatisRCU.txt | 3
drivers/gpu/drm/i915/i915_gem.c | 2
drivers/gpu/drm/i915/i915_gem_request.h | 2
drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c | 2
fs/jbd2/journal.c | 2
fs/signalfd.c | 2
include/linux/dma-fence.h | 4
include/linux/rculist.h | 3
include/linux/slab.h | 6 -
include/linux/types.h | 2
include/net/sock.h | 2
init/Kconfig | 10 +
kernel/fork.c | 4
kernel/locking/lockdep.c | 86 +++++++--------
kernel/locking/rtmutex-debug.c | 9 -
kernel/rcu/tree.c | 49 ++++++--
kernel/rcu/tree_plugin.h | 2
kernel/signal.c | 2
mm/kasan/kasan.c | 6 -
mm/kmemcheck.c | 2
mm/rmap.c | 4
mm/slab.c | 6 -
mm/slab.h | 4
mm/slab_common.c | 6 -
mm/slob.c | 6 -
mm/slub.c | 12 +-
net/dccp/ipv4.c | 2
net/dccp/ipv6.c | 2
net/ipv4/tcp_ipv4.c | 2
net/ipv6/tcp_ipv6.c | 2
net/llc/af_llc.c | 2
net/llc/llc_conn.c | 4
net/llc/llc_sap.c | 2
net/netfilter/nf_conntrack_core.c | 8 -
net/smc/af_smc.c | 2
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2
38 files changed, 158 insertions(+), 116 deletions(-)

2017-04-19 16:46:49

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 10/11] rcu: Use true/false in assignment to bool

From: Nicholas Mc Guire <[email protected]>

This commit makes the parse_rcu_nocb_poll() function assign true
(rather than the constant 1) to the bool variable rcu_nocb_poll.

Signed-off-by: Nicholas Mc Guire <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0a62a8f1caac..f4b7a9be1a44 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1709,7 +1709,7 @@ __setup("rcu_nocbs=", rcu_nocb_setup);

static int __init parse_rcu_nocb_poll(char *arg)
{
- rcu_nocb_poll = 1;
+ rcu_nocb_poll = true;
return 0;
}
early_param("rcu_nocb_poll", parse_rcu_nocb_poll);
--
2.5.2

2017-04-19 16:46:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 09/11] rcu: Use bool value directly

From: Nicholas Mc Guire <[email protected]>

The beenonline variable is declared bool so there is no need for an
explicit comparison, especially not against the constant zero.

Signed-off-by: Nicholas Mc Guire <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c4f195dd7c94..7c238604df18 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3085,7 +3085,7 @@ __rcu_process_callbacks(struct rcu_state *rsp)
bool needwake;
struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);

- WARN_ON_ONCE(rdp->beenonline == 0);
+ WARN_ON_ONCE(!rdp->beenonline);

/* Update RCU state based on any recent quiescent states. */
rcu_check_quiescent_state(rsp, rdp);
--
2.5.2

2017-04-19 16:46:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 03/11] types: Update obsolete callback_head comment

The comment header for callback_head (and thus for rcu_head) states that
the bottom two bits of a pointer to these structures must be zero. This
is obsolete: The new requirement is that only the bottom bit need be
zero. This commit therefore updates this comment.

Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/types.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/types.h b/include/linux/types.h
index 1e7bd24848fc..258099a4ed82 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -209,7 +209,7 @@ struct ustat {
* naturally due ABI requirements, but some architectures (like CRIS) have
* weird ABI and we need to ask it explicitly.
*
- * The alignment is required to guarantee that bits 0 and 1 of @next will be
+ * The alignment is required to guarantee that bit 0 of @next will be
* clear under normal conditions -- as long as we use call_rcu(),
* call_rcu_bh(), call_rcu_sched(), or call_srcu() to queue callback.
*
--
2.5.2

2017-04-19 16:47:07

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 02/11] lockdep: Use "WARNING" tag on lockdep splats

This commit changes lockdep splats to begin lines with "WARNING" and
to use pr_warn() instead of printk(). This change eases scripted
analysis of kernel console output.

Reported-by: Dmitry Vyukov <[email protected]>
Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
kernel/locking/lockdep.c | 86 +++++++++++++++++++++---------------------
kernel/locking/rtmutex-debug.c | 9 +++--
2 files changed, 48 insertions(+), 47 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index a95e5d1f4a9c..e9d4f85b290c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1144,10 +1144,10 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
return 0;

printk("\n");
- printk("======================================================\n");
- printk("[ INFO: possible circular locking dependency detected ]\n");
+ pr_warn("======================================================\n");
+ pr_warn("WARNING: possible circular locking dependency detected\n");
print_kernel_ident();
- printk("-------------------------------------------------------\n");
+ pr_warn("------------------------------------------------------\n");
printk("%s/%d is trying to acquire lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(check_src);
@@ -1482,11 +1482,11 @@ print_bad_irq_dependency(struct task_struct *curr,
return 0;

printk("\n");
- printk("======================================================\n");
- printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
+ pr_warn("=====================================================\n");
+ pr_warn("WARNING: %s-safe -> %s-unsafe lock order detected\n",
irqclass, irqclass);
print_kernel_ident();
- printk("------------------------------------------------------\n");
+ pr_warn("-----------------------------------------------------\n");
printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
curr->comm, task_pid_nr(curr),
curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
@@ -1711,10 +1711,10 @@ print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
return 0;

printk("\n");
- printk("=============================================\n");
- printk("[ INFO: possible recursive locking detected ]\n");
+ pr_warn("============================================\n");
+ pr_warn("WARNING: possible recursive locking detected\n");
print_kernel_ident();
- printk("---------------------------------------------\n");
+ pr_warn("--------------------------------------------\n");
printk("%s/%d is trying to acquire lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(next);
@@ -2061,10 +2061,10 @@ static void print_collision(struct task_struct *curr,
struct lock_chain *chain)
{
printk("\n");
- printk("======================\n");
- printk("[chain_key collision ]\n");
+ pr_warn("============================\n");
+ pr_warn("WARNING: chain_key collision\n");
print_kernel_ident();
- printk("----------------------\n");
+ pr_warn("----------------------------\n");
printk("%s/%d: ", current->comm, task_pid_nr(current));
printk("Hash chain already cached but the contents don't match!\n");

@@ -2360,10 +2360,10 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this,
return 0;

printk("\n");
- printk("=================================\n");
- printk("[ INFO: inconsistent lock state ]\n");
+ pr_warn("================================\n");
+ pr_warn("WARNING: inconsistent lock state\n");
print_kernel_ident();
- printk("---------------------------------\n");
+ pr_warn("--------------------------------\n");

printk("inconsistent {%s} -> {%s} usage.\n",
usage_str[prev_bit], usage_str[new_bit]);
@@ -2425,10 +2425,10 @@ print_irq_inversion_bug(struct task_struct *curr,
return 0;

printk("\n");
- printk("=========================================================\n");
- printk("[ INFO: possible irq lock inversion dependency detected ]\n");
+ pr_warn("========================================================\n");
+ pr_warn("WARNING: possible irq lock inversion dependency detected\n");
print_kernel_ident();
- printk("---------------------------------------------------------\n");
+ pr_warn("--------------------------------------------------------\n");
printk("%s/%d just changed the state of lock:\n",
curr->comm, task_pid_nr(curr));
print_lock(this);
@@ -3170,10 +3170,10 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
return 0;

printk("\n");
- printk("==================================\n");
- printk("[ BUG: Nested lock was not taken ]\n");
+ pr_warn("==================================\n");
+ pr_warn("WARNING: Nested lock was not taken\n");
print_kernel_ident();
- printk("----------------------------------\n");
+ pr_warn("----------------------------------\n");

printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
print_lock(hlock);
@@ -3383,10 +3383,10 @@ print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
return 0;

printk("\n");
- printk("=====================================\n");
- printk("[ BUG: bad unlock balance detected! ]\n");
+ pr_warn("=====================================\n");
+ pr_warn("WARNING: bad unlock balance detected!\n");
print_kernel_ident();
- printk("-------------------------------------\n");
+ pr_warn("-------------------------------------\n");
printk("%s/%d is trying to release lock (",
curr->comm, task_pid_nr(curr));
print_lockdep_cache(lock);
@@ -3880,10 +3880,10 @@ print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
return 0;

printk("\n");
- printk("=================================\n");
- printk("[ BUG: bad contention detected! ]\n");
+ pr_warn("=================================\n");
+ pr_warn("WARNING: bad contention detected!\n");
print_kernel_ident();
- printk("---------------------------------\n");
+ pr_warn("---------------------------------\n");
printk("%s/%d is trying to contend lock (",
curr->comm, task_pid_nr(curr));
print_lockdep_cache(lock);
@@ -4244,10 +4244,10 @@ print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
return;

printk("\n");
- printk("=========================\n");
- printk("[ BUG: held lock freed! ]\n");
+ pr_warn("=========================\n");
+ pr_warn("WARNING: held lock freed!\n");
print_kernel_ident();
- printk("-------------------------\n");
+ pr_warn("-------------------------\n");
printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
print_lock(hlock);
@@ -4302,11 +4302,11 @@ static void print_held_locks_bug(void)
return;

printk("\n");
- printk("=====================================\n");
- printk("[ BUG: %s/%d still has locks held! ]\n",
+ pr_warn("====================================\n");
+ pr_warn("WARNING: %s/%d still has locks held!\n",
current->comm, task_pid_nr(current));
print_kernel_ident();
- printk("-------------------------------------\n");
+ pr_warn("------------------------------------\n");
lockdep_print_held_locks(current);
printk("\nstack backtrace:\n");
dump_stack();
@@ -4371,7 +4371,7 @@ void debug_show_all_locks(void)
} while_each_thread(g, p);

printk("\n");
- printk("=============================================\n\n");
+ pr_warn("=============================================\n\n");

if (unlock)
read_unlock(&tasklist_lock);
@@ -4401,10 +4401,10 @@ asmlinkage __visible void lockdep_sys_exit(void)
if (!debug_locks_off())
return;
printk("\n");
- printk("================================================\n");
- printk("[ BUG: lock held when returning to user space! ]\n");
+ pr_warn("================================================\n");
+ pr_warn("WARNING: lock held when returning to user space!\n");
print_kernel_ident();
- printk("------------------------------------------------\n");
+ pr_warn("------------------------------------------------\n");
printk("%s/%d is leaving the kernel with locks still held!\n",
curr->comm, curr->pid);
lockdep_print_held_locks(curr);
@@ -4421,13 +4421,13 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
#endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
/* Note: the following can be executed concurrently, so be careful. */
printk("\n");
- pr_err("===============================\n");
- pr_err("[ ERR: suspicious RCU usage. ]\n");
+ pr_warn("=============================\n");
+ pr_warn("WARNING: suspicious RCU usage\n");
print_kernel_ident();
- pr_err("-------------------------------\n");
- pr_err("%s:%d %s!\n", file, line, s);
- pr_err("\nother info that might help us debug this:\n\n");
- pr_err("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
+ pr_warn("-----------------------------\n");
+ printk("%s:%d %s!\n", file, line, s);
+ printk("\nother info that might help us debug this:\n\n");
+ printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
!rcu_lockdep_current_cpu_online()
? "RCU used illegally from offline CPU!\n"
: !rcu_is_watching()
diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
index 97ee9df32e0f..db4f55211b04 100644
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -102,10 +102,11 @@ void debug_rt_mutex_print_deadlock(struct rt_mutex_waiter *waiter)
return;
}

- printk("\n============================================\n");
- printk( "[ BUG: circular locking deadlock detected! ]\n");
- printk("%s\n", print_tainted());
- printk( "--------------------------------------------\n");
+ pr_warn("\n");
+ pr_warn("============================================\n");
+ pr_warn("WARNING: circular locking deadlock detected!\n");
+ pr_warn("%s\n", print_tainted());
+ pr_warn("--------------------------------------------\n");
printk("%s/%d is deadlocking current task %s/%d\n\n",
task->comm, task_pid_nr(task),
current->comm, task_pid_nr(current));
--
2.5.2

2017-04-19 16:47:11

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 11/11] rcu: Fix typo in PER_RCU_NODE_PERIOD header comment

This commit just changes a "the the" to "the" to reduce repetition.

Reported-by: Michalis Kokologiannakis <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7c238604df18..b1679e8cc5ed 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -199,7 +199,7 @@ static const int gp_cleanup_delay;

/*
* Number of grace periods between delays, normalized by the duration of
- * the delay. The longer the the delay, the more the grace periods between
+ * the delay. The longer the delay, the more the grace periods between
* each delay. The reason for this normalization is that it means that,
* for non-zero delays, the overall slowdown of grace periods is constant
* regardless of the duration of the delay. This arrangement balances
--
2.5.2

2017-04-19 16:47:04

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 06/11] hlist_add_tail_rcu disable sparse warning

From: "Michael S. Tsirkin" <[email protected]>

sparse is unhappy about this code in hlist_add_tail_rcu:

struct hlist_node *i, *last = NULL;

for (i = hlist_first_rcu(h); i; i = hlist_next_rcu(i))
last = i;

This is because hlist_next_rcu and hlist_next_rcu return
__rcu pointers.

It's a false positive - it's a write side primitive and so
does not need to be called in a read side critical section.

The following trivial patch disables the warning
without changing the behaviour in any way.

Note: __hlist_for_each_rcu would also remove the warning but it would be
confusing since it calls rcu_derefence and is designed to run in the rcu
read side critical section.

Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rculist.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 4f7a9561b8c4..b1fd8bf85fdc 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -509,7 +509,8 @@ static inline void hlist_add_tail_rcu(struct hlist_node *n,
{
struct hlist_node *i, *last = NULL;

- for (i = hlist_first_rcu(h); i; i = hlist_next_rcu(i))
+ /* Note: write side code, so rcu accessors are not needed. */
+ for (i = h->first; i; i = i->next)
last = i;

if (last) {
--
2.5.2

2017-04-19 16:47:01

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 08/11] torture: Use correct path for Kconfig fragment for duplicates

Currently, the rcutorture scripting will give an error message if
running a duplicate scenario that happens also to have a non-existent
build directory (b1, b2, ... in the rcutorture directory). Worse yet, if
the build directory has already been created and used for a real build,
the script will silently grab the wrong Kconfig fragment, which could
cause confusion to the poor sap (me) analyzing old test results. At
least the actual test runs correctly...

This commit therefore accesses the Kconfig fragment from the results
directory corresponding to the first of the duplicate scenarios, for
which a build was actually carried out. This prevents both the messages
and at least one form of later confusion.

Signed-off-by: Paul E. McKenney <[email protected]>
---
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index ea6e373edc27..93eede4e8fbe 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -170,7 +170,7 @@ qemu_append="`identify_qemu_append "$QEMU"`"
# Pull in Kconfig-fragment boot parameters
boot_args="`configfrag_boot_params "$boot_args" "$config_template"`"
# Generate kernel-version-specific boot parameters
-boot_args="`per_version_boot_params "$boot_args" $builddir/.config $seconds`"
+boot_args="`per_version_boot_params "$boot_args" $resdir/.config $seconds`"

if test -n "$TORTURE_BUILDONLY"
then
--
2.5.2

2017-04-19 16:47:51

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 07/11] rcu: Improve comments for hotplug/suspend/hibernate functions

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 41 +++++++++++++++++++++++++++++++++++++----
1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bdaa69d23a8a..c4f195dd7c94 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3894,6 +3894,10 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}

+/*
+ * Invoked early in the CPU-online process, when pretty much all
+ * services are available. The incoming CPU is not present.
+ */
int rcutree_prepare_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3907,6 +3911,9 @@ int rcutree_prepare_cpu(unsigned int cpu)
return 0;
}

+/*
+ * Update RCU priority boot kthread affinity for CPU-hotplug changes.
+ */
static void rcutree_affinity_setting(unsigned int cpu, int outgoing)
{
struct rcu_data *rdp = per_cpu_ptr(rcu_state_p->rda, cpu);
@@ -3914,6 +3921,10 @@ static void rcutree_affinity_setting(unsigned int cpu, int outgoing)
rcu_boost_kthread_setaffinity(rdp->mynode, outgoing);
}

+/*
+ * Near the end of the CPU-online process. Pretty much all services
+ * enabled, and the CPU is now very much alive.
+ */
int rcutree_online_cpu(unsigned int cpu)
{
sync_sched_exp_online_cleanup(cpu);
@@ -3921,13 +3932,19 @@ int rcutree_online_cpu(unsigned int cpu)
return 0;
}

+/*
+ * Near the beginning of the process. The CPU is still very much alive
+ * with pretty much all services enabled.
+ */
int rcutree_offline_cpu(unsigned int cpu)
{
rcutree_affinity_setting(cpu, cpu);
return 0;
}

-
+/*
+ * Near the end of the offline process. We do only tracing here.
+ */
int rcutree_dying_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3937,6 +3954,9 @@ int rcutree_dying_cpu(unsigned int cpu)
return 0;
}

+/*
+ * The outgoing CPU is gone and we are running elsewhere.
+ */
int rcutree_dead_cpu(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -3954,6 +3974,10 @@ int rcutree_dead_cpu(unsigned int cpu)
* incoming CPUs are not allowed to use RCU read-side critical sections
* until this function is called. Failing to observe this restriction
* will result in lockdep splats.
+ *
+ * Note that this function is special in that it is invoked directly
+ * from the incoming CPU rather than from the cpuhp_step mechanism.
+ * This is because this function must be invoked at a precise location.
*/
void rcu_cpu_starting(unsigned int cpu)
{
@@ -3979,9 +4003,6 @@ void rcu_cpu_starting(unsigned int cpu)
* The CPU is exiting the idle loop into the arch_cpu_idle_dead()
* function. We now remove it from the rcu_node tree's ->qsmaskinit
* bit masks.
- * The CPU is exiting the idle loop into the arch_cpu_idle_dead()
- * function. We now remove it from the rcu_node tree's ->qsmaskinit
- * bit masks.
*/
static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
{
@@ -3997,6 +4018,14 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}

+/*
+ * The outgoing function has no further need of RCU, so remove it from
+ * the list of CPUs that RCU must track.
+ *
+ * Note that this function is special in that it is invoked directly
+ * from the outgoing CPU rather than from the cpuhp_step mechanism.
+ * This is because this function must be invoked at a precise location.
+ */
void rcu_report_dead(unsigned int cpu)
{
struct rcu_state *rsp;
@@ -4011,6 +4040,10 @@ void rcu_report_dead(unsigned int cpu)
}
#endif

+/*
+ * On non-huge systems, use expedited RCU grace periods to make suspend
+ * and hibernation run faster.
+ */
static int rcu_pm_notify(struct notifier_block *self,
unsigned long action, void *hcpu)
{
--
2.5.2

2017-04-19 16:48:22

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 05/11] rcu: Remove obsolete comment from rcu_future_gp_cleanup() header

The rcu_nocb_gp_cleanup() function is now invoked elsewhere, so this
commit drags this comment into the year 2017.

Reported-by: Michalis Kokologiannakis <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 50fee7689e71..bdaa69d23a8a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1793,9 +1793,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,

/*
* Clean up any old requests for the just-ended grace period. Also return
- * whether any additional grace periods have been requested. Also invoke
- * rcu_nocb_gp_cleanup() in order to wake up any no-callbacks kthreads
- * waiting for this grace period to complete.
+ * whether any additional grace periods have been requested.
*/
static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
{
--
2.5.2

2017-04-19 16:57:19

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 04/11] rcu: Make RCU_FANOUT_LEAF help text more explicit about skew_tick

If you set RCU_FANOUT_LEAF too high, you can get lock contention
on the leaf rcu_node, and you should boot with the skew_tick kernel
parameter set in order to avoid this lock contention. This commit
therefore upgrades the RCU_FANOUT_LEAF help text to explicitly state
this.

Signed-off-by: Paul E. McKenney <[email protected]>
---
init/Kconfig | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a92f27da4a27..c549618c72f0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,11 +612,17 @@ config RCU_FANOUT_LEAF
initialization. These systems tend to run CPU-bound, and thus
are not helped by synchronized interrupts, and thus tend to
skew them, which reduces lock contention enough that large
- leaf-level fanouts work well.
+ leaf-level fanouts work well. That said, setting leaf-level
+ fanout to a large number will likely cause problematic
+ lock contention on the leaf-level rcu_node structures unless
+ you boot with the skew_tick kernel parameter.

Select a specific number if testing RCU itself.

- Select the maximum permissible value for large systems.
+ Select the maximum permissible value for large systems, but
+ please understand that you may also need to set the skew_tick
+ kernel boot parameter to avoid contention on the rcu_node
+ structure's locks.

Take the default if unsure.

--
2.5.2

2017-04-19 16:57:23

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 tip/core/rcu 01/11] mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU

A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes <[email protected]>
---
Documentation/RCU/00-INDEX | 2 +-
Documentation/RCU/rculist_nulls.txt | 6 +++---
Documentation/RCU/whatisRCU.txt | 3 ++-
drivers/gpu/drm/i915/i915_gem.c | 2 +-
drivers/gpu/drm/i915/i915_gem_request.h | 2 +-
drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c | 2 +-
fs/jbd2/journal.c | 2 +-
fs/signalfd.c | 2 +-
include/linux/dma-fence.h | 4 ++--
include/linux/slab.h | 6 ++++--
include/net/sock.h | 2 +-
kernel/fork.c | 4 ++--
kernel/signal.c | 2 +-
mm/kasan/kasan.c | 6 +++---
mm/kmemcheck.c | 2 +-
mm/rmap.c | 4 ++--
mm/slab.c | 6 +++---
mm/slab.h | 4 ++--
mm/slab_common.c | 6 +++---
mm/slob.c | 6 +++---
mm/slub.c | 12 ++++++------
net/dccp/ipv4.c | 2 +-
net/dccp/ipv6.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv6/tcp_ipv6.c | 2 +-
net/llc/af_llc.c | 2 +-
net/llc/llc_conn.c | 4 ++--
net/llc/llc_sap.c | 2 +-
net/netfilter/nf_conntrack_core.c | 8 ++++----
net/smc/af_smc.c | 2 +-
30 files changed, 57 insertions(+), 54 deletions(-)

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index f773a264ae02..1672573b037a 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -17,7 +17,7 @@ rcu_dereference.txt
rcubarrier.txt
- RCU and Unloadable Modules
rculist_nulls.txt
- - RCU list primitives for use with SLAB_DESTROY_BY_RCU
+ - RCU list primitives for use with SLAB_TYPESAFE_BY_RCU
rcuref.txt
- Reference-count design for elements of lists/arrays protected by RCU
rcu.txt
diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
index 18f9651ff23d..8151f0195f76 100644
--- a/Documentation/RCU/rculist_nulls.txt
+++ b/Documentation/RCU/rculist_nulls.txt
@@ -1,5 +1,5 @@
Using hlist_nulls to protect read-mostly linked lists and
-objects using SLAB_DESTROY_BY_RCU allocations.
+objects using SLAB_TYPESAFE_BY_RCU allocations.

Please read the basics in Documentation/RCU/listRCU.txt

@@ -7,7 +7,7 @@ Using special makers (called 'nulls') is a convenient way
to solve following problem :

A typical RCU linked list managing objects which are
-allocated with SLAB_DESTROY_BY_RCU kmem_cache can
+allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
use following algos :

1) Lookup algo
@@ -96,7 +96,7 @@ unlock_chain(); // typically a spin_unlock()
3) Remove algo
--------------
Nothing special here, we can use a standard RCU hlist deletion.
-But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused
+But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
very very fast (before the end of RCU grace period)

if (put_last_reference_on(obj) {
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 5cbd8b2395b8..91c912e86915 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -925,7 +925,8 @@ d. Do you need RCU grace periods to complete even in the face

e. Is your workload too update-intensive for normal use of
RCU, but inappropriate for other synchronization mechanisms?
- If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
+ If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
+ named SLAB_DESTROY_BY_RCU). But please be careful!

f. Do you need read-side critical sections that are respected
even though they are in the middle of the idle loop, during
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6908123162d1..3b668895ac24 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4552,7 +4552,7 @@ i915_gem_load_init(struct drm_i915_private *dev_priv)
dev_priv->requests = KMEM_CACHE(drm_i915_gem_request,
SLAB_HWCACHE_ALIGN |
SLAB_RECLAIM_ACCOUNT |
- SLAB_DESTROY_BY_RCU);
+ SLAB_TYPESAFE_BY_RCU);
if (!dev_priv->requests)
goto err_vmas;

diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
index ea511f06efaf..9ee2750e1dde 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -493,7 +493,7 @@ static inline struct drm_i915_gem_request *
__i915_gem_active_get_rcu(const struct i915_gem_active *active)
{
/* Performing a lockless retrieval of the active request is super
- * tricky. SLAB_DESTROY_BY_RCU merely guarantees that the backing
+ * tricky. SLAB_TYPESAFE_BY_RCU merely guarantees that the backing
* slab of request objects will not be freed whilst we hold the
* RCU read lock. It does not guarantee that the request itself
* will not be freed and then *reused*. Viz,
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c b/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
index 12647af5a336..e7fb47e84a93 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_lockd.c
@@ -1071,7 +1071,7 @@ int ldlm_init(void)
ldlm_lock_slab = kmem_cache_create("ldlm_locks",
sizeof(struct ldlm_lock), 0,
SLAB_HWCACHE_ALIGN |
- SLAB_DESTROY_BY_RCU, NULL);
+ SLAB_TYPESAFE_BY_RCU, NULL);
if (!ldlm_lock_slab) {
kmem_cache_destroy(ldlm_resource_slab);
return -ENOMEM;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a1a359bfcc9c..7f8f962454e5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2340,7 +2340,7 @@ static int jbd2_journal_init_journal_head_cache(void)
jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
sizeof(struct journal_head),
0, /* offset */
- SLAB_TEMPORARY | SLAB_DESTROY_BY_RCU,
+ SLAB_TEMPORARY | SLAB_TYPESAFE_BY_RCU,
NULL); /* ctor */
retval = 0;
if (!jbd2_journal_head_cache) {
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 270221fcef42..7e3d71109f51 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -38,7 +38,7 @@ void signalfd_cleanup(struct sighand_struct *sighand)
/*
* The lockless check can race with remove_wait_queue() in progress,
* but in this case its caller should run under rcu_read_lock() and
- * sighand_cachep is SLAB_DESTROY_BY_RCU, we can safely return.
+ * sighand_cachep is SLAB_TYPESAFE_BY_RCU, we can safely return.
*/
if (likely(!waitqueue_active(wqh)))
return;
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 6048fa404e57..a5195a7d6f77 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -229,7 +229,7 @@ static inline struct dma_fence *dma_fence_get_rcu(struct dma_fence *fence)
*
* Function returns NULL if no refcount could be obtained, or the fence.
* This function handles acquiring a reference to a fence that may be
- * reallocated within the RCU grace period (such as with SLAB_DESTROY_BY_RCU),
+ * reallocated within the RCU grace period (such as with SLAB_TYPESAFE_BY_RCU),
* so long as the caller is using RCU on the pointer to the fence.
*
* An alternative mechanism is to employ a seqlock to protect a bunch of
@@ -257,7 +257,7 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep)
* have successfully acquire a reference to it. If it no
* longer matches, we are holding a reference to some other
* reallocated pointer. This is possible if the allocator
- * is using a freelist like SLAB_DESTROY_BY_RCU where the
+ * is using a freelist like SLAB_TYPESAFE_BY_RCU where the
* fence remains valid for the RCU grace period, but it
* may be reallocated. When using such allocators, we are
* responsible for ensuring the reference we get is to
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3c37a8c51921..04a7f7993e67 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -28,7 +28,7 @@
#define SLAB_STORE_USER 0x00010000UL /* DEBUG: Store the last owner for bug hunting */
#define SLAB_PANIC 0x00040000UL /* Panic if kmem_cache_create() fails */
/*
- * SLAB_DESTROY_BY_RCU - **WARNING** READ THIS!
+ * SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
*
* This delays freeing the SLAB page by a grace period, it does _NOT_
* delay object freeing. This means that if you do kmem_cache_free()
@@ -61,8 +61,10 @@
*
* rcu_read_lock before reading the address, then rcu_read_unlock after
* taking the spinlock within the structure expected at that address.
+ *
+ * Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.
*/
-#define SLAB_DESTROY_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
+#define SLAB_TYPESAFE_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */
#define SLAB_TRACE 0x00200000UL /* Trace allocations and frees */

diff --git a/include/net/sock.h b/include/net/sock.h
index 5e5997654db6..59cdccaa30e7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -993,7 +993,7 @@ struct smc_hashinfo;
struct module;

/*
- * caches using SLAB_DESTROY_BY_RCU should let .next pointer from nulls nodes
+ * caches using SLAB_TYPESAFE_BY_RCU should let .next pointer from nulls nodes
* un-modified. Special care is taken when initializing object to zero.
*/
static inline void sk_prot_clear_nulls(struct sock *sk, int size)
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80e93d..9330ce24f1bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1313,7 +1313,7 @@ void __cleanup_sighand(struct sighand_struct *sighand)
if (atomic_dec_and_test(&sighand->count)) {
signalfd_cleanup(sighand);
/*
- * sighand_cachep is SLAB_DESTROY_BY_RCU so we can free it
+ * sighand_cachep is SLAB_TYPESAFE_BY_RCU so we can free it
* without an RCU grace period, see __lock_task_sighand().
*/
kmem_cache_free(sighand_cachep, sighand);
@@ -2144,7 +2144,7 @@ void __init proc_caches_init(void)
{
sighand_cachep = kmem_cache_create("sighand_cache",
sizeof(struct sighand_struct), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_DESTROY_BY_RCU|
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
SLAB_NOTRACK|SLAB_ACCOUNT, sighand_ctor);
signal_cachep = kmem_cache_create("signal_cache",
sizeof(struct signal_struct), 0,
diff --git a/kernel/signal.c b/kernel/signal.c
index 7e59ebc2c25e..6df5f72158e4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1237,7 +1237,7 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
}
/*
* This sighand can be already freed and even reused, but
- * we rely on SLAB_DESTROY_BY_RCU and sighand_ctor() which
+ * we rely on SLAB_TYPESAFE_BY_RCU and sighand_ctor() which
* initializes ->siglock: this slab can't go away, it has
* the same object type, ->siglock can't be reinitialized.
*
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 98b27195e38b..4b20061102f6 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -413,7 +413,7 @@ void kasan_cache_create(struct kmem_cache *cache, size_t *size,
*size += sizeof(struct kasan_alloc_meta);

/* Add free meta. */
- if (cache->flags & SLAB_DESTROY_BY_RCU || cache->ctor ||
+ if (cache->flags & SLAB_TYPESAFE_BY_RCU || cache->ctor ||
cache->object_size < sizeof(struct kasan_free_meta)) {
cache->kasan_info.free_meta_offset = *size;
*size += sizeof(struct kasan_free_meta);
@@ -561,7 +561,7 @@ static void kasan_poison_slab_free(struct kmem_cache *cache, void *object)
unsigned long rounded_up_size = round_up(size, KASAN_SHADOW_SCALE_SIZE);

/* RCU slabs could be legally used after free within the RCU period */
- if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return;

kasan_poison_shadow(object, rounded_up_size, KASAN_KMALLOC_FREE);
@@ -572,7 +572,7 @@ bool kasan_slab_free(struct kmem_cache *cache, void *object)
s8 shadow_byte;

/* RCU slabs could be legally used after free within the RCU period */
- if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return false;

shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object));
diff --git a/mm/kmemcheck.c b/mm/kmemcheck.c
index 5bf191756a4a..2d5959c5f7c5 100644
--- a/mm/kmemcheck.c
+++ b/mm/kmemcheck.c
@@ -95,7 +95,7 @@ void kmemcheck_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, void *object,
void kmemcheck_slab_free(struct kmem_cache *s, void *object, size_t size)
{
/* TODO: RCU freeing is unsupported for now; hide false positives. */
- if (!s->ctor && !(s->flags & SLAB_DESTROY_BY_RCU))
+ if (!s->ctor && !(s->flags & SLAB_TYPESAFE_BY_RCU))
kmemcheck_mark_freed(object, size);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index 49ed681ccc7b..8ffd59df8a3f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -430,7 +430,7 @@ static void anon_vma_ctor(void *data)
void __init anon_vma_init(void)
{
anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
- 0, SLAB_DESTROY_BY_RCU|SLAB_PANIC|SLAB_ACCOUNT,
+ 0, SLAB_TYPESAFE_BY_RCU|SLAB_PANIC|SLAB_ACCOUNT,
anon_vma_ctor);
anon_vma_chain_cachep = KMEM_CACHE(anon_vma_chain,
SLAB_PANIC|SLAB_ACCOUNT);
@@ -481,7 +481,7 @@ struct anon_vma *page_get_anon_vma(struct page *page)
* If this page is still mapped, then its anon_vma cannot have been
* freed. But if it has been unmapped, we have no security against the
* anon_vma structure being freed and reused (for another anon_vma:
- * SLAB_DESTROY_BY_RCU guarantees that - so the atomic_inc_not_zero()
+ * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
* above cannot corrupt).
*/
if (!page_mapped(page)) {
diff --git a/mm/slab.c b/mm/slab.c
index 807d86c76908..93c827864862 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1728,7 +1728,7 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)

freelist = page->freelist;
slab_destroy_debugcheck(cachep, page);
- if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
+ if (unlikely(cachep->flags & SLAB_TYPESAFE_BY_RCU))
call_rcu(&page->rcu_head, kmem_rcu_free);
else
kmem_freepages(cachep, page);
@@ -1924,7 +1924,7 @@ static bool set_objfreelist_slab_cache(struct kmem_cache *cachep,

cachep->num = 0;

- if (cachep->ctor || flags & SLAB_DESTROY_BY_RCU)
+ if (cachep->ctor || flags & SLAB_TYPESAFE_BY_RCU)
return false;

left = calculate_slab_order(cachep, size,
@@ -2030,7 +2030,7 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
2 * sizeof(unsigned long long)))
flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
- if (!(flags & SLAB_DESTROY_BY_RCU))
+ if (!(flags & SLAB_TYPESAFE_BY_RCU))
flags |= SLAB_POISON;
#endif
#endif
diff --git a/mm/slab.h b/mm/slab.h
index 65e7c3fcac72..9cfcf099709c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -126,7 +126,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,

/* Legal flag mask for kmem_cache_create(), for various configurations */
#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | SLAB_PANIC | \
- SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS )
+ SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS )

#if defined(CONFIG_DEBUG_SLAB)
#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
@@ -415,7 +415,7 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
* back there or track user information then we can
* only use the space before that information.
*/
- if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+ if (s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_STORE_USER))
return s->inuse;
/*
* Else we can use all the padding etc for the allocation
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 09d0e849b07f..01a0fe2eb332 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -39,7 +39,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
* Set of flags that will prevent slab merging
*/
#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
- SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \
+ SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
SLAB_FAILSLAB | SLAB_KASAN)

#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
@@ -500,7 +500,7 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work)
struct kmem_cache *s, *s2;

/*
- * On destruction, SLAB_DESTROY_BY_RCU kmem_caches are put on the
+ * On destruction, SLAB_TYPESAFE_BY_RCU kmem_caches are put on the
* @slab_caches_to_rcu_destroy list. The slab pages are freed
* through RCU and and the associated kmem_cache are dereferenced
* while freeing the pages, so the kmem_caches should be freed only
@@ -537,7 +537,7 @@ static int shutdown_cache(struct kmem_cache *s)
memcg_unlink_cache(s);
list_del(&s->list);

- if (s->flags & SLAB_DESTROY_BY_RCU) {
+ if (s->flags & SLAB_TYPESAFE_BY_RCU) {
list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
schedule_work(&slab_caches_to_rcu_destroy_work);
} else {
diff --git a/mm/slob.c b/mm/slob.c
index eac04d4357ec..1bae78d71096 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -126,7 +126,7 @@ static inline void clear_slob_page_free(struct page *sp)

/*
* struct slob_rcu is inserted at the tail of allocated slob blocks, which
- * were created with a SLAB_DESTROY_BY_RCU slab. slob_rcu is used to free
+ * were created with a SLAB_TYPESAFE_BY_RCU slab. slob_rcu is used to free
* the block using call_rcu.
*/
struct slob_rcu {
@@ -524,7 +524,7 @@ EXPORT_SYMBOL(ksize);

int __kmem_cache_create(struct kmem_cache *c, unsigned long flags)
{
- if (flags & SLAB_DESTROY_BY_RCU) {
+ if (flags & SLAB_TYPESAFE_BY_RCU) {
/* leave room for rcu footer at the end of object */
c->size += sizeof(struct slob_rcu);
}
@@ -598,7 +598,7 @@ static void kmem_rcu_free(struct rcu_head *head)
void kmem_cache_free(struct kmem_cache *c, void *b)
{
kmemleak_free_recursive(b, c->flags);
- if (unlikely(c->flags & SLAB_DESTROY_BY_RCU)) {
+ if (unlikely(c->flags & SLAB_TYPESAFE_BY_RCU)) {
struct slob_rcu *slob_rcu;
slob_rcu = b + (c->size - sizeof(struct slob_rcu));
slob_rcu->size = c->size;
diff --git a/mm/slub.c b/mm/slub.c
index 7f4bc7027ed5..57e5156f02be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1687,7 +1687,7 @@ static void rcu_free_slab(struct rcu_head *h)

static void free_slab(struct kmem_cache *s, struct page *page)
{
- if (unlikely(s->flags & SLAB_DESTROY_BY_RCU)) {
+ if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) {
struct rcu_head *head;

if (need_reserve_slab_rcu) {
@@ -2963,7 +2963,7 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
* slab_free_freelist_hook() could have put the items into quarantine.
* If so, no need to free them.
*/
- if (s->flags & SLAB_KASAN && !(s->flags & SLAB_DESTROY_BY_RCU))
+ if (s->flags & SLAB_KASAN && !(s->flags & SLAB_TYPESAFE_BY_RCU))
return;
do_slab_free(s, page, head, tail, cnt, addr);
}
@@ -3433,7 +3433,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
* the slab may touch the object after free or before allocation
* then we should never poison the object itself.
*/
- if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
+ if ((flags & SLAB_POISON) && !(flags & SLAB_TYPESAFE_BY_RCU) &&
!s->ctor)
s->flags |= __OBJECT_POISON;
else
@@ -3455,7 +3455,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
*/
s->inuse = size;

- if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
+ if (((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
s->ctor)) {
/*
* Relocate free pointer after the object if it is not
@@ -3537,7 +3537,7 @@ static int kmem_cache_open(struct kmem_cache *s, unsigned long flags)
s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
s->reserved = 0;

- if (need_reserve_slab_rcu && (s->flags & SLAB_DESTROY_BY_RCU))
+ if (need_reserve_slab_rcu && (s->flags & SLAB_TYPESAFE_BY_RCU))
s->reserved = sizeof(struct rcu_head);

if (!calculate_sizes(s, -1))
@@ -5042,7 +5042,7 @@ SLAB_ATTR_RO(cache_dma);

static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
{
- return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_TYPESAFE_BY_RCU));
}
SLAB_ATTR_RO(destroy_by_rcu);

diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 409d0cfd3447..90210a0e3888 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -950,7 +950,7 @@ static struct proto dccp_v4_prot = {
.orphan_count = &dccp_orphan_count,
.max_header = MAX_DCCP_HEADER,
.obj_size = sizeof(struct dccp_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.rsk_prot = &dccp_request_sock_ops,
.twsk_prot = &dccp_timewait_sock_ops,
.h.hashinfo = &dccp_hashinfo,
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 233b57367758..b4019a5e4551 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -1012,7 +1012,7 @@ static struct proto dccp_v6_prot = {
.orphan_count = &dccp_orphan_count,
.max_header = MAX_DCCP_HEADER,
.obj_size = sizeof(struct dccp6_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.rsk_prot = &dccp6_request_sock_ops,
.twsk_prot = &dccp6_timewait_sock_ops,
.h.hashinfo = &dccp_hashinfo,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9a89b8deafae..82c89abeb989 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2398,7 +2398,7 @@ struct proto tcp_prot = {
.sysctl_rmem = sysctl_tcp_rmem,
.max_header = MAX_TCP_HEADER,
.obj_size = sizeof(struct tcp_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.twsk_prot = &tcp_timewait_sock_ops,
.rsk_prot = &tcp_request_sock_ops,
.h.hashinfo = &tcp_hashinfo,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 60a5295a7de6..bdbc4327ebee 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1919,7 +1919,7 @@ struct proto tcpv6_prot = {
.sysctl_rmem = sysctl_tcp_rmem,
.max_header = MAX_TCP_HEADER,
.obj_size = sizeof(struct tcp6_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
.twsk_prot = &tcp6_timewait_sock_ops,
.rsk_prot = &tcp6_request_sock_ops,
.h.hashinfo = &tcp_hashinfo,
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 06186d608a27..d096ca563054 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -142,7 +142,7 @@ static struct proto llc_proto = {
.name = "LLC",
.owner = THIS_MODULE,
.obj_size = sizeof(struct llc_sock),
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
};

/**
diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c
index 8bc5a1bd2d45..9b02c13d258b 100644
--- a/net/llc/llc_conn.c
+++ b/net/llc/llc_conn.c
@@ -506,7 +506,7 @@ static struct sock *__llc_lookup_established(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_estab_match(sap, daddr, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
@@ -565,7 +565,7 @@ static struct sock *__llc_lookup_listener(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_listener_match(sap, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
diff --git a/net/llc/llc_sap.c b/net/llc/llc_sap.c
index 5404d0d195cc..63b6ab056370 100644
--- a/net/llc/llc_sap.c
+++ b/net/llc/llc_sap.c
@@ -328,7 +328,7 @@ static struct sock *llc_lookup_dgram(struct llc_sap *sap,
again:
sk_nulls_for_each_rcu(rc, node, laddr_hb) {
if (llc_dgram_match(sap, laddr, rc)) {
- /* Extra checks required by SLAB_DESTROY_BY_RCU */
+ /* Extra checks required by SLAB_TYPESAFE_BY_RCU */
if (unlikely(!atomic_inc_not_zero(&rc->sk_refcnt)))
goto again;
if (unlikely(llc_sk(rc)->sap != sap ||
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 071b97fcbefb..fdcdac7916b2 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -914,7 +914,7 @@ static unsigned int early_drop_list(struct net *net,
continue;

/* kill only if still in same netns -- might have moved due to
- * SLAB_DESTROY_BY_RCU rules.
+ * SLAB_TYPESAFE_BY_RCU rules.
*
* We steal the timer reference. If that fails timer has
* already fired or someone else deleted it. Just drop ref
@@ -1069,7 +1069,7 @@ __nf_conntrack_alloc(struct net *net,

/*
* Do not use kmem_cache_zalloc(), as this cache uses
- * SLAB_DESTROY_BY_RCU.
+ * SLAB_TYPESAFE_BY_RCU.
*/
ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
if (ct == NULL)
@@ -1114,7 +1114,7 @@ void nf_conntrack_free(struct nf_conn *ct)
struct net *net = nf_ct_net(ct);

/* A freed object has refcnt == 0, that's
- * the golden rule for SLAB_DESTROY_BY_RCU
+ * the golden rule for SLAB_TYPESAFE_BY_RCU
*/
NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);

@@ -1878,7 +1878,7 @@ int nf_conntrack_init_start(void)
nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
sizeof(struct nf_conn),
NFCT_INFOMASK + 1,
- SLAB_DESTROY_BY_RCU | SLAB_HWCACHE_ALIGN, NULL);
+ SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN, NULL);
if (!nf_conntrack_cachep)
goto err_cachep;

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 85837ab90e89..d34bbd6d8f38 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -101,7 +101,7 @@ struct proto smc_proto = {
.unhash = smc_unhash_sk,
.obj_size = sizeof(struct smc_sock),
.h.smc_hash = &smc_v4_hashinfo,
- .slab_flags = SLAB_DESTROY_BY_RCU,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
};
EXPORT_SYMBOL_GPL(smc_proto);

--
2.5.2

2017-04-19 23:24:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 07:51:36PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:39:51AM -0700, Paul E. McKenney wrote:
>
> > Well, if there are no objections, I will fix up the smp_mb__before_atomic()
> > and smp_mb__after_atomic() pieces.
>
> Feel free.

How about if I add this in the atomic_ops.txt description of these
two primitives?

Preceding a non-value-returning read-modify-write atomic
operation with smp_mb__before_atomic() and following it with
smp_mb__after_atomic() provides the same full ordering that is
provided by value-returning read-modify-write atomic operations.

> > I suppose that one alternative is the new variant of kerneldoc, though
> > very few of these functions have comment headers, let alone kerneldoc
> > headers. Which reminds me, the question of spin_unlock_wait() and
> > spin_is_locked() semantics came up a bit ago. Here is what I believe
> > to be the case. Does this match others' expectations?
> >
> > o spin_unlock_wait() semantics:
> >
> > 1. Any access in any critical section prior to the
> > spin_unlock_wait() is visible to all code following
> > (in program order) the spin_unlock_wait().
> >
> > 2. Any access prior (in program order) to the
> > spin_unlock_wait() is visible to any critical
> > section following the spin_unlock_wait().
> >
> > o spin_is_locked() semantics: Half of spin_unlock_wait(),
> > but only if it returns false:
> >
> > 1. Any access in any critical section prior to the
> > spin_unlock_wait() is visible to all code following
> > (in program order) the spin_unlock_wait().
>
> Urgh.. yes those are pain. The best advise is to not use them.
>
> 055ce0fd1b86 ("locking/qspinlock: Add comments")

Ah, I must confess that I missed that one. Would you be OK with the
following patch, which adds a docbook header comment for both of them?

Thanx, Paul

------------------------------------------------------------------------

commit 5789953adc360b4d3685dc89513655e6bfb83980
Author: Paul E. McKenney <[email protected]>
Date: Wed Apr 19 16:20:07 2017 -0700

atomics: Add header comment so spin_unlock_wait() and spin_is_locked()

There is material describing the ordering guarantees provided by
spin_unlock_wait() and spin_is_locked(), but it is not necessarily
easy to find. This commit therefore adds a docbook header comment
to both functions informally describing their semantics.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 59248dcc6ef3..2647dc7f3ea9 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -369,11 +369,49 @@ static __always_inline int spin_trylock_irq(spinlock_t *lock)
raw_spin_trylock_irqsave(spinlock_check(lock), flags); \
})

+/**
+ * spin_unlock_wait - Interpose between successive critical sections
+ * @lock: the spinlock whose critical sections are to be interposed.
+ *
+ * Semantically this is equivalent to a spin_lock() immediately
+ * followed by a spin_unlock(). However, most architectures have
+ * more efficient implementations in which the spin_unlock_wait()
+ * cannot block concurrent lock acquisition, and in some cases
+ * where spin_unlock_wait() does not write to the lock variable.
+ * Nevertheless, spin_unlock_wait() can have high overhead, so if
+ * you feel the need to use it, please check to see if there is
+ * a better way to get your job done.
+ *
+ * The ordering guarantees provided by spin_unlock_wait() are:
+ *
+ * 1. All accesses preceding the spin_unlock_wait() happen before
+ * any accesses in later critical sections for this same lock.
+ * 2. All accesses following the spin_unlock_wait() happen after
+ * any accesses in earlier critical sections for this same lock.
+ */
static __always_inline void spin_unlock_wait(spinlock_t *lock)
{
raw_spin_unlock_wait(&lock->rlock);
}

+/**
+ * spin_is_locked - Conditionally interpose after prior critical sections
+ * @lock: the spinlock whose critical sections are to be interposed.
+ *
+ * Semantically this is equivalent to a spin_trylock(), and, if
+ * the spin_trylock() succeeds, immediately followed by a (mythical)
+ * spin_unlock_relaxed(). The return value from spin_trylock() is returned
+ * by spin_is_locked(). Note that all current architectures have extremely
+ * efficient implementations in which the spin_is_locked() does not even
+ * write to the lock variable.
+ *
+ * A successful spin_is_locked() primitive in some sense "takes its place"
+ * after some critical section for the lock in question. Any accesses
+ * following a successful spin_is_locked() call will therefore happen
+ * after any accesses by any of the preceding critical section for that
+ * same lock. Note however, that spin_is_locked() provides absolutely no
+ * ordering guarantees for code preceding the call to that spin_is_locked().
+ */
static __always_inline int spin_is_locked(spinlock_t *lock)
{
return raw_spin_is_locked(&lock->rlock);

2017-04-19 23:25:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 13, 2017 at 07:59:07PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 07:51:36PM +0200, Peter Zijlstra wrote:
>
> > > I suppose that one alternative is the new variant of kerneldoc, though
> > > very few of these functions have comment headers, let alone kerneldoc
> > > headers. Which reminds me, the question of spin_unlock_wait() and
> > > spin_is_locked() semantics came up a bit ago. Here is what I believe
> > > to be the case. Does this match others' expectations?
> > >
> > > o spin_unlock_wait() semantics:
> > >
> > > 1. Any access in any critical section prior to the
> > > spin_unlock_wait() is visible to all code following
> > > (in program order) the spin_unlock_wait().
> > >
> > > 2. Any access prior (in program order) to the
> > > spin_unlock_wait() is visible to any critical
> > > section following the spin_unlock_wait().
> > >
> > > o spin_is_locked() semantics: Half of spin_unlock_wait(),
> > > but only if it returns false:
> > >
> > > 1. Any access in any critical section prior to the
> > > spin_unlock_wait() is visible to all code following
> > > (in program order) the spin_unlock_wait().
> >
> > Urgh.. yes those are pain. The best advise is to not use them.
> >
> > 055ce0fd1b86 ("locking/qspinlock: Add comments")
>
> The big problem with spin_unlock_wait(), aside from the icky barrier
> semantics, is that it tends to end up prone to starvation. So where
> spin_lock()+spin_unlock() have guaranteed fwd progress if the lock is
> fair (ticket,queued,etc..) spin_unlock_wait() must often lack that
> guarantee.
>
> Equally, spin_unlock_wait() was intended to be 'cheap' and be a
> read-only loop, but in order to satisfy the barrier requirements, it
> ends up doing stores anyway (see for example the arm64 and ppc
> implementations).

Good points, and my proposed patch includes verbiage urging the use
of something else to get the job done. Does that work?

Thanx, Paul

2017-04-20 11:18:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Wed, Apr 19, 2017 at 04:23:52PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 13, 2017 at 07:51:36PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 13, 2017 at 10:39:51AM -0700, Paul E. McKenney wrote:
> >
> > > Well, if there are no objections, I will fix up the smp_mb__before_atomic()
> > > and smp_mb__after_atomic() pieces.
> >
> > Feel free.
>
> How about if I add this in the atomic_ops.txt description of these
> two primitives?
>
> Preceding a non-value-returning read-modify-write atomic
> operation with smp_mb__before_atomic() and following it with
> smp_mb__after_atomic() provides the same full ordering that is
> provided by value-returning read-modify-write atomic operations.

That seems correct. It also already seems a direct implication of the
extant text though. But as you're wont to say, people need repetition
and pointing out the obvious etc..

The way I read that document, specifically this:

"For example, smp_mb__before_atomic() can be used like so:

obj->dead = 1;
smp_mb__before_atomic();
atomic_dec(&obj->ref_count);

It makes sure that all memory operations preceding the atomic_dec()
call are strongly ordered with respect to the atomic counter
operation."

Leaves no question that these operations must be full barriers.

And therefore, your paragraph that basically states that:

smp_mb__before_atomic();
atomic_inc_return_relaxed();
smp_mb__after_atomic();

equals:

atomic_inc_return();

is implied, no?

> commit 5789953adc360b4d3685dc89513655e6bfb83980
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Apr 19 16:20:07 2017 -0700
>
> atomics: Add header comment so spin_unlock_wait() and spin_is_locked()
>
> There is material describing the ordering guarantees provided by
> spin_unlock_wait() and spin_is_locked(), but it is not necessarily
> easy to find. This commit therefore adds a docbook header comment
> to both functions informally describing their semantics.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 59248dcc6ef3..2647dc7f3ea9 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -369,11 +369,49 @@ static __always_inline int spin_trylock_irq(spinlock_t *lock)
> raw_spin_trylock_irqsave(spinlock_check(lock), flags); \
> })
>
> +/**
> + * spin_unlock_wait - Interpose between successive critical sections
> + * @lock: the spinlock whose critical sections are to be interposed.
> + *
> + * Semantically this is equivalent to a spin_lock() immediately
> + * followed by a spin_unlock(). However, most architectures have
> + * more efficient implementations in which the spin_unlock_wait()
> + * cannot block concurrent lock acquisition, and in some cases
> + * where spin_unlock_wait() does not write to the lock variable.
> + * Nevertheless, spin_unlock_wait() can have high overhead, so if
> + * you feel the need to use it, please check to see if there is
> + * a better way to get your job done.
> + *
> + * The ordering guarantees provided by spin_unlock_wait() are:
> + *
> + * 1. All accesses preceding the spin_unlock_wait() happen before
> + * any accesses in later critical sections for this same lock.
> + * 2. All accesses following the spin_unlock_wait() happen after
> + * any accesses in earlier critical sections for this same lock.
> + */
> static __always_inline void spin_unlock_wait(spinlock_t *lock)
> {
> raw_spin_unlock_wait(&lock->rlock);
> }

ACK

>
> +/**
> + * spin_is_locked - Conditionally interpose after prior critical sections
> + * @lock: the spinlock whose critical sections are to be interposed.
> + *
> + * Semantically this is equivalent to a spin_trylock(), and, if
> + * the spin_trylock() succeeds, immediately followed by a (mythical)
> + * spin_unlock_relaxed(). The return value from spin_trylock() is returned
> + * by spin_is_locked(). Note that all current architectures have extremely
> + * efficient implementations in which the spin_is_locked() does not even
> + * write to the lock variable.
> + *
> + * A successful spin_is_locked() primitive in some sense "takes its place"
> + * after some critical section for the lock in question. Any accesses
> + * following a successful spin_is_locked() call will therefore happen
> + * after any accesses by any of the preceding critical section for that
> + * same lock. Note however, that spin_is_locked() provides absolutely no
> + * ordering guarantees for code preceding the call to that spin_is_locked().
> + */
> static __always_inline int spin_is_locked(spinlock_t *lock)
> {
> return raw_spin_is_locked(&lock->rlock);

I'm current confused on this one. The case listed in the qspinlock code
doesn't appear to exist in the kernel anymore (or at least, I'm having
trouble finding it).

That said, I'm also not sure spin_is_locked() provides an acquire, as
that comment has an explicit smp_acquire__after_ctrl_dep();

2017-04-20 15:05:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 20, 2017 at 01:17:43PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 19, 2017 at 04:23:52PM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 13, 2017 at 07:51:36PM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 13, 2017 at 10:39:51AM -0700, Paul E. McKenney wrote:
> > >
> > > > Well, if there are no objections, I will fix up the smp_mb__before_atomic()
> > > > and smp_mb__after_atomic() pieces.
> > >
> > > Feel free.
> >
> > How about if I add this in the atomic_ops.txt description of these
> > two primitives?
> >
> > Preceding a non-value-returning read-modify-write atomic
> > operation with smp_mb__before_atomic() and following it with
> > smp_mb__after_atomic() provides the same full ordering that is
> > provided by value-returning read-modify-write atomic operations.
>
> That seems correct. It also already seems a direct implication of the
> extant text though. But as you're wont to say, people need repetition
> and pointing out the obvious etc..

Especially given that it never is obvious until you understand it.
At which point you don't need the documentation. Therefore, documentation
is mostly useful to people who are missing a few pieces of the overall
puzzle. Which we all were at some time in the past. ;-)

> The way I read that document, specifically this:
>
> "For example, smp_mb__before_atomic() can be used like so:
>
> obj->dead = 1;
> smp_mb__before_atomic();
> atomic_dec(&obj->ref_count);
>
> It makes sure that all memory operations preceding the atomic_dec()
> call are strongly ordered with respect to the atomic counter
> operation."
>
> Leaves no question that these operations must be full barriers.
>
> And therefore, your paragraph that basically states that:
>
> smp_mb__before_atomic();
> atomic_inc_return_relaxed();
> smp_mb__after_atomic();
>
> equals:
>
> atomic_inc_return();
>
> is implied, no?

That is a reasonable argument, but some very intelligent people didn't
make that leap when reading it, so more redundancy appears to be needed.

> > commit 5789953adc360b4d3685dc89513655e6bfb83980
> > Author: Paul E. McKenney <[email protected]>
> > Date: Wed Apr 19 16:20:07 2017 -0700
> >
> > atomics: Add header comment so spin_unlock_wait() and spin_is_locked()
> >
> > There is material describing the ordering guarantees provided by
> > spin_unlock_wait() and spin_is_locked(), but it is not necessarily
> > easy to find. This commit therefore adds a docbook header comment
> > to both functions informally describing their semantics.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> > index 59248dcc6ef3..2647dc7f3ea9 100644
> > --- a/include/linux/spinlock.h
> > +++ b/include/linux/spinlock.h
> > @@ -369,11 +369,49 @@ static __always_inline int spin_trylock_irq(spinlock_t *lock)
> > raw_spin_trylock_irqsave(spinlock_check(lock), flags); \
> > })
> >
> > +/**
> > + * spin_unlock_wait - Interpose between successive critical sections
> > + * @lock: the spinlock whose critical sections are to be interposed.
> > + *
> > + * Semantically this is equivalent to a spin_lock() immediately
> > + * followed by a spin_unlock(). However, most architectures have
> > + * more efficient implementations in which the spin_unlock_wait()
> > + * cannot block concurrent lock acquisition, and in some cases
> > + * where spin_unlock_wait() does not write to the lock variable.
> > + * Nevertheless, spin_unlock_wait() can have high overhead, so if
> > + * you feel the need to use it, please check to see if there is
> > + * a better way to get your job done.
> > + *
> > + * The ordering guarantees provided by spin_unlock_wait() are:
> > + *
> > + * 1. All accesses preceding the spin_unlock_wait() happen before
> > + * any accesses in later critical sections for this same lock.
> > + * 2. All accesses following the spin_unlock_wait() happen after
> > + * any accesses in earlier critical sections for this same lock.
> > + */
> > static __always_inline void spin_unlock_wait(spinlock_t *lock)
> > {
> > raw_spin_unlock_wait(&lock->rlock);
> > }
>
> ACK

Very good, adding your Acked-by.

> > +/**
> > + * spin_is_locked - Conditionally interpose after prior critical sections
> > + * @lock: the spinlock whose critical sections are to be interposed.
> > + *
> > + * Semantically this is equivalent to a spin_trylock(), and, if
> > + * the spin_trylock() succeeds, immediately followed by a (mythical)
> > + * spin_unlock_relaxed(). The return value from spin_trylock() is returned
> > + * by spin_is_locked(). Note that all current architectures have extremely
> > + * efficient implementations in which the spin_is_locked() does not even
> > + * write to the lock variable.
> > + *
> > + * A successful spin_is_locked() primitive in some sense "takes its place"
> > + * after some critical section for the lock in question. Any accesses
> > + * following a successful spin_is_locked() call will therefore happen
> > + * after any accesses by any of the preceding critical section for that
> > + * same lock. Note however, that spin_is_locked() provides absolutely no
> > + * ordering guarantees for code preceding the call to that spin_is_locked().
> > + */
> > static __always_inline int spin_is_locked(spinlock_t *lock)
> > {
> > return raw_spin_is_locked(&lock->rlock);
>
> I'm current confused on this one. The case listed in the qspinlock code
> doesn't appear to exist in the kernel anymore (or at least, I'm having
> trouble finding it).
>
> That said, I'm also not sure spin_is_locked() provides an acquire, as
> that comment has an explicit smp_acquire__after_ctrl_dep();

OK, I have dropped this portion of the patch for the moment.

Going forward, exactly what semantics do you believe spin_is_locked()
provides?

Do any of the current implementations need to change to provide the
semantics expected by the various use cases?

Thanx, Paul

2017-04-20 15:08:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 20, 2017 at 08:03:21AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 20, 2017 at 01:17:43PM +0200, Peter Zijlstra wrote:

> > > +/**
> > > + * spin_is_locked - Conditionally interpose after prior critical sections
> > > + * @lock: the spinlock whose critical sections are to be interposed.
> > > + *
> > > + * Semantically this is equivalent to a spin_trylock(), and, if
> > > + * the spin_trylock() succeeds, immediately followed by a (mythical)
> > > + * spin_unlock_relaxed(). The return value from spin_trylock() is returned
> > > + * by spin_is_locked(). Note that all current architectures have extremely
> > > + * efficient implementations in which the spin_is_locked() does not even
> > > + * write to the lock variable.
> > > + *
> > > + * A successful spin_is_locked() primitive in some sense "takes its place"
> > > + * after some critical section for the lock in question. Any accesses
> > > + * following a successful spin_is_locked() call will therefore happen
> > > + * after any accesses by any of the preceding critical section for that
> > > + * same lock. Note however, that spin_is_locked() provides absolutely no
> > > + * ordering guarantees for code preceding the call to that spin_is_locked().
> > > + */
> > > static __always_inline int spin_is_locked(spinlock_t *lock)
> > > {
> > > return raw_spin_is_locked(&lock->rlock);
> >
> > I'm current confused on this one. The case listed in the qspinlock code
> > doesn't appear to exist in the kernel anymore (or at least, I'm having
> > trouble finding it).
> >
> > That said, I'm also not sure spin_is_locked() provides an acquire, as
> > that comment has an explicit smp_acquire__after_ctrl_dep();
>
> OK, I have dropped this portion of the patch for the moment.
>
> Going forward, exactly what semantics do you believe spin_is_locked()
> provides?
>
> Do any of the current implementations need to change to provide the
> semantics expected by the various use cases?

I don't have anything other than the comment I wrote back then. I would
have to go audit all spin_is_locked() implementations and users (again).

2017-06-09 22:56:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Thu, Apr 20, 2017 at 05:08:26PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 20, 2017 at 08:03:21AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 20, 2017 at 01:17:43PM +0200, Peter Zijlstra wrote:
>
> > > > +/**
> > > > + * spin_is_locked - Conditionally interpose after prior critical sections
> > > > + * @lock: the spinlock whose critical sections are to be interposed.
> > > > + *
> > > > + * Semantically this is equivalent to a spin_trylock(), and, if
> > > > + * the spin_trylock() succeeds, immediately followed by a (mythical)
> > > > + * spin_unlock_relaxed(). The return value from spin_trylock() is returned
> > > > + * by spin_is_locked(). Note that all current architectures have extremely
> > > > + * efficient implementations in which the spin_is_locked() does not even
> > > > + * write to the lock variable.
> > > > + *
> > > > + * A successful spin_is_locked() primitive in some sense "takes its place"
> > > > + * after some critical section for the lock in question. Any accesses
> > > > + * following a successful spin_is_locked() call will therefore happen
> > > > + * after any accesses by any of the preceding critical section for that
> > > > + * same lock. Note however, that spin_is_locked() provides absolutely no
> > > > + * ordering guarantees for code preceding the call to that spin_is_locked().
> > > > + */
> > > > static __always_inline int spin_is_locked(spinlock_t *lock)
> > > > {
> > > > return raw_spin_is_locked(&lock->rlock);
> > >
> > > I'm current confused on this one. The case listed in the qspinlock code
> > > doesn't appear to exist in the kernel anymore (or at least, I'm having
> > > trouble finding it).
> > >
> > > That said, I'm also not sure spin_is_locked() provides an acquire, as
> > > that comment has an explicit smp_acquire__after_ctrl_dep();
> >
> > OK, I have dropped this portion of the patch for the moment.
> >
> > Going forward, exactly what semantics do you believe spin_is_locked()
> > provides?
> >
> > Do any of the current implementations need to change to provide the
> > semantics expected by the various use cases?
>
> I don't have anything other than the comment I wrote back then. I would
> have to go audit all spin_is_locked() implementations and users (again).

And Andrea (CCed) and I did a review of the v4.11 uses of
spin_is_locked(), and none of the current uses requires any particular
ordering.

There is one very strange use of spin_is_locked() in __fnic_set_state_flags()
in drivers/scsi/fnic/fnic_scsi.c. This code checks spin_is_locked(),
and then acquires the lock only if it wasn't held. I am having a very
hard time imagining a situation where this would do something useful.
My guess is that the author thought that spin_is_locked() meant that
the current CPU holds the lock, when it instead means that some CPU
(possibly the current one, possibly not) holds the lock.

Adding the FNIC guys on CC so that they can enlighten me.

Ignoring the FNIC use case for the moment, anyone believe that
spin_is_locked() needs to provide any ordering guarantees?

Thanx, Paul

2017-06-12 14:52:10

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Sat, Jun 10, 2017 at 12:56 AM, Paul E. McKenney
<[email protected]> wrote:
>> > > > +/**
>> > > > + * spin_is_locked - Conditionally interpose after prior critical sections
>> > > > + * @lock: the spinlock whose critical sections are to be interposed.
>> > > > + *
>> > > > + * Semantically this is equivalent to a spin_trylock(), and, if
>> > > > + * the spin_trylock() succeeds, immediately followed by a (mythical)
>> > > > + * spin_unlock_relaxed(). The return value from spin_trylock() is returned
>> > > > + * by spin_is_locked(). Note that all current architectures have extremely
>> > > > + * efficient implementations in which the spin_is_locked() does not even
>> > > > + * write to the lock variable.
>> > > > + *
>> > > > + * A successful spin_is_locked() primitive in some sense "takes its place"
>> > > > + * after some critical section for the lock in question. Any accesses
>> > > > + * following a successful spin_is_locked() call will therefore happen
>> > > > + * after any accesses by any of the preceding critical section for that
>> > > > + * same lock. Note however, that spin_is_locked() provides absolutely no
>> > > > + * ordering guarantees for code preceding the call to that spin_is_locked().
>> > > > + */
>> > > > static __always_inline int spin_is_locked(spinlock_t *lock)
>> > > > {
>> > > > return raw_spin_is_locked(&lock->rlock);
>> > >
>> > > I'm current confused on this one. The case listed in the qspinlock code
>> > > doesn't appear to exist in the kernel anymore (or at least, I'm having
>> > > trouble finding it).
>> > >
>> > > That said, I'm also not sure spin_is_locked() provides an acquire, as
>> > > that comment has an explicit smp_acquire__after_ctrl_dep();
>> >
>> > OK, I have dropped this portion of the patch for the moment.
>> >
>> > Going forward, exactly what semantics do you believe spin_is_locked()
>> > provides?
>> >
>> > Do any of the current implementations need to change to provide the
>> > semantics expected by the various use cases?
>>
>> I don't have anything other than the comment I wrote back then. I would
>> have to go audit all spin_is_locked() implementations and users (again).
>
> And Andrea (CCed) and I did a review of the v4.11 uses of
> spin_is_locked(), and none of the current uses requires any particular
> ordering.
>
> There is one very strange use of spin_is_locked() in __fnic_set_state_flags()
> in drivers/scsi/fnic/fnic_scsi.c. This code checks spin_is_locked(),
> and then acquires the lock only if it wasn't held. I am having a very
> hard time imagining a situation where this would do something useful.
> My guess is that the author thought that spin_is_locked() meant that
> the current CPU holds the lock, when it instead means that some CPU
> (possibly the current one, possibly not) holds the lock.
>
> Adding the FNIC guys on CC so that they can enlighten me.
>
> Ignoring the FNIC use case for the moment, anyone believe that
> spin_is_locked() needs to provide any ordering guarantees?


Not providing any ordering guarantees for spin_is_locked() sounds good to me.
Restricting all types of mutexes/locks to the simple canonical use
case (protecting a critical section of code) makes it easier to reason
about code, enables a bunch of possible static/dynamic correctness
checking and reliefs lock/unlock function from providing unnecessary
ordering (i.e. acquire in spin_is_locked() pairing with release in
spin_lock()).
Tricky uses of is_locked and try_lock can resort to atomic operations
(or maybe be removed).

2017-06-12 21:54:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 07/13] rcu: Add smp_mb__after_atomic() to sync_exp_work_done()

On Mon, Jun 12, 2017 at 04:51:43PM +0200, Dmitry Vyukov wrote:
> On Sat, Jun 10, 2017 at 12:56 AM, Paul E. McKenney
> <[email protected]> wrote:
> >> > > > +/**
> >> > > > + * spin_is_locked - Conditionally interpose after prior critical sections
> >> > > > + * @lock: the spinlock whose critical sections are to be interposed.
> >> > > > + *
> >> > > > + * Semantically this is equivalent to a spin_trylock(), and, if
> >> > > > + * the spin_trylock() succeeds, immediately followed by a (mythical)
> >> > > > + * spin_unlock_relaxed(). The return value from spin_trylock() is returned
> >> > > > + * by spin_is_locked(). Note that all current architectures have extremely
> >> > > > + * efficient implementations in which the spin_is_locked() does not even
> >> > > > + * write to the lock variable.
> >> > > > + *
> >> > > > + * A successful spin_is_locked() primitive in some sense "takes its place"
> >> > > > + * after some critical section for the lock in question. Any accesses
> >> > > > + * following a successful spin_is_locked() call will therefore happen
> >> > > > + * after any accesses by any of the preceding critical section for that
> >> > > > + * same lock. Note however, that spin_is_locked() provides absolutely no
> >> > > > + * ordering guarantees for code preceding the call to that spin_is_locked().
> >> > > > + */
> >> > > > static __always_inline int spin_is_locked(spinlock_t *lock)
> >> > > > {
> >> > > > return raw_spin_is_locked(&lock->rlock);
> >> > >
> >> > > I'm current confused on this one. The case listed in the qspinlock code
> >> > > doesn't appear to exist in the kernel anymore (or at least, I'm having
> >> > > trouble finding it).
> >> > >
> >> > > That said, I'm also not sure spin_is_locked() provides an acquire, as
> >> > > that comment has an explicit smp_acquire__after_ctrl_dep();
> >> >
> >> > OK, I have dropped this portion of the patch for the moment.
> >> >
> >> > Going forward, exactly what semantics do you believe spin_is_locked()
> >> > provides?
> >> >
> >> > Do any of the current implementations need to change to provide the
> >> > semantics expected by the various use cases?
> >>
> >> I don't have anything other than the comment I wrote back then. I would
> >> have to go audit all spin_is_locked() implementations and users (again).
> >
> > And Andrea (CCed) and I did a review of the v4.11 uses of
> > spin_is_locked(), and none of the current uses requires any particular
> > ordering.
> >
> > There is one very strange use of spin_is_locked() in __fnic_set_state_flags()
> > in drivers/scsi/fnic/fnic_scsi.c. This code checks spin_is_locked(),
> > and then acquires the lock only if it wasn't held. I am having a very
> > hard time imagining a situation where this would do something useful.
> > My guess is that the author thought that spin_is_locked() meant that
> > the current CPU holds the lock, when it instead means that some CPU
> > (possibly the current one, possibly not) holds the lock.
> >
> > Adding the FNIC guys on CC so that they can enlighten me.

And if my guess is correct, the usual fix is to use a variable to track
which CPU is holding the lock, with -1 indicating that no one holds it.
Then the spin_is_locked() check becomes a test to see if the value of
this variable is equal to the ID of the current CPU, but with preemption
disabled across the test. If this makes no sense, let me know, and I
can supply a prototype patch.

> > Ignoring the FNIC use case for the moment, anyone believe that
> > spin_is_locked() needs to provide any ordering guarantees?
>
> Not providing any ordering guarantees for spin_is_locked() sounds good to me.
> Restricting all types of mutexes/locks to the simple canonical use
> case (protecting a critical section of code) makes it easier to reason
> about code, enables a bunch of possible static/dynamic correctness
> checking and reliefs lock/unlock function from providing unnecessary
> ordering (i.e. acquire in spin_is_locked() pairing with release in
> spin_lock()).
> Tricky uses of is_locked and try_lock can resort to atomic operations
> (or maybe be removed).

One vote in favor of dropping ordering guarantees. Any objections?

Thanx, Paul