2019-11-02 12:49:24

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth

Rather than using the generic version of preempt_count, x86 uses
a special version of preempt_count implementation that take advantage
of x86 (single instruction to access/add/decl/flip-bits of percpu
counter). It makes the preempt_count operations really cheap.

For x86, rcu_preempt_depth can also take the same advantage by
using the same technique.

After the patchset:
- No function call when using rcu_read_[un]lock().
It is minor improvement, other arch can also achieve it via
moving ->rcu_read_lock_nesting and ->rcu_read_unlock_special
to thread_info, but inlined rcu_read_[un]lock() generates
more instructions and footprint in other arch generally.
- Only single instruction for rcu_read_lock().
- Only 2 instructions for the fast path of rcu_read_unlock().

Patch4 simplifies rcu_read_unlock() by avoid using negative
->rcu_read_lock_nesting, Patch7 introduces the percpu rcu_preempt_depth.
Other patches are for preparation.

changed from v1:
drop patch1/2 of the v1
drop merged patches

Using special.b.deferred_qs to avoid wakeup in v1 is changed to using
preempt_count. And special.b.deferred_qs is removed.

Lai Jiangshan (7):
rcu: use preempt_count to test whether scheduler locks is held
rcu: cleanup rcu_preempt_deferred_qs()
rcu: remove useless special.b.deferred_qs
rcu: don't use negative ->rcu_read_lock_nesting
rcu: wrap usages of rcu_read_lock_nesting
rcu: clear the special.b.need_qs in rcu_note_context_switch()
x86,rcu: use percpu rcu_preempt_depth

arch/x86/Kconfig | 2 +
arch/x86/include/asm/rcu_preempt_depth.h | 87 +++++++++++++++++++
arch/x86/kernel/cpu/common.c | 7 ++
arch/x86/kernel/process_32.c | 2 +
arch/x86/kernel/process_64.c | 2 +
include/linux/rcupdate.h | 24 ++++++
include/linux/sched.h | 2 +-
init/init_task.c | 2 +-
kernel/fork.c | 2 +-
kernel/rcu/Kconfig | 3 +
kernel/rcu/tree_exp.h | 35 ++------
kernel/rcu/tree_plugin.h | 101 ++++++++++++++---------
12 files changed, 196 insertions(+), 73 deletions(-)
create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h

--
2.20.1


2019-11-02 12:50:44

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH V2 4/7] rcu: don't use negative ->rcu_read_lock_nesting

Negative ->rcu_read_lock_nesting was introduced to prevent
scheduler deadlock. But now with the help of deferred qs
mechanism, we can defer qs rather than persevere in reporting qs
and deadlock. So negative ->rcu_read_lock_nesting is useless now
and rcu_read_unlock() can be simplified.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcu/tree_exp.h | 29 ++---------------------------
kernel/rcu/tree_plugin.h | 14 +++-----------
2 files changed, 5 insertions(+), 38 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index e4b77d314f6d..afa9f573b00f 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -609,11 +609,11 @@ static void rcu_exp_handler(void *unused)
* report the quiescent state, otherwise defer.
*/
if (!t->rcu_read_lock_nesting) {
+ rdp->exp_deferred_qs = true;
if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
rcu_dynticks_curr_cpu_in_eqs()) {
- rcu_report_exp_rdp(rdp);
+ rcu_preempt_deferred_qs(t);
} else {
- rdp->exp_deferred_qs = true;
set_tsk_need_resched(t);
set_preempt_need_resched();
}
@@ -641,31 +641,6 @@ static void rcu_exp_handler(void *unused)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
-
- /*
- * The final and least likely case is where the interrupted
- * code was just about to or just finished exiting the RCU-preempt
- * read-side critical section, and no, we can't tell which.
- * So either way, set ->deferred_qs to flag later code that
- * a quiescent state is required.
- *
- * If the CPU is fully enabled (or if some buggy RCU-preempt
- * read-side critical section is being used from idle), just
- * invoke rcu_preempt_deferred_qs() to immediately report the
- * quiescent state. We cannot use rcu_read_unlock_special()
- * because we are in an interrupt handler, which will cause that
- * function to take an early exit without doing anything.
- *
- * Otherwise, force a context switch after the CPU enables everything.
- */
- rdp->exp_deferred_qs = true;
- if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
- WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
- rcu_preempt_deferred_qs(t);
- } else {
- set_tsk_need_resched(t);
- set_preempt_need_resched();
- }
}

/* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index f2fd7d687bdb..98644966c808 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -344,8 +344,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
}

/* Bias and limit values for ->rcu_read_lock_nesting. */
-#define RCU_NEST_BIAS INT_MAX
-#define RCU_NEST_NMAX (-INT_MAX / 2)
#define RCU_NEST_PMAX (INT_MAX / 2)

/*
@@ -373,21 +371,15 @@ void __rcu_read_unlock(void)
{
struct task_struct *t = current;

- if (t->rcu_read_lock_nesting != 1) {
- --t->rcu_read_lock_nesting;
- } else {
+ if (--t->rcu_read_lock_nesting == 0) {
barrier(); /* critical section before exit code. */
- t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
- barrier(); /* assign before ->rcu_read_unlock_special load */
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
- barrier(); /* ->rcu_read_unlock_special load before assign */
- t->rcu_read_lock_nesting = 0;
}
if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
int rrln = t->rcu_read_lock_nesting;

- WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
+ WARN_ON_ONCE(rrln < 0);
}
}
EXPORT_SYMBOL_GPL(__rcu_read_unlock);
@@ -539,7 +531,7 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
{
return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
READ_ONCE(t->rcu_read_unlock_special.s)) &&
- t->rcu_read_lock_nesting <= 0;
+ !t->rcu_read_lock_nesting;
}

/*
--
2.20.1

2019-11-02 12:51:24

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH V2 6/7] rcu: clear the special.b.need_qs in rcu_note_context_switch()

It is better to clear the special.b.need_qs when it is
replaced by special.b.blocked or it is really fulfill its
goal in rcu_preempt_deferred_qs_irqrestore().

It makes rcu_qs() easier to be understood, and also prepares for
the percpu rcu_preempt_depth patch, which reqires rcu_special
bits to be clearred in irq-disable context.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcu/tree_plugin.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index eb5906c55c8d..e16c3867d2ff 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -264,8 +264,6 @@ static void rcu_qs(void)
__this_cpu_read(rcu_data.gp_seq),
TPS("cpuqs"));
__this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
- barrier(); /* Coordinate with rcu_flavor_sched_clock_irq(). */
- WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, false);
}
}

@@ -297,6 +295,7 @@ void rcu_note_context_switch(bool preempt)
/* Possibly blocking in an RCU read-side critical section. */
rnp = rdp->mynode;
raw_spin_lock_rcu_node(rnp);
+ t->rcu_read_unlock_special.b.need_qs = false;
t->rcu_read_unlock_special.b.blocked = true;
t->rcu_blocked_node = rnp;

--
2.20.1

2019-11-02 12:51:55

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held

Ever since preemption was introduced to linux kernel,
irq disabled spinlocks are always held with preemption
disabled. One of the reason is that sometimes we need
to use spin_unlock() which will do preempt_enable()
to unlock the irq disabled spinlock with keeping irq
disabled. So preempt_count can be used to test whether
scheduler locks is possible held.

CC: Peter Zijlstra <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcu/tree_plugin.h | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0982e9886103..aba5896d67e3 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -603,10 +603,14 @@ static void rcu_read_unlock_special(struct task_struct *t)
tick_nohz_full_cpu(rdp->cpu);
// Need to defer quiescent state until everything is enabled.
if (irqs_were_disabled && use_softirq &&
- (in_interrupt() ||
- (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
+ (in_interrupt() || (exp && !preempt_bh_were_disabled))) {
// Using softirq, safe to awaken, and we get
// no help from enabling irqs, unlike bh/preempt.
+ // in_interrupt(): raise_softirq_irqoff() is
+ // guaranteed not to not do wakeup
+ // !preempt_bh_were_disabled: scheduler locks cannot
+ // be held, since spinlocks are always held with
+ // preempt_disable(), so the wakeup will be safe.
raise_softirq_irqoff(RCU_SOFTIRQ);
} else {
// Enabling BH or preempt does reschedule, so...
--
2.20.1

2019-11-02 12:52:44

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth

Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
is that accessing per-cpu variables is a lot cheaper than accessing
task_struct or thread_info variables.

We need to save/restore the actual rcu_preempt_depth when switch.
We also place the per-cpu rcu_preempt_depth close to __preempt_count
and current_task variable.

Using the idea of per-cpu __preempt_count.

No function call when using rcu_read_[un]lock().
Single instruction for rcu_read_lock().
2 instructions for fast path of rcu_read_unlock().

CC: Peter Zijlstra <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/Kconfig | 2 +
arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 7 ++
arch/x86/kernel/process_32.c | 2 +
arch/x86/kernel/process_64.c | 2 +
include/linux/rcupdate.h | 24 +++++++
init/init_task.c | 2 +-
kernel/fork.c | 2 +-
kernel/rcu/Kconfig | 3 +
kernel/rcu/tree_exp.h | 2 +
kernel/rcu/tree_plugin.h | 37 +++++++---
11 files changed, 158 insertions(+), 12 deletions(-)
create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d6e1faa28c58..af9fedc0fdc4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -18,6 +18,7 @@ config X86_32
select MODULES_USE_ELF_REL
select OLD_SIGACTION
select GENERIC_VDSO_32
+ select ARCH_HAVE_RCU_PREEMPT_DEEPTH

config X86_64
def_bool y
@@ -31,6 +32,7 @@ config X86_64
select NEED_DMA_MAP_STATE
select SWIOTLB
select ARCH_HAS_SYSCALL_WRAPPER
+ select ARCH_HAVE_RCU_PREEMPT_DEEPTH

config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
new file mode 100644
index 000000000000..88010ad59c20
--- /dev/null
+++ b/arch/x86/include/asm/rcu_preempt_depth.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_RCU_PREEMPT_DEPTH_H
+#define __ASM_RCU_PREEMPT_DEPTH_H
+
+#include <asm/rmwcc.h>
+#include <asm/percpu.h>
+
+#ifdef CONFIG_PREEMPT_RCU
+DECLARE_PER_CPU(int, __rcu_preempt_depth);
+
+/*
+ * We use the RCU_NEED_SPECIAL bit as an inverted need_special
+ * such that a decrement hitting 0 means we can and should do
+ * rcu_read_unlock_special().
+ */
+#define RCU_NEED_SPECIAL 0x80000000
+
+#define INIT_RCU_PREEMPT_DEPTH (RCU_NEED_SPECIAL)
+
+/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
+static __always_inline int rcu_preempt_depth(void)
+{
+ return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
+}
+
+static __always_inline void rcu_preempt_depth_set(int pc)
+{
+ int old, new;
+
+ do {
+ old = raw_cpu_read_4(__rcu_preempt_depth);
+ new = (old & RCU_NEED_SPECIAL) |
+ (pc & ~RCU_NEED_SPECIAL);
+ } while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
+}
+
+/*
+ * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
+ * rcu_read_unlock() can decrement and test for needing to do special
+ * with a single instruction.
+ *
+ * We invert the actual bit, so that when the decrement hits 0 we know
+ * both it just exited the outmost rcu_read_lock() critical section and
+ * we need to do specail (the bit is cleared) if it doesn't need to be
+ * deferred.
+ */
+
+static inline void set_rcu_preempt_need_special(void)
+{
+ raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
+}
+
+/*
+ * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
+ * and calling this function. Otherwise it may clear the work done
+ * by set_rcu_preempt_need_special() in interrupt.
+ */
+static inline void clear_rcu_preempt_need_special(void)
+{
+ raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
+}
+
+static __always_inline void rcu_preempt_depth_inc(void)
+{
+ raw_cpu_add_4(__rcu_preempt_depth, 1);
+}
+
+static __always_inline bool rcu_preempt_depth_dec_and_test(void)
+{
+ return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
+}
+
+/* must be macros to avoid header recursion hell */
+#define save_restore_rcu_preempt_depth(prev_p, next_p) do { \
+ prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth); \
+ this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting); \
+ } while (0)
+
+#define DEFINE_PERCPU_RCU_PREEMP_DEPTH \
+ DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH; \
+ EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
+#else /* #ifdef CONFIG_PREEMPT_RCU */
+#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
+#define DEFINE_PERCPU_RCU_PREEMP_DEPTH /* empty */
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
+
+#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9ae7d1bcd4f4..0151737e196c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -46,6 +46,7 @@
#include <asm/asm.h>
#include <asm/bugs.h>
#include <asm/cpu.h>
+#include <asm/rcu_preempt_depth.h>
#include <asm/mce.h>
#include <asm/msr.h>
#include <asm/pat.h>
@@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
EXPORT_PER_CPU_SYMBOL(__preempt_count);

+/* close to __preempt_count */
+DEFINE_PERCPU_RCU_PREEMP_DEPTH;
+
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
@@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
EXPORT_PER_CPU_SYMBOL(__preempt_count);

+/* close to __preempt_count */
+DEFINE_PERCPU_RCU_PREEMP_DEPTH;
+
/*
* On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
* the top of the kernel stack. Use an extra percpu variable to track the
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index b8ceec4974fe..ab1f20353663 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -51,6 +51,7 @@
#include <asm/cpu.h>
#include <asm/syscalls.h>
#include <asm/debugreg.h>
+#include <asm/rcu_preempt_depth.h>
#include <asm/switch_to.h>
#include <asm/vm86.h>
#include <asm/resctrl_sched.h>
@@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (prev->gs | next->gs)
lazy_load_gs(next->gs);

+ save_restore_rcu_preempt_depth(prev_p, next_p);
this_cpu_write(current_task, next_p);

switch_fpu_finish(next_fpu);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index af64519b2695..2e1c6e829d30 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -50,6 +50,7 @@
#include <asm/ia32.h>
#include <asm/syscalls.h>
#include <asm/debugreg.h>
+#include <asm/rcu_preempt_depth.h>
#include <asm/switch_to.h>
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
@@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

x86_fsgsbase_load(prev, next);

+ save_restore_rcu_preempt_depth(prev_p, next_p);
/*
* Switch the PDA and FPU contexts.
*/
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index a35daab95d14..0d2abf08b694 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -41,6 +41,29 @@ void synchronize_rcu(void);

#ifdef CONFIG_PREEMPT_RCU

+#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
+#include <asm/rcu_preempt_depth.h>
+
+#ifndef CONFIG_PROVE_LOCKING
+extern void rcu_read_unlock_special(void);
+
+static inline void __rcu_read_lock(void)
+{
+ rcu_preempt_depth_inc();
+}
+
+static inline void __rcu_read_unlock(void)
+{
+ if (unlikely(rcu_preempt_depth_dec_and_test()))
+ rcu_read_unlock_special();
+}
+#else
+void __rcu_read_lock(void);
+void __rcu_read_unlock(void);
+#endif
+
+#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
+#define INIT_RCU_PREEMPT_DEPTH (0)
void __rcu_read_lock(void);
void __rcu_read_unlock(void);

@@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
* types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
*/
#define rcu_preempt_depth() (current->rcu_read_lock_nesting)
+#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */

#else /* #ifdef CONFIG_PREEMPT_RCU */

diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5eab7b..0a91e38fba37 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -130,7 +130,7 @@ struct task_struct init_task
.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
#endif
#ifdef CONFIG_PREEMPT_RCU
- .rcu_read_lock_nesting = 0,
+ .rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
.rcu_read_unlock_special.s = 0,
.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
.rcu_blocked_node = NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..7368d4ccb857 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
static inline void rcu_copy_process(struct task_struct *p)
{
#ifdef CONFIG_PREEMPT_RCU
- p->rcu_read_lock_nesting = 0;
+ p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
p->rcu_read_unlock_special.s = 0;
p->rcu_blocked_node = NULL;
INIT_LIST_HEAD(&p->rcu_node_entry);
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 1cc940fef17c..d2ecca49a1a4 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -14,6 +14,9 @@ config TREE_RCU
thousands of CPUs. It also scales down nicely to
smaller systems.

+config ARCH_HAVE_RCU_PREEMPT_DEEPTH
+ def_bool n
+
config PREEMPT_RCU
bool
default y if PREEMPTION
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index dcb2124203cf..f919881832d4 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
}

#ifdef CONFIG_PREEMPT_RCU
+static inline void set_rcu_preempt_need_special(void);

/*
* Remote handler for smp_call_function_single(). If there is an
@@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
if (rnp->expmask & rdp->grpmask) {
rdp->exp_deferred_qs = true;
t->rcu_read_unlock_special.b.exp_hint = true;
+ set_rcu_preempt_need_special();
}
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e16c3867d2ff..e6774a7ab16b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
#ifdef CONFIG_PREEMPT_RCU

static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
-static void rcu_read_unlock_special(struct task_struct *t);
+void rcu_read_unlock_special(void);

/*
* Tell them what RCU they are running.
@@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
t->rcu_read_unlock_special.b.need_qs = false;
t->rcu_read_unlock_special.b.blocked = true;
t->rcu_blocked_node = rnp;
+ set_rcu_preempt_need_special();

/*
* Verify the CPU's sanity, trace the preemption, and
@@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
/* Bias and limit values for ->rcu_read_lock_nesting. */
#define RCU_NEST_PMAX (INT_MAX / 2)

+#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
static inline void rcu_preempt_depth_inc(void)
{
current->rcu_read_lock_nesting++;
@@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)

static inline bool rcu_preempt_depth_dec_and_test(void)
{
- return --current->rcu_read_lock_nesting == 0;
+ if (--current->rcu_read_lock_nesting == 0) {
+ /* check speical after dec ->rcu_read_lock_nesting */
+ barrier();
+ return READ_ONCE(current->rcu_read_unlock_special.s);
+ }
+ return 0;
}

static inline void rcu_preempt_depth_set(int val)
@@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
current->rcu_read_lock_nesting = val;
}

+static inline void clear_rcu_preempt_need_special(void) {}
+static inline void set_rcu_preempt_need_special(void) {}
+
+#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
+
+#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
/*
* Preemptible RCU implementation for rcu_read_lock().
* Just increment ->rcu_read_lock_nesting, shared state will be updated
@@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
*/
void __rcu_read_unlock(void)
{
- struct task_struct *t = current;
-
if (rcu_preempt_depth_dec_and_test()) {
- barrier(); /* critical section before exit code. */
- if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
- rcu_read_unlock_special(t);
+ rcu_read_unlock_special();
}
if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
- WARN_ON_ONCE(rcu_preempt_depth() < 0);
+ WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
+ rcu_preempt_depth() > RCU_NEST_PMAX);
}
}
EXPORT_SYMBOL_GPL(__rcu_read_unlock);
+#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */

/*
* Advance a ->blkd_tasks-list pointer to the next entry, instead
@@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
return;
}
t->rcu_read_unlock_special.s = 0;
+ clear_rcu_preempt_need_special();
if (special.b.need_qs)
rcu_qs();

@@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
* notify RCU core processing or task having blocked during the RCU
* read-side critical section.
*/
-static void rcu_read_unlock_special(struct task_struct *t)
+void rcu_read_unlock_special(void)
{
+ struct task_struct *t = current;
unsigned long flags;
bool preempt_bh_were_disabled =
!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
@@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
}
rcu_preempt_deferred_qs_irqrestore(t, flags);
}
+EXPORT_SYMBOL_GPL(rcu_read_unlock_special);

/*
* Check that the list of blocked tasks for the newly completed grace
@@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
__this_cpu_read(rcu_data.core_needs_qs) &&
__this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
!t->rcu_read_unlock_special.b.need_qs &&
- time_after(jiffies, rcu_state.gp_start + HZ))
+ time_after(jiffies, rcu_state.gp_start + HZ)) {
t->rcu_read_unlock_special.b.need_qs = true;
+ set_rcu_preempt_need_special();
+ }
}

/*
@@ -714,6 +730,7 @@ void exit_rcu(void)
rcu_preempt_depth_set(1);
barrier();
WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
+ set_rcu_preempt_need_special();
} else if (unlikely(rcu_preempt_depth())) {
rcu_preempt_depth_set(1);
} else {
--
2.20.1

2019-11-02 15:06:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth

On Sat, Nov 02, 2019 at 12:45:52PM +0000, Lai Jiangshan wrote:
> Rather than using the generic version of preempt_count, x86 uses
> a special version of preempt_count implementation that take advantage
> of x86 (single instruction to access/add/decl/flip-bits of percpu
> counter). It makes the preempt_count operations really cheap.
>
> For x86, rcu_preempt_depth can also take the same advantage by
> using the same technique.
>
> After the patchset:
> - No function call when using rcu_read_[un]lock().
> It is minor improvement, other arch can also achieve it via
> moving ->rcu_read_lock_nesting and ->rcu_read_unlock_special
> to thread_info, but inlined rcu_read_[un]lock() generates
> more instructions and footprint in other arch generally.
> - Only single instruction for rcu_read_lock().
> - Only 2 instructions for the fast path of rcu_read_unlock().
>
> Patch4 simplifies rcu_read_unlock() by avoid using negative
> ->rcu_read_lock_nesting, Patch7 introduces the percpu rcu_preempt_depth.
> Other patches are for preparation.
>
> changed from v1:
> drop patch1/2 of the v1
> drop merged patches
>
> Using special.b.deferred_qs to avoid wakeup in v1 is changed to using
> preempt_count. And special.b.deferred_qs is removed.
>
> Lai Jiangshan (7):
> rcu: use preempt_count to test whether scheduler locks is held
> rcu: cleanup rcu_preempt_deferred_qs()
> rcu: remove useless special.b.deferred_qs
> rcu: don't use negative ->rcu_read_lock_nesting
> rcu: wrap usages of rcu_read_lock_nesting
> rcu: clear the special.b.need_qs in rcu_note_context_switch()
> x86,rcu: use percpu rcu_preempt_depth

I am getting on a plane shortly, so what I have done is start a longish
rcutorture test on these. I will take a look at them later on.

And I presume that answers to my questions on some of the patches in
this series will be forthcoming. ;-)

Thanx, Paul

> arch/x86/Kconfig | 2 +
> arch/x86/include/asm/rcu_preempt_depth.h | 87 +++++++++++++++++++
> arch/x86/kernel/cpu/common.c | 7 ++
> arch/x86/kernel/process_32.c | 2 +
> arch/x86/kernel/process_64.c | 2 +
> include/linux/rcupdate.h | 24 ++++++
> include/linux/sched.h | 2 +-
> init/init_task.c | 2 +-
> kernel/fork.c | 2 +-
> kernel/rcu/Kconfig | 3 +
> kernel/rcu/tree_exp.h | 35 ++------
> kernel/rcu/tree_plugin.h | 101 ++++++++++++++---------
> 12 files changed, 196 insertions(+), 73 deletions(-)
> create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>
> --
> 2.20.1
>

2019-11-02 16:32:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth

Resending again because your mail has the craziest header:

Reply-To: 20191101162109.GN20975@paulmck-ThinkPad-P72

and hitting Reply-to-all creates only confusion. WTF?

On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> is that accessing per-cpu variables is a lot cheaper than accessing
> task_struct or thread_info variables.
>
> We need to save/restore the actual rcu_preempt_depth when switch.
> We also place the per-cpu rcu_preempt_depth close to __preempt_count
> and current_task variable.
>
> Using the idea of per-cpu __preempt_count.
>
> No function call when using rcu_read_[un]lock().
> Single instruction for rcu_read_lock().
> 2 instructions for fast path of rcu_read_unlock().
>
> CC: Peter Zijlstra <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/Kconfig | 2 +
> arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
> arch/x86/kernel/cpu/common.c | 7 ++
> arch/x86/kernel/process_32.c | 2 +
> arch/x86/kernel/process_64.c | 2 +
> include/linux/rcupdate.h | 24 +++++++
> init/init_task.c | 2 +-
> kernel/fork.c | 2 +-
> kernel/rcu/Kconfig | 3 +
> kernel/rcu/tree_exp.h | 2 +
> kernel/rcu/tree_plugin.h | 37 +++++++---
> 11 files changed, 158 insertions(+), 12 deletions(-)
> create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d6e1faa28c58..af9fedc0fdc4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -18,6 +18,7 @@ config X86_32
> select MODULES_USE_ELF_REL
> select OLD_SIGACTION
> select GENERIC_VDSO_32
> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH

WTF is a "DEEPTH"?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2019-11-03 04:38:58

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth



On 2019/11/3 12:30 上午, Borislav Petkov wrote:
> Resending again because your mail has the craziest header:
>
> Reply-To: 20191101162109.GN20975@paulmck-ThinkPad-P72
>
> and hitting Reply-to-all creates only confusion. WTF?


Sorry, I send the emails via git send-email with the wrong
argument "--reply-to", it should have been "--in-reply-to".

>
> On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>> is that accessing per-cpu variables is a lot cheaper than accessing
>> task_struct or thread_info variables.
>>
>> We need to save/restore the actual rcu_preempt_depth when switch.
>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>> and current_task variable.
>>
>> Using the idea of per-cpu __preempt_count.
>>
>> No function call when using rcu_read_[un]lock().
>> Single instruction for rcu_read_lock().
>> 2 instructions for fast path of rcu_read_unlock().
>>
>> CC: Peter Zijlstra <[email protected]>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>> ---
>> arch/x86/Kconfig | 2 +
>> arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>> arch/x86/kernel/cpu/common.c | 7 ++
>> arch/x86/kernel/process_32.c | 2 +
>> arch/x86/kernel/process_64.c | 2 +
>> include/linux/rcupdate.h | 24 +++++++
>> init/init_task.c | 2 +-
>> kernel/fork.c | 2 +-
>> kernel/rcu/Kconfig | 3 +
>> kernel/rcu/tree_exp.h | 2 +
>> kernel/rcu/tree_plugin.h | 37 +++++++---
>> 11 files changed, 158 insertions(+), 12 deletions(-)
>> create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index d6e1faa28c58..af9fedc0fdc4 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -18,6 +18,7 @@ config X86_32
>> select MODULES_USE_ELF_REL
>> select OLD_SIGACTION
>> select GENERIC_VDSO_32
>> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>
> WTF is a "DEEPTH"?
>

Sorry, bad English.

Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth

On 2019-11-02 12:45:59 [+0000], Lai Jiangshan wrote:
> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> is that accessing per-cpu variables is a lot cheaper than accessing
> task_struct or thread_info variables.

Is there a benchmark saying how much we gain from this?

> We need to save/restore the actual rcu_preempt_depth when switch.
> We also place the per-cpu rcu_preempt_depth close to __preempt_count
> and current_task variable.
>
> Using the idea of per-cpu __preempt_count.
>
> No function call when using rcu_read_[un]lock().
> Single instruction for rcu_read_lock().
> 2 instructions for fast path of rcu_read_unlock().

I think these were not inlined due to the header requirements.

Boris pointed one thing, there is also DEFINE_PERCPU_RCU_PREEMP_DEPTH.

Sebastian

2019-11-04 11:42:35

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth



On 2019/11/4 5:25 下午, Sebastian Andrzej Siewior wrote:
> On 2019-11-02 12:45:59 [+0000], Lai Jiangshan wrote:
>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>> is that accessing per-cpu variables is a lot cheaper than accessing
>> task_struct or thread_info variables.
>
> Is there a benchmark saying how much we gain from this?

Hello

Maybe I can write a tight loop for testing, but I don't
think anyone will be interesting in it.

I'm also trying to find some good real tests. I need
some suggestions here.

>
>> We need to save/restore the actual rcu_preempt_depth when switch.
>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>> and current_task variable.
>>
>> Using the idea of per-cpu __preempt_count.
>>
>> No function call when using rcu_read_[un]lock().
>> Single instruction for rcu_read_lock().
>> 2 instructions for fast path of rcu_read_unlock().
>
> I think these were not inlined due to the header requirements.

objdump -D -S kernel/workqueue.o shows (selected fractions):


raw_cpu_add_4(__rcu_preempt_depth, 1);
d8f: 65 ff 05 00 00 00 00 incl %gs:0x0(%rip) #
d96 <work_busy+0x16>

......


return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e,
__percpu_arg([var]));
dd8: 65 ff 0d 00 00 00 00 decl %gs:0x0(%rip) #
ddf <work_busy+0x5f>
if (unlikely(rcu_preempt_depth_dec_and_test()))
ddf: 74 26 je e07 <work_busy+0x87>

......

rcu_read_unlock_special();
e07: e8 00 00 00 00 callq e0c <work_busy+0x8c>

>
> Boris pointed one thing, there is also DEFINE_PERCPU_RCU_PREEMP_DEPTH.
>

Thanks for pointing out.

Best regards
Lai

Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth

On 2019-11-04 19:41:20 [+0800], Lai Jiangshan wrote:
> > Is there a benchmark saying how much we gain from this?
>
> Hello
>
> Maybe I can write a tight loop for testing, but I don't
> think anyone will be interesting in it.
>
> I'm also trying to find some good real tests. I need
> some suggestions here.

There is rcutorture but I don't know how much of performance test this
is, Paul would know.

A micro benchmark is one thing. Any visible changes in userland to
workloads like building a kernel or hackbench?

I don't argue that incrementing a per-CPU variable is more efficient
than reading a per-CPU variable, adding an offset and then incrementing
it. I was just curious to see if there are any numbers on it.

> > > No function call when using rcu_read_[un]lock().
> > > Single instruction for rcu_read_lock().
> > > 2 instructions for fast path of rcu_read_unlock().
> >
> > I think these were not inlined due to the header requirements.
>
> objdump -D -S kernel/workqueue.o shows (selected fractions):

That was not what I meant. To inline current rcu_read_lock() would mean
to include definition for struct task_struct (and everything down the
road) in the rcu headers which isn't working.

> Best regards
> Lai

Sebastian

2019-11-15 16:57:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held

On Sat, Nov 02, 2019 at 12:45:53PM +0000, Lai Jiangshan wrote:
> Ever since preemption was introduced to linux kernel,
> irq disabled spinlocks are always held with preemption
> disabled. One of the reason is that sometimes we need
> to use spin_unlock() which will do preempt_enable()
> to unlock the irq disabled spinlock with keeping irq
> disabled. So preempt_count can be used to test whether
> scheduler locks is possible held.
>
> CC: Peter Zijlstra <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>

Again, your point that RCU flavor consolidation allows some
simplifications is an excellent one, so thank you again.

And sorry to be slow, but the interaction with the rest of RCU must
be taken into account. Therefore, doing this patch series justice
does require some time.

> ---
> kernel/rcu/tree_plugin.h | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0982e9886103..aba5896d67e3 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -603,10 +603,14 @@ static void rcu_read_unlock_special(struct task_struct *t)
> tick_nohz_full_cpu(rdp->cpu);
> // Need to defer quiescent state until everything is enabled.
> if (irqs_were_disabled && use_softirq &&
> - (in_interrupt() ||
> - (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
> + (in_interrupt() || (exp && !preempt_bh_were_disabled))) {

My concern here is that this relies on a side-effect of the _irq locking
primitives. What if someone similar to you comes along and is able to
show significant performance benefits from making raw_spin_lock_irqsave()
and friends leave preempt_count alone? After all, these primitives
disable interrupts, so the bits in preempt_count can be argued to have
no effect.

But this patch is not central to simplifying __rcu_read_unlock().
Plus RCU now re-enables the scheduler clock tick on nohz_full CPUs that
are blocking normal grace periods, which gives additional flexibility
on this code path -- one of the big concerns when this was written was
that in a PREEMPT=y kernel, a nohz_full CPU spinning in kernel code might
never pass through a quiescent state. And expedited grace periods need
to be fast on average, not worst case.

So another approach might be to:

1. Simplfy the above expression to only do raise_softirq_irqoff()
if we are actually in an interrupt handler.

2. Make expedited grace periods re-enable the scheduler-clock
interrupt on CPUs that are slow to pass through quiescent states.
(Taking care to disable them again, which might require
coordination with the similar logic in normal grace periods.)

As a second step, it might still be possible to continue using
raise_softirq_irqoff() in some of the non-interrupt-handler cases
involving __rcu_read_unlock() with interrupts disabled.

Thoughts?

Thanx, Paul

> // Using softirq, safe to awaken, and we get
> // no help from enabling irqs, unlike bh/preempt.
> + // in_interrupt(): raise_softirq_irqoff() is
> + // guaranteed not to not do wakeup
> + // !preempt_bh_were_disabled: scheduler locks cannot
> + // be held, since spinlocks are always held with
> + // preempt_disable(), so the wakeup will be safe.
> raise_softirq_irqoff(RCU_SOFTIRQ);
> } else {
> // Enabling BH or preempt does reschedule, so...
> --
> 2.20.1
>

2019-11-16 15:49:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] rcu: clear the special.b.need_qs in rcu_note_context_switch()

On Sat, Nov 02, 2019 at 12:45:58PM +0000, Lai Jiangshan wrote:
> It is better to clear the special.b.need_qs when it is
> replaced by special.b.blocked or it is really fulfill its
> goal in rcu_preempt_deferred_qs_irqrestore().
>
> It makes rcu_qs() easier to be understood, and also prepares for
> the percpu rcu_preempt_depth patch, which reqires rcu_special
> bits to be clearred in irq-disable context.
>
> Signed-off-by: Lai Jiangshan <[email protected]>

This one is a (possible) optimization.

Right now, when the CPU actually passes through the quiescent state,
we clear this field. The quiescent state is not reported until later.
Waiting to clear it until later might cause extra unneeded quiescent-state
work to happen. But your point is that the current code might leave
->rcu_read_unlock_special momentarily zero, causing possible trouble
with the remainder of this series, right?

Hmmm...

The "right" way to do this would be to have another flag saying
"quiescent state encountered but not yet reported". This would keep
->rcu_read_unlock_special non-zero throughout, and would avoid useless
work looking for additional unneeded quiescent states. Or perhaps have
need_qs be zero for don't need, one for need, and two for have but not
yet reported.

Thoughts? Other approaches?

Thanx, Paul

> ---
> kernel/rcu/tree_plugin.h | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index eb5906c55c8d..e16c3867d2ff 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -264,8 +264,6 @@ static void rcu_qs(void)
> __this_cpu_read(rcu_data.gp_seq),
> TPS("cpuqs"));
> __this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
> - barrier(); /* Coordinate with rcu_flavor_sched_clock_irq(). */
> - WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, false);
> }
> }
>
> @@ -297,6 +295,7 @@ void rcu_note_context_switch(bool preempt)
> /* Possibly blocking in an RCU read-side critical section. */
> rnp = rdp->mynode;
> raw_spin_lock_rcu_node(rnp);
> + t->rcu_read_unlock_special.b.need_qs = false;
> t->rcu_read_unlock_special.b.blocked = true;
> t->rcu_blocked_node = rnp;
>
> --
> 2.20.1
>

2019-11-16 15:52:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth

On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> is that accessing per-cpu variables is a lot cheaper than accessing
> task_struct or thread_info variables.
>
> We need to save/restore the actual rcu_preempt_depth when switch.
> We also place the per-cpu rcu_preempt_depth close to __preempt_count
> and current_task variable.
>
> Using the idea of per-cpu __preempt_count.
>
> No function call when using rcu_read_[un]lock().
> Single instruction for rcu_read_lock().
> 2 instructions for fast path of rcu_read_unlock().
>
> CC: Peter Zijlstra <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>

Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
just as nice for #include/inlining, and a lot less per-architecture code?

Or am I missing some issue with the task-info approach?

Thanx, Paul

> ---
> arch/x86/Kconfig | 2 +
> arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
> arch/x86/kernel/cpu/common.c | 7 ++
> arch/x86/kernel/process_32.c | 2 +
> arch/x86/kernel/process_64.c | 2 +
> include/linux/rcupdate.h | 24 +++++++
> init/init_task.c | 2 +-
> kernel/fork.c | 2 +-
> kernel/rcu/Kconfig | 3 +
> kernel/rcu/tree_exp.h | 2 +
> kernel/rcu/tree_plugin.h | 37 +++++++---
> 11 files changed, 158 insertions(+), 12 deletions(-)
> create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d6e1faa28c58..af9fedc0fdc4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -18,6 +18,7 @@ config X86_32
> select MODULES_USE_ELF_REL
> select OLD_SIGACTION
> select GENERIC_VDSO_32
> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>
> config X86_64
> def_bool y
> @@ -31,6 +32,7 @@ config X86_64
> select NEED_DMA_MAP_STATE
> select SWIOTLB
> select ARCH_HAS_SYSCALL_WRAPPER
> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>
> config FORCE_DYNAMIC_FTRACE
> def_bool y
> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> new file mode 100644
> index 000000000000..88010ad59c20
> --- /dev/null
> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> +#define __ASM_RCU_PREEMPT_DEPTH_H
> +
> +#include <asm/rmwcc.h>
> +#include <asm/percpu.h>
> +
> +#ifdef CONFIG_PREEMPT_RCU
> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> +
> +/*
> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> + * such that a decrement hitting 0 means we can and should do
> + * rcu_read_unlock_special().
> + */
> +#define RCU_NEED_SPECIAL 0x80000000
> +
> +#define INIT_RCU_PREEMPT_DEPTH (RCU_NEED_SPECIAL)
> +
> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> +static __always_inline int rcu_preempt_depth(void)
> +{
> + return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> +}
> +
> +static __always_inline void rcu_preempt_depth_set(int pc)
> +{
> + int old, new;
> +
> + do {
> + old = raw_cpu_read_4(__rcu_preempt_depth);
> + new = (old & RCU_NEED_SPECIAL) |
> + (pc & ~RCU_NEED_SPECIAL);
> + } while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> +}
> +
> +/*
> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> + * rcu_read_unlock() can decrement and test for needing to do special
> + * with a single instruction.
> + *
> + * We invert the actual bit, so that when the decrement hits 0 we know
> + * both it just exited the outmost rcu_read_lock() critical section and
> + * we need to do specail (the bit is cleared) if it doesn't need to be
> + * deferred.
> + */
> +
> +static inline void set_rcu_preempt_need_special(void)
> +{
> + raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> +}
> +
> +/*
> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> + * and calling this function. Otherwise it may clear the work done
> + * by set_rcu_preempt_need_special() in interrupt.
> + */
> +static inline void clear_rcu_preempt_need_special(void)
> +{
> + raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> +}
> +
> +static __always_inline void rcu_preempt_depth_inc(void)
> +{
> + raw_cpu_add_4(__rcu_preempt_depth, 1);
> +}
> +
> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> +{
> + return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> +}
> +
> +/* must be macros to avoid header recursion hell */
> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do { \
> + prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth); \
> + this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting); \
> + } while (0)
> +
> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH \
> + DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH; \
> + EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> +#else /* #ifdef CONFIG_PREEMPT_RCU */
> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH /* empty */
> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> +
> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 9ae7d1bcd4f4..0151737e196c 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -46,6 +46,7 @@
> #include <asm/asm.h>
> #include <asm/bugs.h>
> #include <asm/cpu.h>
> +#include <asm/rcu_preempt_depth.h>
> #include <asm/mce.h>
> #include <asm/msr.h>
> #include <asm/pat.h>
> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
> DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> EXPORT_PER_CPU_SYMBOL(__preempt_count);
>
> +/* close to __preempt_count */
> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> +
> /* May not be marked __init: used by software suspend */
> void syscall_init(void)
> {
> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
> DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> EXPORT_PER_CPU_SYMBOL(__preempt_count);
>
> +/* close to __preempt_count */
> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> +
> /*
> * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
> * the top of the kernel stack. Use an extra percpu variable to track the
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index b8ceec4974fe..ab1f20353663 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -51,6 +51,7 @@
> #include <asm/cpu.h>
> #include <asm/syscalls.h>
> #include <asm/debugreg.h>
> +#include <asm/rcu_preempt_depth.h>
> #include <asm/switch_to.h>
> #include <asm/vm86.h>
> #include <asm/resctrl_sched.h>
> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> if (prev->gs | next->gs)
> lazy_load_gs(next->gs);
>
> + save_restore_rcu_preempt_depth(prev_p, next_p);
> this_cpu_write(current_task, next_p);
>
> switch_fpu_finish(next_fpu);
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index af64519b2695..2e1c6e829d30 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -50,6 +50,7 @@
> #include <asm/ia32.h>
> #include <asm/syscalls.h>
> #include <asm/debugreg.h>
> +#include <asm/rcu_preempt_depth.h>
> #include <asm/switch_to.h>
> #include <asm/xen/hypervisor.h>
> #include <asm/vdso.h>
> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>
> x86_fsgsbase_load(prev, next);
>
> + save_restore_rcu_preempt_depth(prev_p, next_p);
> /*
> * Switch the PDA and FPU contexts.
> */
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index a35daab95d14..0d2abf08b694 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>
> #ifdef CONFIG_PREEMPT_RCU
>
> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> +#include <asm/rcu_preempt_depth.h>
> +
> +#ifndef CONFIG_PROVE_LOCKING
> +extern void rcu_read_unlock_special(void);
> +
> +static inline void __rcu_read_lock(void)
> +{
> + rcu_preempt_depth_inc();
> +}
> +
> +static inline void __rcu_read_unlock(void)
> +{
> + if (unlikely(rcu_preempt_depth_dec_and_test()))
> + rcu_read_unlock_special();
> +}
> +#else
> +void __rcu_read_lock(void);
> +void __rcu_read_unlock(void);
> +#endif
> +
> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> +#define INIT_RCU_PREEMPT_DEPTH (0)
> void __rcu_read_lock(void);
> void __rcu_read_unlock(void);
>
> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
> * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
> */
> #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>
> #else /* #ifdef CONFIG_PREEMPT_RCU */
>
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5eab7b..0a91e38fba37 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -130,7 +130,7 @@ struct task_struct init_task
> .perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
> #endif
> #ifdef CONFIG_PREEMPT_RCU
> - .rcu_read_lock_nesting = 0,
> + .rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
> .rcu_read_unlock_special.s = 0,
> .rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
> .rcu_blocked_node = NULL,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f9572f416126..7368d4ccb857 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
> static inline void rcu_copy_process(struct task_struct *p)
> {
> #ifdef CONFIG_PREEMPT_RCU
> - p->rcu_read_lock_nesting = 0;
> + p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
> p->rcu_read_unlock_special.s = 0;
> p->rcu_blocked_node = NULL;
> INIT_LIST_HEAD(&p->rcu_node_entry);
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 1cc940fef17c..d2ecca49a1a4 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -14,6 +14,9 @@ config TREE_RCU
> thousands of CPUs. It also scales down nicely to
> smaller systems.
>
> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> + def_bool n
> +
> config PREEMPT_RCU
> bool
> default y if PREEMPTION
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index dcb2124203cf..f919881832d4 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
> }
>
> #ifdef CONFIG_PREEMPT_RCU
> +static inline void set_rcu_preempt_need_special(void);
>
> /*
> * Remote handler for smp_call_function_single(). If there is an
> @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
> if (rnp->expmask & rdp->grpmask) {
> rdp->exp_deferred_qs = true;
> t->rcu_read_unlock_special.b.exp_hint = true;
> + set_rcu_preempt_need_special();
> }
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> return;
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index e16c3867d2ff..e6774a7ab16b 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
> #ifdef CONFIG_PREEMPT_RCU
>
> static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> -static void rcu_read_unlock_special(struct task_struct *t);
> +void rcu_read_unlock_special(void);
>
> /*
> * Tell them what RCU they are running.
> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
> t->rcu_read_unlock_special.b.need_qs = false;
> t->rcu_read_unlock_special.b.blocked = true;
> t->rcu_blocked_node = rnp;
> + set_rcu_preempt_need_special();
>
> /*
> * Verify the CPU's sanity, trace the preemption, and
> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> /* Bias and limit values for ->rcu_read_lock_nesting. */
> #define RCU_NEST_PMAX (INT_MAX / 2)
>
> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> static inline void rcu_preempt_depth_inc(void)
> {
> current->rcu_read_lock_nesting++;
> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>
> static inline bool rcu_preempt_depth_dec_and_test(void)
> {
> - return --current->rcu_read_lock_nesting == 0;
> + if (--current->rcu_read_lock_nesting == 0) {
> + /* check speical after dec ->rcu_read_lock_nesting */
> + barrier();
> + return READ_ONCE(current->rcu_read_unlock_special.s);
> + }
> + return 0;
> }
>
> static inline void rcu_preempt_depth_set(int val)
> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
> current->rcu_read_lock_nesting = val;
> }
>
> +static inline void clear_rcu_preempt_need_special(void) {}
> +static inline void set_rcu_preempt_need_special(void) {}
> +
> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> +
> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
> /*
> * Preemptible RCU implementation for rcu_read_lock().
> * Just increment ->rcu_read_lock_nesting, shared state will be updated
> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
> */
> void __rcu_read_unlock(void)
> {
> - struct task_struct *t = current;
> -
> if (rcu_preempt_depth_dec_and_test()) {
> - barrier(); /* critical section before exit code. */
> - if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> - rcu_read_unlock_special(t);
> + rcu_read_unlock_special();
> }
> if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> - WARN_ON_ONCE(rcu_preempt_depth() < 0);
> + WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> + rcu_preempt_depth() > RCU_NEST_PMAX);
> }
> }
> EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>
> /*
> * Advance a ->blkd_tasks-list pointer to the next entry, instead
> @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> return;
> }
> t->rcu_read_unlock_special.s = 0;
> + clear_rcu_preempt_need_special();
> if (special.b.need_qs)
> rcu_qs();
>
> @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> * notify RCU core processing or task having blocked during the RCU
> * read-side critical section.
> */
> -static void rcu_read_unlock_special(struct task_struct *t)
> +void rcu_read_unlock_special(void)
> {
> + struct task_struct *t = current;
> unsigned long flags;
> bool preempt_bh_were_disabled =
> !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> }
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> }
> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>
> /*
> * Check that the list of blocked tasks for the newly completed grace
> @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
> __this_cpu_read(rcu_data.core_needs_qs) &&
> __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> !t->rcu_read_unlock_special.b.need_qs &&
> - time_after(jiffies, rcu_state.gp_start + HZ))
> + time_after(jiffies, rcu_state.gp_start + HZ)) {
> t->rcu_read_unlock_special.b.need_qs = true;
> + set_rcu_preempt_need_special();
> + }
> }
>
> /*
> @@ -714,6 +730,7 @@ void exit_rcu(void)
> rcu_preempt_depth_set(1);
> barrier();
> WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> + set_rcu_preempt_need_special();
> } else if (unlikely(rcu_preempt_depth())) {
> rcu_preempt_depth_set(1);
> } else {
> --
> 2.20.1
>

2019-11-18 02:04:23

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth



On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
> On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>> is that accessing per-cpu variables is a lot cheaper than accessing
>> task_struct or thread_info variables.
>>
>> We need to save/restore the actual rcu_preempt_depth when switch.
>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>> and current_task variable.
>>
>> Using the idea of per-cpu __preempt_count.
>>
>> No function call when using rcu_read_[un]lock().
>> Single instruction for rcu_read_lock().
>> 2 instructions for fast path of rcu_read_unlock().
>>
>> CC: Peter Zijlstra <[email protected]>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>
> Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
> just as nice for #include/inlining, and a lot less per-architecture code?
>
> Or am I missing some issue with the task-info approach?

struct thread_info itself is per-architecture definition.
All the arches would have to be touched if RCU's nesting-depth counter
is put int struct thread_info. Though the inlining functions can be
defined in include/asm-generic/ so that it serves for all arches
and X86 can have its own implementation in arch/x86/include/asm/.


>
> Thanx, Paul
>
>> ---
>> arch/x86/Kconfig | 2 +
>> arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>> arch/x86/kernel/cpu/common.c | 7 ++
>> arch/x86/kernel/process_32.c | 2 +
>> arch/x86/kernel/process_64.c | 2 +
>> include/linux/rcupdate.h | 24 +++++++
>> init/init_task.c | 2 +-
>> kernel/fork.c | 2 +-
>> kernel/rcu/Kconfig | 3 +
>> kernel/rcu/tree_exp.h | 2 +
>> kernel/rcu/tree_plugin.h | 37 +++++++---
>> 11 files changed, 158 insertions(+), 12 deletions(-)
>> create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index d6e1faa28c58..af9fedc0fdc4 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -18,6 +18,7 @@ config X86_32
>> select MODULES_USE_ELF_REL
>> select OLD_SIGACTION
>> select GENERIC_VDSO_32
>> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>
>> config X86_64
>> def_bool y
>> @@ -31,6 +32,7 @@ config X86_64
>> select NEED_DMA_MAP_STATE
>> select SWIOTLB
>> select ARCH_HAS_SYSCALL_WRAPPER
>> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>
>> config FORCE_DYNAMIC_FTRACE
>> def_bool y
>> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
>> new file mode 100644
>> index 000000000000..88010ad59c20
>> --- /dev/null
>> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
>> @@ -0,0 +1,87 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
>> +#define __ASM_RCU_PREEMPT_DEPTH_H
>> +
>> +#include <asm/rmwcc.h>
>> +#include <asm/percpu.h>
>> +
>> +#ifdef CONFIG_PREEMPT_RCU
>> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
>> +
>> +/*
>> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
>> + * such that a decrement hitting 0 means we can and should do
>> + * rcu_read_unlock_special().
>> + */
>> +#define RCU_NEED_SPECIAL 0x80000000
>> +
>> +#define INIT_RCU_PREEMPT_DEPTH (RCU_NEED_SPECIAL)
>> +
>> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
>> +static __always_inline int rcu_preempt_depth(void)
>> +{
>> + return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
>> +}
>> +
>> +static __always_inline void rcu_preempt_depth_set(int pc)
>> +{
>> + int old, new;
>> +
>> + do {
>> + old = raw_cpu_read_4(__rcu_preempt_depth);
>> + new = (old & RCU_NEED_SPECIAL) |
>> + (pc & ~RCU_NEED_SPECIAL);
>> + } while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
>> +}
>> +
>> +/*
>> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
>> + * rcu_read_unlock() can decrement and test for needing to do special
>> + * with a single instruction.
>> + *
>> + * We invert the actual bit, so that when the decrement hits 0 we know
>> + * both it just exited the outmost rcu_read_lock() critical section and
>> + * we need to do specail (the bit is cleared) if it doesn't need to be
>> + * deferred.
>> + */
>> +
>> +static inline void set_rcu_preempt_need_special(void)
>> +{
>> + raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
>> +}
>> +
>> +/*
>> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
>> + * and calling this function. Otherwise it may clear the work done
>> + * by set_rcu_preempt_need_special() in interrupt.
>> + */
>> +static inline void clear_rcu_preempt_need_special(void)
>> +{
>> + raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
>> +}
>> +
>> +static __always_inline void rcu_preempt_depth_inc(void)
>> +{
>> + raw_cpu_add_4(__rcu_preempt_depth, 1);
>> +}
>> +
>> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
>> +{
>> + return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
>> +}
>> +
>> +/* must be macros to avoid header recursion hell */
>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do { \
>> + prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth); \
>> + this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting); \
>> + } while (0)
>> +
>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH \
>> + DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH; \
>> + EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
>> +#else /* #ifdef CONFIG_PREEMPT_RCU */
>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH /* empty */
>> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>> +
>> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 9ae7d1bcd4f4..0151737e196c 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -46,6 +46,7 @@
>> #include <asm/asm.h>
>> #include <asm/bugs.h>
>> #include <asm/cpu.h>
>> +#include <asm/rcu_preempt_depth.h>
>> #include <asm/mce.h>
>> #include <asm/msr.h>
>> #include <asm/pat.h>
>> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
>> DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>> EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>
>> +/* close to __preempt_count */
>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>> +
>> /* May not be marked __init: used by software suspend */
>> void syscall_init(void)
>> {
>> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
>> DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>> EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>
>> +/* close to __preempt_count */
>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>> +
>> /*
>> * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
>> * the top of the kernel stack. Use an extra percpu variable to track the
>> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
>> index b8ceec4974fe..ab1f20353663 100644
>> --- a/arch/x86/kernel/process_32.c
>> +++ b/arch/x86/kernel/process_32.c
>> @@ -51,6 +51,7 @@
>> #include <asm/cpu.h>
>> #include <asm/syscalls.h>
>> #include <asm/debugreg.h>
>> +#include <asm/rcu_preempt_depth.h>
>> #include <asm/switch_to.h>
>> #include <asm/vm86.h>
>> #include <asm/resctrl_sched.h>
>> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>> if (prev->gs | next->gs)
>> lazy_load_gs(next->gs);
>>
>> + save_restore_rcu_preempt_depth(prev_p, next_p);
>> this_cpu_write(current_task, next_p);
>>
>> switch_fpu_finish(next_fpu);
>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>> index af64519b2695..2e1c6e829d30 100644
>> --- a/arch/x86/kernel/process_64.c
>> +++ b/arch/x86/kernel/process_64.c
>> @@ -50,6 +50,7 @@
>> #include <asm/ia32.h>
>> #include <asm/syscalls.h>
>> #include <asm/debugreg.h>
>> +#include <asm/rcu_preempt_depth.h>
>> #include <asm/switch_to.h>
>> #include <asm/xen/hypervisor.h>
>> #include <asm/vdso.h>
>> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>
>> x86_fsgsbase_load(prev, next);
>>
>> + save_restore_rcu_preempt_depth(prev_p, next_p);
>> /*
>> * Switch the PDA and FPU contexts.
>> */
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index a35daab95d14..0d2abf08b694 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>>
>> #ifdef CONFIG_PREEMPT_RCU
>>
>> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>> +#include <asm/rcu_preempt_depth.h>
>> +
>> +#ifndef CONFIG_PROVE_LOCKING
>> +extern void rcu_read_unlock_special(void);
>> +
>> +static inline void __rcu_read_lock(void)
>> +{
>> + rcu_preempt_depth_inc();
>> +}
>> +
>> +static inline void __rcu_read_unlock(void)
>> +{
>> + if (unlikely(rcu_preempt_depth_dec_and_test()))
>> + rcu_read_unlock_special();
>> +}
>> +#else
>> +void __rcu_read_lock(void);
>> +void __rcu_read_unlock(void);
>> +#endif
>> +
>> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>> +#define INIT_RCU_PREEMPT_DEPTH (0)
>> void __rcu_read_lock(void);
>> void __rcu_read_unlock(void);
>>
>> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
>> * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
>> */
>> #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
>> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>
>> #else /* #ifdef CONFIG_PREEMPT_RCU */
>>
>> diff --git a/init/init_task.c b/init/init_task.c
>> index 9e5cbe5eab7b..0a91e38fba37 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -130,7 +130,7 @@ struct task_struct init_task
>> .perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
>> #endif
>> #ifdef CONFIG_PREEMPT_RCU
>> - .rcu_read_lock_nesting = 0,
>> + .rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
>> .rcu_read_unlock_special.s = 0,
>> .rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
>> .rcu_blocked_node = NULL,
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f9572f416126..7368d4ccb857 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
>> static inline void rcu_copy_process(struct task_struct *p)
>> {
>> #ifdef CONFIG_PREEMPT_RCU
>> - p->rcu_read_lock_nesting = 0;
>> + p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
>> p->rcu_read_unlock_special.s = 0;
>> p->rcu_blocked_node = NULL;
>> INIT_LIST_HEAD(&p->rcu_node_entry);
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index 1cc940fef17c..d2ecca49a1a4 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -14,6 +14,9 @@ config TREE_RCU
>> thousands of CPUs. It also scales down nicely to
>> smaller systems.
>>
>> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
>> + def_bool n
>> +
>> config PREEMPT_RCU
>> bool
>> default y if PREEMPTION
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index dcb2124203cf..f919881832d4 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
>> }
>>
>> #ifdef CONFIG_PREEMPT_RCU
>> +static inline void set_rcu_preempt_need_special(void);
>>
>> /*
>> * Remote handler for smp_call_function_single(). If there is an
>> @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
>> if (rnp->expmask & rdp->grpmask) {
>> rdp->exp_deferred_qs = true;
>> t->rcu_read_unlock_special.b.exp_hint = true;
>> + set_rcu_preempt_need_special();
>> }
>> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>> return;
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index e16c3867d2ff..e6774a7ab16b 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
>> #ifdef CONFIG_PREEMPT_RCU
>>
>> static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
>> -static void rcu_read_unlock_special(struct task_struct *t);
>> +void rcu_read_unlock_special(void);
>>
>> /*
>> * Tell them what RCU they are running.
>> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
>> t->rcu_read_unlock_special.b.need_qs = false;
>> t->rcu_read_unlock_special.b.blocked = true;
>> t->rcu_blocked_node = rnp;
>> + set_rcu_preempt_need_special();
>>
>> /*
>> * Verify the CPU's sanity, trace the preemption, and
>> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>> /* Bias and limit values for ->rcu_read_lock_nesting. */
>> #define RCU_NEST_PMAX (INT_MAX / 2)
>>
>> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>> static inline void rcu_preempt_depth_inc(void)
>> {
>> current->rcu_read_lock_nesting++;
>> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>>
>> static inline bool rcu_preempt_depth_dec_and_test(void)
>> {
>> - return --current->rcu_read_lock_nesting == 0;
>> + if (--current->rcu_read_lock_nesting == 0) {
>> + /* check speical after dec ->rcu_read_lock_nesting */
>> + barrier();
>> + return READ_ONCE(current->rcu_read_unlock_special.s);
>> + }
>> + return 0;
>> }
>>
>> static inline void rcu_preempt_depth_set(int val)
>> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
>> current->rcu_read_lock_nesting = val;
>> }
>>
>> +static inline void clear_rcu_preempt_need_special(void) {}
>> +static inline void set_rcu_preempt_need_special(void) {}
>> +
>> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>> +
>> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
>> /*
>> * Preemptible RCU implementation for rcu_read_lock().
>> * Just increment ->rcu_read_lock_nesting, shared state will be updated
>> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
>> */
>> void __rcu_read_unlock(void)
>> {
>> - struct task_struct *t = current;
>> -
>> if (rcu_preempt_depth_dec_and_test()) {
>> - barrier(); /* critical section before exit code. */
>> - if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>> - rcu_read_unlock_special(t);
>> + rcu_read_unlock_special();
>> }
>> if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>> - WARN_ON_ONCE(rcu_preempt_depth() < 0);
>> + WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
>> + rcu_preempt_depth() > RCU_NEST_PMAX);
>> }
>> }
>> EXPORT_SYMBOL_GPL(__rcu_read_unlock);
>> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>>
>> /*
>> * Advance a ->blkd_tasks-list pointer to the next entry, instead
>> @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>> return;
>> }
>> t->rcu_read_unlock_special.s = 0;
>> + clear_rcu_preempt_need_special();
>> if (special.b.need_qs)
>> rcu_qs();
>>
>> @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>> * notify RCU core processing or task having blocked during the RCU
>> * read-side critical section.
>> */
>> -static void rcu_read_unlock_special(struct task_struct *t)
>> +void rcu_read_unlock_special(void)
>> {
>> + struct task_struct *t = current;
>> unsigned long flags;
>> bool preempt_bh_were_disabled =
>> !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
>> @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>> }
>> rcu_preempt_deferred_qs_irqrestore(t, flags);
>> }
>> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>>
>> /*
>> * Check that the list of blocked tasks for the newly completed grace
>> @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
>> __this_cpu_read(rcu_data.core_needs_qs) &&
>> __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>> !t->rcu_read_unlock_special.b.need_qs &&
>> - time_after(jiffies, rcu_state.gp_start + HZ))
>> + time_after(jiffies, rcu_state.gp_start + HZ)) {
>> t->rcu_read_unlock_special.b.need_qs = true;
>> + set_rcu_preempt_need_special();
>> + }
>> }
>>
>> /*
>> @@ -714,6 +730,7 @@ void exit_rcu(void)
>> rcu_preempt_depth_set(1);
>> barrier();
>> WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
>> + set_rcu_preempt_need_special();
>> } else if (unlikely(rcu_preempt_depth())) {
>> rcu_preempt_depth_set(1);
>> } else {
>> --
>> 2.20.1
>>

2019-11-18 15:01:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth

On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
>
>
> On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
> > On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> > > Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> > > is that accessing per-cpu variables is a lot cheaper than accessing
> > > task_struct or thread_info variables.
> > >
> > > We need to save/restore the actual rcu_preempt_depth when switch.
> > > We also place the per-cpu rcu_preempt_depth close to __preempt_count
> > > and current_task variable.
> > >
> > > Using the idea of per-cpu __preempt_count.
> > >
> > > No function call when using rcu_read_[un]lock().
> > > Single instruction for rcu_read_lock().
> > > 2 instructions for fast path of rcu_read_unlock().
> > >
> > > CC: Peter Zijlstra <[email protected]>
> > > Signed-off-by: Lai Jiangshan <[email protected]>
> >
> > Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
> > just as nice for #include/inlining, and a lot less per-architecture code?
> >
> > Or am I missing some issue with the task-info approach?
>
> struct thread_info itself is per-architecture definition.
> All the arches would have to be touched if RCU's nesting-depth counter
> is put int struct thread_info. Though the inlining functions can be
> defined in include/asm-generic/ so that it serves for all arches
> and X86 can have its own implementation in arch/x86/include/asm/.

True enough.

But doesn't the per-CPU code require per-architecture changes to copy
to and from the per-CPU variable? If that code simpler and smaller than
the thread_info access code, I will be -very- surprised.


Thanx, Paul

> > > ---
> > > arch/x86/Kconfig | 2 +
> > > arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
> > > arch/x86/kernel/cpu/common.c | 7 ++
> > > arch/x86/kernel/process_32.c | 2 +
> > > arch/x86/kernel/process_64.c | 2 +
> > > include/linux/rcupdate.h | 24 +++++++
> > > init/init_task.c | 2 +-
> > > kernel/fork.c | 2 +-
> > > kernel/rcu/Kconfig | 3 +
> > > kernel/rcu/tree_exp.h | 2 +
> > > kernel/rcu/tree_plugin.h | 37 +++++++---
> > > 11 files changed, 158 insertions(+), 12 deletions(-)
> > > create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> > >
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index d6e1faa28c58..af9fedc0fdc4 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -18,6 +18,7 @@ config X86_32
> > > select MODULES_USE_ELF_REL
> > > select OLD_SIGACTION
> > > select GENERIC_VDSO_32
> > > + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > config X86_64
> > > def_bool y
> > > @@ -31,6 +32,7 @@ config X86_64
> > > select NEED_DMA_MAP_STATE
> > > select SWIOTLB
> > > select ARCH_HAS_SYSCALL_WRAPPER
> > > + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > config FORCE_DYNAMIC_FTRACE
> > > def_bool y
> > > diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> > > new file mode 100644
> > > index 000000000000..88010ad59c20
> > > --- /dev/null
> > > +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> > > @@ -0,0 +1,87 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> > > +#define __ASM_RCU_PREEMPT_DEPTH_H
> > > +
> > > +#include <asm/rmwcc.h>
> > > +#include <asm/percpu.h>
> > > +
> > > +#ifdef CONFIG_PREEMPT_RCU
> > > +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> > > +
> > > +/*
> > > + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> > > + * such that a decrement hitting 0 means we can and should do
> > > + * rcu_read_unlock_special().
> > > + */
> > > +#define RCU_NEED_SPECIAL 0x80000000
> > > +
> > > +#define INIT_RCU_PREEMPT_DEPTH (RCU_NEED_SPECIAL)
> > > +
> > > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > > +static __always_inline int rcu_preempt_depth(void)
> > > +{
> > > + return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> > > +}
> > > +
> > > +static __always_inline void rcu_preempt_depth_set(int pc)
> > > +{
> > > + int old, new;
> > > +
> > > + do {
> > > + old = raw_cpu_read_4(__rcu_preempt_depth);
> > > + new = (old & RCU_NEED_SPECIAL) |
> > > + (pc & ~RCU_NEED_SPECIAL);
> > > + } while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> > > +}
> > > +
> > > +/*
> > > + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> > > + * rcu_read_unlock() can decrement and test for needing to do special
> > > + * with a single instruction.
> > > + *
> > > + * We invert the actual bit, so that when the decrement hits 0 we know
> > > + * both it just exited the outmost rcu_read_lock() critical section and
> > > + * we need to do specail (the bit is cleared) if it doesn't need to be
> > > + * deferred.
> > > + */
> > > +
> > > +static inline void set_rcu_preempt_need_special(void)
> > > +{
> > > + raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> > > +}
> > > +
> > > +/*
> > > + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> > > + * and calling this function. Otherwise it may clear the work done
> > > + * by set_rcu_preempt_need_special() in interrupt.
> > > + */
> > > +static inline void clear_rcu_preempt_need_special(void)
> > > +{
> > > + raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> > > +}
> > > +
> > > +static __always_inline void rcu_preempt_depth_inc(void)
> > > +{
> > > + raw_cpu_add_4(__rcu_preempt_depth, 1);
> > > +}
> > > +
> > > +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> > > +{
> > > + return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> > > +}
> > > +
> > > +/* must be macros to avoid header recursion hell */
> > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do { \
> > > + prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth); \
> > > + this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting); \
> > > + } while (0)
> > > +
> > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH \
> > > + DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH; \
> > > + EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> > > +#else /* #ifdef CONFIG_PREEMPT_RCU */
> > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH /* empty */
> > > +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > +
> > > +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > index 9ae7d1bcd4f4..0151737e196c 100644
> > > --- a/arch/x86/kernel/cpu/common.c
> > > +++ b/arch/x86/kernel/cpu/common.c
> > > @@ -46,6 +46,7 @@
> > > #include <asm/asm.h>
> > > #include <asm/bugs.h>
> > > #include <asm/cpu.h>
> > > +#include <asm/rcu_preempt_depth.h>
> > > #include <asm/mce.h>
> > > #include <asm/msr.h>
> > > #include <asm/pat.h>
> > > @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
> > > DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > +/* close to __preempt_count */
> > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > +
> > > /* May not be marked __init: used by software suspend */
> > > void syscall_init(void)
> > > {
> > > @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
> > > DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > +/* close to __preempt_count */
> > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > +
> > > /*
> > > * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
> > > * the top of the kernel stack. Use an extra percpu variable to track the
> > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > > index b8ceec4974fe..ab1f20353663 100644
> > > --- a/arch/x86/kernel/process_32.c
> > > +++ b/arch/x86/kernel/process_32.c
> > > @@ -51,6 +51,7 @@
> > > #include <asm/cpu.h>
> > > #include <asm/syscalls.h>
> > > #include <asm/debugreg.h>
> > > +#include <asm/rcu_preempt_depth.h>
> > > #include <asm/switch_to.h>
> > > #include <asm/vm86.h>
> > > #include <asm/resctrl_sched.h>
> > > @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > if (prev->gs | next->gs)
> > > lazy_load_gs(next->gs);
> > > + save_restore_rcu_preempt_depth(prev_p, next_p);
> > > this_cpu_write(current_task, next_p);
> > > switch_fpu_finish(next_fpu);
> > > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > > index af64519b2695..2e1c6e829d30 100644
> > > --- a/arch/x86/kernel/process_64.c
> > > +++ b/arch/x86/kernel/process_64.c
> > > @@ -50,6 +50,7 @@
> > > #include <asm/ia32.h>
> > > #include <asm/syscalls.h>
> > > #include <asm/debugreg.h>
> > > +#include <asm/rcu_preempt_depth.h>
> > > #include <asm/switch_to.h>
> > > #include <asm/xen/hypervisor.h>
> > > #include <asm/vdso.h>
> > > @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > x86_fsgsbase_load(prev, next);
> > > + save_restore_rcu_preempt_depth(prev_p, next_p);
> > > /*
> > > * Switch the PDA and FPU contexts.
> > > */
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index a35daab95d14..0d2abf08b694 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -41,6 +41,29 @@ void synchronize_rcu(void);
> > > #ifdef CONFIG_PREEMPT_RCU
> > > +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > +#include <asm/rcu_preempt_depth.h>
> > > +
> > > +#ifndef CONFIG_PROVE_LOCKING
> > > +extern void rcu_read_unlock_special(void);
> > > +
> > > +static inline void __rcu_read_lock(void)
> > > +{
> > > + rcu_preempt_depth_inc();
> > > +}
> > > +
> > > +static inline void __rcu_read_unlock(void)
> > > +{
> > > + if (unlikely(rcu_preempt_depth_dec_and_test()))
> > > + rcu_read_unlock_special();
> > > +}
> > > +#else
> > > +void __rcu_read_lock(void);
> > > +void __rcu_read_unlock(void);
> > > +#endif
> > > +
> > > +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > +#define INIT_RCU_PREEMPT_DEPTH (0)
> > > void __rcu_read_lock(void);
> > > void __rcu_read_unlock(void);
> > > @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
> > > * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
> > > */
> > > #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> > > +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > #else /* #ifdef CONFIG_PREEMPT_RCU */
> > > diff --git a/init/init_task.c b/init/init_task.c
> > > index 9e5cbe5eab7b..0a91e38fba37 100644
> > > --- a/init/init_task.c
> > > +++ b/init/init_task.c
> > > @@ -130,7 +130,7 @@ struct task_struct init_task
> > > .perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
> > > #endif
> > > #ifdef CONFIG_PREEMPT_RCU
> > > - .rcu_read_lock_nesting = 0,
> > > + .rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
> > > .rcu_read_unlock_special.s = 0,
> > > .rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
> > > .rcu_blocked_node = NULL,
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index f9572f416126..7368d4ccb857 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
> > > static inline void rcu_copy_process(struct task_struct *p)
> > > {
> > > #ifdef CONFIG_PREEMPT_RCU
> > > - p->rcu_read_lock_nesting = 0;
> > > + p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
> > > p->rcu_read_unlock_special.s = 0;
> > > p->rcu_blocked_node = NULL;
> > > INIT_LIST_HEAD(&p->rcu_node_entry);
> > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > index 1cc940fef17c..d2ecca49a1a4 100644
> > > --- a/kernel/rcu/Kconfig
> > > +++ b/kernel/rcu/Kconfig
> > > @@ -14,6 +14,9 @@ config TREE_RCU
> > > thousands of CPUs. It also scales down nicely to
> > > smaller systems.
> > > +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > + def_bool n
> > > +
> > > config PREEMPT_RCU
> > > bool
> > > default y if PREEMPTION
> > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > index dcb2124203cf..f919881832d4 100644
> > > --- a/kernel/rcu/tree_exp.h
> > > +++ b/kernel/rcu/tree_exp.h
> > > @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
> > > }
> > > #ifdef CONFIG_PREEMPT_RCU
> > > +static inline void set_rcu_preempt_need_special(void);
> > > /*
> > > * Remote handler for smp_call_function_single(). If there is an
> > > @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
> > > if (rnp->expmask & rdp->grpmask) {
> > > rdp->exp_deferred_qs = true;
> > > t->rcu_read_unlock_special.b.exp_hint = true;
> > > + set_rcu_preempt_need_special();
> > > }
> > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > return;
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index e16c3867d2ff..e6774a7ab16b 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
> > > #ifdef CONFIG_PREEMPT_RCU
> > > static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > -static void rcu_read_unlock_special(struct task_struct *t);
> > > +void rcu_read_unlock_special(void);
> > > /*
> > > * Tell them what RCU they are running.
> > > @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
> > > t->rcu_read_unlock_special.b.need_qs = false;
> > > t->rcu_read_unlock_special.b.blocked = true;
> > > t->rcu_blocked_node = rnp;
> > > + set_rcu_preempt_need_special();
> > > /*
> > > * Verify the CPU's sanity, trace the preemption, and
> > > @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > > /* Bias and limit values for ->rcu_read_lock_nesting. */
> > > #define RCU_NEST_PMAX (INT_MAX / 2)
> > > +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > static inline void rcu_preempt_depth_inc(void)
> > > {
> > > current->rcu_read_lock_nesting++;
> > > @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
> > > static inline bool rcu_preempt_depth_dec_and_test(void)
> > > {
> > > - return --current->rcu_read_lock_nesting == 0;
> > > + if (--current->rcu_read_lock_nesting == 0) {
> > > + /* check speical after dec ->rcu_read_lock_nesting */
> > > + barrier();
> > > + return READ_ONCE(current->rcu_read_unlock_special.s);
> > > + }
> > > + return 0;
> > > }
> > > static inline void rcu_preempt_depth_set(int val)
> > > @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
> > > current->rcu_read_lock_nesting = val;
> > > }
> > > +static inline void clear_rcu_preempt_need_special(void) {}
> > > +static inline void set_rcu_preempt_need_special(void) {}
> > > +
> > > +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > +
> > > +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
> > > /*
> > > * Preemptible RCU implementation for rcu_read_lock().
> > > * Just increment ->rcu_read_lock_nesting, shared state will be updated
> > > @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
> > > */
> > > void __rcu_read_unlock(void)
> > > {
> > > - struct task_struct *t = current;
> > > -
> > > if (rcu_preempt_depth_dec_and_test()) {
> > > - barrier(); /* critical section before exit code. */
> > > - if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > - rcu_read_unlock_special(t);
> > > + rcu_read_unlock_special();
> > > }
> > > if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> > > - WARN_ON_ONCE(rcu_preempt_depth() < 0);
> > > + WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> > > + rcu_preempt_depth() > RCU_NEST_PMAX);
> > > }
> > > }
> > > EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> > > +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
> > > /*
> > > * Advance a ->blkd_tasks-list pointer to the next entry, instead
> > > @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > return;
> > > }
> > > t->rcu_read_unlock_special.s = 0;
> > > + clear_rcu_preempt_need_special();
> > > if (special.b.need_qs)
> > > rcu_qs();
> > > @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > * notify RCU core processing or task having blocked during the RCU
> > > * read-side critical section.
> > > */
> > > -static void rcu_read_unlock_special(struct task_struct *t)
> > > +void rcu_read_unlock_special(void)
> > > {
> > > + struct task_struct *t = current;
> > > unsigned long flags;
> > > bool preempt_bh_were_disabled =
> > > !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > }
> > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > }
> > > +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
> > > /*
> > > * Check that the list of blocked tasks for the newly completed grace
> > > @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
> > > __this_cpu_read(rcu_data.core_needs_qs) &&
> > > __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> > > !t->rcu_read_unlock_special.b.need_qs &&
> > > - time_after(jiffies, rcu_state.gp_start + HZ))
> > > + time_after(jiffies, rcu_state.gp_start + HZ)) {
> > > t->rcu_read_unlock_special.b.need_qs = true;
> > > + set_rcu_preempt_need_special();
> > > + }
> > > }
> > > /*
> > > @@ -714,6 +730,7 @@ void exit_rcu(void)
> > > rcu_preempt_depth_set(1);
> > > barrier();
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> > > + set_rcu_preempt_need_special();
> > > } else if (unlikely(rcu_preempt_depth())) {
> > > rcu_preempt_depth_set(1);
> > > } else {
> > > --
> > > 2.20.1
> > >

2019-11-19 02:01:32

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth



On 2019/11/18 10:59 下午, Paul E. McKenney wrote:
> On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
>>
>>
>> On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
>>> On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
>>>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>>>> is that accessing per-cpu variables is a lot cheaper than accessing
>>>> task_struct or thread_info variables.
>>>>
>>>> We need to save/restore the actual rcu_preempt_depth when switch.
>>>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>>>> and current_task variable.
>>>>
>>>> Using the idea of per-cpu __preempt_count.
>>>>
>>>> No function call when using rcu_read_[un]lock().
>>>> Single instruction for rcu_read_lock().
>>>> 2 instructions for fast path of rcu_read_unlock().
>>>>
>>>> CC: Peter Zijlstra <[email protected]>
>>>> Signed-off-by: Lai Jiangshan <[email protected]>
>>>
>>> Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
>>> just as nice for #include/inlining, and a lot less per-architecture code?
>>>
>>> Or am I missing some issue with the task-info approach?
>>
>> struct thread_info itself is per-architecture definition.
>> All the arches would have to be touched if RCU's nesting-depth counter
>> is put int struct thread_info. Though the inlining functions can be
>> defined in include/asm-generic/ so that it serves for all arches
>> and X86 can have its own implementation in arch/x86/include/asm/.
>
> True enough.
>
> But doesn't the per-CPU code require per-architecture changes to copy
> to and from the per-CPU variable? If that code simpler and smaller than
> the thread_info access code, I will be -very- surprised.
>

The per-CPU code is not simpler. And my code touch X86 ONLY so that
it requires an additional CONFIG and some more "#if" in rcu code
which adds little more complicity.

As far as I understand, the rcu_read_lock_nesting can only be possible
put in per CPU in X86, not possible in other ARCHs.

Putting the rcu_read_lock_nesting in struct thread_info and putting
the inlining functions in include/asm-generic/ are OK for me. It will
be a good-shaped framework and reduce function calls, and still allow
x86 has its own implementation.

The framework will only be more appealing when the x86 percpu
rcu_read_lock_nesting is implemented. This series implements
the x86 percpu rcu_read_lock_nesting first and avoid touch too much
files. The framework can be implemented later

Thanks
Lai


>
> Thanx, Paul
>
>>>> ---
>>>> arch/x86/Kconfig | 2 +
>>>> arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>>>> arch/x86/kernel/cpu/common.c | 7 ++
>>>> arch/x86/kernel/process_32.c | 2 +
>>>> arch/x86/kernel/process_64.c | 2 +
>>>> include/linux/rcupdate.h | 24 +++++++
>>>> init/init_task.c | 2 +-
>>>> kernel/fork.c | 2 +-
>>>> kernel/rcu/Kconfig | 3 +
>>>> kernel/rcu/tree_exp.h | 2 +
>>>> kernel/rcu/tree_plugin.h | 37 +++++++---
>>>> 11 files changed, 158 insertions(+), 12 deletions(-)
>>>> create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>>>>
>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>> index d6e1faa28c58..af9fedc0fdc4 100644
>>>> --- a/arch/x86/Kconfig
>>>> +++ b/arch/x86/Kconfig
>>>> @@ -18,6 +18,7 @@ config X86_32
>>>> select MODULES_USE_ELF_REL
>>>> select OLD_SIGACTION
>>>> select GENERIC_VDSO_32
>>>> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>> config X86_64
>>>> def_bool y
>>>> @@ -31,6 +32,7 @@ config X86_64
>>>> select NEED_DMA_MAP_STATE
>>>> select SWIOTLB
>>>> select ARCH_HAS_SYSCALL_WRAPPER
>>>> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>> config FORCE_DYNAMIC_FTRACE
>>>> def_bool y
>>>> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
>>>> new file mode 100644
>>>> index 000000000000..88010ad59c20
>>>> --- /dev/null
>>>> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
>>>> @@ -0,0 +1,87 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
>>>> +#define __ASM_RCU_PREEMPT_DEPTH_H
>>>> +
>>>> +#include <asm/rmwcc.h>
>>>> +#include <asm/percpu.h>
>>>> +
>>>> +#ifdef CONFIG_PREEMPT_RCU
>>>> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
>>>> +
>>>> +/*
>>>> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
>>>> + * such that a decrement hitting 0 means we can and should do
>>>> + * rcu_read_unlock_special().
>>>> + */
>>>> +#define RCU_NEED_SPECIAL 0x80000000
>>>> +
>>>> +#define INIT_RCU_PREEMPT_DEPTH (RCU_NEED_SPECIAL)
>>>> +
>>>> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
>>>> +static __always_inline int rcu_preempt_depth(void)
>>>> +{
>>>> + return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
>>>> +}
>>>> +
>>>> +static __always_inline void rcu_preempt_depth_set(int pc)
>>>> +{
>>>> + int old, new;
>>>> +
>>>> + do {
>>>> + old = raw_cpu_read_4(__rcu_preempt_depth);
>>>> + new = (old & RCU_NEED_SPECIAL) |
>>>> + (pc & ~RCU_NEED_SPECIAL);
>>>> + } while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
>>>> +}
>>>> +
>>>> +/*
>>>> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
>>>> + * rcu_read_unlock() can decrement and test for needing to do special
>>>> + * with a single instruction.
>>>> + *
>>>> + * We invert the actual bit, so that when the decrement hits 0 we know
>>>> + * both it just exited the outmost rcu_read_lock() critical section and
>>>> + * we need to do specail (the bit is cleared) if it doesn't need to be
>>>> + * deferred.
>>>> + */
>>>> +
>>>> +static inline void set_rcu_preempt_need_special(void)
>>>> +{
>>>> + raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
>>>> +}
>>>> +
>>>> +/*
>>>> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
>>>> + * and calling this function. Otherwise it may clear the work done
>>>> + * by set_rcu_preempt_need_special() in interrupt.
>>>> + */
>>>> +static inline void clear_rcu_preempt_need_special(void)
>>>> +{
>>>> + raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
>>>> +}
>>>> +
>>>> +static __always_inline void rcu_preempt_depth_inc(void)
>>>> +{
>>>> + raw_cpu_add_4(__rcu_preempt_depth, 1);
>>>> +}
>>>> +
>>>> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
>>>> +{
>>>> + return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
>>>> +}
>>>> +
>>>> +/* must be macros to avoid header recursion hell */
>>>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do { \
>>>> + prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth); \
>>>> + this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting); \
>>>> + } while (0)
>>>> +
>>>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH \
>>>> + DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH; \
>>>> + EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
>>>> +#else /* #ifdef CONFIG_PREEMPT_RCU */
>>>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
>>>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH /* empty */
>>>> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>>> +
>>>> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>>> index 9ae7d1bcd4f4..0151737e196c 100644
>>>> --- a/arch/x86/kernel/cpu/common.c
>>>> +++ b/arch/x86/kernel/cpu/common.c
>>>> @@ -46,6 +46,7 @@
>>>> #include <asm/asm.h>
>>>> #include <asm/bugs.h>
>>>> #include <asm/cpu.h>
>>>> +#include <asm/rcu_preempt_depth.h>
>>>> #include <asm/mce.h>
>>>> #include <asm/msr.h>
>>>> #include <asm/pat.h>
>>>> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
>>>> DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>>> EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>>> +/* close to __preempt_count */
>>>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>>>> +
>>>> /* May not be marked __init: used by software suspend */
>>>> void syscall_init(void)
>>>> {
>>>> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
>>>> DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>>> EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>>> +/* close to __preempt_count */
>>>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>>>> +
>>>> /*
>>>> * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
>>>> * the top of the kernel stack. Use an extra percpu variable to track the
>>>> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
>>>> index b8ceec4974fe..ab1f20353663 100644
>>>> --- a/arch/x86/kernel/process_32.c
>>>> +++ b/arch/x86/kernel/process_32.c
>>>> @@ -51,6 +51,7 @@
>>>> #include <asm/cpu.h>
>>>> #include <asm/syscalls.h>
>>>> #include <asm/debugreg.h>
>>>> +#include <asm/rcu_preempt_depth.h>
>>>> #include <asm/switch_to.h>
>>>> #include <asm/vm86.h>
>>>> #include <asm/resctrl_sched.h>
>>>> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>>> if (prev->gs | next->gs)
>>>> lazy_load_gs(next->gs);
>>>> + save_restore_rcu_preempt_depth(prev_p, next_p);
>>>> this_cpu_write(current_task, next_p);
>>>> switch_fpu_finish(next_fpu);
>>>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>>>> index af64519b2695..2e1c6e829d30 100644
>>>> --- a/arch/x86/kernel/process_64.c
>>>> +++ b/arch/x86/kernel/process_64.c
>>>> @@ -50,6 +50,7 @@
>>>> #include <asm/ia32.h>
>>>> #include <asm/syscalls.h>
>>>> #include <asm/debugreg.h>
>>>> +#include <asm/rcu_preempt_depth.h>
>>>> #include <asm/switch_to.h>
>>>> #include <asm/xen/hypervisor.h>
>>>> #include <asm/vdso.h>
>>>> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>>> x86_fsgsbase_load(prev, next);
>>>> + save_restore_rcu_preempt_depth(prev_p, next_p);
>>>> /*
>>>> * Switch the PDA and FPU contexts.
>>>> */
>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>>> index a35daab95d14..0d2abf08b694 100644
>>>> --- a/include/linux/rcupdate.h
>>>> +++ b/include/linux/rcupdate.h
>>>> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>>>> #ifdef CONFIG_PREEMPT_RCU
>>>> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>> +#include <asm/rcu_preempt_depth.h>
>>>> +
>>>> +#ifndef CONFIG_PROVE_LOCKING
>>>> +extern void rcu_read_unlock_special(void);
>>>> +
>>>> +static inline void __rcu_read_lock(void)
>>>> +{
>>>> + rcu_preempt_depth_inc();
>>>> +}
>>>> +
>>>> +static inline void __rcu_read_unlock(void)
>>>> +{
>>>> + if (unlikely(rcu_preempt_depth_dec_and_test()))
>>>> + rcu_read_unlock_special();
>>>> +}
>>>> +#else
>>>> +void __rcu_read_lock(void);
>>>> +void __rcu_read_unlock(void);
>>>> +#endif
>>>> +
>>>> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>> +#define INIT_RCU_PREEMPT_DEPTH (0)
>>>> void __rcu_read_lock(void);
>>>> void __rcu_read_unlock(void);
>>>> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
>>>> * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
>>>> */
>>>> #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
>>>> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>> #else /* #ifdef CONFIG_PREEMPT_RCU */
>>>> diff --git a/init/init_task.c b/init/init_task.c
>>>> index 9e5cbe5eab7b..0a91e38fba37 100644
>>>> --- a/init/init_task.c
>>>> +++ b/init/init_task.c
>>>> @@ -130,7 +130,7 @@ struct task_struct init_task
>>>> .perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
>>>> #endif
>>>> #ifdef CONFIG_PREEMPT_RCU
>>>> - .rcu_read_lock_nesting = 0,
>>>> + .rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
>>>> .rcu_read_unlock_special.s = 0,
>>>> .rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
>>>> .rcu_blocked_node = NULL,
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index f9572f416126..7368d4ccb857 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
>>>> static inline void rcu_copy_process(struct task_struct *p)
>>>> {
>>>> #ifdef CONFIG_PREEMPT_RCU
>>>> - p->rcu_read_lock_nesting = 0;
>>>> + p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
>>>> p->rcu_read_unlock_special.s = 0;
>>>> p->rcu_blocked_node = NULL;
>>>> INIT_LIST_HEAD(&p->rcu_node_entry);
>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>>>> index 1cc940fef17c..d2ecca49a1a4 100644
>>>> --- a/kernel/rcu/Kconfig
>>>> +++ b/kernel/rcu/Kconfig
>>>> @@ -14,6 +14,9 @@ config TREE_RCU
>>>> thousands of CPUs. It also scales down nicely to
>>>> smaller systems.
>>>> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>> + def_bool n
>>>> +
>>>> config PREEMPT_RCU
>>>> bool
>>>> default y if PREEMPTION
>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>> index dcb2124203cf..f919881832d4 100644
>>>> --- a/kernel/rcu/tree_exp.h
>>>> +++ b/kernel/rcu/tree_exp.h
>>>> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
>>>> }
>>>> #ifdef CONFIG_PREEMPT_RCU
>>>> +static inline void set_rcu_preempt_need_special(void);
>>>> /*
>>>> * Remote handler for smp_call_function_single(). If there is an
>>>> @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
>>>> if (rnp->expmask & rdp->grpmask) {
>>>> rdp->exp_deferred_qs = true;
>>>> t->rcu_read_unlock_special.b.exp_hint = true;
>>>> + set_rcu_preempt_need_special();
>>>> }
>>>> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>>>> return;
>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>> index e16c3867d2ff..e6774a7ab16b 100644
>>>> --- a/kernel/rcu/tree_plugin.h
>>>> +++ b/kernel/rcu/tree_plugin.h
>>>> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
>>>> #ifdef CONFIG_PREEMPT_RCU
>>>> static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
>>>> -static void rcu_read_unlock_special(struct task_struct *t);
>>>> +void rcu_read_unlock_special(void);
>>>> /*
>>>> * Tell them what RCU they are running.
>>>> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
>>>> t->rcu_read_unlock_special.b.need_qs = false;
>>>> t->rcu_read_unlock_special.b.blocked = true;
>>>> t->rcu_blocked_node = rnp;
>>>> + set_rcu_preempt_need_special();
>>>> /*
>>>> * Verify the CPU's sanity, trace the preemption, and
>>>> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>>>> /* Bias and limit values for ->rcu_read_lock_nesting. */
>>>> #define RCU_NEST_PMAX (INT_MAX / 2)
>>>> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>> static inline void rcu_preempt_depth_inc(void)
>>>> {
>>>> current->rcu_read_lock_nesting++;
>>>> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>>>> static inline bool rcu_preempt_depth_dec_and_test(void)
>>>> {
>>>> - return --current->rcu_read_lock_nesting == 0;
>>>> + if (--current->rcu_read_lock_nesting == 0) {
>>>> + /* check speical after dec ->rcu_read_lock_nesting */
>>>> + barrier();
>>>> + return READ_ONCE(current->rcu_read_unlock_special.s);
>>>> + }
>>>> + return 0;
>>>> }
>>>> static inline void rcu_preempt_depth_set(int val)
>>>> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
>>>> current->rcu_read_lock_nesting = val;
>>>> }
>>>> +static inline void clear_rcu_preempt_need_special(void) {}
>>>> +static inline void set_rcu_preempt_need_special(void) {}
>>>> +
>>>> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>> +
>>>> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
>>>> /*
>>>> * Preemptible RCU implementation for rcu_read_lock().
>>>> * Just increment ->rcu_read_lock_nesting, shared state will be updated
>>>> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
>>>> */
>>>> void __rcu_read_unlock(void)
>>>> {
>>>> - struct task_struct *t = current;
>>>> -
>>>> if (rcu_preempt_depth_dec_and_test()) {
>>>> - barrier(); /* critical section before exit code. */
>>>> - if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>>>> - rcu_read_unlock_special(t);
>>>> + rcu_read_unlock_special();
>>>> }
>>>> if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>>>> - WARN_ON_ONCE(rcu_preempt_depth() < 0);
>>>> + WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
>>>> + rcu_preempt_depth() > RCU_NEST_PMAX);
>>>> }
>>>> }
>>>> EXPORT_SYMBOL_GPL(__rcu_read_unlock);
>>>> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>>>> /*
>>>> * Advance a ->blkd_tasks-list pointer to the next entry, instead
>>>> @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>>> return;
>>>> }
>>>> t->rcu_read_unlock_special.s = 0;
>>>> + clear_rcu_preempt_need_special();
>>>> if (special.b.need_qs)
>>>> rcu_qs();
>>>> @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>>>> * notify RCU core processing or task having blocked during the RCU
>>>> * read-side critical section.
>>>> */
>>>> -static void rcu_read_unlock_special(struct task_struct *t)
>>>> +void rcu_read_unlock_special(void)
>>>> {
>>>> + struct task_struct *t = current;
>>>> unsigned long flags;
>>>> bool preempt_bh_were_disabled =
>>>> !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
>>>> @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>>>> }
>>>> rcu_preempt_deferred_qs_irqrestore(t, flags);
>>>> }
>>>> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>>>> /*
>>>> * Check that the list of blocked tasks for the newly completed grace
>>>> @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
>>>> __this_cpu_read(rcu_data.core_needs_qs) &&
>>>> __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>>>> !t->rcu_read_unlock_special.b.need_qs &&
>>>> - time_after(jiffies, rcu_state.gp_start + HZ))
>>>> + time_after(jiffies, rcu_state.gp_start + HZ)) {
>>>> t->rcu_read_unlock_special.b.need_qs = true;
>>>> + set_rcu_preempt_need_special();
>>>> + }
>>>> }
>>>> /*
>>>> @@ -714,6 +730,7 @@ void exit_rcu(void)
>>>> rcu_preempt_depth_set(1);
>>>> barrier();
>>>> WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
>>>> + set_rcu_preempt_need_special();
>>>> } else if (unlikely(rcu_preempt_depth())) {
>>>> rcu_preempt_depth_set(1);
>>>> } else {
>>>> --
>>>> 2.20.1
>>>>

2019-11-19 21:18:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth

On Tue, Nov 19, 2019 at 09:59:03AM +0800, Lai Jiangshan wrote:
>
>
> On 2019/11/18 10:59 下午, Paul E. McKenney wrote:
> > On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
> > >
> > >
> > > On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
> > > > On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> > > > > Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> > > > > is that accessing per-cpu variables is a lot cheaper than accessing
> > > > > task_struct or thread_info variables.
> > > > >
> > > > > We need to save/restore the actual rcu_preempt_depth when switch.
> > > > > We also place the per-cpu rcu_preempt_depth close to __preempt_count
> > > > > and current_task variable.
> > > > >
> > > > > Using the idea of per-cpu __preempt_count.
> > > > >
> > > > > No function call when using rcu_read_[un]lock().
> > > > > Single instruction for rcu_read_lock().
> > > > > 2 instructions for fast path of rcu_read_unlock().
> > > > >
> > > > > CC: Peter Zijlstra <[email protected]>
> > > > > Signed-off-by: Lai Jiangshan <[email protected]>
> > > >
> > > > Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
> > > > just as nice for #include/inlining, and a lot less per-architecture code?
> > > >
> > > > Or am I missing some issue with the task-info approach?
> > >
> > > struct thread_info itself is per-architecture definition.
> > > All the arches would have to be touched if RCU's nesting-depth counter
> > > is put int struct thread_info. Though the inlining functions can be
> > > defined in include/asm-generic/ so that it serves for all arches
> > > and X86 can have its own implementation in arch/x86/include/asm/.
> >
> > True enough.
> >
> > But doesn't the per-CPU code require per-architecture changes to copy
> > to and from the per-CPU variable? If that code simpler and smaller than
> > the thread_info access code, I will be -very- surprised.
>
> The per-CPU code is not simpler. And my code touch X86 ONLY so that
> it requires an additional CONFIG and some more "#if" in rcu code
> which adds little more complicity.
>
> As far as I understand, the rcu_read_lock_nesting can only be possible
> put in per CPU in X86, not possible in other ARCHs.
>
> Putting the rcu_read_lock_nesting in struct thread_info and putting
> the inlining functions in include/asm-generic/ are OK for me. It will
> be a good-shaped framework and reduce function calls, and still allow
> x86 has its own implementation.

In addition, even for x86, the struct thread_info approach would avoid
adding more instructions to the context-switch path.

> The framework will only be more appealing when the x86 percpu
> rcu_read_lock_nesting is implemented. This series implements
> the x86 percpu rcu_read_lock_nesting first and avoid touch too much
> files. The framework can be implemented later

If I remember correctly, you mentioned earlier that the struct
thread_info approach was just as few instructions for rcu_read_lock()
and rcu_read_unlock() as is the per-CPU approach. If that is the
case, and if the struct thread_info approach avoids adding code to the
context-switch path, why would the x86 maintainers be at all happy with
the per-CPU approach? What am I missing here?

Thanx, Paul

> Thanks
> Lai
>
>
> >
> > Thanx, Paul
> >
> > > > > ---
> > > > > arch/x86/Kconfig | 2 +
> > > > > arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
> > > > > arch/x86/kernel/cpu/common.c | 7 ++
> > > > > arch/x86/kernel/process_32.c | 2 +
> > > > > arch/x86/kernel/process_64.c | 2 +
> > > > > include/linux/rcupdate.h | 24 +++++++
> > > > > init/init_task.c | 2 +-
> > > > > kernel/fork.c | 2 +-
> > > > > kernel/rcu/Kconfig | 3 +
> > > > > kernel/rcu/tree_exp.h | 2 +
> > > > > kernel/rcu/tree_plugin.h | 37 +++++++---
> > > > > 11 files changed, 158 insertions(+), 12 deletions(-)
> > > > > create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> > > > >
> > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > > index d6e1faa28c58..af9fedc0fdc4 100644
> > > > > --- a/arch/x86/Kconfig
> > > > > +++ b/arch/x86/Kconfig
> > > > > @@ -18,6 +18,7 @@ config X86_32
> > > > > select MODULES_USE_ELF_REL
> > > > > select OLD_SIGACTION
> > > > > select GENERIC_VDSO_32
> > > > > + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > config X86_64
> > > > > def_bool y
> > > > > @@ -31,6 +32,7 @@ config X86_64
> > > > > select NEED_DMA_MAP_STATE
> > > > > select SWIOTLB
> > > > > select ARCH_HAS_SYSCALL_WRAPPER
> > > > > + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > config FORCE_DYNAMIC_FTRACE
> > > > > def_bool y
> > > > > diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> > > > > new file mode 100644
> > > > > index 000000000000..88010ad59c20
> > > > > --- /dev/null
> > > > > +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> > > > > @@ -0,0 +1,87 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> > > > > +#define __ASM_RCU_PREEMPT_DEPTH_H
> > > > > +
> > > > > +#include <asm/rmwcc.h>
> > > > > +#include <asm/percpu.h>
> > > > > +
> > > > > +#ifdef CONFIG_PREEMPT_RCU
> > > > > +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> > > > > +
> > > > > +/*
> > > > > + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> > > > > + * such that a decrement hitting 0 means we can and should do
> > > > > + * rcu_read_unlock_special().
> > > > > + */
> > > > > +#define RCU_NEED_SPECIAL 0x80000000
> > > > > +
> > > > > +#define INIT_RCU_PREEMPT_DEPTH (RCU_NEED_SPECIAL)
> > > > > +
> > > > > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > > > > +static __always_inline int rcu_preempt_depth(void)
> > > > > +{
> > > > > + return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> > > > > +}
> > > > > +
> > > > > +static __always_inline void rcu_preempt_depth_set(int pc)
> > > > > +{
> > > > > + int old, new;
> > > > > +
> > > > > + do {
> > > > > + old = raw_cpu_read_4(__rcu_preempt_depth);
> > > > > + new = (old & RCU_NEED_SPECIAL) |
> > > > > + (pc & ~RCU_NEED_SPECIAL);
> > > > > + } while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> > > > > + * rcu_read_unlock() can decrement and test for needing to do special
> > > > > + * with a single instruction.
> > > > > + *
> > > > > + * We invert the actual bit, so that when the decrement hits 0 we know
> > > > > + * both it just exited the outmost rcu_read_lock() critical section and
> > > > > + * we need to do specail (the bit is cleared) if it doesn't need to be
> > > > > + * deferred.
> > > > > + */
> > > > > +
> > > > > +static inline void set_rcu_preempt_need_special(void)
> > > > > +{
> > > > > + raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> > > > > + * and calling this function. Otherwise it may clear the work done
> > > > > + * by set_rcu_preempt_need_special() in interrupt.
> > > > > + */
> > > > > +static inline void clear_rcu_preempt_need_special(void)
> > > > > +{
> > > > > + raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> > > > > +}
> > > > > +
> > > > > +static __always_inline void rcu_preempt_depth_inc(void)
> > > > > +{
> > > > > + raw_cpu_add_4(__rcu_preempt_depth, 1);
> > > > > +}
> > > > > +
> > > > > +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> > > > > +{
> > > > > + return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> > > > > +}
> > > > > +
> > > > > +/* must be macros to avoid header recursion hell */
> > > > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do { \
> > > > > + prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth); \
> > > > > + this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting); \
> > > > > + } while (0)
> > > > > +
> > > > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH \
> > > > > + DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH; \
> > > > > + EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> > > > > +#else /* #ifdef CONFIG_PREEMPT_RCU */
> > > > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> > > > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH /* empty */
> > > > > +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > > > +
> > > > > +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> > > > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > > > index 9ae7d1bcd4f4..0151737e196c 100644
> > > > > --- a/arch/x86/kernel/cpu/common.c
> > > > > +++ b/arch/x86/kernel/cpu/common.c
> > > > > @@ -46,6 +46,7 @@
> > > > > #include <asm/asm.h>
> > > > > #include <asm/bugs.h>
> > > > > #include <asm/cpu.h>
> > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > #include <asm/mce.h>
> > > > > #include <asm/msr.h>
> > > > > #include <asm/pat.h>
> > > > > @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
> > > > > DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > > > EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > > > +/* close to __preempt_count */
> > > > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > > > +
> > > > > /* May not be marked __init: used by software suspend */
> > > > > void syscall_init(void)
> > > > > {
> > > > > @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
> > > > > DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > > > EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > > > +/* close to __preempt_count */
> > > > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > > > +
> > > > > /*
> > > > > * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
> > > > > * the top of the kernel stack. Use an extra percpu variable to track the
> > > > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > > > > index b8ceec4974fe..ab1f20353663 100644
> > > > > --- a/arch/x86/kernel/process_32.c
> > > > > +++ b/arch/x86/kernel/process_32.c
> > > > > @@ -51,6 +51,7 @@
> > > > > #include <asm/cpu.h>
> > > > > #include <asm/syscalls.h>
> > > > > #include <asm/debugreg.h>
> > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > #include <asm/switch_to.h>
> > > > > #include <asm/vm86.h>
> > > > > #include <asm/resctrl_sched.h>
> > > > > @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > > > if (prev->gs | next->gs)
> > > > > lazy_load_gs(next->gs);
> > > > > + save_restore_rcu_preempt_depth(prev_p, next_p);
> > > > > this_cpu_write(current_task, next_p);
> > > > > switch_fpu_finish(next_fpu);
> > > > > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > > > > index af64519b2695..2e1c6e829d30 100644
> > > > > --- a/arch/x86/kernel/process_64.c
> > > > > +++ b/arch/x86/kernel/process_64.c
> > > > > @@ -50,6 +50,7 @@
> > > > > #include <asm/ia32.h>
> > > > > #include <asm/syscalls.h>
> > > > > #include <asm/debugreg.h>
> > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > #include <asm/switch_to.h>
> > > > > #include <asm/xen/hypervisor.h>
> > > > > #include <asm/vdso.h>
> > > > > @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > > > x86_fsgsbase_load(prev, next);
> > > > > + save_restore_rcu_preempt_depth(prev_p, next_p);
> > > > > /*
> > > > > * Switch the PDA and FPU contexts.
> > > > > */
> > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > > index a35daab95d14..0d2abf08b694 100644
> > > > > --- a/include/linux/rcupdate.h
> > > > > +++ b/include/linux/rcupdate.h
> > > > > @@ -41,6 +41,29 @@ void synchronize_rcu(void);
> > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > +
> > > > > +#ifndef CONFIG_PROVE_LOCKING
> > > > > +extern void rcu_read_unlock_special(void);
> > > > > +
> > > > > +static inline void __rcu_read_lock(void)
> > > > > +{
> > > > > + rcu_preempt_depth_inc();
> > > > > +}
> > > > > +
> > > > > +static inline void __rcu_read_unlock(void)
> > > > > +{
> > > > > + if (unlikely(rcu_preempt_depth_dec_and_test()))
> > > > > + rcu_read_unlock_special();
> > > > > +}
> > > > > +#else
> > > > > +void __rcu_read_lock(void);
> > > > > +void __rcu_read_unlock(void);
> > > > > +#endif
> > > > > +
> > > > > +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > +#define INIT_RCU_PREEMPT_DEPTH (0)
> > > > > void __rcu_read_lock(void);
> > > > > void __rcu_read_unlock(void);
> > > > > @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
> > > > > * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
> > > > > */
> > > > > #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> > > > > +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > #else /* #ifdef CONFIG_PREEMPT_RCU */
> > > > > diff --git a/init/init_task.c b/init/init_task.c
> > > > > index 9e5cbe5eab7b..0a91e38fba37 100644
> > > > > --- a/init/init_task.c
> > > > > +++ b/init/init_task.c
> > > > > @@ -130,7 +130,7 @@ struct task_struct init_task
> > > > > .perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
> > > > > #endif
> > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > - .rcu_read_lock_nesting = 0,
> > > > > + .rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
> > > > > .rcu_read_unlock_special.s = 0,
> > > > > .rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
> > > > > .rcu_blocked_node = NULL,
> > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > index f9572f416126..7368d4ccb857 100644
> > > > > --- a/kernel/fork.c
> > > > > +++ b/kernel/fork.c
> > > > > @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
> > > > > static inline void rcu_copy_process(struct task_struct *p)
> > > > > {
> > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > - p->rcu_read_lock_nesting = 0;
> > > > > + p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
> > > > > p->rcu_read_unlock_special.s = 0;
> > > > > p->rcu_blocked_node = NULL;
> > > > > INIT_LIST_HEAD(&p->rcu_node_entry);
> > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > > index 1cc940fef17c..d2ecca49a1a4 100644
> > > > > --- a/kernel/rcu/Kconfig
> > > > > +++ b/kernel/rcu/Kconfig
> > > > > @@ -14,6 +14,9 @@ config TREE_RCU
> > > > > thousands of CPUs. It also scales down nicely to
> > > > > smaller systems.
> > > > > +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > + def_bool n
> > > > > +
> > > > > config PREEMPT_RCU
> > > > > bool
> > > > > default y if PREEMPTION
> > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > index dcb2124203cf..f919881832d4 100644
> > > > > --- a/kernel/rcu/tree_exp.h
> > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
> > > > > }
> > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > +static inline void set_rcu_preempt_need_special(void);
> > > > > /*
> > > > > * Remote handler for smp_call_function_single(). If there is an
> > > > > @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
> > > > > if (rnp->expmask & rdp->grpmask) {
> > > > > rdp->exp_deferred_qs = true;
> > > > > t->rcu_read_unlock_special.b.exp_hint = true;
> > > > > + set_rcu_preempt_need_special();
> > > > > }
> > > > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > > > return;
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index e16c3867d2ff..e6774a7ab16b 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > -static void rcu_read_unlock_special(struct task_struct *t);
> > > > > +void rcu_read_unlock_special(void);
> > > > > /*
> > > > > * Tell them what RCU they are running.
> > > > > @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
> > > > > t->rcu_read_unlock_special.b.need_qs = false;
> > > > > t->rcu_read_unlock_special.b.blocked = true;
> > > > > t->rcu_blocked_node = rnp;
> > > > > + set_rcu_preempt_need_special();
> > > > > /*
> > > > > * Verify the CPU's sanity, trace the preemption, and
> > > > > @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > > > > /* Bias and limit values for ->rcu_read_lock_nesting. */
> > > > > #define RCU_NEST_PMAX (INT_MAX / 2)
> > > > > +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > static inline void rcu_preempt_depth_inc(void)
> > > > > {
> > > > > current->rcu_read_lock_nesting++;
> > > > > @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
> > > > > static inline bool rcu_preempt_depth_dec_and_test(void)
> > > > > {
> > > > > - return --current->rcu_read_lock_nesting == 0;
> > > > > + if (--current->rcu_read_lock_nesting == 0) {
> > > > > + /* check speical after dec ->rcu_read_lock_nesting */
> > > > > + barrier();
> > > > > + return READ_ONCE(current->rcu_read_unlock_special.s);
> > > > > + }
> > > > > + return 0;
> > > > > }
> > > > > static inline void rcu_preempt_depth_set(int val)
> > > > > @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
> > > > > current->rcu_read_lock_nesting = val;
> > > > > }
> > > > > +static inline void clear_rcu_preempt_need_special(void) {}
> > > > > +static inline void set_rcu_preempt_need_special(void) {}
> > > > > +
> > > > > +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > +
> > > > > +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
> > > > > /*
> > > > > * Preemptible RCU implementation for rcu_read_lock().
> > > > > * Just increment ->rcu_read_lock_nesting, shared state will be updated
> > > > > @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
> > > > > */
> > > > > void __rcu_read_unlock(void)
> > > > > {
> > > > > - struct task_struct *t = current;
> > > > > -
> > > > > if (rcu_preempt_depth_dec_and_test()) {
> > > > > - barrier(); /* critical section before exit code. */
> > > > > - if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > > > - rcu_read_unlock_special(t);
> > > > > + rcu_read_unlock_special();
> > > > > }
> > > > > if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> > > > > - WARN_ON_ONCE(rcu_preempt_depth() < 0);
> > > > > + WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> > > > > + rcu_preempt_depth() > RCU_NEST_PMAX);
> > > > > }
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> > > > > +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
> > > > > /*
> > > > > * Advance a ->blkd_tasks-list pointer to the next entry, instead
> > > > > @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > > > return;
> > > > > }
> > > > > t->rcu_read_unlock_special.s = 0;
> > > > > + clear_rcu_preempt_need_special();
> > > > > if (special.b.need_qs)
> > > > > rcu_qs();
> > > > > @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > > > * notify RCU core processing or task having blocked during the RCU
> > > > > * read-side critical section.
> > > > > */
> > > > > -static void rcu_read_unlock_special(struct task_struct *t)
> > > > > +void rcu_read_unlock_special(void)
> > > > > {
> > > > > + struct task_struct *t = current;
> > > > > unsigned long flags;
> > > > > bool preempt_bh_were_disabled =
> > > > > !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > > > @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > > > }
> > > > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > > }
> > > > > +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
> > > > > /*
> > > > > * Check that the list of blocked tasks for the newly completed grace
> > > > > @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
> > > > > __this_cpu_read(rcu_data.core_needs_qs) &&
> > > > > __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> > > > > !t->rcu_read_unlock_special.b.need_qs &&
> > > > > - time_after(jiffies, rcu_state.gp_start + HZ))
> > > > > + time_after(jiffies, rcu_state.gp_start + HZ)) {
> > > > > t->rcu_read_unlock_special.b.need_qs = true;
> > > > > + set_rcu_preempt_need_special();
> > > > > + }
> > > > > }
> > > > > /*
> > > > > @@ -714,6 +730,7 @@ void exit_rcu(void)
> > > > > rcu_preempt_depth_set(1);
> > > > > barrier();
> > > > > WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> > > > > + set_rcu_preempt_need_special();
> > > > > } else if (unlikely(rcu_preempt_depth())) {
> > > > > rcu_preempt_depth_set(1);
> > > > > } else {
> > > > > --
> > > > > 2.20.1
> > > > >

2019-11-20 02:50:34

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth



On 2019/11/20 5:14 上午, Paul E. McKenney wrote:
> On Tue, Nov 19, 2019 at 09:59:03AM +0800, Lai Jiangshan wrote:
>>
>>
>> On 2019/11/18 10:59 下午, Paul E. McKenney wrote:
>>> On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
>>>>
>>>>
>>>> On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
>>>>> On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
>>>>>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>>>>>> is that accessing per-cpu variables is a lot cheaper than accessing
>>>>>> task_struct or thread_info variables.
>>>>>>
>>>>>> We need to save/restore the actual rcu_preempt_depth when switch.
>>>>>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>>>>>> and current_task variable.
>>>>>>
>>>>>> Using the idea of per-cpu __preempt_count.
>>>>>>
>>>>>> No function call when using rcu_read_[un]lock().
>>>>>> Single instruction for rcu_read_lock().
>>>>>> 2 instructions for fast path of rcu_read_unlock().
>>>>>>
>>>>>> CC: Peter Zijlstra <[email protected]>
>>>>>> Signed-off-by: Lai Jiangshan <[email protected]>
>>>>>
>>>>> Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
>>>>> just as nice for #include/inlining, and a lot less per-architecture code?
>>>>>
>>>>> Or am I missing some issue with the task-info approach?
>>>>
>>>> struct thread_info itself is per-architecture definition.
>>>> All the arches would have to be touched if RCU's nesting-depth counter
>>>> is put int struct thread_info. Though the inlining functions can be
>>>> defined in include/asm-generic/ so that it serves for all arches
>>>> and X86 can have its own implementation in arch/x86/include/asm/.
>>>
>>> True enough.
>>>
>>> But doesn't the per-CPU code require per-architecture changes to copy
>>> to and from the per-CPU variable? If that code simpler and smaller than
>>> the thread_info access code, I will be -very- surprised.
>>
>> The per-CPU code is not simpler. And my code touch X86 ONLY so that
>> it requires an additional CONFIG and some more "#if" in rcu code
>> which adds little more complicity.
>>
>> As far as I understand, the rcu_read_lock_nesting can only be possible
>> put in per CPU in X86, not possible in other ARCHs.
>>
>> Putting the rcu_read_lock_nesting in struct thread_info and putting
>> the inlining functions in include/asm-generic/ are OK for me. It will
>> be a good-shaped framework and reduce function calls, and still allow
>> x86 has its own implementation.
>
> In addition, even for x86, the struct thread_info approach would avoid
> adding more instructions to the context-switch path.

For all arches, these is no additional instructions to the
context-switch path for the struct thread_info approach.

The X86 per-cpu approach adds only a 2loads and 2 stores in the
context-switch path.

But there are tens of read_read_[un]lock calls between context-switchs
if not hundreds nor thousands. The read_read_[un]lock is more frequent
than context-switch. Every read_read_[un]lock pair needs about 20
instructions now, reducing the pair to 3 instructions in the percpu
approach is a win in my opinion.

>
>> The framework will only be more appealing when the x86 percpu
>> rcu_read_lock_nesting is implemented. This series implements
>> the x86 percpu rcu_read_lock_nesting first and avoid touch too much
>> files. The framework can be implemented later
>
> If I remember correctly, you mentioned earlier that the struct
> thread_info approach was just as few instructions for rcu_read_lock()
> and rcu_read_unlock() as is the per-CPU approach. If that is the
> case, and if the struct thread_info approach avoids adding code to the
> context-switch path, why would the x86 maintainers be at all happy with
> the per-CPU approach? What am I missing here?


The struct thread_info approach will save only function calls compared
with the current struct task_struct approach. The struct thread_info
approach and the current struct task_struct approach are the same
excepct that the later one has to do function calls to avoid c-header
files including hell.

These is no additional instructions to the context-switch path for the
struct thread_info approach. But I'm not sure whether each Arch
maintainers be at all happy with the struct thread_info approach
since the benefit is not so large as x86.

And maybe gcc LTO can reduce the function calls too, so that the
struct thread_info approach and the current struct task_struct
approach will be really exactly the same. But I know nothing about
LTO.

The struct thread_info approach will still need 12(=5+7) instructions
for read_read_[un]lock pair whereas the per-CPU approach needs only 3
instructions for every pair (fast path).

In x86, on one hand, this patchset reduces the number of instructions
needed for the read_read_[un]lock pair markedly (12 basic instructions
+ 2 functions calls => 3 instructions).

On the other hand, it can be compared with preempt_disable/enable()
pair, which had also been changed from the struct thread_info approach
by Peter Zijlstra <[email protected]> at c2daa3bed53a in 2013.
The importance of reducing the overhead of the preempt_disable/enable()
pair and the read_read_[un]lock pair is the same to me. Especially,
preempt_disable/enable() is discouraged except really needed, while
the usage of read_read_[un]lock is increasing.

Thanks
Lai

>
> Thanx, Paul
>
>> Thanks
>> Lai
>>
>>
>>>
>>> Thanx, Paul
>>>
>>>>>> ---
>>>>>> arch/x86/Kconfig | 2 +
>>>>>> arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>>>>>> arch/x86/kernel/cpu/common.c | 7 ++
>>>>>> arch/x86/kernel/process_32.c | 2 +
>>>>>> arch/x86/kernel/process_64.c | 2 +
>>>>>> include/linux/rcupdate.h | 24 +++++++
>>>>>> init/init_task.c | 2 +-
>>>>>> kernel/fork.c | 2 +-
>>>>>> kernel/rcu/Kconfig | 3 +
>>>>>> kernel/rcu/tree_exp.h | 2 +
>>>>>> kernel/rcu/tree_plugin.h | 37 +++++++---
>>>>>> 11 files changed, 158 insertions(+), 12 deletions(-)
>>>>>> create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>>>>>>
>>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>>> index d6e1faa28c58..af9fedc0fdc4 100644
>>>>>> --- a/arch/x86/Kconfig
>>>>>> +++ b/arch/x86/Kconfig
>>>>>> @@ -18,6 +18,7 @@ config X86_32
>>>>>> select MODULES_USE_ELF_REL
>>>>>> select OLD_SIGACTION
>>>>>> select GENERIC_VDSO_32
>>>>>> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>> config X86_64
>>>>>> def_bool y
>>>>>> @@ -31,6 +32,7 @@ config X86_64
>>>>>> select NEED_DMA_MAP_STATE
>>>>>> select SWIOTLB
>>>>>> select ARCH_HAS_SYSCALL_WRAPPER
>>>>>> + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>> config FORCE_DYNAMIC_FTRACE
>>>>>> def_bool y
>>>>>> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..88010ad59c20
>>>>>> --- /dev/null
>>>>>> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
>>>>>> @@ -0,0 +1,87 @@
>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>>> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
>>>>>> +#define __ASM_RCU_PREEMPT_DEPTH_H
>>>>>> +
>>>>>> +#include <asm/rmwcc.h>
>>>>>> +#include <asm/percpu.h>
>>>>>> +
>>>>>> +#ifdef CONFIG_PREEMPT_RCU
>>>>>> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
>>>>>> +
>>>>>> +/*
>>>>>> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
>>>>>> + * such that a decrement hitting 0 means we can and should do
>>>>>> + * rcu_read_unlock_special().
>>>>>> + */
>>>>>> +#define RCU_NEED_SPECIAL 0x80000000
>>>>>> +
>>>>>> +#define INIT_RCU_PREEMPT_DEPTH (RCU_NEED_SPECIAL)
>>>>>> +
>>>>>> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
>>>>>> +static __always_inline int rcu_preempt_depth(void)
>>>>>> +{
>>>>>> + return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline void rcu_preempt_depth_set(int pc)
>>>>>> +{
>>>>>> + int old, new;
>>>>>> +
>>>>>> + do {
>>>>>> + old = raw_cpu_read_4(__rcu_preempt_depth);
>>>>>> + new = (old & RCU_NEED_SPECIAL) |
>>>>>> + (pc & ~RCU_NEED_SPECIAL);
>>>>>> + } while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
>>>>>> + * rcu_read_unlock() can decrement and test for needing to do special
>>>>>> + * with a single instruction.
>>>>>> + *
>>>>>> + * We invert the actual bit, so that when the decrement hits 0 we know
>>>>>> + * both it just exited the outmost rcu_read_lock() critical section and
>>>>>> + * we need to do specail (the bit is cleared) if it doesn't need to be
>>>>>> + * deferred.
>>>>>> + */
>>>>>> +
>>>>>> +static inline void set_rcu_preempt_need_special(void)
>>>>>> +{
>>>>>> + raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
>>>>>> + * and calling this function. Otherwise it may clear the work done
>>>>>> + * by set_rcu_preempt_need_special() in interrupt.
>>>>>> + */
>>>>>> +static inline void clear_rcu_preempt_need_special(void)
>>>>>> +{
>>>>>> + raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline void rcu_preempt_depth_inc(void)
>>>>>> +{
>>>>>> + raw_cpu_add_4(__rcu_preempt_depth, 1);
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
>>>>>> +{
>>>>>> + return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
>>>>>> +}
>>>>>> +
>>>>>> +/* must be macros to avoid header recursion hell */
>>>>>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do { \
>>>>>> + prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth); \
>>>>>> + this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting); \
>>>>>> + } while (0)
>>>>>> +
>>>>>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH \
>>>>>> + DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH; \
>>>>>> + EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
>>>>>> +#else /* #ifdef CONFIG_PREEMPT_RCU */
>>>>>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
>>>>>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH /* empty */
>>>>>> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>>>>> +
>>>>>> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
>>>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>>>>> index 9ae7d1bcd4f4..0151737e196c 100644
>>>>>> --- a/arch/x86/kernel/cpu/common.c
>>>>>> +++ b/arch/x86/kernel/cpu/common.c
>>>>>> @@ -46,6 +46,7 @@
>>>>>> #include <asm/asm.h>
>>>>>> #include <asm/bugs.h>
>>>>>> #include <asm/cpu.h>
>>>>>> +#include <asm/rcu_preempt_depth.h>
>>>>>> #include <asm/mce.h>
>>>>>> #include <asm/msr.h>
>>>>>> #include <asm/pat.h>
>>>>>> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
>>>>>> DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>>>>> EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>>>>> +/* close to __preempt_count */
>>>>>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>>>>>> +
>>>>>> /* May not be marked __init: used by software suspend */
>>>>>> void syscall_init(void)
>>>>>> {
>>>>>> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
>>>>>> DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>>>>> EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>>>>> +/* close to __preempt_count */
>>>>>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>>>>>> +
>>>>>> /*
>>>>>> * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
>>>>>> * the top of the kernel stack. Use an extra percpu variable to track the
>>>>>> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
>>>>>> index b8ceec4974fe..ab1f20353663 100644
>>>>>> --- a/arch/x86/kernel/process_32.c
>>>>>> +++ b/arch/x86/kernel/process_32.c
>>>>>> @@ -51,6 +51,7 @@
>>>>>> #include <asm/cpu.h>
>>>>>> #include <asm/syscalls.h>
>>>>>> #include <asm/debugreg.h>
>>>>>> +#include <asm/rcu_preempt_depth.h>
>>>>>> #include <asm/switch_to.h>
>>>>>> #include <asm/vm86.h>
>>>>>> #include <asm/resctrl_sched.h>
>>>>>> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>>>>> if (prev->gs | next->gs)
>>>>>> lazy_load_gs(next->gs);
>>>>>> + save_restore_rcu_preempt_depth(prev_p, next_p);
>>>>>> this_cpu_write(current_task, next_p);
>>>>>> switch_fpu_finish(next_fpu);
>>>>>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>>>>>> index af64519b2695..2e1c6e829d30 100644
>>>>>> --- a/arch/x86/kernel/process_64.c
>>>>>> +++ b/arch/x86/kernel/process_64.c
>>>>>> @@ -50,6 +50,7 @@
>>>>>> #include <asm/ia32.h>
>>>>>> #include <asm/syscalls.h>
>>>>>> #include <asm/debugreg.h>
>>>>>> +#include <asm/rcu_preempt_depth.h>
>>>>>> #include <asm/switch_to.h>
>>>>>> #include <asm/xen/hypervisor.h>
>>>>>> #include <asm/vdso.h>
>>>>>> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>>>>> x86_fsgsbase_load(prev, next);
>>>>>> + save_restore_rcu_preempt_depth(prev_p, next_p);
>>>>>> /*
>>>>>> * Switch the PDA and FPU contexts.
>>>>>> */
>>>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>>>>> index a35daab95d14..0d2abf08b694 100644
>>>>>> --- a/include/linux/rcupdate.h
>>>>>> +++ b/include/linux/rcupdate.h
>>>>>> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>>>>>> #ifdef CONFIG_PREEMPT_RCU
>>>>>> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>> +#include <asm/rcu_preempt_depth.h>
>>>>>> +
>>>>>> +#ifndef CONFIG_PROVE_LOCKING
>>>>>> +extern void rcu_read_unlock_special(void);
>>>>>> +
>>>>>> +static inline void __rcu_read_lock(void)
>>>>>> +{
>>>>>> + rcu_preempt_depth_inc();
>>>>>> +}
>>>>>> +
>>>>>> +static inline void __rcu_read_unlock(void)
>>>>>> +{
>>>>>> + if (unlikely(rcu_preempt_depth_dec_and_test()))
>>>>>> + rcu_read_unlock_special();
>>>>>> +}
>>>>>> +#else
>>>>>> +void __rcu_read_lock(void);
>>>>>> +void __rcu_read_unlock(void);
>>>>>> +#endif
>>>>>> +
>>>>>> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>>>> +#define INIT_RCU_PREEMPT_DEPTH (0)
>>>>>> void __rcu_read_lock(void);
>>>>>> void __rcu_read_unlock(void);
>>>>>> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
>>>>>> * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
>>>>>> */
>>>>>> #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
>>>>>> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>>>> #else /* #ifdef CONFIG_PREEMPT_RCU */
>>>>>> diff --git a/init/init_task.c b/init/init_task.c
>>>>>> index 9e5cbe5eab7b..0a91e38fba37 100644
>>>>>> --- a/init/init_task.c
>>>>>> +++ b/init/init_task.c
>>>>>> @@ -130,7 +130,7 @@ struct task_struct init_task
>>>>>> .perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
>>>>>> #endif
>>>>>> #ifdef CONFIG_PREEMPT_RCU
>>>>>> - .rcu_read_lock_nesting = 0,
>>>>>> + .rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
>>>>>> .rcu_read_unlock_special.s = 0,
>>>>>> .rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
>>>>>> .rcu_blocked_node = NULL,
>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>> index f9572f416126..7368d4ccb857 100644
>>>>>> --- a/kernel/fork.c
>>>>>> +++ b/kernel/fork.c
>>>>>> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
>>>>>> static inline void rcu_copy_process(struct task_struct *p)
>>>>>> {
>>>>>> #ifdef CONFIG_PREEMPT_RCU
>>>>>> - p->rcu_read_lock_nesting = 0;
>>>>>> + p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
>>>>>> p->rcu_read_unlock_special.s = 0;
>>>>>> p->rcu_blocked_node = NULL;
>>>>>> INIT_LIST_HEAD(&p->rcu_node_entry);
>>>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>>>>>> index 1cc940fef17c..d2ecca49a1a4 100644
>>>>>> --- a/kernel/rcu/Kconfig
>>>>>> +++ b/kernel/rcu/Kconfig
>>>>>> @@ -14,6 +14,9 @@ config TREE_RCU
>>>>>> thousands of CPUs. It also scales down nicely to
>>>>>> smaller systems.
>>>>>> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>> + def_bool n
>>>>>> +
>>>>>> config PREEMPT_RCU
>>>>>> bool
>>>>>> default y if PREEMPTION
>>>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>>>> index dcb2124203cf..f919881832d4 100644
>>>>>> --- a/kernel/rcu/tree_exp.h
>>>>>> +++ b/kernel/rcu/tree_exp.h
>>>>>> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
>>>>>> }
>>>>>> #ifdef CONFIG_PREEMPT_RCU
>>>>>> +static inline void set_rcu_preempt_need_special(void);
>>>>>> /*
>>>>>> * Remote handler for smp_call_function_single(). If there is an
>>>>>> @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
>>>>>> if (rnp->expmask & rdp->grpmask) {
>>>>>> rdp->exp_deferred_qs = true;
>>>>>> t->rcu_read_unlock_special.b.exp_hint = true;
>>>>>> + set_rcu_preempt_need_special();
>>>>>> }
>>>>>> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>>>>>> return;
>>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>>>> index e16c3867d2ff..e6774a7ab16b 100644
>>>>>> --- a/kernel/rcu/tree_plugin.h
>>>>>> +++ b/kernel/rcu/tree_plugin.h
>>>>>> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
>>>>>> #ifdef CONFIG_PREEMPT_RCU
>>>>>> static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
>>>>>> -static void rcu_read_unlock_special(struct task_struct *t);
>>>>>> +void rcu_read_unlock_special(void);
>>>>>> /*
>>>>>> * Tell them what RCU they are running.
>>>>>> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
>>>>>> t->rcu_read_unlock_special.b.need_qs = false;
>>>>>> t->rcu_read_unlock_special.b.blocked = true;
>>>>>> t->rcu_blocked_node = rnp;
>>>>>> + set_rcu_preempt_need_special();
>>>>>> /*
>>>>>> * Verify the CPU's sanity, trace the preemption, and
>>>>>> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>>>>>> /* Bias and limit values for ->rcu_read_lock_nesting. */
>>>>>> #define RCU_NEST_PMAX (INT_MAX / 2)
>>>>>> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>> static inline void rcu_preempt_depth_inc(void)
>>>>>> {
>>>>>> current->rcu_read_lock_nesting++;
>>>>>> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>>>>>> static inline bool rcu_preempt_depth_dec_and_test(void)
>>>>>> {
>>>>>> - return --current->rcu_read_lock_nesting == 0;
>>>>>> + if (--current->rcu_read_lock_nesting == 0) {
>>>>>> + /* check speical after dec ->rcu_read_lock_nesting */
>>>>>> + barrier();
>>>>>> + return READ_ONCE(current->rcu_read_unlock_special.s);
>>>>>> + }
>>>>>> + return 0;
>>>>>> }
>>>>>> static inline void rcu_preempt_depth_set(int val)
>>>>>> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
>>>>>> current->rcu_read_lock_nesting = val;
>>>>>> }
>>>>>> +static inline void clear_rcu_preempt_need_special(void) {}
>>>>>> +static inline void set_rcu_preempt_need_special(void) {}
>>>>>> +
>>>>>> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>>>> +
>>>>>> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
>>>>>> /*
>>>>>> * Preemptible RCU implementation for rcu_read_lock().
>>>>>> * Just increment ->rcu_read_lock_nesting, shared state will be updated
>>>>>> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
>>>>>> */
>>>>>> void __rcu_read_unlock(void)
>>>>>> {
>>>>>> - struct task_struct *t = current;
>>>>>> -
>>>>>> if (rcu_preempt_depth_dec_and_test()) {
>>>>>> - barrier(); /* critical section before exit code. */
>>>>>> - if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>>>>>> - rcu_read_unlock_special(t);
>>>>>> + rcu_read_unlock_special();
>>>>>> }
>>>>>> if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>>>>>> - WARN_ON_ONCE(rcu_preempt_depth() < 0);
>>>>>> + WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
>>>>>> + rcu_preempt_depth() > RCU_NEST_PMAX);
>>>>>> }
>>>>>> }
>>>>>> EXPORT_SYMBOL_GPL(__rcu_read_unlock);
>>>>>> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>>>>>> /*
>>>>>> * Advance a ->blkd_tasks-list pointer to the next entry, instead
>>>>>> @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>>>>> return;
>>>>>> }
>>>>>> t->rcu_read_unlock_special.s = 0;
>>>>>> + clear_rcu_preempt_need_special();
>>>>>> if (special.b.need_qs)
>>>>>> rcu_qs();
>>>>>> @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>>>>>> * notify RCU core processing or task having blocked during the RCU
>>>>>> * read-side critical section.
>>>>>> */
>>>>>> -static void rcu_read_unlock_special(struct task_struct *t)
>>>>>> +void rcu_read_unlock_special(void)
>>>>>> {
>>>>>> + struct task_struct *t = current;
>>>>>> unsigned long flags;
>>>>>> bool preempt_bh_were_disabled =
>>>>>> !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
>>>>>> @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>>>>>> }
>>>>>> rcu_preempt_deferred_qs_irqrestore(t, flags);
>>>>>> }
>>>>>> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>>>>>> /*
>>>>>> * Check that the list of blocked tasks for the newly completed grace
>>>>>> @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
>>>>>> __this_cpu_read(rcu_data.core_needs_qs) &&
>>>>>> __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>>>>>> !t->rcu_read_unlock_special.b.need_qs &&
>>>>>> - time_after(jiffies, rcu_state.gp_start + HZ))
>>>>>> + time_after(jiffies, rcu_state.gp_start + HZ)) {
>>>>>> t->rcu_read_unlock_special.b.need_qs = true;
>>>>>> + set_rcu_preempt_need_special();
>>>>>> + }
>>>>>> }
>>>>>> /*
>>>>>> @@ -714,6 +730,7 @@ void exit_rcu(void)
>>>>>> rcu_preempt_depth_set(1);
>>>>>> barrier();
>>>>>> WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
>>>>>> + set_rcu_preempt_need_special();
>>>>>> } else if (unlikely(rcu_preempt_depth())) {
>>>>>> rcu_preempt_depth_set(1);
>>>>>> } else {
>>>>>> --
>>>>>> 2.20.1
>>>>>>

2019-11-21 04:06:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth

On Wed, Nov 20, 2019 at 10:47:53AM +0800, Lai Jiangshan wrote:
>
>
> On 2019/11/20 5:14 上午, Paul E. McKenney wrote:
> > On Tue, Nov 19, 2019 at 09:59:03AM +0800, Lai Jiangshan wrote:
> > >
> > >
> > > On 2019/11/18 10:59 下午, Paul E. McKenney wrote:
> > > > On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
> > > > >
> > > > >
> > > > > On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
> > > > > > On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> > > > > > > Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> > > > > > > is that accessing per-cpu variables is a lot cheaper than accessing
> > > > > > > task_struct or thread_info variables.
> > > > > > >
> > > > > > > We need to save/restore the actual rcu_preempt_depth when switch.
> > > > > > > We also place the per-cpu rcu_preempt_depth close to __preempt_count
> > > > > > > and current_task variable.
> > > > > > >
> > > > > > > Using the idea of per-cpu __preempt_count.
> > > > > > >
> > > > > > > No function call when using rcu_read_[un]lock().
> > > > > > > Single instruction for rcu_read_lock().
> > > > > > > 2 instructions for fast path of rcu_read_unlock().
> > > > > > >
> > > > > > > CC: Peter Zijlstra <[email protected]>
> > > > > > > Signed-off-by: Lai Jiangshan <[email protected]>
> > > > > >
> > > > > > Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
> > > > > > just as nice for #include/inlining, and a lot less per-architecture code?
> > > > > >
> > > > > > Or am I missing some issue with the task-info approach?
> > > > >
> > > > > struct thread_info itself is per-architecture definition.
> > > > > All the arches would have to be touched if RCU's nesting-depth counter
> > > > > is put int struct thread_info. Though the inlining functions can be
> > > > > defined in include/asm-generic/ so that it serves for all arches
> > > > > and X86 can have its own implementation in arch/x86/include/asm/.
> > > >
> > > > True enough.
> > > >
> > > > But doesn't the per-CPU code require per-architecture changes to copy
> > > > to and from the per-CPU variable? If that code simpler and smaller than
> > > > the thread_info access code, I will be -very- surprised.
> > >
> > > The per-CPU code is not simpler. And my code touch X86 ONLY so that
> > > it requires an additional CONFIG and some more "#if" in rcu code
> > > which adds little more complicity.
> > >
> > > As far as I understand, the rcu_read_lock_nesting can only be possible
> > > put in per CPU in X86, not possible in other ARCHs.
> > >
> > > Putting the rcu_read_lock_nesting in struct thread_info and putting
> > > the inlining functions in include/asm-generic/ are OK for me. It will
> > > be a good-shaped framework and reduce function calls, and still allow
> > > x86 has its own implementation.
> >
> > In addition, even for x86, the struct thread_info approach would avoid
> > adding more instructions to the context-switch path.
>
> For all arches, these is no additional instructions to the
> context-switch path for the struct thread_info approach.
>
> The X86 per-cpu approach adds only a 2loads and 2 stores in the
> context-switch path.
>
> But there are tens of read_read_[un]lock calls between context-switchs
> if not hundreds nor thousands. The read_read_[un]lock is more frequent
> than context-switch. Every read_read_[un]lock pair needs about 20
> instructions now, reducing the pair to 3 instructions in the percpu
> approach is a win in my opinion.
>
> > > The framework will only be more appealing when the x86 percpu
> > > rcu_read_lock_nesting is implemented. This series implements
> > > the x86 percpu rcu_read_lock_nesting first and avoid touch too much
> > > files. The framework can be implemented later
> >
> > If I remember correctly, you mentioned earlier that the struct
> > thread_info approach was just as few instructions for rcu_read_lock()
> > and rcu_read_unlock() as is the per-CPU approach. If that is the
> > case, and if the struct thread_info approach avoids adding code to the
> > context-switch path, why would the x86 maintainers be at all happy with
> > the per-CPU approach? What am I missing here?
>
> The struct thread_info approach will save only function calls compared
> with the current struct task_struct approach. The struct thread_info
> approach and the current struct task_struct approach are the same
> excepct that the later one has to do function calls to avoid c-header
> files including hell.
>
> These is no additional instructions to the context-switch path for the
> struct thread_info approach. But I'm not sure whether each Arch
> maintainers be at all happy with the struct thread_info approach
> since the benefit is not so large as x86.
>
> And maybe gcc LTO can reduce the function calls too, so that the
> struct thread_info approach and the current struct task_struct
> approach will be really exactly the same. But I know nothing about
> LTO.
>
> The struct thread_info approach will still need 12(=5+7) instructions
> for read_read_[un]lock pair whereas the per-CPU approach needs only 3
> instructions for every pair (fast path).

OK, that does sound like a potential benefit. Let's get the preparatory
work (needed in either case) done. At the end of the day, the final
decision will be with the x86 maintainers and the scheduler maintainer.

But I have to ask... Is this difference in number of instructions
visible at the system level in any way?

> In x86, on one hand, this patchset reduces the number of instructions
> needed for the read_read_[un]lock pair markedly (12 basic instructions
> + 2 functions calls => 3 instructions).
>
> On the other hand, it can be compared with preempt_disable/enable()
> pair, which had also been changed from the struct thread_info approach
> by Peter Zijlstra <[email protected]> at c2daa3bed53a in 2013.
> The importance of reducing the overhead of the preempt_disable/enable()
> pair and the read_read_[un]lock pair is the same to me. Especially,
> preempt_disable/enable() is discouraged except really needed, while
> the usage of read_read_[un]lock is increasing.

Then perhaps when the time comes we will get an ack from Peter.

But again, I need to get expedited grace periods fixed up for nohz_full
CPUs in either case, and again this will have priority next week..

Thanx, Paul

> Thanks
> Lai
>
> >
> > Thanx, Paul
> >
> > > Thanks
> > > Lai
> > >
> > >
> > > >
> > > > Thanx, Paul
> > > >
> > > > > > > ---
> > > > > > > arch/x86/Kconfig | 2 +
> > > > > > > arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
> > > > > > > arch/x86/kernel/cpu/common.c | 7 ++
> > > > > > > arch/x86/kernel/process_32.c | 2 +
> > > > > > > arch/x86/kernel/process_64.c | 2 +
> > > > > > > include/linux/rcupdate.h | 24 +++++++
> > > > > > > init/init_task.c | 2 +-
> > > > > > > kernel/fork.c | 2 +-
> > > > > > > kernel/rcu/Kconfig | 3 +
> > > > > > > kernel/rcu/tree_exp.h | 2 +
> > > > > > > kernel/rcu/tree_plugin.h | 37 +++++++---
> > > > > > > 11 files changed, 158 insertions(+), 12 deletions(-)
> > > > > > > create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> > > > > > >
> > > > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > > > > index d6e1faa28c58..af9fedc0fdc4 100644
> > > > > > > --- a/arch/x86/Kconfig
> > > > > > > +++ b/arch/x86/Kconfig
> > > > > > > @@ -18,6 +18,7 @@ config X86_32
> > > > > > > select MODULES_USE_ELF_REL
> > > > > > > select OLD_SIGACTION
> > > > > > > select GENERIC_VDSO_32
> > > > > > > + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > > config X86_64
> > > > > > > def_bool y
> > > > > > > @@ -31,6 +32,7 @@ config X86_64
> > > > > > > select NEED_DMA_MAP_STATE
> > > > > > > select SWIOTLB
> > > > > > > select ARCH_HAS_SYSCALL_WRAPPER
> > > > > > > + select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > > config FORCE_DYNAMIC_FTRACE
> > > > > > > def_bool y
> > > > > > > diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..88010ad59c20
> > > > > > > --- /dev/null
> > > > > > > +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> > > > > > > @@ -0,0 +1,87 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> > > > > > > +#define __ASM_RCU_PREEMPT_DEPTH_H
> > > > > > > +
> > > > > > > +#include <asm/rmwcc.h>
> > > > > > > +#include <asm/percpu.h>
> > > > > > > +
> > > > > > > +#ifdef CONFIG_PREEMPT_RCU
> > > > > > > +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> > > > > > > + * such that a decrement hitting 0 means we can and should do
> > > > > > > + * rcu_read_unlock_special().
> > > > > > > + */
> > > > > > > +#define RCU_NEED_SPECIAL 0x80000000
> > > > > > > +
> > > > > > > +#define INIT_RCU_PREEMPT_DEPTH (RCU_NEED_SPECIAL)
> > > > > > > +
> > > > > > > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > > > > > > +static __always_inline int rcu_preempt_depth(void)
> > > > > > > +{
> > > > > > > + return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static __always_inline void rcu_preempt_depth_set(int pc)
> > > > > > > +{
> > > > > > > + int old, new;
> > > > > > > +
> > > > > > > + do {
> > > > > > > + old = raw_cpu_read_4(__rcu_preempt_depth);
> > > > > > > + new = (old & RCU_NEED_SPECIAL) |
> > > > > > > + (pc & ~RCU_NEED_SPECIAL);
> > > > > > > + } while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> > > > > > > + * rcu_read_unlock() can decrement and test for needing to do special
> > > > > > > + * with a single instruction.
> > > > > > > + *
> > > > > > > + * We invert the actual bit, so that when the decrement hits 0 we know
> > > > > > > + * both it just exited the outmost rcu_read_lock() critical section and
> > > > > > > + * we need to do specail (the bit is cleared) if it doesn't need to be
> > > > > > > + * deferred.
> > > > > > > + */
> > > > > > > +
> > > > > > > +static inline void set_rcu_preempt_need_special(void)
> > > > > > > +{
> > > > > > > + raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> > > > > > > + * and calling this function. Otherwise it may clear the work done
> > > > > > > + * by set_rcu_preempt_need_special() in interrupt.
> > > > > > > + */
> > > > > > > +static inline void clear_rcu_preempt_need_special(void)
> > > > > > > +{
> > > > > > > + raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static __always_inline void rcu_preempt_depth_inc(void)
> > > > > > > +{
> > > > > > > + raw_cpu_add_4(__rcu_preempt_depth, 1);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> > > > > > > +{
> > > > > > > + return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* must be macros to avoid header recursion hell */
> > > > > > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do { \
> > > > > > > + prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth); \
> > > > > > > + this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting); \
> > > > > > > + } while (0)
> > > > > > > +
> > > > > > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH \
> > > > > > > + DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH; \
> > > > > > > + EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> > > > > > > +#else /* #ifdef CONFIG_PREEMPT_RCU */
> > > > > > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> > > > > > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH /* empty */
> > > > > > > +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > > > > > +
> > > > > > > +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> > > > > > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > > > > > index 9ae7d1bcd4f4..0151737e196c 100644
> > > > > > > --- a/arch/x86/kernel/cpu/common.c
> > > > > > > +++ b/arch/x86/kernel/cpu/common.c
> > > > > > > @@ -46,6 +46,7 @@
> > > > > > > #include <asm/asm.h>
> > > > > > > #include <asm/bugs.h>
> > > > > > > #include <asm/cpu.h>
> > > > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > > > #include <asm/mce.h>
> > > > > > > #include <asm/msr.h>
> > > > > > > #include <asm/pat.h>
> > > > > > > @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
> > > > > > > DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > > > > > EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > > > > > +/* close to __preempt_count */
> > > > > > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > > > > > +
> > > > > > > /* May not be marked __init: used by software suspend */
> > > > > > > void syscall_init(void)
> > > > > > > {
> > > > > > > @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
> > > > > > > DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > > > > > EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > > > > > +/* close to __preempt_count */
> > > > > > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > > > > > +
> > > > > > > /*
> > > > > > > * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
> > > > > > > * the top of the kernel stack. Use an extra percpu variable to track the
> > > > > > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > > > > > > index b8ceec4974fe..ab1f20353663 100644
> > > > > > > --- a/arch/x86/kernel/process_32.c
> > > > > > > +++ b/arch/x86/kernel/process_32.c
> > > > > > > @@ -51,6 +51,7 @@
> > > > > > > #include <asm/cpu.h>
> > > > > > > #include <asm/syscalls.h>
> > > > > > > #include <asm/debugreg.h>
> > > > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > > > #include <asm/switch_to.h>
> > > > > > > #include <asm/vm86.h>
> > > > > > > #include <asm/resctrl_sched.h>
> > > > > > > @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > > > > > if (prev->gs | next->gs)
> > > > > > > lazy_load_gs(next->gs);
> > > > > > > + save_restore_rcu_preempt_depth(prev_p, next_p);
> > > > > > > this_cpu_write(current_task, next_p);
> > > > > > > switch_fpu_finish(next_fpu);
> > > > > > > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > > > > > > index af64519b2695..2e1c6e829d30 100644
> > > > > > > --- a/arch/x86/kernel/process_64.c
> > > > > > > +++ b/arch/x86/kernel/process_64.c
> > > > > > > @@ -50,6 +50,7 @@
> > > > > > > #include <asm/ia32.h>
> > > > > > > #include <asm/syscalls.h>
> > > > > > > #include <asm/debugreg.h>
> > > > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > > > #include <asm/switch_to.h>
> > > > > > > #include <asm/xen/hypervisor.h>
> > > > > > > #include <asm/vdso.h>
> > > > > > > @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > > > > > x86_fsgsbase_load(prev, next);
> > > > > > > + save_restore_rcu_preempt_depth(prev_p, next_p);
> > > > > > > /*
> > > > > > > * Switch the PDA and FPU contexts.
> > > > > > > */
> > > > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > > > > index a35daab95d14..0d2abf08b694 100644
> > > > > > > --- a/include/linux/rcupdate.h
> > > > > > > +++ b/include/linux/rcupdate.h
> > > > > > > @@ -41,6 +41,29 @@ void synchronize_rcu(void);
> > > > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > > > +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > > > +
> > > > > > > +#ifndef CONFIG_PROVE_LOCKING
> > > > > > > +extern void rcu_read_unlock_special(void);
> > > > > > > +
> > > > > > > +static inline void __rcu_read_lock(void)
> > > > > > > +{
> > > > > > > + rcu_preempt_depth_inc();
> > > > > > > +}
> > > > > > > +
> > > > > > > +static inline void __rcu_read_unlock(void)
> > > > > > > +{
> > > > > > > + if (unlikely(rcu_preempt_depth_dec_and_test()))
> > > > > > > + rcu_read_unlock_special();
> > > > > > > +}
> > > > > > > +#else
> > > > > > > +void __rcu_read_lock(void);
> > > > > > > +void __rcu_read_unlock(void);
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > > > +#define INIT_RCU_PREEMPT_DEPTH (0)
> > > > > > > void __rcu_read_lock(void);
> > > > > > > void __rcu_read_unlock(void);
> > > > > > > @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
> > > > > > > * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
> > > > > > > */
> > > > > > > #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> > > > > > > +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > > > #else /* #ifdef CONFIG_PREEMPT_RCU */
> > > > > > > diff --git a/init/init_task.c b/init/init_task.c
> > > > > > > index 9e5cbe5eab7b..0a91e38fba37 100644
> > > > > > > --- a/init/init_task.c
> > > > > > > +++ b/init/init_task.c
> > > > > > > @@ -130,7 +130,7 @@ struct task_struct init_task
> > > > > > > .perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
> > > > > > > #endif
> > > > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > > > - .rcu_read_lock_nesting = 0,
> > > > > > > + .rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
> > > > > > > .rcu_read_unlock_special.s = 0,
> > > > > > > .rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
> > > > > > > .rcu_blocked_node = NULL,
> > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > index f9572f416126..7368d4ccb857 100644
> > > > > > > --- a/kernel/fork.c
> > > > > > > +++ b/kernel/fork.c
> > > > > > > @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
> > > > > > > static inline void rcu_copy_process(struct task_struct *p)
> > > > > > > {
> > > > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > > > - p->rcu_read_lock_nesting = 0;
> > > > > > > + p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
> > > > > > > p->rcu_read_unlock_special.s = 0;
> > > > > > > p->rcu_blocked_node = NULL;
> > > > > > > INIT_LIST_HEAD(&p->rcu_node_entry);
> > > > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > > > > index 1cc940fef17c..d2ecca49a1a4 100644
> > > > > > > --- a/kernel/rcu/Kconfig
> > > > > > > +++ b/kernel/rcu/Kconfig
> > > > > > > @@ -14,6 +14,9 @@ config TREE_RCU
> > > > > > > thousands of CPUs. It also scales down nicely to
> > > > > > > smaller systems.
> > > > > > > +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > > + def_bool n
> > > > > > > +
> > > > > > > config PREEMPT_RCU
> > > > > > > bool
> > > > > > > default y if PREEMPTION
> > > > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > > > index dcb2124203cf..f919881832d4 100644
> > > > > > > --- a/kernel/rcu/tree_exp.h
> > > > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > > > @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
> > > > > > > }
> > > > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > > > +static inline void set_rcu_preempt_need_special(void);
> > > > > > > /*
> > > > > > > * Remote handler for smp_call_function_single(). If there is an
> > > > > > > @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
> > > > > > > if (rnp->expmask & rdp->grpmask) {
> > > > > > > rdp->exp_deferred_qs = true;
> > > > > > > t->rcu_read_unlock_special.b.exp_hint = true;
> > > > > > > + set_rcu_preempt_need_special();
> > > > > > > }
> > > > > > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > > > > > return;
> > > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > > index e16c3867d2ff..e6774a7ab16b 100644
> > > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > > @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > > > > #ifdef CONFIG_PREEMPT_RCU
> > > > > > > static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > > > -static void rcu_read_unlock_special(struct task_struct *t);
> > > > > > > +void rcu_read_unlock_special(void);
> > > > > > > /*
> > > > > > > * Tell them what RCU they are running.
> > > > > > > @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
> > > > > > > t->rcu_read_unlock_special.b.need_qs = false;
> > > > > > > t->rcu_read_unlock_special.b.blocked = true;
> > > > > > > t->rcu_blocked_node = rnp;
> > > > > > > + set_rcu_preempt_need_special();
> > > > > > > /*
> > > > > > > * Verify the CPU's sanity, trace the preemption, and
> > > > > > > @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > > > > > > /* Bias and limit values for ->rcu_read_lock_nesting. */
> > > > > > > #define RCU_NEST_PMAX (INT_MAX / 2)
> > > > > > > +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > > static inline void rcu_preempt_depth_inc(void)
> > > > > > > {
> > > > > > > current->rcu_read_lock_nesting++;
> > > > > > > @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
> > > > > > > static inline bool rcu_preempt_depth_dec_and_test(void)
> > > > > > > {
> > > > > > > - return --current->rcu_read_lock_nesting == 0;
> > > > > > > + if (--current->rcu_read_lock_nesting == 0) {
> > > > > > > + /* check speical after dec ->rcu_read_lock_nesting */
> > > > > > > + barrier();
> > > > > > > + return READ_ONCE(current->rcu_read_unlock_special.s);
> > > > > > > + }
> > > > > > > + return 0;
> > > > > > > }
> > > > > > > static inline void rcu_preempt_depth_set(int val)
> > > > > > > @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
> > > > > > > current->rcu_read_lock_nesting = val;
> > > > > > > }
> > > > > > > +static inline void clear_rcu_preempt_need_special(void) {}
> > > > > > > +static inline void set_rcu_preempt_need_special(void) {}
> > > > > > > +
> > > > > > > +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > > > +
> > > > > > > +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
> > > > > > > /*
> > > > > > > * Preemptible RCU implementation for rcu_read_lock().
> > > > > > > * Just increment ->rcu_read_lock_nesting, shared state will be updated
> > > > > > > @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
> > > > > > > */
> > > > > > > void __rcu_read_unlock(void)
> > > > > > > {
> > > > > > > - struct task_struct *t = current;
> > > > > > > -
> > > > > > > if (rcu_preempt_depth_dec_and_test()) {
> > > > > > > - barrier(); /* critical section before exit code. */
> > > > > > > - if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > > > > > - rcu_read_unlock_special(t);
> > > > > > > + rcu_read_unlock_special();
> > > > > > > }
> > > > > > > if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> > > > > > > - WARN_ON_ONCE(rcu_preempt_depth() < 0);
> > > > > > > + WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> > > > > > > + rcu_preempt_depth() > RCU_NEST_PMAX);
> > > > > > > }
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> > > > > > > +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
> > > > > > > /*
> > > > > > > * Advance a ->blkd_tasks-list pointer to the next entry, instead
> > > > > > > @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > > > > > return;
> > > > > > > }
> > > > > > > t->rcu_read_unlock_special.s = 0;
> > > > > > > + clear_rcu_preempt_need_special();
> > > > > > > if (special.b.need_qs)
> > > > > > > rcu_qs();
> > > > > > > @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > > > > > * notify RCU core processing or task having blocked during the RCU
> > > > > > > * read-side critical section.
> > > > > > > */
> > > > > > > -static void rcu_read_unlock_special(struct task_struct *t)
> > > > > > > +void rcu_read_unlock_special(void)
> > > > > > > {
> > > > > > > + struct task_struct *t = current;
> > > > > > > unsigned long flags;
> > > > > > > bool preempt_bh_were_disabled =
> > > > > > > !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > > > > > @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > > > > > }
> > > > > > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > > > > }
> > > > > > > +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
> > > > > > > /*
> > > > > > > * Check that the list of blocked tasks for the newly completed grace
> > > > > > > @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
> > > > > > > __this_cpu_read(rcu_data.core_needs_qs) &&
> > > > > > > __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> > > > > > > !t->rcu_read_unlock_special.b.need_qs &&
> > > > > > > - time_after(jiffies, rcu_state.gp_start + HZ))
> > > > > > > + time_after(jiffies, rcu_state.gp_start + HZ)) {
> > > > > > > t->rcu_read_unlock_special.b.need_qs = true;
> > > > > > > + set_rcu_preempt_need_special();
> > > > > > > + }
> > > > > > > }
> > > > > > > /*
> > > > > > > @@ -714,6 +730,7 @@ void exit_rcu(void)
> > > > > > > rcu_preempt_depth_set(1);
> > > > > > > barrier();
> > > > > > > WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> > > > > > > + set_rcu_preempt_need_special();
> > > > > > > } else if (unlikely(rcu_preempt_depth())) {
> > > > > > > rcu_preempt_depth_set(1);
> > > > > > > } else {
> > > > > > > --
> > > > > > > 2.20.1
> > > > > > >

2020-02-17 23:26:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 4/7] rcu: don't use negative ->rcu_read_lock_nesting

On Sat, Nov 02, 2019 at 12:45:56PM +0000, Lai Jiangshan wrote:
> Negative ->rcu_read_lock_nesting was introduced to prevent
> scheduler deadlock. But now with the help of deferred qs
> mechanism, we can defer qs rather than persevere in reporting qs
> and deadlock. So negative ->rcu_read_lock_nesting is useless now
> and rcu_read_unlock() can be simplified.
>
> Signed-off-by: Lai Jiangshan <[email protected]>

I have queued this for further review and testing, thank you!

There were a few adjustments, so please see the updated patch below.

Thanx, Paul

------------------------------------------------------------------------

commit 756b5aea6df6d769a346d4b55cc66707b2d607a9
Author: Lai Jiangshan <[email protected]>
Date: Sat Feb 15 15:23:26 2020 -0800

rcu: Don't use negative nesting depth in __rcu_read_unlock()

Now that RCU flavors have been consolidated, an RCU-preempt
rcu_read_unlock() in an interrupt or softirq handler cannot possibly
end the RCU read-side critical section. Consider the old vulnerability
involving rcu_read_unlock() being invoked within such a handler that
interrupted an __rcu_read_unlock_special(), in which a wakeup might be
invoked with a scheduler lock held. Because rcu_read_unlock_special()
no longer does wakeups in such situations, it is no longer necessary
for __rcu_read_unlock() to set the nesting level negative.

This commit therfore removes this recursion-protection code from
__rcu_read_unlock().

Signed-off-by: Lai Jiangshan <[email protected]>
[ paulmck: Let rcu_exp_handler() continue to call rcu_report_exp_rdp(). ]
[ paulmck: Adjust other checks given no more negative nesting. ]
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index c2b04da..72952ed 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -639,6 +639,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
*/
static void rcu_exp_handler(void *unused)
{
+ int depth = rcu_preempt_depth();
unsigned long flags;
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
struct rcu_node *rnp = rdp->mynode;
@@ -649,7 +650,7 @@ static void rcu_exp_handler(void *unused)
* critical section. If also enabled or idle, immediately
* report the quiescent state, otherwise defer.
*/
- if (!rcu_preempt_depth()) {
+ if (!depth) {
if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
rcu_dynticks_curr_cpu_in_eqs()) {
rcu_report_exp_rdp(rdp);
@@ -673,7 +674,7 @@ static void rcu_exp_handler(void *unused)
* can have caused this quiescent state to already have been
* reported, so we really do need to check ->expmask.
*/
- if (rcu_preempt_depth() > 0) {
+ if (depth > 0) {
raw_spin_lock_irqsave_rcu_node(rnp, flags);
if (rnp->expmask & rdp->grpmask) {
rdp->exp_deferred_qs = true;
@@ -683,30 +684,8 @@ static void rcu_exp_handler(void *unused)
return;
}

- /*
- * The final and least likely case is where the interrupted
- * code was just about to or just finished exiting the RCU-preempt
- * read-side critical section, and no, we can't tell which.
- * So either way, set ->deferred_qs to flag later code that
- * a quiescent state is required.
- *
- * If the CPU is fully enabled (or if some buggy RCU-preempt
- * read-side critical section is being used from idle), just
- * invoke rcu_preempt_deferred_qs() to immediately report the
- * quiescent state. We cannot use rcu_read_unlock_special()
- * because we are in an interrupt handler, which will cause that
- * function to take an early exit without doing anything.
- *
- * Otherwise, force a context switch after the CPU enables everything.
- */
- rdp->exp_deferred_qs = true;
- if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
- WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
- rcu_preempt_deferred_qs(t);
- } else {
- set_tsk_need_resched(t);
- set_preempt_need_resched();
- }
+ // Finally, negative nesting depth should not happen.
+ WARN_ON_ONCE(1);
}

/* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index be3d100..571b7c9 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -345,9 +345,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
return READ_ONCE(rnp->gp_tasks) != NULL;
}

-/* Bias and limit values for ->rcu_read_lock_nesting. */
-#define RCU_NEST_BIAS INT_MAX
-#define RCU_NEST_NMAX (-INT_MAX / 2)
+/* limit value for ->rcu_read_lock_nesting. */
#define RCU_NEST_PMAX (INT_MAX / 2)

static void rcu_preempt_read_enter(void)
@@ -355,9 +353,9 @@ static void rcu_preempt_read_enter(void)
current->rcu_read_lock_nesting++;
}

-static void rcu_preempt_read_exit(void)
+static int rcu_preempt_read_exit(void)
{
- current->rcu_read_lock_nesting--;
+ return --current->rcu_read_lock_nesting;
}

static void rcu_preempt_depth_set(int val)
@@ -390,21 +388,15 @@ void __rcu_read_unlock(void)
{
struct task_struct *t = current;

- if (rcu_preempt_depth() != 1) {
- rcu_preempt_read_exit();
- } else {
+ if (rcu_preempt_read_exit() == 0) {
barrier(); /* critical section before exit code. */
- rcu_preempt_depth_set(-RCU_NEST_BIAS);
- barrier(); /* assign before ->rcu_read_unlock_special load */
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
- barrier(); /* ->rcu_read_unlock_special load before assign */
- rcu_preempt_depth_set(0);
}
if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
int rrln = rcu_preempt_depth();

- WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
+ WARN_ON_ONCE(rrln < 0 || rrln > RCU_NEST_PMAX);
}
}
EXPORT_SYMBOL_GPL(__rcu_read_unlock);
@@ -556,7 +548,7 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
{
return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
READ_ONCE(t->rcu_read_unlock_special.s)) &&
- rcu_preempt_depth() <= 0;
+ rcu_preempt_depth() == 0;
}

/*
@@ -692,7 +684,7 @@ static void rcu_flavor_sched_clock_irq(int user)
} else if (rcu_preempt_need_deferred_qs(t)) {
rcu_preempt_deferred_qs(t); /* Report deferred QS. */
return;
- } else if (!rcu_preempt_depth()) {
+ } else if (!WARN_ON_ONCE(rcu_preempt_depth())) {
rcu_qs(); /* Report immediate QS. */
return;
}

2020-02-19 03:34:03

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held

On Sat, Nov 02, 2019 at 12:45:53PM +0000, Lai Jiangshan wrote:
> Ever since preemption was introduced to linux kernel,
> irq disabled spinlocks are always held with preemption
> disabled. One of the reason is that sometimes we need
> to use spin_unlock() which will do preempt_enable()
> to unlock the irq disabled spinlock with keeping irq
> disabled. So preempt_count can be used to test whether
> scheduler locks is possible held.
>
> CC: Peter Zijlstra <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> kernel/rcu/tree_plugin.h | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0982e9886103..aba5896d67e3 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -603,10 +603,14 @@ static void rcu_read_unlock_special(struct task_struct *t)
> tick_nohz_full_cpu(rdp->cpu);
> // Need to defer quiescent state until everything is enabled.
> if (irqs_were_disabled && use_softirq &&
> - (in_interrupt() ||
> - (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
> + (in_interrupt() || (exp && !preempt_bh_were_disabled))) {
> // Using softirq, safe to awaken, and we get
> // no help from enabling irqs, unlike bh/preempt.
> + // in_interrupt(): raise_softirq_irqoff() is
> + // guaranteed not to not do wakeup
> + // !preempt_bh_were_disabled: scheduler locks cannot
> + // be held, since spinlocks are always held with
> + // preempt_disable(), so the wakeup will be safe.

This means if preemption is disabled for any reason (other than scheduler
locks), such as acquiring an unrelated lock that is not held by the
scheduler, then the softirq would not be raised even if it was safe to
do so. From that respect, it seems a step back no?

thanks,

- Joel


> raise_softirq_irqoff(RCU_SOFTIRQ);
> } else {
> // Enabling BH or preempt does reschedule, so...
> --
> 2.20.1
>

2020-02-19 04:00:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held

On Tue, Feb 18, 2020 at 10:31:47PM -0500, Joel Fernandes wrote:
> On Sat, Nov 02, 2019 at 12:45:53PM +0000, Lai Jiangshan wrote:
> > Ever since preemption was introduced to linux kernel,
> > irq disabled spinlocks are always held with preemption
> > disabled. One of the reason is that sometimes we need
> > to use spin_unlock() which will do preempt_enable()
> > to unlock the irq disabled spinlock with keeping irq
> > disabled. So preempt_count can be used to test whether
> > scheduler locks is possible held.
> >
> > CC: Peter Zijlstra <[email protected]>
> > Signed-off-by: Lai Jiangshan <[email protected]>
> > ---
> > kernel/rcu/tree_plugin.h | 8 ++++++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 0982e9886103..aba5896d67e3 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -603,10 +603,14 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > tick_nohz_full_cpu(rdp->cpu);
> > // Need to defer quiescent state until everything is enabled.
> > if (irqs_were_disabled && use_softirq &&
> > - (in_interrupt() ||
> > - (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
> > + (in_interrupt() || (exp && !preempt_bh_were_disabled))) {
> > // Using softirq, safe to awaken, and we get
> > // no help from enabling irqs, unlike bh/preempt.
> > + // in_interrupt(): raise_softirq_irqoff() is
> > + // guaranteed not to not do wakeup
> > + // !preempt_bh_were_disabled: scheduler locks cannot
> > + // be held, since spinlocks are always held with
> > + // preempt_disable(), so the wakeup will be safe.
>
> This means if preemption is disabled for any reason (other than scheduler
> locks), such as acquiring an unrelated lock that is not held by the
> scheduler, then the softirq would not be raised even if it was safe to
> do so. From that respect, it seems a step back no?

This patch was one of the things motivating me to turn tick on for
nohz_full CPUs that spend too long in the kernel. Given that change,
this patch can be (and recently was) made more straightforward. Prior to
the nohz_full change, Lai was kind of between a rock and a hard place
on this one. ;-)

Thanx, Paul

> thanks,
>
> - Joel
>
>
> > raise_softirq_irqoff(RCU_SOFTIRQ);
> > } else {
> > // Enabling BH or preempt does reschedule, so...
> > --
> > 2.20.1
> >