2018-04-05 17:19:51

by Yury Norov

[permalink] [raw]
Subject: [PATCH v2 0/2] smp: don't kick CPUs running idle or nohz_full tasks

kick_all_cpus_sync() is used to broadcast IPIs to all online CPUs to force
them synchronize caches, TLB etc. It is called only 3 times - from mm/slab
arm64 and powerpc code.

We can delay synchronization work for CPUs in extended quiescent state
(idle or nohz_full userspace).

As Paul E. McKenney wrote:

--

Currently, IPIs are used to force other CPUs to invalidate their TLBs
in response to a kernel virtual-memory mapping change. This works, but
degrades both battery lifetime (for idle CPUs) and real-time response
(for nohz_full CPUs), and in addition results in unnecessary IPIs due to
the fact that CPUs executing in usermode are unaffected by stale kernel
mappings. It would be better to cause a CPU executing in usermode to
wait until it is entering kernel mode to do the flush, first to avoid
interrupting usemode tasks and second to handle multiple flush requests
with a single flush in the case of a long-running user task.

--

v2 is big rework to address comments in v1:
- rcu_eqs_special() declaration in public header is dropped, it is not
used in new implementation. Though, I hope Paul will pick it in his
tree;
- for arm64, few isb() added to ensure kernel text synchronization
(patches 1-4);
- rcu_get_eqs_cpus() introduced and used to mask EQS CPUs before
generating broadcast IPIs;
- RCU_DYNTICK_CTRL_MASK is not touched because memory barrier is
implicitly issued in EQS exit path;
- powerpc is not an exception anymore. I think it's safe to delay
synchronization for it as well, and I didn't get comments from ppc
community.
v1:
https://lkml.org/lkml/2018/3/25/109

Based on next-20180405

Yury Norov (5):
arm64: entry: isb in el1_irq
arm64: entry: introduce restore_syscall_args macro
arm64: ISB early at exit from extended quiescent state
rcu: arm64: add rcu_dynticks_eqs_exit_sync()
smp: Lazy synchronization for EQS CPUs in kick_all_cpus_sync()

arch/arm64/kernel/Makefile | 2 ++
arch/arm64/kernel/entry.S | 52 +++++++++++++++++++++++++++++++--------------
arch/arm64/kernel/process.c | 7 ++++++
arch/arm64/kernel/rcu.c | 8 +++++++
include/linux/rcutiny.h | 2 ++
include/linux/rcutree.h | 1 +
kernel/rcu/tiny.c | 9 ++++++++
kernel/rcu/tree.c | 27 +++++++++++++++++++++++
kernel/smp.c | 21 +++++++++++-------
9 files changed, 105 insertions(+), 24 deletions(-)
create mode 100644 arch/arm64/kernel/rcu.c

--
2.14.1



2018-04-05 17:20:13

by Yury Norov

[permalink] [raw]
Subject: [PATCH 1/5] arm64: entry: isb in el1_irq

Kernel text patching framework relies on IPI to ensure that other
SMP cores observe the change. Target core calls isb() in IPI handler
path, but not at the beginning of el1_irq entry. There's a chance
that modified instruction will appear prior isb(), and so will not be
observed.

This patch inserts isb early at el1_irq entry to avoid that chance.

Signed-off-by: Yury Norov <[email protected]>
---
arch/arm64/kernel/entry.S | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ec2ee720e33e..9c06b4b80060 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -593,6 +593,7 @@ ENDPROC(el1_sync)

.align 6
el1_irq:
+ isb // pairs with aarch64_insn_patch_text
kernel_entry 1
enable_da_f
#ifdef CONFIG_TRACE_IRQFLAGS
--
2.14.1


2018-04-05 17:20:33

by Yury Norov

[permalink] [raw]
Subject: [PATCH 2/5] arm64: entry: introduce restore_syscall_args macro

Syscall arguments are passed in registers x0..x7. If assembler
code has to call C functions before passing control to syscall
handler, it should restore original state of that registers
after the call.

Currently, syscall arguments restoring is opencoded in el0_svc_naked
and __sys_trace. This patch introduces restore_syscall_args macro to
use it there.

Also, parameter 'syscall = 0' is removed from ct_user_exit to make
el0_svc_naked call restore_syscall_args explicitly. This is needed
because the following patch of the series adds another call to C
function in el0_svc_naked, and restoring of syscall args becomes not
only a matter of ct_user_exit.

Signed-off-by: Yury Norov <[email protected]>
---
arch/arm64/kernel/entry.S | 37 +++++++++++++++++++++----------------
1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 9c06b4b80060..c8d9ec363ddd 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -37,22 +37,29 @@
#include <asm/unistd.h>

/*
- * Context tracking subsystem. Used to instrument transitions
- * between user and kernel mode.
+ * Save/restore needed during syscalls. Restore syscall arguments from
+ * the values already saved on stack during kernel_entry.
*/
- .macro ct_user_exit, syscall = 0
-#ifdef CONFIG_CONTEXT_TRACKING
- bl context_tracking_user_exit
- .if \syscall == 1
- /*
- * Save/restore needed during syscalls. Restore syscall arguments from
- * the values already saved on stack during kernel_entry.
- */
+ .macro restore_syscall_args
ldp x0, x1, [sp]
ldp x2, x3, [sp, #S_X2]
ldp x4, x5, [sp, #S_X4]
ldp x6, x7, [sp, #S_X6]
- .endif
+ .endm
+
+ .macro el0_svc_restore_syscall_args
+#if defined(CONFIG_CONTEXT_TRACKING)
+ restore_syscall_args
+#endif
+ .endm
+
+/*
+ * Context tracking subsystem. Used to instrument transitions
+ * between user and kernel mode.
+ */
+ .macro ct_user_exit
+#ifdef CONFIG_CONTEXT_TRACKING
+ bl context_tracking_user_exit
#endif
.endm

@@ -943,7 +950,8 @@ alternative_else_nop_endif
el0_svc_naked: // compat entry point
stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and syscall number
enable_daif
- ct_user_exit 1
+ ct_user_exit
+ el0_svc_restore_syscall_args

tst x16, #_TIF_SYSCALL_WORK // check for syscall hooks
b.ne __sys_trace
@@ -976,10 +984,7 @@ __sys_trace:
mov x1, sp // pointer to regs
cmp wscno, wsc_nr // check upper syscall limit
b.hs __ni_sys_trace
- ldp x0, x1, [sp] // restore the syscall args
- ldp x2, x3, [sp, #S_X2]
- ldp x4, x5, [sp, #S_X4]
- ldp x6, x7, [sp, #S_X6]
+ restore_syscall_args
ldr x16, [stbl, xscno, lsl #3] // address in the syscall table
blr x16 // call sys_* routine

--
2.14.1


2018-04-05 17:21:00

by Yury Norov

[permalink] [raw]
Subject: [PATCH 3/5] arm64: early ISB at exit from extended quiescent state

This series enables delaying of kernel memory synchronization
for CPUs running in extended quiescent state (EQS) till the exit
of that state.

ARM64 uses IPI mechanism to notify all cores in SMP system that
kernel text is changed; and IPI handler calls isb() to synchronize.

If we don't deliver IPI to EQS CPUs anymore, we should add ISB early
in EQS exit path.

There are 2 such paths. One starts in do_idle() loop, and other
in el0_svc entry. For do_idle(), isb() is added in
arch_cpu_idle_exit() hook. And for SVC handler, isb is called in
el0_svc_naked.

Suggested-by: Will Deacon <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
arch/arm64/kernel/entry.S | 16 +++++++++++++++-
arch/arm64/kernel/process.c | 7 +++++++
2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index c8d9ec363ddd..b1e1c19b4432 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -48,7 +48,7 @@
.endm

.macro el0_svc_restore_syscall_args
-#if defined(CONFIG_CONTEXT_TRACKING)
+#if !defined(CONFIG_TINY_RCU) || defined(CONFIG_CONTEXT_TRACKING)
restore_syscall_args
#endif
.endm
@@ -483,6 +483,19 @@ __bad_stack:
ASM_BUG()
.endm

+/*
+ * If CPU is in extended quiescent state we need isb to ensure that
+ * possible change of kernel text is visible by the core.
+ */
+ .macro isb_if_eqs
+#ifndef CONFIG_TINY_RCU
+ bl rcu_is_watching
+ cbnz x0, 1f
+ isb // pairs with aarch64_insn_patch_text
+1:
+#endif
+ .endm
+
el0_sync_invalid:
inv_entry 0, BAD_SYNC
ENDPROC(el0_sync_invalid)
@@ -949,6 +962,7 @@ alternative_else_nop_endif

el0_svc_naked: // compat entry point
stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and syscall number
+ isb_if_eqs
enable_daif
ct_user_exit
el0_svc_restore_syscall_args
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index f08a2ed9db0d..74cad496b07b 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -88,6 +88,13 @@ void arch_cpu_idle(void)
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
}

+void arch_cpu_idle_exit(void)
+{
+ /* Pairs with aarch64_insn_patch_text() for EQS CPUs. */
+ if (!rcu_is_watching())
+ isb();
+}
+
#ifdef CONFIG_HOTPLUG_CPU
void arch_cpu_idle_dead(void)
{
--
2.14.1


2018-04-05 17:21:27

by Yury Norov

[permalink] [raw]
Subject: [PATCH 5/5] smp: Lazy synchronization for EQS CPUs in kick_all_cpus_sync()

kick_all_cpus_sync() forces all CPUs to sync caches by sending broadcast
IPI. If CPU is in extended quiescent state (idle task or nohz_full
userspace), this work may be done at the exit of this state. Delaying
synchronization helps to save power if CPU is in idle state and decrease
latency for real-time tasks.

This patch introduces rcu_get_eqs_cpus() and uses it in
kick_all_cpus_sync() to delay synchronization.

For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the CPU
running isolated task is fatal, as it breaks isolation. The approach with
lazy synchronization helps to maintain isolated state.

I've tested it with test from task isolation series on ThunderX2 for
more than 10 hours (10k giga-ticks) without breaking isolation.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/rcutiny.h | 2 ++
include/linux/rcutree.h | 1 +
kernel/rcu/tiny.c | 9 +++++++++
kernel/rcu/tree.c | 23 +++++++++++++++++++++++
kernel/smp.c | 21 +++++++++++++--------
5 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index ce9beec35e34..dc7e2ea731fa 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -36,6 +36,8 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
/* Never flag non-existent other CPUs! */
static inline bool rcu_eqs_special_set(int cpu) { return false; }

+void rcu_get_eqs_cpus(struct cpumask *cpus, int choose_eqs);
+
static inline unsigned long get_state_synchronize_rcu(void)
{
return 0;
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index fd996cdf1833..7a34eb8c0df3 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -74,6 +74,7 @@ static inline void synchronize_rcu_bh_expedited(void)
void rcu_barrier(void);
void rcu_barrier_bh(void);
void rcu_barrier_sched(void);
+void rcu_get_eqs_cpus(struct cpumask *cpus, int choose_eqs);
unsigned long get_state_synchronize_rcu(void);
void cond_synchronize_rcu(unsigned long oldstate);
unsigned long get_state_synchronize_sched(void);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index a64eee0db39e..d4e94e1b0570 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -128,6 +128,15 @@ void rcu_check_callbacks(int user)
rcu_note_voluntary_context_switch(current);
}

+/*
+ * For tiny RCU, all CPUs are active (non-EQS).
+ */
+void rcu_get_eqs_cpus(struct cpumask *cpus, int choose_eqs)
+{
+ if (!choose_eqs)
+ cpumask_copy(cpus, cpu_online_mask);
+}
+
/*
* Invoke the RCU callbacks on the specified rcu_ctrlkblk structure
* whose grace period has elapsed.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 363f91776b66..cb0d3afe7ea8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -419,6 +419,29 @@ bool rcu_eqs_special_set(int cpu)
return true;
}

+/*
+ * Get EQS CPUs. If @choose_eqs is 0, set of active (non-EQS)
+ * CPUs is returned instead.
+ *
+ * Call with disabled preemption. Make sure @cpus is cleared.
+ */
+void rcu_get_eqs_cpus(struct cpumask *cpus, int choose_eqs)
+{
+ int cpu, in_eqs;
+ struct rcu_dynticks *rdtp;
+
+ for_each_online_cpu(cpu) {
+ rdtp = &per_cpu(rcu_dynticks, cpu);
+ in_eqs = rcu_dynticks_in_eqs(atomic_read(&rdtp->dynticks));
+
+ if (in_eqs && choose_eqs)
+ cpumask_set_cpu(cpu, cpus);
+
+ if (!in_eqs && !choose_eqs)
+ cpumask_set_cpu(cpu, cpus);
+ }
+}
+
/*
* Let the RCU core know that this CPU has gone through the scheduler,
* which is a quiescent state. This is called when the need for a
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..5e6cfb57da22 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -708,19 +708,24 @@ static void do_nothing(void *unused)
/**
* kick_all_cpus_sync - Force all cpus out of idle
*
- * Used to synchronize the update of pm_idle function pointer. It's
- * called after the pointer is updated and returns after the dummy
- * callback function has been executed on all cpus. The execution of
- * the function can only happen on the remote cpus after they have
- * left the idle function which had been called via pm_idle function
- * pointer. So it's guaranteed that nothing uses the previous pointer
- * anymore.
+ * - on current CPU call smp_mb() explicitly;
+ * - on CPUs in extended quiescent state (idle or nohz_full userspace), memory
+ * is synchronized at the exit of that mode, so do nothing (it's safe to delay
+ * synchronization because EQS CPUs don't run kernel code);
+ * - on other CPUs fire IPI for synchronization, which implies barrier.
*/
void kick_all_cpus_sync(void)
{
+ struct cpumask active_cpus;
+
/* Make sure the change is visible before we kick the cpus */
smp_mb();
- smp_call_function(do_nothing, NULL, 1);
+
+ cpumask_clear(&active_cpus);
+ preempt_disable();
+ rcu_get_eqs_cpus(&active_cpus, 0);
+ smp_call_function_many(&active_cpus, do_nothing, NULL, 1);
+ preempt_enable();
}
EXPORT_SYMBOL_GPL(kick_all_cpus_sync);

--
2.14.1


2018-04-05 17:22:28

by Yury Norov

[permalink] [raw]
Subject: [PATCH 4/5] rcu: arm64: add rcu_dynticks_eqs_exit_sync()

The following patch of the series enables delaying of kernel memory
synchronization for CPUs running in extended quiescent state (EQS)
till the exit of that state.

In previous patch ISB was added in EQS exit path to ensure that
any change made by kernel patching framework is visible. But after
that isb(), EQS is still enabled for a while, and there's a chance
that some other core will modify text in parallel, and EQS core
will be not notified about it, as EQS will mask IPI:

CPU0 CPU1

ISB
patch_some_text()
kick_all_active_cpus_sync()
exit EQS

// not synchronized!
use_of_patched_text()

This patch introduces rcu_dynticks_eqs_exit_sync() function and uses
it in arm64 code to call ipi() after the exit from quiescent state.

Suggested-by: Mark Rutland <[email protected]>
Signed-off-by: Yury Norov <[email protected]>
---
arch/arm64/kernel/Makefile | 2 ++
arch/arm64/kernel/rcu.c | 8 ++++++++
kernel/rcu/tree.c | 4 ++++
3 files changed, 14 insertions(+)
create mode 100644 arch/arm64/kernel/rcu.c

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 9b55a3f24be7..c87a203524ab 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -54,6 +54,8 @@ arm64-obj-$(CONFIG_ARM64_RELOC_TEST) += arm64-reloc-test.o
arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o
arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
arm64-obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o
+arm64-obj-$(CONFIG_TREE_RCU) += rcu.o
+arm64-obj-$(CONFIG_PREEMPT_RCU) += rcu.o

arm64-obj-$(CONFIG_KVM_INDIRECT_VECTORS)+= bpi.o

diff --git a/arch/arm64/kernel/rcu.c b/arch/arm64/kernel/rcu.c
new file mode 100644
index 000000000000..67fe33c0ea03
--- /dev/null
+++ b/arch/arm64/kernel/rcu.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <asm/barrier.h>
+
+void rcu_dynticks_eqs_exit_sync(void)
+{
+ isb();
+};
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2a734692a581..363f91776b66 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -264,6 +264,8 @@ void rcu_bh_qs(void)
#define rcu_eqs_special_exit() do { } while (0)
#endif

+void __weak rcu_dynticks_eqs_exit_sync(void) {};
+
static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
.dynticks_nesting = 1,
.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
@@ -308,6 +310,8 @@ static void rcu_dynticks_eqs_exit(void)
* critical section.
*/
seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdtp->dynticks);
+ rcu_dynticks_eqs_exit_sync();
+
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
!(seq & RCU_DYNTICK_CTRL_CTR));
if (seq & RCU_DYNTICK_CTRL_MASK) {
--
2.14.1


2018-04-06 10:07:11

by James Morse

[permalink] [raw]
Subject: Re: [PATCH 1/5] arm64: entry: isb in el1_irq

Hi Yury,

On 05/04/18 18:17, Yury Norov wrote:
> Kernel text patching framework relies on IPI to ensure that other
> SMP cores observe the change. Target core calls isb() in IPI handler

(Odd, if its just to synchronize the CPU, taking the IPI should be enough).


> path, but not at the beginning of el1_irq entry. There's a chance
> that modified instruction will appear prior isb(), and so will not be
> observed.
>
> This patch inserts isb early at el1_irq entry to avoid that chance.


> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index ec2ee720e33e..9c06b4b80060 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -593,6 +593,7 @@ ENDPROC(el1_sync)
>
> .align 6
> el1_irq:
> + isb // pairs with aarch64_insn_patch_text
> kernel_entry 1
> enable_da_f
> #ifdef CONFIG_TRACE_IRQFLAGS
>

An ISB at the beginning of the vectors? This is odd, taking an IRQ to get in
here would be a context-synchronization-event too, so the ISB is superfluous.

The ARM-ARM has a list of 'Context-Synchronization event's (Glossary on page
6480 of DDI0487B.b), paraphrasing:
* ISB
* Taking an exception
* ERET
* (...loads of debug stuff...)


Thanks,

James

2018-04-06 10:11:04

by James Morse

[permalink] [raw]
Subject: Re: [PATCH 3/5] arm64: early ISB at exit from extended quiescent state

Hi Yury,

On 05/04/18 18:17, Yury Norov wrote:
> This series enables delaying of kernel memory synchronization
> for CPUs running in extended quiescent state (EQS) till the exit
> of that state.
>
> ARM64 uses IPI mechanism to notify all cores in SMP system that
> kernel text is changed; and IPI handler calls isb() to synchronize.
>
> If we don't deliver IPI to EQS CPUs anymore, we should add ISB early
> in EQS exit path.
>
> There are 2 such paths. One starts in do_idle() loop, and other
> in el0_svc entry. For do_idle(), isb() is added in
> arch_cpu_idle_exit() hook. And for SVC handler, isb is called in
> el0_svc_naked.

(I know nothing about this EQS stuff, but) there is a third path that might be
relevant.
From include/linux/context_tracking.h:guest_enter_irqoff():
| * KVM does not hold any references to rcu protected data when it
| * switches CPU into a guest mode. In fact switching to a guest mode
| * is very similar to exiting to userspace from rcu point of view. In
| * addition CPU may stay in a guest mode for quite a long time (up to
| * one time slice). Lets treat guest mode as quiescent state, just like
| * we do with user-mode execution.

For non-VHE systems guest_enter_irqoff()() is called just before we jump to EL2.
Coming back gives us an exception-return, so we have a context-synchronisation
event there, and I assume we will never patch the hyp-text on these systems.

But with VHE on the upcoming kernel version we still go on to run code at the
same EL. Do we need an ISB on the path back from the guest once we've told RCU
we've 'exited user-space'?
If this code can be patched, do we have a problem here?


> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index c8d9ec363ddd..b1e1c19b4432 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -48,7 +48,7 @@
> .endm
>
> .macro el0_svc_restore_syscall_args
> -#if defined(CONFIG_CONTEXT_TRACKING)
> +#if !defined(CONFIG_TINY_RCU) || defined(CONFIG_CONTEXT_TRACKING)
> restore_syscall_args
> #endif
> .endm
> @@ -483,6 +483,19 @@ __bad_stack:
> ASM_BUG()
> .endm
>
> +/*
> + * If CPU is in extended quiescent state we need isb to ensure that
> + * possible change of kernel text is visible by the core.
> + */
> + .macro isb_if_eqs
> +#ifndef CONFIG_TINY_RCU
> + bl rcu_is_watching
> + cbnz x0, 1f
> + isb // pairs with aarch64_insn_patch_text
> +1:
> +#endif
> + .endm
> +
> el0_sync_invalid:
> inv_entry 0, BAD_SYNC
> ENDPROC(el0_sync_invalid)
> @@ -949,6 +962,7 @@ alternative_else_nop_endif
>
> el0_svc_naked: // compat entry point
> stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and syscall number
> + isb_if_eqs
> enable_daif
> ct_user_exit
> el0_svc_restore_syscall_args

Shouldn't this be at the point that RCU knows we've exited user-space? Otherwise
there is a gap where RCU thinks we're in user-space, we're not, and we're about
to tell it. Code-patching occurring in this gap would be missed.

This gap only contains 'enable_daif', and any exception that occurs here is
safe, but its going to give someone a nasty surprise...

Mark points out this ISB needs to be after RCU knows we're not quiescent:
https://lkml.org/lkml/2018/4/3/378

Can't this go in the rcu exit-quiescence code? Isn't this what your
rcu_dynticks_eqs_exit_sync() hook does?


Thanks,

James

2018-04-06 10:58:44

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/5] arm64: entry: isb in el1_irq

On Thu, Apr 05, 2018 at 08:17:56PM +0300, Yury Norov wrote:
> Kernel text patching framework relies on IPI to ensure that other
> SMP cores observe the change. Target core calls isb() in IPI handler
> path, but not at the beginning of el1_irq entry. There's a chance
> that modified instruction will appear prior isb(), and so will not be
> observed.
>
> This patch inserts isb early at el1_irq entry to avoid that chance.

As James pointed out, taking an exception is context synchronizing, so
this looks unnecessary.

Also, it's important to realise that the exception entry is not tied to a
specific interrupt. We might take an EL1 IRQ because of a timer interrupt,
then an IPI could be taken before we get to gic_handle_irq().

This means that we can race:

CPU0 CPU1
<take IRQ>
ISB
<patch text>
<send IPI>
<discover IPI pending>

... and thus the ISB is too early.

Only once we're in the interrupt handler can we pair an ISB with the IPI, and
any code executed before that is not guaranteed to be up-to-date.

Thanks,
Mark.

>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> arch/arm64/kernel/entry.S | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index ec2ee720e33e..9c06b4b80060 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -593,6 +593,7 @@ ENDPROC(el1_sync)
>
> .align 6
> el1_irq:
> + isb // pairs with aarch64_insn_patch_text
> kernel_entry 1
> enable_da_f
> #ifdef CONFIG_TRACE_IRQFLAGS
> --
> 2.14.1
>

2018-04-06 11:09:08

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 3/5] arm64: early ISB at exit from extended quiescent state

On Thu, Apr 05, 2018 at 08:17:58PM +0300, Yury Norov wrote:
> This series enables delaying of kernel memory synchronization
> for CPUs running in extended quiescent state (EQS) till the exit
> of that state.
>
> ARM64 uses IPI mechanism to notify all cores in SMP system that
> kernel text is changed; and IPI handler calls isb() to synchronize.
>
> If we don't deliver IPI to EQS CPUs anymore, we should add ISB early
> in EQS exit path.
>
> There are 2 such paths. One starts in do_idle() loop, and other
> in el0_svc entry. For do_idle(), isb() is added in
> arch_cpu_idle_exit() hook. And for SVC handler, isb is called in
> el0_svc_naked.
>
> Suggested-by: Will Deacon <[email protected]>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> arch/arm64/kernel/entry.S | 16 +++++++++++++++-
> arch/arm64/kernel/process.c | 7 +++++++
> 2 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index c8d9ec363ddd..b1e1c19b4432 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -48,7 +48,7 @@
> .endm
>
> .macro el0_svc_restore_syscall_args
> -#if defined(CONFIG_CONTEXT_TRACKING)
> +#if !defined(CONFIG_TINY_RCU) || defined(CONFIG_CONTEXT_TRACKING)
> restore_syscall_args
> #endif
> .endm
> @@ -483,6 +483,19 @@ __bad_stack:
> ASM_BUG()
> .endm
>
> +/*
> + * If CPU is in extended quiescent state we need isb to ensure that
> + * possible change of kernel text is visible by the core.
> + */
> + .macro isb_if_eqs
> +#ifndef CONFIG_TINY_RCU
> + bl rcu_is_watching
> + cbnz x0, 1f
> + isb // pairs with aarch64_insn_patch_text
> +1:
> +#endif
> + .endm
> +
> el0_sync_invalid:
> inv_entry 0, BAD_SYNC
> ENDPROC(el0_sync_invalid)
> @@ -949,6 +962,7 @@ alternative_else_nop_endif
>
> el0_svc_naked: // compat entry point
> stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and syscall number
> + isb_if_eqs

As I mentioned before, this is too early.

If we only kick active CPUs, then until we exit a quiescent state, we
can race with concurrent modification, and cannot reliably ensure that
instructions are up-to-date. Practically speaking, that means that we
cannot patch any code used on the path to exit a quiescent state.

Also, if this were needed in the SVC path, it would be necessary for all
exceptions from EL0. Buggy userspace can always trigger a data abort,
even if it doesn't intend to.

> enable_daif
> ct_user_exit
> el0_svc_restore_syscall_args
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index f08a2ed9db0d..74cad496b07b 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -88,6 +88,13 @@ void arch_cpu_idle(void)
> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
> }
>
> +void arch_cpu_idle_exit(void)
> +{
> + /* Pairs with aarch64_insn_patch_text() for EQS CPUs. */
> + if (!rcu_is_watching())
> + isb();
> +}

Likewise, this is too early as we haven't left the extended quiescent
state yet.

Thanks,
Mark.

2018-04-06 16:58:08

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 1/5] arm64: entry: isb in el1_irq

On Fri, Apr 06, 2018 at 11:02:56AM +0100, James Morse wrote:
> Hi Yury,
>
> An ISB at the beginning of the vectors? This is odd, taking an IRQ to get in
> here would be a context-synchronization-event too, so the ISB is superfluous.
>
> The ARM-ARM has a list of 'Context-Synchronization event's (Glossary on page
> 6480 of DDI0487B.b), paraphrasing:
> * ISB
> * Taking an exception
> * ERET
> * (...loads of debug stuff...)

Hi James, Mark,

I completely forgot that taking an exception is the context synchronization
event. Sorry for your time on reviewing this crap. It means that patches 1,
2 and 3 are not needed except chunk that adds ISB in do_idle() path.

Also it means that for arm64 we are safe to mask IPI delivering to CPUs that
run any userspace code, not only nohz_full.

In general, kick_all_cpus_sync() is needed to switch contexts. But exit from
userspace is anyway the switch of context. And while in userspace, we cannot
do something wrong on kernel side. For me it means that we can safely drop
IPI for all userspace modes - both normal and nohz_full.

If it's correct, for v3 I would suggest:
- in kick_all_cpus_sync() mask all is_idle_task() and user_mode() CPUs;
- add isb() for arm64 in do_idle() path only - this path doesn't imply
context switch.

What do you think?

Yury

2018-04-06 17:26:03

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/5] arm64: entry: isb in el1_irq

On Fri, Apr 06, 2018 at 07:54:02PM +0300, Yury Norov wrote:
> In general, kick_all_cpus_sync() is needed to switch contexts. But exit from
> userspace is anyway the switch of context. And while in userspace, we cannot
> do something wrong on kernel side. For me it means that we can safely drop
> IPI for all userspace modes - both normal and nohz_full.

This *may* be true, but only if we never have to patch text in the
windows:

* between exception entry and eqs_exit()

* between eqs_enter() and exception return

* between eqs_enter() and eqs_exit() in the idle loop.

If it's possible that we need to execute patched text in any of those
paths, we must IPI all CPUs in order to correctly serialize things.

Digging a bit, I also thing that our ct_user_exit and ct_user_enter
usage is on dodgy ground today.

For example, in el0_dbg we call do_debug_exception() *before* calling
ct_user_exit. Which I believe means we'd use RCU while supposedly in an
extended quiescent period, which would be bad.

In other paths, we unmask all DAIF bits before calling ct_user_exit, so
we could similarly take an EL1 debug exception without having exited the
extended quiescent period.

I think similar applies to SDEI; we don't negotiate with RCU prior to
invoking handlers, which might need RCU.

> If it's correct, for v3 I would suggest:
> - in kick_all_cpus_sync() mask all is_idle_task() and user_mode() CPUs;
> - add isb() for arm64 in do_idle() path only - this path doesn't imply
> context switch.

As mentioned in my other reply, I don't think the ISB in do_idle()
makes sense, unless that occurs *after* we exit the extended quiescent
state.

Thanks,
Mark.

2018-04-06 17:37:22

by James Morse

[permalink] [raw]
Subject: Re: [PATCH 1/5] arm64: entry: isb in el1_irq

Hi Mark,

On 06/04/18 18:22, Mark Rutland wrote:
> Digging a bit, I also thing that our ct_user_exit and ct_user_enter
> usage is on dodgy ground today.

[...]

> I think similar applies to SDEI; we don't negotiate with RCU prior to
> invoking handlers, which might need RCU.

The arch code's __sdei_handler() calls nmi_enter() -> rcu_nmi_enter(), which
does a rcu_dynticks_eqs_exit().


Thanks,

James

2018-04-06 17:38:38

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/5] arm64: entry: isb in el1_irq

On Fri, Apr 06, 2018 at 06:30:50PM +0100, James Morse wrote:
> Hi Mark,
>
> On 06/04/18 18:22, Mark Rutland wrote:
> > Digging a bit, I also thing that our ct_user_exit and ct_user_enter
> > usage is on dodgy ground today.
>
> [...]
>
> > I think similar applies to SDEI; we don't negotiate with RCU prior to
> > invoking handlers, which might need RCU.
>
> The arch code's __sdei_handler() calls nmi_enter() -> rcu_nmi_enter(), which
> does a rcu_dynticks_eqs_exit().

Ah, sorry. I missed that!

Thanks,
Mark.

2018-04-06 17:53:52

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/5] arm64: entry: isb in el1_irq

On Fri, Apr 06, 2018 at 06:22:11PM +0100, Mark Rutland wrote:
> Digging a bit, I also thing that our ct_user_exit and ct_user_enter
> usage is on dodgy ground today.
>
> For example, in el0_dbg we call do_debug_exception() *before* calling
> ct_user_exit. Which I believe means we'd use RCU while supposedly in an
> extended quiescent period, which would be bad.

It seems this is the case. I can trigger the following by having GDB
place a SW breakpoint:

[ 51.217947] =============================
[ 51.217953] WARNING: suspicious RCU usage
[ 51.217961] 4.16.0 #4 Not tainted
[ 51.217966] -----------------------------
[ 51.217974] ./include/linux/rcupdate.h:632 rcu_read_lock() used illegally while idle!
[ 51.217980]
[ 51.217980] other info that might help us debug this:
[ 51.217980]
[ 51.217987]
[ 51.217987] RCU used illegally from idle CPU!
[ 51.217987] rcu_scheduler_active = 2, debug_locks = 1
[ 51.217992] RCU used illegally from extended quiescent state!
[ 51.217999] 1 lock held by ls/2412:
[ 51.218004] #0: (rcu_read_lock){....}, at: [<0000000092efbdd5>] brk_handler+0x0/0x198
[ 51.218041]
[ 51.218041] stack backtrace:
[ 51.218049] CPU: 2 PID: 2412 Comm: ls Not tainted 4.16.0 #4
[ 51.218055] Hardware name: ARM Juno development board (r1) (DT)
[ 51.218061] Call trace:
[ 51.218070] dump_backtrace+0x0/0x1c8
[ 51.218078] show_stack+0x14/0x20
[ 51.218087] dump_stack+0xac/0xe4
[ 51.218096] lockdep_rcu_suspicious+0xcc/0x110
[ 51.218103] brk_handler+0x144/0x198
[ 51.218110] do_debug_exception+0x9c/0x190
[ 51.218116] el0_dbg+0x14/0x20

We will need to fix this before we can fiddle with kick_all_cpus_sync().

Thanks,
Mark.