2018-01-29 20:23:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 00/11] membarrier updates for 4.16

Hi Ingo, Peter, Thomas,

Here is the updated membarrier patch series following Acked-by from
Peter.

It would be appreciated of those can go through the scheduler tree for
the 4.16 merge window.

Highlights:

"powerpc: membarrier: Skip memory barrier in switch_mm()" takes care of
a TODO that was left in the private expedited implementation when merged
in 4.14: an extra memory barrier was added on context switch on powerpc.
Ensure that the barrier is only performed when scheduling between
different processes, only for threads belonging to processes that have
registered their intent to use the private expedited command.

"membarrier: provide GLOBAL_EXPEDITED command" adds new commands to
membarrier for registration and use of membarrier across processes
communicating through shared memory mappings. The non-expedited command
has proven to be really too slow (taking 10ms and more to complete) for
real-world use. The expedited version completes in a matter of
microseconds. This patch renames the pre-existing MEMBARRIER_CMD_SHARED
to MEMBARRIER_CMD_GLOBAL for consistency, keeping the old enum label
as an alias for backward compatibility.

"membarrier: Provide core serializing command" provides core
serialization for JIT reclaim. We received positive feedback from
Android developers that the proposed ABI fits their use-case.
Only x86 32/64 and arm 64 implement this command so far. This is
opt-in per architecture.

The other patches add selftests and documentation.

Thanks,

Mathieu


Mathieu Desnoyers (11):
membarrier: selftest: Test private expedited cmd (v2)
powerpc: membarrier: Skip memory barrier in switch_mm() (v7)
membarrier: Document scheduler barrier requirements (v5)
membarrier: provide GLOBAL_EXPEDITED command (v3)
membarrier: selftest: Test global expedited cmd (v2)
Introduce sync_core_before_usermode (v2)
x86: Implement sync_core_before_usermode (v3)
membarrier: Provide core serializing command (v3)
membarrier: x86: Provide core serializing command (v4)
membarrier: arm64: Provide core serializing command
membarrier: selftest: Test private expedited sync core cmd

MAINTAINERS | 1 +
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/entry.S | 4 +
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/membarrier.h | 27 +++
arch/powerpc/mm/mmu_context.c | 7 +
arch/x86/Kconfig | 2 +
arch/x86/entry/entry_32.S | 5 +
arch/x86/entry/entry_64.S | 4 +
arch/x86/include/asm/sync_core.h | 28 +++
arch/x86/mm/tlb.c | 6 +
include/linux/sched/mm.h | 40 +++-
include/linux/sync_core.h | 21 ++
include/uapi/linux/membarrier.h | 74 ++++++-
init/Kconfig | 9 +
kernel/sched/core.c | 57 +++--
kernel/sched/membarrier.c | 177 +++++++++++++--
.../testing/selftests/membarrier/membarrier_test.c | 237 +++++++++++++++++++--
18 files changed, 633 insertions(+), 68 deletions(-)
create mode 100644 arch/powerpc/include/asm/membarrier.h
create mode 100644 arch/x86/include/asm/sync_core.h
create mode 100644 include/linux/sync_core.h

--
2.11.0



2018-01-29 20:23:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v3 07/11] x86: Implement sync_core_before_usermode

Ensure that a core serializing instruction is issued before returning to
user-mode. x86 implements return to user-space through sysexit, sysrel,
and sysretq, which are not core serializing.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andrea Parri <[email protected]>
CC: Russell King <[email protected]>
CC: Greg Hackmann <[email protected]>
CC: Will Deacon <[email protected]>
CC: David Sehr <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Arnd Bergmann <[email protected]>
CC: [email protected]
CC: [email protected]
---
Changes since v1:
- Fix prototype of sync_core_before_usermode in generic code (missing
return type).
- Add linux/processor.h include to sched/core.c.
- Add ARCH_HAS_SYNC_CORE_BEFORE_USERMODE to init/Kconfig.
- Fix linux/processor.h ifdef to target
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE rather than
ARCH_HAS_SYNC_CORE_BEFORE_USERMODE.
- Move empty static inline in processor.h to generic patch.

Changes since v2:
- Introduce arch/x86/include/asm/sync_core.h
- Don't sync_core when KPTI is enabled, and when invoked from irq and nmi
context.
- Note: v2 was reviewed by Thomas Gleixner, but changes were introduced
since.
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/sync_core.h | 28 ++++++++++++++++++++++++++++
2 files changed, 29 insertions(+)
create mode 100644 arch/x86/include/asm/sync_core.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 20da391b5f32..0b44c8dd0e95 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -61,6 +61,7 @@ config X86
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
+ select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
new file mode 100644
index 000000000000..c67caafd3381
--- /dev/null
+++ b/arch/x86/include/asm/sync_core.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SYNC_CORE_H
+#define _ASM_X86_SYNC_CORE_H
+
+#include <linux/preempt.h>
+#include <asm/processor.h>
+#include <asm/cpufeature.h>
+
+/*
+ * Ensure that a core serializing instruction is issued before returning
+ * to user-mode. x86 implements return to user-space through sysexit,
+ * sysrel, and sysretq, which are not core serializing.
+ */
+static inline void sync_core_before_usermode(void)
+{
+ /* With PTI, we unconditionally serialize before running user code. */
+ if (static_cpu_has(X86_FEATURE_PTI))
+ return;
+ /*
+ * Return from interrupt and NMI is done through iret, which is core
+ * serializing.
+ */
+ if (in_irq() || in_nmi())
+ return;
+ sync_core();
+}
+
+#endif /* _ASM_X86_SYNC_CORE_H */
--
2.11.0


2018-01-29 20:23:31

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 11/11] membarrier: selftest: Test private expedited sync core cmd

Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Shuah Khan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Alan Stern <[email protected]>
CC: Will Deacon <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Alice Ferrazzi <[email protected]>
CC: Paul Elder <[email protected]>
CC: [email protected]
CC: [email protected]
---
.../testing/selftests/membarrier/membarrier_test.c | 73 ++++++++++++++++++++++
1 file changed, 73 insertions(+)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index eec39a0e1f77..22bffd55a523 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -132,6 +132,63 @@ static int test_membarrier_private_expedited_success(void)
return 0;
}

+static int test_membarrier_private_expedited_sync_core_fail(void)
+{
+ int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE not registered failure";
+
+ if (sys_membarrier(cmd, flags) != -1) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should fail, but passed\n",
+ test_name, flags);
+ }
+ if (errno != EPERM) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+ test_name, flags, EPERM, strerror(EPERM),
+ errno, strerror(errno));
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ return 0;
+}
+
+static int test_membarrier_register_private_expedited_sync_core_success(void)
+{
+ int cmd = MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
+static int test_membarrier_private_expedited_sync_core_success(void)
+{
+ int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
static int test_membarrier_register_global_expedited_success(void)
{
int cmd = MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED, flags = 0;
@@ -188,6 +245,22 @@ static int test_membarrier(void)
status = test_membarrier_private_expedited_success();
if (status)
return status;
+ status = sys_membarrier(MEMBARRIER_CMD_QUERY, 0);
+ if (status < 0) {
+ ksft_test_result_fail("sys_membarrier() failed\n");
+ return status;
+ }
+ if (status & MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE) {
+ status = test_membarrier_private_expedited_sync_core_fail();
+ if (status)
+ return status;
+ status = test_membarrier_register_private_expedited_sync_core_success();
+ if (status)
+ return status;
+ status = test_membarrier_private_expedited_sync_core_success();
+ if (status)
+ return status;
+ }
/*
* It is valid to send a global membarrier from a non-registered
* process.
--
2.11.0


2018-01-29 20:24:08

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v4 09/11] membarrier: x86: Provide core serializing command

There are two places where core serialization is needed by membarrier:

1) When returning from the membarrier IPI,
2) After scheduler updates curr to a thread with a different mm, before
going back to user-space, since the curr->mm is used by membarrier to
check whether it needs to send an IPI to that CPU.

x86-32 uses iret as return from interrupt, and both iret and sysexit to go
back to user-space. The iret instruction is core serializing, but not
sysexit.

x86-64 uses iret as return from interrupt, which takes care of the IPI.
However, it can return to user-space through either sysretl (compat
code), sysretq, or iret. Given that sysret{l,q} is not core serializing,
we rely instead on write_cr3() performed by switch_mm() to provide core
serialization after changing the current mm, and deal with the special
case of kthread -> uthread (temporarily keeping current mm into
active_mm) by adding a sync_core() in that specific case.

Use the new sync_core_before_usermode() to guarantee this.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andrea Parri <[email protected]>
CC: Russell King <[email protected]>
CC: Greg Hackmann <[email protected]>
CC: Will Deacon <[email protected]>
CC: David Sehr <[email protected]>
CC: [email protected]
CC: [email protected]
---
Changes since v1:
- Use the newly introduced sync_core_before_usermode(). Move all state
handling to generic code.
- Add linux/processor.h include to include/linux/sched/mm.h.
Changes since v2:
- Fix use-after-free in membarrier_mm_sync_core_before_usermode.
Changes since v3:
- Move generic code into separate patch.
---
arch/x86/Kconfig | 1 +
arch/x86/entry/entry_32.S | 5 +++++
arch/x86/entry/entry_64.S | 4 ++++
arch/x86/mm/tlb.c | 7 ++++---
4 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0b44c8dd0e95..b5324f2e3162 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV if X86_64
+ select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_REFCOUNT
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 60c4c342316c..267d747a867f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -565,6 +565,11 @@ restore_all:
.Lrestore_nocheck:
RESTORE_REGS 4 # skip orig_eax/error_code
.Lirq_return:
+ /*
+ * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on iret core serialization
+ * when returning from IPI handler and when returning from
+ * scheduler to user-space.
+ */
INTERRUPT_RETURN

.section .fixup, "ax"
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ff6f8022612c..52e20d37cdd3 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -803,6 +803,10 @@ GLOBAL(restore_regs_and_return_to_kernel)
POP_EXTRA_REGS
POP_C_REGS
addq $8, %rsp /* skip regs->orig_ax */
+ /*
+ * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on iret core serialization
+ * when returning from IPI handler.
+ */
INTERRUPT_RETURN

ENTRY(native_iret)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9fa7d2e0e15e..9b34121c8f05 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -229,9 +229,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.is_lazy, false);

/*
- * The membarrier system call requires a full memory barrier
- * before returning to user-space, after storing to rq->curr.
- * Writing to CR3 provides that full memory barrier.
+ * The membarrier system call requires a full memory barrier and
+ * core serialization before returning to user-space, after
+ * storing to rq->curr. Writing to CR3 provides that full
+ * memory barrier and core serializing instruction.
*/
if (real_prev == next) {
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
--
2.11.0


2018-01-29 20:24:15

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v3 08/11] membarrier: Provide core serializing command

Provide core serializing membarrier command to support memory reclaim
by JIT.

Each architecture needs to explicitly opt into that support by
documenting in their architecture code how they provide the core
serializing instructions required when returning from the membarrier
IPI, and after the scheduler has updated the curr->mm pointer (before
going back to user-space). They should then select
ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
their architecture.

Architectures selecting this feature need to either document that
they issue core serializing instructions when returning to user-space,
or implement their architecture-specific sync_core_before_usermode().

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andrea Parri <[email protected]>
CC: Russell King <[email protected]>
CC: Greg Hackmann <[email protected]>
CC: Will Deacon <[email protected]>
CC: David Sehr <[email protected]>
CC: [email protected]
---
Changes since v1:
- Include linux/sync_core.h

Changes since v2:
- Rework a comment and changelog based on Peter Zijlstra's feedback.
---
include/linux/sched/mm.h | 18 ++++++++++++++
include/uapi/linux/membarrier.h | 32 ++++++++++++++++++++++++-
init/Kconfig | 3 +++
kernel/sched/core.c | 18 ++++++++++----
kernel/sched/membarrier.c | 53 +++++++++++++++++++++++++++++++----------
5 files changed, 106 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index c7ad7a70cfef..a7840f0f8832 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/mm_types.h>
#include <linux/gfp.h>
+#include <linux/sync_core.h>

/*
* Routines for handling mm_structs
@@ -223,12 +224,26 @@ enum {
MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1),
MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY = (1U << 2),
MEMBARRIER_STATE_GLOBAL_EXPEDITED = (1U << 3),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY = (1U << 4),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE = (1U << 5),
+};
+
+enum {
+ MEMBARRIER_FLAG_SYNC_CORE = (1U << 0),
};

#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
#include <asm/membarrier.h>
#endif

+static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
+{
+ if (likely(!(atomic_read(&mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
+ return;
+ sync_core_before_usermode();
+}
+
static inline void membarrier_execve(struct task_struct *t)
{
atomic_set(&t->mm->membarrier_state, 0);
@@ -244,6 +259,9 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
static inline void membarrier_execve(struct task_struct *t)
{
}
+static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
+{
+}
#endif

#endif /* _LINUX_SCHED_MM_H */
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index d252506e1b5e..5891d7614c8c 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -73,7 +73,7 @@
* to and return from the system call
* (non-running threads are de facto in such a
* state). This only covers threads from the
- * same processes as the caller thread. This
+ * same process as the caller thread. This
* command returns 0 on success. The
* "expedited" commands complete faster than
* the non-expedited ones, they never block,
@@ -86,6 +86,34 @@
* Register the process intent to use
* MEMBARRIER_CMD_PRIVATE_EXPEDITED. Always
* returns 0.
+ * @MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE:
+ * In addition to provide memory ordering
+ * guarantees described in
+ * MEMBARRIER_CMD_PRIVATE_EXPEDITED, ensure
+ * the caller thread, upon return from system
+ * call, that all its running threads siblings
+ * have executed a core serializing
+ * instruction. (architectures are required to
+ * guarantee that non-running threads issue
+ * core serializing instructions before they
+ * resume user-space execution). This only
+ * covers threads from the same process as the
+ * caller thread. This command returns 0 on
+ * success. The "expedited" commands complete
+ * faster than the non-expedited ones, they
+ * never block, but have the downside of
+ * causing extra overhead. If this command is
+ * not implemented by an architecture, -EINVAL
+ * is returned. A process needs to register its
+ * intent to use the private expedited sync
+ * core command prior to using it, otherwise
+ * this command returns -EPERM.
+ * @MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE:
+ * Register the process intent to use
+ * MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.
+ * If this command is not implemented by an
+ * architecture, -EINVAL is returned.
+ * Returns 0 on success.
* @MEMBARRIER_CMD_SHARED:
* Alias to MEMBARRIER_CMD_GLOBAL. Provided for
* header backward compatibility.
@@ -101,6 +129,8 @@ enum membarrier_cmd {
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED = (1 << 2),
MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
+ MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 5),
+ MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 6),

/* Alias for header backward compatibility. */
MEMBARRIER_CMD_SHARED = MEMBARRIER_CMD_GLOBAL,
diff --git a/init/Kconfig b/init/Kconfig
index 30208da2221f..30b65febeb23 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1415,6 +1415,9 @@ config USERFAULTFD
config ARCH_HAS_MEMBARRIER_HOOKS
bool

+config ARCH_HAS_MEMBARRIER_SYNC_CORE
+ bool
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f38c4c7e256a..48c3c1b83354 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2658,13 +2658,21 @@ static struct rq *finish_task_switch(struct task_struct *prev)

fire_sched_in_preempt_notifiers(current);
/*
- * When transitioning from a kernel thread to a userspace
- * thread, mmdrop()'s implicit full barrier is required by the
- * membarrier system call, because the current active_mm can
- * become the current mm without going through switch_mm().
+ * When switching through a kernel thread, the loop in
+ * membarrier_{private,global}_expedited() may have observed that
+ * kernel thread and not issued an IPI. It is therefore possible to
+ * schedule between user->kernel->user threads without passing though
+ * switch_mm(). Membarrier requires a barrier after storing to
+ * rq->curr, before returning to userspace, so provide them here:
+ *
+ * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
+ * provided by mmdrop(),
+ * - a sync_core for SYNC_CORE.
*/
- if (mm)
+ if (mm) {
+ membarrier_mm_sync_core_before_usermode(mm);
mmdrop(mm);
+ }
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index d2087d5f9837..5d0762633639 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -26,11 +26,20 @@
* Bitmask made from a "or" of all commands within enum membarrier_cmd,
* except MEMBARRIER_CMD_QUERY.
*/
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#define MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK \
+ (MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE \
+ | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE)
+#else
+#define MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK 0
+#endif
+
#define MEMBARRIER_CMD_BITMASK \
(MEMBARRIER_CMD_GLOBAL | MEMBARRIER_CMD_GLOBAL_EXPEDITED \
| MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED \
| MEMBARRIER_CMD_PRIVATE_EXPEDITED \
- | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED)
+ | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED \
+ | MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK)

static void ipi_mb(void *info)
{
@@ -104,15 +113,23 @@ static int membarrier_global_expedited(void)
return 0;
}

-static int membarrier_private_expedited(void)
+static int membarrier_private_expedited(int flags)
{
int cpu;
bool fallback = false;
cpumask_var_t tmpmask;

- if (!(atomic_read(&current->mm->membarrier_state)
- & MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY))
- return -EPERM;
+ if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+ return -EINVAL;
+ if (!(atomic_read(&current->mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
+ return -EPERM;
+ } else {
+ if (!(atomic_read(&current->mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY))
+ return -EPERM;
+ }

if (num_online_cpus() == 1)
return 0;
@@ -205,20 +222,29 @@ static int membarrier_register_global_expedited(void)
return 0;
}

-static int membarrier_register_private_expedited(void)
+static int membarrier_register_private_expedited(int flags)
{
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ int state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY;
+
+ if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+ return -EINVAL;
+ state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY;
+ }

/*
* We need to consider threads belonging to different thread
* groups, which use the same mm. (CLONE_VM but not
* CLONE_THREAD).
*/
- if (atomic_read(&mm->membarrier_state)
- & MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
+ if (atomic_read(&mm->membarrier_state) & state)
return 0;
atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
+ if (flags & MEMBARRIER_FLAG_SYNC_CORE)
+ atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE,
+ &mm->membarrier_state);
if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
/*
* Ensure all future scheduler executions will observe the
@@ -226,8 +252,7 @@ static int membarrier_register_private_expedited(void)
*/
synchronize_sched();
}
- atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
- &mm->membarrier_state);
+ atomic_or(state, &mm->membarrier_state);
return 0;
}

@@ -283,9 +308,13 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
case MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED:
return membarrier_register_global_expedited();
case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
- return membarrier_private_expedited();
+ return membarrier_private_expedited(0);
case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
- return membarrier_register_private_expedited();
+ return membarrier_register_private_expedited(0);
+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE:
+ return membarrier_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
+ case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE:
+ return membarrier_register_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
default:
return -EINVAL;
}
--
2.11.0


2018-01-29 20:24:30

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 10/11] membarrier: arm64: Provide core serializing command

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andrea Parri <[email protected]>
CC: Russell King <[email protected]>
CC: Greg Hackmann <[email protected]>
CC: Will Deacon <[email protected]>
CC: David Sehr <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/entry.S | 4 ++++
2 files changed, 5 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e1414f..5b0c06d8dbbe 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -16,6 +16,7 @@ config ARM64
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
+ select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6d14b8f29b5f..5edde1c2e93e 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -302,6 +302,10 @@ alternative_else_nop_endif
ldp x28, x29, [sp, #16 * 14]
ldr lr, [sp, #S_LR]
add sp, sp, #S_FRAME_SIZE // restore sp
+ /*
+ * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on eret context synchronization
+ * when returning from IPI handler, and when returning to user-space.
+ */
eret // return to kernel
.endm

--
2.11.0


2018-01-29 20:24:48

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v5 03/11] membarrier: Document scheduler barrier requirements

Document the membarrier requirement on having a full memory barrier in
__schedule() after coming from user-space, before storing to rq->curr.
It is provided by smp_mb__after_spinlock() in __schedule().

Document that membarrier requires a full barrier on transition from
kernel thread to userspace thread. We currently have an implicit barrier
from atomic_dec_and_test() in mmdrop() that ensures this.

The x86 switch_mm_irqs_off() full barrier is currently provided by many
cpumask update operations as well as write_cr3(). Document that
write_cr3() provides this barrier.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andrea Parri <[email protected]>
CC: [email protected]
---
Changes since v1:
- Update comments to match reality for code paths which are after
storing to rq->curr, before returning to user-space, based on feedback
from Andrea Parri.
Changes since v2:
- Update changelog (smp_mb__before_spinlock -> smp_mb__after_spinlock).
Based on feedback from Andrea Parri.
Changes since v3:
- Clarify comments following feeback from Peter Zijlstra.
Changes since v4:
- Update comment regarding powerpc barrier.
---
arch/x86/mm/tlb.c | 5 +++++
include/linux/sched/mm.h | 5 +++++
kernel/sched/core.c | 37 ++++++++++++++++++++++++++-----------
3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5bfe61a5e8e3..9fa7d2e0e15e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -228,6 +228,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
#endif
this_cpu_write(cpu_tlbstate.is_lazy, false);

+ /*
+ * The membarrier system call requires a full memory barrier
+ * before returning to user-space, after storing to rq->curr.
+ * Writing to CR3 provides that full memory barrier.
+ */
if (real_prev == next) {
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1754396795f6..28aef7051d73 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -39,6 +39,11 @@ static inline void mmgrab(struct mm_struct *mm)
extern void __mmdrop(struct mm_struct *);
static inline void mmdrop(struct mm_struct *mm)
{
+ /*
+ * The implicit full barrier implied by atomic_dec_and_test is
+ * required by the membarrier system call before returning to
+ * user-space, after storing to rq->curr.
+ */
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7e06dfa804b..f38c4c7e256a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2657,6 +2657,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
finish_arch_post_lock_switch();

fire_sched_in_preempt_notifiers(current);
+ /*
+ * When transitioning from a kernel thread to a userspace
+ * thread, mmdrop()'s implicit full barrier is required by the
+ * membarrier system call, because the current active_mm can
+ * become the current mm without going through switch_mm().
+ */
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
@@ -2762,6 +2768,13 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
arch_start_context_switch(prev);

+ /*
+ * If mm is non-NULL, we pass through switch_mm(). If mm is
+ * NULL, we will pass through mmdrop() in finish_task_switch().
+ * Both of these contain the full memory barrier required by
+ * membarrier after storing to rq->curr, before returning to
+ * user-space.
+ */
if (!mm) {
next->active_mm = oldmm;
mmgrab(oldmm);
@@ -3298,6 +3311,9 @@ static void __sched notrace __schedule(bool preempt)
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
* done by the caller to avoid the race with signal_wake_up().
+ *
+ * The membarrier system call requires a full memory barrier
+ * after coming from user-space, before storing to rq->curr.
*/
rq_lock(rq, &rf);
smp_mb__after_spinlock();
@@ -3345,17 +3361,16 @@ static void __sched notrace __schedule(bool preempt)
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
- * rq->curr, before returning to user-space. For TSO
- * (e.g. x86), the architecture must provide its own
- * barrier in switch_mm(). For weakly ordered machines
- * for which spin_unlock() acts as a full memory
- * barrier, finish_lock_switch() in common code takes
- * care of this barrier. For weakly ordered machines for
- * which spin_unlock() acts as a RELEASE barrier (only
- * arm64 and PowerPC), arm64 has a full barrier in
- * switch_to(), and PowerPC has
- * smp_mb__after_unlock_lock() before
- * finish_lock_switch().
+ * rq->curr, before returning to user-space.
+ *
+ * Here are the schemes providing that barrier on the
+ * various architectures:
+ * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
+ * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+ * - finish_lock_switch() for weakly-ordered
+ * architectures where spin_unlock is a full barrier,
+ * - switch_to() for arm64 (weakly-ordered, spin_unlock
+ * is a RELEASE barrier),
*/
++*switch_count;

--
2.11.0


2018-01-29 20:25:09

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Alan Stern <[email protected]>
CC: Will Deacon <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Alexander Viro <[email protected]>
CC: Nicholas Piggin <[email protected]>
CC: [email protected]
CC: [email protected]
---
Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
powerpc membarrier_arch_sched_in(), given that we want to specifically
check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
*_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
field on fork.
- powerpc: test thread flag instead of reading
membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
code, thus eliminating a race where concurrent commands could succeed
when they should fail if issued concurrently with process
registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
membarrier_private_expedited. Matches WRITE_ONCE() performed in
process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.
---
MAINTAINERS | 1 +
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/membarrier.h | 26 ++++++++++++++++++++++++++
arch/powerpc/mm/mmu_context.c | 7 +++++++
include/linux/sched/mm.h | 13 ++++++++++++-
init/Kconfig | 3 +++
kernel/sched/core.c | 10 ----------
kernel/sched/membarrier.c | 8 ++++++++
8 files changed, 58 insertions(+), 11 deletions(-)
create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 845fc25812f1..34c1ecd5a5d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8929,6 +8929,7 @@ L: [email protected]
S: Supported
F: kernel/sched/membarrier.c
F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h

MEMORY MANAGEMENT
L: [email protected]
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2ed525a44734..09b02180b8a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PMEM_API if PPC64
+ select ARCH_HAS_MEMBARRIER_HOOKS
select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
new file mode 100644
index 000000000000..98ff4f1fcf2b
--- /dev/null
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -0,0 +1,26 @@
+#ifndef _ASM_POWERPC_MEMBARRIER_H
+#define _ASM_POWERPC_MEMBARRIER_H
+
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ /*
+ * Only need the full barrier when switching between processes.
+ * Barrier when switching from kernel to userspace is not
+ * required here, given that it is implied by mmdrop(). Barrier
+ * when switching from userspace to kernel is not needed after
+ * store to rq->curr.
+ */
+ if (likely(!(atomic_read(&next->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED) || !prev))
+ return;
+
+ /*
+ * The membarrier system call requires a full memory barrier
+ * after storing to rq->curr, before going back to user-space.
+ */
+ smp_mb();
+}
+
+#endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index d60a62bf4fc7..0ab297c4cfad 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -12,6 +12,7 @@

#include <linux/mm.h>
#include <linux/cpu.h>
+#include <linux/sched/mm.h>

#include <asm/mmu_context.h>

@@ -58,6 +59,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
*
* On the read side the barrier is in pte_xchg(), which orders
* the store to the PTE vs the load of mm_cpumask.
+ *
+ * This full barrier is needed by membarrier when switching
+ * between processes after store to rq->curr, before user-space
+ * memory accesses.
*/
smp_mb();

@@ -80,6 +85,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

if (new_on_cpu)
radix_kvm_prefetch_workaround(next);
+ else
+ membarrier_arch_switch_mm(prev, next, tsk);

/*
* The actual HW switching method differs between the various
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 3d49b91b674d..1754396795f6 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -215,14 +215,25 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
#ifdef CONFIG_MEMBARRIER
enum {
MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0),
- MEMBARRIER_STATE_SWITCH_MM = (1U << 1),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1),
};

+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
+#include <asm/membarrier.h>
+#endif
+
static inline void membarrier_execve(struct task_struct *t)
{
atomic_set(&t->mm->membarrier_state, 0);
}
#else
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+}
+#endif
static inline void membarrier_execve(struct task_struct *t)
{
}
diff --git a/init/Kconfig b/init/Kconfig
index a9a2e2c86671..2d118b6adee2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1412,6 +1412,9 @@ config USERFAULTFD
Enable the userfaultfd() system call that allows to intercept and
handle page faults in userland.

+config ARCH_HAS_MEMBARRIER_HOOKS
+ bool
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7bf32aabfda..c7e06dfa804b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2653,16 +2653,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
prev_state = prev->state;
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
- /*
- * The membarrier system call requires a full memory barrier
- * after storing to rq->curr, before going back to user-space.
- *
- * TODO: This smp_mb__after_unlock_lock can go away if PPC end
- * up adding a full barrier to switch_mm(), or we should figure
- * out if a smp_mb__after_unlock_lock is really the proper API
- * to use.
- */
- smp_mb__after_unlock_lock();
finish_lock_switch(rq, prev);
finish_arch_post_lock_switch();

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 9bcbacba82a8..678577267a9a 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -118,6 +118,14 @@ static void membarrier_register_private_expedited(void)
if (atomic_read(&mm->membarrier_state)
& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
return;
+ atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
+ if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
+ /*
+ * Ensure all future scheduler executions will observe the
+ * new thread flag state for this process.
+ */
+ synchronize_sched();
+ }
atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
&mm->membarrier_state);
}
--
2.11.0


2018-01-29 20:25:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v3 04/11] membarrier: provide GLOBAL_EXPEDITED command

Allow expedited membarrier to be used for data shared between processes
through shared memory.

Processes wishing to receive the membarriers register with
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue
membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED.

This allows extremely simple kernel-level implementation: we have almost
everything we need with the PRIVATE_EXPEDITED barrier code. All we need
to do is to add a flag in the mm_struct that will be used to check
whether we need to send the IPI to the current thread of each CPU.

There is a slight downside to this approach compared to targeting
specific shared memory users: when performing a membarrier operation,
all registered "global" receivers will get the barrier, even if they
don't share a memory mapping with the sender issuing
MEMBARRIER_CMD_GLOBAL_EXPEDITED.

This registration approach seems to fit the requirement of not
disturbing processes that really deeply care about real-time: they
simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.

In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED"
command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of
MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward
compatibility.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andrea Parri <[email protected]>
CC: [email protected]
---
Changes since v1:
- Add missing preempt disable around smp_call_function_many().
Changes since v2:
- UAPI documentation fix based on Thomas Gleixner's feedback.
- Rename SHARED to GLOBAL.
---
arch/powerpc/include/asm/membarrier.h | 3 +-
include/linux/sched/mm.h | 6 +-
include/uapi/linux/membarrier.h | 42 ++++++++++--
kernel/sched/membarrier.c | 120 +++++++++++++++++++++++++++++++---
4 files changed, 153 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
index 98ff4f1fcf2b..6e20bb5c74ea 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -13,7 +13,8 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
* store to rq->curr.
*/
if (likely(!(atomic_read(&next->membarrier_state) &
- MEMBARRIER_STATE_PRIVATE_EXPEDITED) || !prev))
+ (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
return;

/*
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 28aef7051d73..c7ad7a70cfef 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -219,8 +219,10 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)

#ifdef CONFIG_MEMBARRIER
enum {
- MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0),
- MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1),
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY = (1U << 2),
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED = (1U << 3),
};

#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index 4e01ad7ffe98..d252506e1b5e 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -31,7 +31,7 @@
* enum membarrier_cmd - membarrier system call command
* @MEMBARRIER_CMD_QUERY: Query the set of supported commands. It returns
* a bitmask of valid commands.
- * @MEMBARRIER_CMD_SHARED: Execute a memory barrier on all running threads.
+ * @MEMBARRIER_CMD_GLOBAL: Execute a memory barrier on all running threads.
* Upon return from system call, the caller thread
* is ensured that all running threads have passed
* through a state where all memory accesses to
@@ -40,6 +40,28 @@
* (non-running threads are de facto in such a
* state). This covers threads from all processes
* running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_GLOBAL_EXPEDITED:
+ * Execute a memory barrier on all running threads
+ * of all processes which previously registered
+ * with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
+ * Upon return from system call, the caller thread
+ * is ensured that all running threads have passed
+ * through a state where all memory accesses to
+ * user-space addresses match program order between
+ * entry to and return from the system call
+ * (non-running threads are de facto in such a
+ * state). This only covers threads from processes
+ * which registered with
+ * MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
+ * This command returns 0. Given that
+ * registration is about the intent to receive
+ * the barriers, it is valid to invoke
+ * MEMBARRIER_CMD_GLOBAL_EXPEDITED from a
+ * non-registered process.
+ * @MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED:
+ * Register the process intent to receive
+ * MEMBARRIER_CMD_GLOBAL_EXPEDITED memory
+ * barriers. Always returns 0.
* @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
* Execute a memory barrier on each running
* thread belonging to the same process as the current
@@ -64,18 +86,24 @@
* Register the process intent to use
* MEMBARRIER_CMD_PRIVATE_EXPEDITED. Always
* returns 0.
+ * @MEMBARRIER_CMD_SHARED:
+ * Alias to MEMBARRIER_CMD_GLOBAL. Provided for
+ * header backward compatibility.
*
* Command to be passed to the membarrier system call. The commands need to
* be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
* the value 0.
*/
enum membarrier_cmd {
- MEMBARRIER_CMD_QUERY = 0,
- MEMBARRIER_CMD_SHARED = (1 << 0),
- /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
- /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
- MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
- MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
+ MEMBARRIER_CMD_QUERY = 0,
+ MEMBARRIER_CMD_GLOBAL = (1 << 0),
+ MEMBARRIER_CMD_GLOBAL_EXPEDITED = (1 << 1),
+ MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED = (1 << 2),
+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
+ MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
+
+ /* Alias for header backward compatibility. */
+ MEMBARRIER_CMD_SHARED = MEMBARRIER_CMD_GLOBAL,
};

#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 678577267a9a..d2087d5f9837 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -27,7 +27,9 @@
* except MEMBARRIER_CMD_QUERY.
*/
#define MEMBARRIER_CMD_BITMASK \
- (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED \
+ (MEMBARRIER_CMD_GLOBAL | MEMBARRIER_CMD_GLOBAL_EXPEDITED \
+ | MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED \
+ | MEMBARRIER_CMD_PRIVATE_EXPEDITED \
| MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED)

static void ipi_mb(void *info)
@@ -35,6 +37,73 @@ static void ipi_mb(void *info)
smp_mb(); /* IPIs should be serializing but paranoid. */
}

+static int membarrier_global_expedited(void)
+{
+ int cpu;
+ bool fallback = false;
+ cpumask_var_t tmpmask;
+
+ if (num_online_cpus() == 1)
+ return 0;
+
+ /*
+ * Matches memory barriers around rq->curr modification in
+ * scheduler.
+ */
+ smp_mb(); /* system call entry is not a mb. */
+
+ /*
+ * Expedited membarrier commands guarantee that they won't
+ * block, hence the GFP_NOWAIT allocation flag and fallback
+ * implementation.
+ */
+ if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+ /* Fallback for OOM. */
+ fallback = true;
+ }
+
+ cpus_read_lock();
+ for_each_online_cpu(cpu) {
+ struct task_struct *p;
+
+ /*
+ * Skipping the current CPU is OK even through we can be
+ * migrated at any point. The current CPU, at the point
+ * where we read raw_smp_processor_id(), is ensured to
+ * be in program order with respect to the caller
+ * thread. Therefore, we can skip this CPU from the
+ * iteration.
+ */
+ if (cpu == raw_smp_processor_id())
+ continue;
+ rcu_read_lock();
+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+ if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
+ if (!fallback)
+ __cpumask_set_cpu(cpu, tmpmask);
+ else
+ smp_call_function_single(cpu, ipi_mb, NULL, 1);
+ }
+ rcu_read_unlock();
+ }
+ if (!fallback) {
+ preempt_disable();
+ smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
+ preempt_enable();
+ free_cpumask_var(tmpmask);
+ }
+ cpus_read_unlock();
+
+ /*
+ * Memory barrier on the caller thread _after_ we finished
+ * waiting for the last IPI. Matches memory barriers around
+ * rq->curr modification in scheduler.
+ */
+ smp_mb(); /* exit from system call is not a mb */
+ return 0;
+}
+
static int membarrier_private_expedited(void)
{
int cpu;
@@ -105,7 +174,38 @@ static int membarrier_private_expedited(void)
return 0;
}

-static void membarrier_register_private_expedited(void)
+static int membarrier_register_global_expedited(void)
+{
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+
+ if (atomic_read(&mm->membarrier_state) &
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY)
+ return 0;
+ atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state);
+ if (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) {
+ /*
+ * For single mm user, single threaded process, we can
+ * simply issue a memory barrier after setting
+ * MEMBARRIER_STATE_GLOBAL_EXPEDITED to guarantee that
+ * no memory access following registration is reordered
+ * before registration.
+ */
+ smp_mb();
+ } else {
+ /*
+ * For multi-mm user threads, we need to ensure all
+ * future scheduler executions will observe the new
+ * thread flag state for this mm.
+ */
+ synchronize_sched();
+ }
+ atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY,
+ &mm->membarrier_state);
+ return 0;
+}
+
+static int membarrier_register_private_expedited(void)
{
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
@@ -117,7 +217,7 @@ static void membarrier_register_private_expedited(void)
*/
if (atomic_read(&mm->membarrier_state)
& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
- return;
+ return 0;
atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
/*
@@ -128,6 +228,7 @@ static void membarrier_register_private_expedited(void)
}
atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
&mm->membarrier_state);
+ return 0;
}

/**
@@ -167,21 +268,24 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
int cmd_mask = MEMBARRIER_CMD_BITMASK;

if (tick_nohz_full_enabled())
- cmd_mask &= ~MEMBARRIER_CMD_SHARED;
+ cmd_mask &= ~MEMBARRIER_CMD_GLOBAL;
return cmd_mask;
}
- case MEMBARRIER_CMD_SHARED:
- /* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */
+ case MEMBARRIER_CMD_GLOBAL:
+ /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
if (tick_nohz_full_enabled())
return -EINVAL;
if (num_online_cpus() > 1)
synchronize_sched();
return 0;
+ case MEMBARRIER_CMD_GLOBAL_EXPEDITED:
+ return membarrier_global_expedited();
+ case MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED:
+ return membarrier_register_global_expedited();
case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
return membarrier_private_expedited();
case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
- membarrier_register_private_expedited();
- return 0;
+ return membarrier_register_private_expedited();
default:
return -EINVAL;
}
--
2.11.0


2018-01-29 20:25:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v2 05/11] membarrier: selftest: Test global expedited cmd

Test the new MEMBARRIER_CMD_GLOBAL_EXPEDITED and
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED commands.

Adapt to the MEMBARRIER_CMD_SHARED -> MEMBARRIER_CMD_GLOBAL rename.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Shuah Khan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Alan Stern <[email protected]>
CC: Will Deacon <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Alice Ferrazzi <[email protected]>
CC: Paul Elder <[email protected]>
CC: [email protected]
CC: [email protected]
---
Changes since v1:
- Rename SHARED to GLOBAL.
---
.../testing/selftests/membarrier/membarrier_test.c | 59 ++++++++++++++++++++--
1 file changed, 54 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index e6ee73d01fa1..eec39a0e1f77 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -59,10 +59,10 @@ static int test_membarrier_flags_fail(void)
return 0;
}

-static int test_membarrier_shared_success(void)
+static int test_membarrier_global_success(void)
{
- int cmd = MEMBARRIER_CMD_SHARED, flags = 0;
- const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED";
+ int cmd = MEMBARRIER_CMD_GLOBAL, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_GLOBAL";

if (sys_membarrier(cmd, flags) != 0) {
ksft_exit_fail_msg(
@@ -132,6 +132,40 @@ static int test_membarrier_private_expedited_success(void)
return 0;
}

+static int test_membarrier_register_global_expedited_success(void)
+{
+ int cmd = MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
+static int test_membarrier_global_expedited_success(void)
+{
+ int cmd = MEMBARRIER_CMD_GLOBAL_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_GLOBAL_EXPEDITED";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
static int test_membarrier(void)
{
int status;
@@ -142,7 +176,7 @@ static int test_membarrier(void)
status = test_membarrier_flags_fail();
if (status)
return status;
- status = test_membarrier_shared_success();
+ status = test_membarrier_global_success();
if (status)
return status;
status = test_membarrier_private_expedited_fail();
@@ -154,6 +188,19 @@ static int test_membarrier(void)
status = test_membarrier_private_expedited_success();
if (status)
return status;
+ /*
+ * It is valid to send a global membarrier from a non-registered
+ * process.
+ */
+ status = test_membarrier_global_expedited_success();
+ if (status)
+ return status;
+ status = test_membarrier_register_global_expedited_success();
+ if (status)
+ return status;
+ status = test_membarrier_global_expedited_success();
+ if (status)
+ return status;
return 0;
}

@@ -173,8 +220,10 @@ static int test_membarrier_query(void)
}
ksft_exit_fail_msg("sys_membarrier() failed\n");
}
- if (!(ret & MEMBARRIER_CMD_SHARED))
+ if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
+ ksft_test_result_fail("sys_membarrier() CMD_GLOBAL query failed\n");
ksft_exit_fail_msg("sys_membarrier is not supported.\n");
+ }

ksft_test_result_pass("sys_membarrier available\n");
return 0;
--
2.11.0


2018-01-29 20:25:34

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v2 01/11] membarrier: selftest: Test private expedited cmd

Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED and
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED commands.

Add checks expecting specific error values on system calls expected to
fail.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Shuah Khan <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Alan Stern <[email protected]>
CC: Will Deacon <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Alice Ferrazzi <[email protected]>
CC: Paul Elder <[email protected]>
CC: [email protected]
CC: [email protected]
---
Changes since v1:
- return result of ksft_exit_pass from main(), silencing compiler
warning about missing return value.
---
.../testing/selftests/membarrier/membarrier_test.c | 111 ++++++++++++++++++---
1 file changed, 95 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index 9e674d9514d1..e6ee73d01fa1 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -16,49 +16,119 @@ static int sys_membarrier(int cmd, int flags)
static int test_membarrier_cmd_fail(void)
{
int cmd = -1, flags = 0;
+ const char *test_name = "sys membarrier invalid command";

if (sys_membarrier(cmd, flags) != -1) {
ksft_exit_fail_msg(
- "sys membarrier invalid command test: command = %d, flags = %d. Should fail, but passed\n",
- cmd, flags);
+ "%s test: command = %d, flags = %d. Should fail, but passed\n",
+ test_name, cmd, flags);
+ }
+ if (errno != EINVAL) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+ test_name, flags, EINVAL, strerror(EINVAL),
+ errno, strerror(errno));
}

ksft_test_result_pass(
- "sys membarrier invalid command test: command = %d, flags = %d. Failed as expected\n",
- cmd, flags);
+ "%s test: command = %d, flags = %d, errno = %d. Failed as expected\n",
+ test_name, cmd, flags, errno);
return 0;
}

static int test_membarrier_flags_fail(void)
{
int cmd = MEMBARRIER_CMD_QUERY, flags = 1;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_QUERY invalid flags";

if (sys_membarrier(cmd, flags) != -1) {
ksft_exit_fail_msg(
- "sys membarrier MEMBARRIER_CMD_QUERY invalid flags test: flags = %d. Should fail, but passed\n",
- flags);
+ "%s test: flags = %d. Should fail, but passed\n",
+ test_name, flags);
+ }
+ if (errno != EINVAL) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+ test_name, flags, EINVAL, strerror(EINVAL),
+ errno, strerror(errno));
}

ksft_test_result_pass(
- "sys membarrier MEMBARRIER_CMD_QUERY invalid flags test: flags = %d. Failed as expected\n",
- flags);
+ "%s test: flags = %d, errno = %d. Failed as expected\n",
+ test_name, flags, errno);
return 0;
}

-static int test_membarrier_success(void)
+static int test_membarrier_shared_success(void)
{
int cmd = MEMBARRIER_CMD_SHARED, flags = 0;
- const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED\n";
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n", test_name, flags);
+ return 0;
+}
+
+static int test_membarrier_private_expedited_fail(void)
+{
+ int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED not registered failure";
+
+ if (sys_membarrier(cmd, flags) != -1) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should fail, but passed\n",
+ test_name, flags);
+ }
+ if (errno != EPERM) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+ test_name, flags, EPERM, strerror(EPERM),
+ errno, strerror(errno));
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ return 0;
+}
+
+static int test_membarrier_register_private_expedited_success(void)
+{
+ int cmd = MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED";

if (sys_membarrier(cmd, flags) != 0) {
ksft_exit_fail_msg(
- "sys membarrier MEMBARRIER_CMD_SHARED test: flags = %d\n",
- flags);
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
}

ksft_test_result_pass(
- "sys membarrier MEMBARRIER_CMD_SHARED test: flags = %d\n",
- flags);
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
+static int test_membarrier_private_expedited_success(void)
+{
+ int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
return 0;
}

@@ -72,7 +142,16 @@ static int test_membarrier(void)
status = test_membarrier_flags_fail();
if (status)
return status;
- status = test_membarrier_success();
+ status = test_membarrier_shared_success();
+ if (status)
+ return status;
+ status = test_membarrier_private_expedited_fail();
+ if (status)
+ return status;
+ status = test_membarrier_register_private_expedited_success();
+ if (status)
+ return status;
+ status = test_membarrier_private_expedited_success();
if (status)
return status;
return 0;
@@ -108,5 +187,5 @@ int main(int argc, char **argv)
test_membarrier_query();
test_membarrier();

- ksft_exit_pass();
+ return ksft_exit_pass();
}
--
2.11.0


2018-01-29 20:25:51

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH for 4.16 v2 06/11] Introduce sync_core_before_usermode

Introduce an architecture function that ensures the current CPU
issues a core serializing instruction before returning to usermode.

This is needed for the membarrier "sync_core" command.

Architectures defining the sync_core_before_usermode() static inline
need to select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Dave Watson <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andrea Parri <[email protected]>
CC: Russell King <[email protected]>
CC: Greg Hackmann <[email protected]>
CC: Will Deacon <[email protected]>
CC: David Sehr <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Arnd Bergmann <[email protected]>
CC: [email protected]
CC: [email protected]
---
Changes since v1:
- Introduce include/linux/sync_core.h
---
include/linux/sync_core.h | 21 +++++++++++++++++++++
init/Kconfig | 3 +++
2 files changed, 24 insertions(+)
create mode 100644 include/linux/sync_core.h

diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
new file mode 100644
index 000000000000..013da4b8b327
--- /dev/null
+++ b/include/linux/sync_core.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SYNC_CORE_H
+#define _LINUX_SYNC_CORE_H
+
+#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
+#include <asm/sync_core.h>
+#else
+/*
+ * This is a dummy sync_core_before_usermode() implementation that can be used
+ * on all architectures which return to user-space through core serializing
+ * instructions.
+ * If your architecture returns to user-space through non-core-serializing
+ * instructions, you need to write your own functions.
+ */
+static inline void sync_core_before_usermode(void)
+{
+}
+#endif
+
+#endif /* _LINUX_SYNC_CORE_H */
+
diff --git a/init/Kconfig b/init/Kconfig
index 2d118b6adee2..30208da2221f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1918,3 +1918,6 @@ config ASN1
functions to call on what tags.

source "kernel/Kconfig.locks"
+
+config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
+ bool
--
2.11.0


2018-02-05 20:23:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()


* Mathieu Desnoyers <[email protected]> wrote:

>
> +config ARCH_HAS_MEMBARRIER_HOOKS
> + bool

Yeah, so I have renamed this to ARCH_HAS_MEMBARRIER_CALLBACKS, and propagated it
through the rest of the patches. "Callback" is the canonical name, and I also
cringe every time I see 'hook'.

Please let me know if there are any big objections against this minor cleanup.

Thanks,

Ingo

2018-02-05 20:33:54

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

----- On Feb 5, 2018, at 3:22 PM, Ingo Molnar [email protected] wrote:

> * Mathieu Desnoyers <[email protected]> wrote:
>
>>
>> +config ARCH_HAS_MEMBARRIER_HOOKS
>> + bool
>
> Yeah, so I have renamed this to ARCH_HAS_MEMBARRIER_CALLBACKS, and propagated it
> through the rest of the patches. "Callback" is the canonical name, and I also
> cringe every time I see 'hook'.
>
> Please let me know if there are any big objections against this minor cleanup.

No objection at all,

Thanks!

Mathieu

>
> Thanks,
>
> Ingo

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-02-05 21:24:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH for 4.16 00/11] membarrier updates for 4.16


Hi,

Find below the interdiff of the (minor) edits I've done to the series, before
applying them to tip:sched/urgent.

I've also tidied up some of the changelogs.

Nothing earth-shattering.

Thanks,

Ingo

=========>

arch/powerpc/Kconfig | 2 +-
arch/x86/entry/entry_32.S | 2 +-
arch/x86/entry/entry_64.S | 2 +-
include/linux/sched/mm.h | 6 +++---
init/Kconfig | 2 +-
5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 09b02180b8a0..a2380de50878 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,7 +140,7 @@ config PPC
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PMEM_API if PPC64
- select ARCH_HAS_MEMBARRIER_HOOKS
+ select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d8ba23d0d77a..abee6d2b9311 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -567,7 +567,7 @@ ENTRY(entry_INT80_32)
RESTORE_REGS 4 # skip orig_eax/error_code
.Lirq_return:
/*
- * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on iret core serialization
+ * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
* when returning from IPI handler and when returning from
* scheduler to user-space.
*/
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 49852d9920da..5816858c8820 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -805,7 +805,7 @@ GLOBAL(restore_regs_and_return_to_kernel)
POP_C_REGS
addq $8, %rsp /* skip regs->orig_ax */
/*
- * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on iret core serialization
+ * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
* when returning from IPI handler.
*/
INTERRUPT_RETURN
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index a7840f0f8832..03a169087a18 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -41,7 +41,7 @@ extern void __mmdrop(struct mm_struct *);
static inline void mmdrop(struct mm_struct *mm)
{
/*
- * The implicit full barrier implied by atomic_dec_and_test is
+ * The implicit full barrier implied by atomic_dec_and_test() is
* required by the membarrier system call before returning to
* user-space, after storing to rq->curr.
*/
@@ -232,7 +232,7 @@ enum {
MEMBARRIER_FLAG_SYNC_CORE = (1U << 0),
};

-#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
#include <asm/membarrier.h>
#endif

@@ -249,7 +249,7 @@ static inline void membarrier_execve(struct task_struct *t)
atomic_set(&t->mm->membarrier_state, 0);
}
#else
-#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
struct mm_struct *next,
struct task_struct *tsk)
diff --git a/init/Kconfig b/init/Kconfig
index 30b65febeb23..e37f4b2a6445 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1412,7 +1412,7 @@ config USERFAULTFD
Enable the userfaultfd() system call that allows to intercept and
handle page faults in userland.

-config ARCH_HAS_MEMBARRIER_HOOKS
+config ARCH_HAS_MEMBARRIER_CALLBACKS
bool

config ARCH_HAS_MEMBARRIER_SYNC_CORE

Subject: [tip:sched/urgent] membarrier/selftest: Test private expedited command

Commit-ID: 667ca1ec7c9eb7ac3b80590b6597151b4c2a750b
Gitweb: https://git.kernel.org/tip/667ca1ec7c9eb7ac3b80590b6597151b4c2a750b
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:10 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:33:29 +0100

membarrier/selftest: Test private expedited command

Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED and
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED commands.

Add checks expecting specific error values on system calls expected to
fail.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Shuah Khan <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Alice Ferrazzi <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Elder <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
.../testing/selftests/membarrier/membarrier_test.c | 111 ++++++++++++++++++---
1 file changed, 95 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index 9e674d9..e6ee73d 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -16,49 +16,119 @@ static int sys_membarrier(int cmd, int flags)
static int test_membarrier_cmd_fail(void)
{
int cmd = -1, flags = 0;
+ const char *test_name = "sys membarrier invalid command";

if (sys_membarrier(cmd, flags) != -1) {
ksft_exit_fail_msg(
- "sys membarrier invalid command test: command = %d, flags = %d. Should fail, but passed\n",
- cmd, flags);
+ "%s test: command = %d, flags = %d. Should fail, but passed\n",
+ test_name, cmd, flags);
+ }
+ if (errno != EINVAL) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+ test_name, flags, EINVAL, strerror(EINVAL),
+ errno, strerror(errno));
}

ksft_test_result_pass(
- "sys membarrier invalid command test: command = %d, flags = %d. Failed as expected\n",
- cmd, flags);
+ "%s test: command = %d, flags = %d, errno = %d. Failed as expected\n",
+ test_name, cmd, flags, errno);
return 0;
}

static int test_membarrier_flags_fail(void)
{
int cmd = MEMBARRIER_CMD_QUERY, flags = 1;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_QUERY invalid flags";

if (sys_membarrier(cmd, flags) != -1) {
ksft_exit_fail_msg(
- "sys membarrier MEMBARRIER_CMD_QUERY invalid flags test: flags = %d. Should fail, but passed\n",
- flags);
+ "%s test: flags = %d. Should fail, but passed\n",
+ test_name, flags);
+ }
+ if (errno != EINVAL) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+ test_name, flags, EINVAL, strerror(EINVAL),
+ errno, strerror(errno));
}

ksft_test_result_pass(
- "sys membarrier MEMBARRIER_CMD_QUERY invalid flags test: flags = %d. Failed as expected\n",
- flags);
+ "%s test: flags = %d, errno = %d. Failed as expected\n",
+ test_name, flags, errno);
return 0;
}

-static int test_membarrier_success(void)
+static int test_membarrier_shared_success(void)
{
int cmd = MEMBARRIER_CMD_SHARED, flags = 0;
- const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED\n";
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n", test_name, flags);
+ return 0;
+}
+
+static int test_membarrier_private_expedited_fail(void)
+{
+ int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED not registered failure";
+
+ if (sys_membarrier(cmd, flags) != -1) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should fail, but passed\n",
+ test_name, flags);
+ }
+ if (errno != EPERM) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+ test_name, flags, EPERM, strerror(EPERM),
+ errno, strerror(errno));
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ return 0;
+}
+
+static int test_membarrier_register_private_expedited_success(void)
+{
+ int cmd = MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED";

if (sys_membarrier(cmd, flags) != 0) {
ksft_exit_fail_msg(
- "sys membarrier MEMBARRIER_CMD_SHARED test: flags = %d\n",
- flags);
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
}

ksft_test_result_pass(
- "sys membarrier MEMBARRIER_CMD_SHARED test: flags = %d\n",
- flags);
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
+static int test_membarrier_private_expedited_success(void)
+{
+ int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
return 0;
}

@@ -72,7 +142,16 @@ static int test_membarrier(void)
status = test_membarrier_flags_fail();
if (status)
return status;
- status = test_membarrier_success();
+ status = test_membarrier_shared_success();
+ if (status)
+ return status;
+ status = test_membarrier_private_expedited_fail();
+ if (status)
+ return status;
+ status = test_membarrier_register_private_expedited_success();
+ if (status)
+ return status;
+ status = test_membarrier_private_expedited_success();
if (status)
return status;
return 0;
@@ -108,5 +187,5 @@ int main(int argc, char **argv)
test_membarrier_query();
test_membarrier();

- ksft_exit_pass();
+ return ksft_exit_pass();
}

Subject: [tip:sched/urgent] powerpc, membarrier: Skip memory barrier in switch_mm()

Commit-ID: 3ccfebedd8cf54e291c809c838d8ad5cc00f5688
Gitweb: https://git.kernel.org/tip/3ccfebedd8cf54e291c809c838d8ad5cc00f5688
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:11 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:34:02 +0100

powerpc, membarrier: Skip memory barrier in switch_mm()

Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
MAINTAINERS | 1 +
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/membarrier.h | 26 ++++++++++++++++++++++++++
arch/powerpc/mm/mmu_context.c | 7 +++++++
include/linux/sched/mm.h | 13 ++++++++++++-
init/Kconfig | 3 +++
kernel/sched/core.c | 10 ----------
kernel/sched/membarrier.c | 8 ++++++++
8 files changed, 58 insertions(+), 11 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 217a875..8e96d4e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8944,6 +8944,7 @@ L: [email protected]
S: Supported
F: kernel/sched/membarrier.c
F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h

MEMORY MANAGEMENT
L: [email protected]
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2ed525a..a2380de 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PMEM_API if PPC64
+ select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
new file mode 100644
index 0000000..98ff4f1
--- /dev/null
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -0,0 +1,26 @@
+#ifndef _ASM_POWERPC_MEMBARRIER_H
+#define _ASM_POWERPC_MEMBARRIER_H
+
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ /*
+ * Only need the full barrier when switching between processes.
+ * Barrier when switching from kernel to userspace is not
+ * required here, given that it is implied by mmdrop(). Barrier
+ * when switching from userspace to kernel is not needed after
+ * store to rq->curr.
+ */
+ if (likely(!(atomic_read(&next->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED) || !prev))
+ return;
+
+ /*
+ * The membarrier system call requires a full memory barrier
+ * after storing to rq->curr, before going back to user-space.
+ */
+ smp_mb();
+}
+
+#endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index d60a62b..0ab297c 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -12,6 +12,7 @@

#include <linux/mm.h>
#include <linux/cpu.h>
+#include <linux/sched/mm.h>

#include <asm/mmu_context.h>

@@ -58,6 +59,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
*
* On the read side the barrier is in pte_xchg(), which orders
* the store to the PTE vs the load of mm_cpumask.
+ *
+ * This full barrier is needed by membarrier when switching
+ * between processes after store to rq->curr, before user-space
+ * memory accesses.
*/
smp_mb();

@@ -80,6 +85,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

if (new_on_cpu)
radix_kvm_prefetch_workaround(next);
+ else
+ membarrier_arch_switch_mm(prev, next, tsk);

/*
* The actual HW switching method differs between the various
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 3d49b91..26307cd 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -215,14 +215,25 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
#ifdef CONFIG_MEMBARRIER
enum {
MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0),
- MEMBARRIER_STATE_SWITCH_MM = (1U << 1),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1),
};

+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
+#include <asm/membarrier.h>
+#endif
+
static inline void membarrier_execve(struct task_struct *t)
{
atomic_set(&t->mm->membarrier_state, 0);
}
#else
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
+static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+}
+#endif
static inline void membarrier_execve(struct task_struct *t)
{
}
diff --git a/init/Kconfig b/init/Kconfig
index a9a2e2c..837adcf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1412,6 +1412,9 @@ config USERFAULTFD
Enable the userfaultfd() system call that allows to intercept and
handle page faults in userland.

+config ARCH_HAS_MEMBARRIER_CALLBACKS
+ bool
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3da7a24..ead0c21 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2698,16 +2698,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
prev_state = prev->state;
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
- /*
- * The membarrier system call requires a full memory barrier
- * after storing to rq->curr, before going back to user-space.
- *
- * TODO: This smp_mb__after_unlock_lock can go away if PPC end
- * up adding a full barrier to switch_mm(), or we should figure
- * out if a smp_mb__after_unlock_lock is really the proper API
- * to use.
- */
- smp_mb__after_unlock_lock();
finish_task(prev);
finish_lock_switch(rq);
finish_arch_post_lock_switch();
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 9bcbacb..6785772 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -118,6 +118,14 @@ static void membarrier_register_private_expedited(void)
if (atomic_read(&mm->membarrier_state)
& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
return;
+ atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
+ if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
+ /*
+ * Ensure all future scheduler executions will observe the
+ * new thread flag state for this process.
+ */
+ synchronize_sched();
+ }
atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
&mm->membarrier_state);
}

Subject: [tip:sched/urgent] membarrier/selftest: Test global expedited command

Commit-ID: 92485487ba834f6665ec13dfb2c69e80cd0c7c37
Gitweb: https://git.kernel.org/tip/92485487ba834f6665ec13dfb2c69e80cd0c7c37
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:14 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:34:40 +0100

membarrier/selftest: Test global expedited command

Test the new MEMBARRIER_CMD_GLOBAL_EXPEDITED and
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED commands.

Adapt to the MEMBARRIER_CMD_SHARED -> MEMBARRIER_CMD_GLOBAL rename.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Shuah Khan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Alice Ferrazzi <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Elder <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
.../testing/selftests/membarrier/membarrier_test.c | 59 ++++++++++++++++++++--
1 file changed, 54 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index e6ee73d..eec39a0 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -59,10 +59,10 @@ static int test_membarrier_flags_fail(void)
return 0;
}

-static int test_membarrier_shared_success(void)
+static int test_membarrier_global_success(void)
{
- int cmd = MEMBARRIER_CMD_SHARED, flags = 0;
- const char *test_name = "sys membarrier MEMBARRIER_CMD_SHARED";
+ int cmd = MEMBARRIER_CMD_GLOBAL, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_GLOBAL";

if (sys_membarrier(cmd, flags) != 0) {
ksft_exit_fail_msg(
@@ -132,6 +132,40 @@ static int test_membarrier_private_expedited_success(void)
return 0;
}

+static int test_membarrier_register_global_expedited_success(void)
+{
+ int cmd = MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
+static int test_membarrier_global_expedited_success(void)
+{
+ int cmd = MEMBARRIER_CMD_GLOBAL_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_GLOBAL_EXPEDITED";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
static int test_membarrier(void)
{
int status;
@@ -142,7 +176,7 @@ static int test_membarrier(void)
status = test_membarrier_flags_fail();
if (status)
return status;
- status = test_membarrier_shared_success();
+ status = test_membarrier_global_success();
if (status)
return status;
status = test_membarrier_private_expedited_fail();
@@ -154,6 +188,19 @@ static int test_membarrier(void)
status = test_membarrier_private_expedited_success();
if (status)
return status;
+ /*
+ * It is valid to send a global membarrier from a non-registered
+ * process.
+ */
+ status = test_membarrier_global_expedited_success();
+ if (status)
+ return status;
+ status = test_membarrier_register_global_expedited_success();
+ if (status)
+ return status;
+ status = test_membarrier_global_expedited_success();
+ if (status)
+ return status;
return 0;
}

@@ -173,8 +220,10 @@ static int test_membarrier_query(void)
}
ksft_exit_fail_msg("sys_membarrier() failed\n");
}
- if (!(ret & MEMBARRIER_CMD_SHARED))
+ if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
+ ksft_test_result_fail("sys_membarrier() CMD_GLOBAL query failed\n");
ksft_exit_fail_msg("sys_membarrier is not supported.\n");
+ }

ksft_test_result_pass("sys_membarrier available\n");
return 0;

Subject: [tip:sched/urgent] locking: Introduce sync_core_before_usermode()

Commit-ID: e61938a921a46347d7725badc40ec436ebfff977
Gitweb: https://git.kernel.org/tip/e61938a921a46347d7725badc40ec436ebfff977
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:15 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:34:50 +0100

locking: Introduce sync_core_before_usermode()

Introduce an architecture function that ensures the current CPU
issues a core serializing instruction before returning to usermode.

This is needed for the membarrier "sync_core" command.

Architectures defining the sync_core_before_usermode() static inline
need to select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sync_core.h | 21 +++++++++++++++++++++
init/Kconfig | 3 +++
2 files changed, 24 insertions(+)

diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
new file mode 100644
index 0000000..013da4b
--- /dev/null
+++ b/include/linux/sync_core.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SYNC_CORE_H
+#define _LINUX_SYNC_CORE_H
+
+#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
+#include <asm/sync_core.h>
+#else
+/*
+ * This is a dummy sync_core_before_usermode() implementation that can be used
+ * on all architectures which return to user-space through core serializing
+ * instructions.
+ * If your architecture returns to user-space through non-core-serializing
+ * instructions, you need to write your own functions.
+ */
+static inline void sync_core_before_usermode(void)
+{
+}
+#endif
+
+#endif /* _LINUX_SYNC_CORE_H */
+
diff --git a/init/Kconfig b/init/Kconfig
index 837adcf..535421f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1918,3 +1918,6 @@ config ASN1
functions to call on what tags.

source "kernel/Kconfig.locks"
+
+config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
+ bool

Subject: [tip:sched/urgent] membarrier: Document scheduler barrier requirements

Commit-ID: 306e060435d7a3aef8f6f033e43b0f581638adce
Gitweb: https://git.kernel.org/tip/306e060435d7a3aef8f6f033e43b0f581638adce
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:12 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:34:21 +0100

membarrier: Document scheduler barrier requirements

Document the membarrier requirement on having a full memory barrier in
__schedule() after coming from user-space, before storing to rq->curr.
It is provided by smp_mb__after_spinlock() in __schedule().

Document that membarrier requires a full barrier on transition from
kernel thread to userspace thread. We currently have an implicit barrier
from atomic_dec_and_test() in mmdrop() that ensures this.

The x86 switch_mm_irqs_off() full barrier is currently provided by many
cpumask update operations as well as write_cr3(). Document that
write_cr3() provides this barrier.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/tlb.c | 5 +++++
include/linux/sched/mm.h | 5 +++++
kernel/sched/core.c | 37 ++++++++++++++++++++++++++-----------
3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5bfe61a..9fa7d2e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -228,6 +228,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
#endif
this_cpu_write(cpu_tlbstate.is_lazy, false);

+ /*
+ * The membarrier system call requires a full memory barrier
+ * before returning to user-space, after storing to rq->curr.
+ * Writing to CR3 provides that full memory barrier.
+ */
if (real_prev == next) {
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 26307cd..b84e0fd 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -39,6 +39,11 @@ static inline void mmgrab(struct mm_struct *mm)
extern void __mmdrop(struct mm_struct *);
static inline void mmdrop(struct mm_struct *mm)
{
+ /*
+ * The implicit full barrier implied by atomic_dec_and_test() is
+ * required by the membarrier system call before returning to
+ * user-space, after storing to rq->curr.
+ */
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ead0c21..11bf4d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2703,6 +2703,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
finish_arch_post_lock_switch();

fire_sched_in_preempt_notifiers(current);
+ /*
+ * When transitioning from a kernel thread to a userspace
+ * thread, mmdrop()'s implicit full barrier is required by the
+ * membarrier system call, because the current ->active_mm can
+ * become the current mm without going through switch_mm().
+ */
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
@@ -2808,6 +2814,13 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
arch_start_context_switch(prev);

+ /*
+ * If mm is non-NULL, we pass through switch_mm(). If mm is
+ * NULL, we will pass through mmdrop() in finish_task_switch().
+ * Both of these contain the full memory barrier required by
+ * membarrier after storing to rq->curr, before returning to
+ * user-space.
+ */
if (!mm) {
next->active_mm = oldmm;
mmgrab(oldmm);
@@ -3344,6 +3357,9 @@ static void __sched notrace __schedule(bool preempt)
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
* done by the caller to avoid the race with signal_wake_up().
+ *
+ * The membarrier system call requires a full memory barrier
+ * after coming from user-space, before storing to rq->curr.
*/
rq_lock(rq, &rf);
smp_mb__after_spinlock();
@@ -3391,17 +3407,16 @@ static void __sched notrace __schedule(bool preempt)
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
- * rq->curr, before returning to user-space. For TSO
- * (e.g. x86), the architecture must provide its own
- * barrier in switch_mm(). For weakly ordered machines
- * for which spin_unlock() acts as a full memory
- * barrier, finish_lock_switch() in common code takes
- * care of this barrier. For weakly ordered machines for
- * which spin_unlock() acts as a RELEASE barrier (only
- * arm64 and PowerPC), arm64 has a full barrier in
- * switch_to(), and PowerPC has
- * smp_mb__after_unlock_lock() before
- * finish_lock_switch().
+ * rq->curr, before returning to user-space.
+ *
+ * Here are the schemes providing that barrier on the
+ * various architectures:
+ * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
+ * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+ * - finish_lock_switch() for weakly-ordered
+ * architectures where spin_unlock is a full barrier,
+ * - switch_to() for arm64 (weakly-ordered, spin_unlock
+ * is a RELEASE barrier),
*/
++*switch_count;


Subject: [tip:sched/urgent] lockin/x86: Implement sync_core_before_usermode()

Commit-ID: ac1ab12a3e6e878274e7107c8c6f326694a1c1f3
Gitweb: https://git.kernel.org/tip/ac1ab12a3e6e878274e7107c8c6f326694a1c1f3
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:16 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:34:57 +0100

lockin/x86: Implement sync_core_before_usermode()

Ensure that a core serializing instruction is issued before returning to
user-mode. x86 implements return to user-space through sysexit, sysrel,
and sysretq, which are not core serializing.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/sync_core.h | 28 ++++++++++++++++++++++++++++
2 files changed, 29 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 423e4b6..31030ad 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -61,6 +61,7 @@ config X86
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
+ select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
new file mode 100644
index 0000000..c67caaf
--- /dev/null
+++ b/arch/x86/include/asm/sync_core.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SYNC_CORE_H
+#define _ASM_X86_SYNC_CORE_H
+
+#include <linux/preempt.h>
+#include <asm/processor.h>
+#include <asm/cpufeature.h>
+
+/*
+ * Ensure that a core serializing instruction is issued before returning
+ * to user-mode. x86 implements return to user-space through sysexit,
+ * sysrel, and sysretq, which are not core serializing.
+ */
+static inline void sync_core_before_usermode(void)
+{
+ /* With PTI, we unconditionally serialize before running user code. */
+ if (static_cpu_has(X86_FEATURE_PTI))
+ return;
+ /*
+ * Return from interrupt and NMI is done through iret, which is core
+ * serializing.
+ */
+ if (in_irq() || in_nmi())
+ return;
+ sync_core();
+}
+
+#endif /* _ASM_X86_SYNC_CORE_H */

Subject: [tip:sched/urgent] membarrier: Provide GLOBAL_EXPEDITED command

Commit-ID: c5f58bd58f432be5d92df33c5458e0bcbee3aadf
Gitweb: https://git.kernel.org/tip/c5f58bd58f432be5d92df33c5458e0bcbee3aadf
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:13 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:34:31 +0100

membarrier: Provide GLOBAL_EXPEDITED command

Allow expedited membarrier to be used for data shared between processes
through shared memory.

Processes wishing to receive the membarriers register with
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue
membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED.

This allows extremely simple kernel-level implementation: we have almost
everything we need with the PRIVATE_EXPEDITED barrier code. All we need
to do is to add a flag in the mm_struct that will be used to check
whether we need to send the IPI to the current thread of each CPU.

There is a slight downside to this approach compared to targeting
specific shared memory users: when performing a membarrier operation,
all registered "global" receivers will get the barrier, even if they
don't share a memory mapping with the sender issuing
MEMBARRIER_CMD_GLOBAL_EXPEDITED.

This registration approach seems to fit the requirement of not
disturbing processes that really deeply care about real-time: they
simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.

In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED"
command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of
MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward
compatibility.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/powerpc/include/asm/membarrier.h | 3 +-
include/linux/sched/mm.h | 6 +-
include/uapi/linux/membarrier.h | 42 ++++++++++--
kernel/sched/membarrier.c | 120 +++++++++++++++++++++++++++++++---
4 files changed, 153 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h
index 98ff4f1..6e20bb5 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -13,7 +13,8 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
* store to rq->curr.
*/
if (likely(!(atomic_read(&next->membarrier_state) &
- MEMBARRIER_STATE_PRIVATE_EXPEDITED) || !prev))
+ (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
return;

/*
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index b84e0fd..1c4e40c 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -219,8 +219,10 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)

#ifdef CONFIG_MEMBARRIER
enum {
- MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0),
- MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1),
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY = (1U << 2),
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED = (1U << 3),
};

#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index 4e01ad7..d252506 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -31,7 +31,7 @@
* enum membarrier_cmd - membarrier system call command
* @MEMBARRIER_CMD_QUERY: Query the set of supported commands. It returns
* a bitmask of valid commands.
- * @MEMBARRIER_CMD_SHARED: Execute a memory barrier on all running threads.
+ * @MEMBARRIER_CMD_GLOBAL: Execute a memory barrier on all running threads.
* Upon return from system call, the caller thread
* is ensured that all running threads have passed
* through a state where all memory accesses to
@@ -40,6 +40,28 @@
* (non-running threads are de facto in such a
* state). This covers threads from all processes
* running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_GLOBAL_EXPEDITED:
+ * Execute a memory barrier on all running threads
+ * of all processes which previously registered
+ * with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
+ * Upon return from system call, the caller thread
+ * is ensured that all running threads have passed
+ * through a state where all memory accesses to
+ * user-space addresses match program order between
+ * entry to and return from the system call
+ * (non-running threads are de facto in such a
+ * state). This only covers threads from processes
+ * which registered with
+ * MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
+ * This command returns 0. Given that
+ * registration is about the intent to receive
+ * the barriers, it is valid to invoke
+ * MEMBARRIER_CMD_GLOBAL_EXPEDITED from a
+ * non-registered process.
+ * @MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED:
+ * Register the process intent to receive
+ * MEMBARRIER_CMD_GLOBAL_EXPEDITED memory
+ * barriers. Always returns 0.
* @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
* Execute a memory barrier on each running
* thread belonging to the same process as the current
@@ -64,18 +86,24 @@
* Register the process intent to use
* MEMBARRIER_CMD_PRIVATE_EXPEDITED. Always
* returns 0.
+ * @MEMBARRIER_CMD_SHARED:
+ * Alias to MEMBARRIER_CMD_GLOBAL. Provided for
+ * header backward compatibility.
*
* Command to be passed to the membarrier system call. The commands need to
* be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
* the value 0.
*/
enum membarrier_cmd {
- MEMBARRIER_CMD_QUERY = 0,
- MEMBARRIER_CMD_SHARED = (1 << 0),
- /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
- /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
- MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
- MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
+ MEMBARRIER_CMD_QUERY = 0,
+ MEMBARRIER_CMD_GLOBAL = (1 << 0),
+ MEMBARRIER_CMD_GLOBAL_EXPEDITED = (1 << 1),
+ MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED = (1 << 2),
+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
+ MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
+
+ /* Alias for header backward compatibility. */
+ MEMBARRIER_CMD_SHARED = MEMBARRIER_CMD_GLOBAL,
};

#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 6785772..d2087d5 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -27,7 +27,9 @@
* except MEMBARRIER_CMD_QUERY.
*/
#define MEMBARRIER_CMD_BITMASK \
- (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED \
+ (MEMBARRIER_CMD_GLOBAL | MEMBARRIER_CMD_GLOBAL_EXPEDITED \
+ | MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED \
+ | MEMBARRIER_CMD_PRIVATE_EXPEDITED \
| MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED)

static void ipi_mb(void *info)
@@ -35,6 +37,73 @@ static void ipi_mb(void *info)
smp_mb(); /* IPIs should be serializing but paranoid. */
}

+static int membarrier_global_expedited(void)
+{
+ int cpu;
+ bool fallback = false;
+ cpumask_var_t tmpmask;
+
+ if (num_online_cpus() == 1)
+ return 0;
+
+ /*
+ * Matches memory barriers around rq->curr modification in
+ * scheduler.
+ */
+ smp_mb(); /* system call entry is not a mb. */
+
+ /*
+ * Expedited membarrier commands guarantee that they won't
+ * block, hence the GFP_NOWAIT allocation flag and fallback
+ * implementation.
+ */
+ if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+ /* Fallback for OOM. */
+ fallback = true;
+ }
+
+ cpus_read_lock();
+ for_each_online_cpu(cpu) {
+ struct task_struct *p;
+
+ /*
+ * Skipping the current CPU is OK even through we can be
+ * migrated at any point. The current CPU, at the point
+ * where we read raw_smp_processor_id(), is ensured to
+ * be in program order with respect to the caller
+ * thread. Therefore, we can skip this CPU from the
+ * iteration.
+ */
+ if (cpu == raw_smp_processor_id())
+ continue;
+ rcu_read_lock();
+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+ if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
+ if (!fallback)
+ __cpumask_set_cpu(cpu, tmpmask);
+ else
+ smp_call_function_single(cpu, ipi_mb, NULL, 1);
+ }
+ rcu_read_unlock();
+ }
+ if (!fallback) {
+ preempt_disable();
+ smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
+ preempt_enable();
+ free_cpumask_var(tmpmask);
+ }
+ cpus_read_unlock();
+
+ /*
+ * Memory barrier on the caller thread _after_ we finished
+ * waiting for the last IPI. Matches memory barriers around
+ * rq->curr modification in scheduler.
+ */
+ smp_mb(); /* exit from system call is not a mb */
+ return 0;
+}
+
static int membarrier_private_expedited(void)
{
int cpu;
@@ -105,7 +174,38 @@ static int membarrier_private_expedited(void)
return 0;
}

-static void membarrier_register_private_expedited(void)
+static int membarrier_register_global_expedited(void)
+{
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+
+ if (atomic_read(&mm->membarrier_state) &
+ MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY)
+ return 0;
+ atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state);
+ if (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) {
+ /*
+ * For single mm user, single threaded process, we can
+ * simply issue a memory barrier after setting
+ * MEMBARRIER_STATE_GLOBAL_EXPEDITED to guarantee that
+ * no memory access following registration is reordered
+ * before registration.
+ */
+ smp_mb();
+ } else {
+ /*
+ * For multi-mm user threads, we need to ensure all
+ * future scheduler executions will observe the new
+ * thread flag state for this mm.
+ */
+ synchronize_sched();
+ }
+ atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY,
+ &mm->membarrier_state);
+ return 0;
+}
+
+static int membarrier_register_private_expedited(void)
{
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
@@ -117,7 +217,7 @@ static void membarrier_register_private_expedited(void)
*/
if (atomic_read(&mm->membarrier_state)
& MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
- return;
+ return 0;
atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
/*
@@ -128,6 +228,7 @@ static void membarrier_register_private_expedited(void)
}
atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
&mm->membarrier_state);
+ return 0;
}

/**
@@ -167,21 +268,24 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
int cmd_mask = MEMBARRIER_CMD_BITMASK;

if (tick_nohz_full_enabled())
- cmd_mask &= ~MEMBARRIER_CMD_SHARED;
+ cmd_mask &= ~MEMBARRIER_CMD_GLOBAL;
return cmd_mask;
}
- case MEMBARRIER_CMD_SHARED:
- /* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */
+ case MEMBARRIER_CMD_GLOBAL:
+ /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
if (tick_nohz_full_enabled())
return -EINVAL;
if (num_online_cpus() > 1)
synchronize_sched();
return 0;
+ case MEMBARRIER_CMD_GLOBAL_EXPEDITED:
+ return membarrier_global_expedited();
+ case MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED:
+ return membarrier_register_global_expedited();
case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
return membarrier_private_expedited();
case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
- membarrier_register_private_expedited();
- return 0;
+ return membarrier_register_private_expedited();
default:
return -EINVAL;
}

Subject: [tip:sched/urgent] membarrier: Provide core serializing command, *_SYNC_CORE

Commit-ID: 70216e18e519a54a2f13569e8caff99a092a92d6
Gitweb: https://git.kernel.org/tip/70216e18e519a54a2f13569e8caff99a092a92d6
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:17 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:35:03 +0100

membarrier: Provide core serializing command, *_SYNC_CORE

Provide core serializing membarrier command to support memory reclaim
by JIT.

Each architecture needs to explicitly opt into that support by
documenting in their architecture code how they provide the core
serializing instructions required when returning from the membarrier
IPI, and after the scheduler has updated the curr->mm pointer (before
going back to user-space). They should then select
ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
their architecture.

Architectures selecting this feature need to either document that
they issue core serializing instructions when returning to user-space,
or implement their architecture-specific sync_core_before_usermode().

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched/mm.h | 18 ++++++++++++++
include/uapi/linux/membarrier.h | 32 ++++++++++++++++++++++++-
init/Kconfig | 3 +++
kernel/sched/core.c | 18 ++++++++++----
kernel/sched/membarrier.c | 53 +++++++++++++++++++++++++++++++----------
5 files changed, 106 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1c4e40c..03a1690 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/mm_types.h>
#include <linux/gfp.h>
+#include <linux/sync_core.h>

/*
* Routines for handling mm_structs
@@ -223,12 +224,26 @@ enum {
MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1),
MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY = (1U << 2),
MEMBARRIER_STATE_GLOBAL_EXPEDITED = (1U << 3),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY = (1U << 4),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE = (1U << 5),
+};
+
+enum {
+ MEMBARRIER_FLAG_SYNC_CORE = (1U << 0),
};

#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
#include <asm/membarrier.h>
#endif

+static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
+{
+ if (likely(!(atomic_read(&mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
+ return;
+ sync_core_before_usermode();
+}
+
static inline void membarrier_execve(struct task_struct *t)
{
atomic_set(&t->mm->membarrier_state, 0);
@@ -244,6 +259,9 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
static inline void membarrier_execve(struct task_struct *t)
{
}
+static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
+{
+}
#endif

#endif /* _LINUX_SCHED_MM_H */
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index d252506..5891d76 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -73,7 +73,7 @@
* to and return from the system call
* (non-running threads are de facto in such a
* state). This only covers threads from the
- * same processes as the caller thread. This
+ * same process as the caller thread. This
* command returns 0 on success. The
* "expedited" commands complete faster than
* the non-expedited ones, they never block,
@@ -86,6 +86,34 @@
* Register the process intent to use
* MEMBARRIER_CMD_PRIVATE_EXPEDITED. Always
* returns 0.
+ * @MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE:
+ * In addition to provide memory ordering
+ * guarantees described in
+ * MEMBARRIER_CMD_PRIVATE_EXPEDITED, ensure
+ * the caller thread, upon return from system
+ * call, that all its running threads siblings
+ * have executed a core serializing
+ * instruction. (architectures are required to
+ * guarantee that non-running threads issue
+ * core serializing instructions before they
+ * resume user-space execution). This only
+ * covers threads from the same process as the
+ * caller thread. This command returns 0 on
+ * success. The "expedited" commands complete
+ * faster than the non-expedited ones, they
+ * never block, but have the downside of
+ * causing extra overhead. If this command is
+ * not implemented by an architecture, -EINVAL
+ * is returned. A process needs to register its
+ * intent to use the private expedited sync
+ * core command prior to using it, otherwise
+ * this command returns -EPERM.
+ * @MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE:
+ * Register the process intent to use
+ * MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.
+ * If this command is not implemented by an
+ * architecture, -EINVAL is returned.
+ * Returns 0 on success.
* @MEMBARRIER_CMD_SHARED:
* Alias to MEMBARRIER_CMD_GLOBAL. Provided for
* header backward compatibility.
@@ -101,6 +129,8 @@ enum membarrier_cmd {
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED = (1 << 2),
MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
+ MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 5),
+ MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 6),

/* Alias for header backward compatibility. */
MEMBARRIER_CMD_SHARED = MEMBARRIER_CMD_GLOBAL,
diff --git a/init/Kconfig b/init/Kconfig
index 535421f..e37f4b2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1415,6 +1415,9 @@ config USERFAULTFD
config ARCH_HAS_MEMBARRIER_CALLBACKS
bool

+config ARCH_HAS_MEMBARRIER_SYNC_CORE
+ bool
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 11bf4d4..ee420d7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2704,13 +2704,21 @@ static struct rq *finish_task_switch(struct task_struct *prev)

fire_sched_in_preempt_notifiers(current);
/*
- * When transitioning from a kernel thread to a userspace
- * thread, mmdrop()'s implicit full barrier is required by the
- * membarrier system call, because the current ->active_mm can
- * become the current mm without going through switch_mm().
+ * When switching through a kernel thread, the loop in
+ * membarrier_{private,global}_expedited() may have observed that
+ * kernel thread and not issued an IPI. It is therefore possible to
+ * schedule between user->kernel->user threads without passing though
+ * switch_mm(). Membarrier requires a barrier after storing to
+ * rq->curr, before returning to userspace, so provide them here:
+ *
+ * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
+ * provided by mmdrop(),
+ * - a sync_core for SYNC_CORE.
*/
- if (mm)
+ if (mm) {
+ membarrier_mm_sync_core_before_usermode(mm);
mmdrop(mm);
+ }
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index d2087d5..5d07626 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -26,11 +26,20 @@
* Bitmask made from a "or" of all commands within enum membarrier_cmd,
* except MEMBARRIER_CMD_QUERY.
*/
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#define MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK \
+ (MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE \
+ | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE)
+#else
+#define MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK 0
+#endif
+
#define MEMBARRIER_CMD_BITMASK \
(MEMBARRIER_CMD_GLOBAL | MEMBARRIER_CMD_GLOBAL_EXPEDITED \
| MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED \
| MEMBARRIER_CMD_PRIVATE_EXPEDITED \
- | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED)
+ | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED \
+ | MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK)

static void ipi_mb(void *info)
{
@@ -104,15 +113,23 @@ static int membarrier_global_expedited(void)
return 0;
}

-static int membarrier_private_expedited(void)
+static int membarrier_private_expedited(int flags)
{
int cpu;
bool fallback = false;
cpumask_var_t tmpmask;

- if (!(atomic_read(&current->mm->membarrier_state)
- & MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY))
- return -EPERM;
+ if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+ return -EINVAL;
+ if (!(atomic_read(&current->mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
+ return -EPERM;
+ } else {
+ if (!(atomic_read(&current->mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY))
+ return -EPERM;
+ }

if (num_online_cpus() == 1)
return 0;
@@ -205,20 +222,29 @@ static int membarrier_register_global_expedited(void)
return 0;
}

-static int membarrier_register_private_expedited(void)
+static int membarrier_register_private_expedited(int flags)
{
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ int state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY;
+
+ if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+ return -EINVAL;
+ state = MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY;
+ }

/*
* We need to consider threads belonging to different thread
* groups, which use the same mm. (CLONE_VM but not
* CLONE_THREAD).
*/
- if (atomic_read(&mm->membarrier_state)
- & MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)
+ if (atomic_read(&mm->membarrier_state) & state)
return 0;
atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state);
+ if (flags & MEMBARRIER_FLAG_SYNC_CORE)
+ atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE,
+ &mm->membarrier_state);
if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
/*
* Ensure all future scheduler executions will observe the
@@ -226,8 +252,7 @@ static int membarrier_register_private_expedited(void)
*/
synchronize_sched();
}
- atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY,
- &mm->membarrier_state);
+ atomic_or(state, &mm->membarrier_state);
return 0;
}

@@ -283,9 +308,13 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
case MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED:
return membarrier_register_global_expedited();
case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
- return membarrier_private_expedited();
+ return membarrier_private_expedited(0);
case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
- return membarrier_register_private_expedited();
+ return membarrier_register_private_expedited(0);
+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE:
+ return membarrier_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
+ case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE:
+ return membarrier_register_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
default:
return -EINVAL;
}

Subject: [tip:sched/urgent] membarrier/selftest: Test private expedited sync core command

Commit-ID: 460e8c3340a265d1d70fa1ee05c42afd68b2efa0
Gitweb: https://git.kernel.org/tip/460e8c3340a265d1d70fa1ee05c42afd68b2efa0
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:20 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:35:38 +0100

membarrier/selftest: Test private expedited sync core command

Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Shuah Khan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Alice Ferrazzi <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Elder <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
.../testing/selftests/membarrier/membarrier_test.c | 73 ++++++++++++++++++++++
1 file changed, 73 insertions(+)

diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
index eec39a0..22bffd5 100644
--- a/tools/testing/selftests/membarrier/membarrier_test.c
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -132,6 +132,63 @@ static int test_membarrier_private_expedited_success(void)
return 0;
}

+static int test_membarrier_private_expedited_sync_core_fail(void)
+{
+ int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE not registered failure";
+
+ if (sys_membarrier(cmd, flags) != -1) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should fail, but passed\n",
+ test_name, flags);
+ }
+ if (errno != EPERM) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n",
+ test_name, flags, EPERM, strerror(EPERM),
+ errno, strerror(errno));
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ return 0;
+}
+
+static int test_membarrier_register_private_expedited_sync_core_success(void)
+{
+ int cmd = MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
+static int test_membarrier_private_expedited_sync_core_success(void)
+{
+ int cmd = MEMBARRIER_CMD_PRIVATE_EXPEDITED, flags = 0;
+ const char *test_name = "sys membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE";
+
+ if (sys_membarrier(cmd, flags) != 0) {
+ ksft_exit_fail_msg(
+ "%s test: flags = %d, errno = %d\n",
+ test_name, flags, errno);
+ }
+
+ ksft_test_result_pass(
+ "%s test: flags = %d\n",
+ test_name, flags);
+ return 0;
+}
+
static int test_membarrier_register_global_expedited_success(void)
{
int cmd = MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED, flags = 0;
@@ -188,6 +245,22 @@ static int test_membarrier(void)
status = test_membarrier_private_expedited_success();
if (status)
return status;
+ status = sys_membarrier(MEMBARRIER_CMD_QUERY, 0);
+ if (status < 0) {
+ ksft_test_result_fail("sys_membarrier() failed\n");
+ return status;
+ }
+ if (status & MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE) {
+ status = test_membarrier_private_expedited_sync_core_fail();
+ if (status)
+ return status;
+ status = test_membarrier_register_private_expedited_sync_core_success();
+ if (status)
+ return status;
+ status = test_membarrier_private_expedited_sync_core_success();
+ if (status)
+ return status;
+ }
/*
* It is valid to send a global membarrier from a non-registered
* process.

Subject: [tip:sched/urgent] membarrier/arm64: Provide core serializing command

Commit-ID: f1e3a12b6543a73cfd47aa7da16d3d73d297795e
Gitweb: https://git.kernel.org/tip/f1e3a12b6543a73cfd47aa7da16d3d73d297795e
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:19 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:35:17 +0100

membarrier/arm64: Provide core serializing command

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/entry.S | 4 ++++
2 files changed, 5 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e..5b0c06d 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -16,6 +16,7 @@ config ARM64
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
+ select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6d14b8f..5edde1c 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -302,6 +302,10 @@ alternative_else_nop_endif
ldp x28, x29, [sp, #16 * 14]
ldr lr, [sp, #S_LR]
add sp, sp, #S_FRAME_SIZE // restore sp
+ /*
+ * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on eret context synchronization
+ * when returning from IPI handler, and when returning to user-space.
+ */
eret // return to kernel
.endm


Subject: [tip:sched/urgent] membarrier/x86: Provide core serializing command

Commit-ID: 10bcc80e9dbced128e3b4aa86e4737e5486a45d0
Gitweb: https://git.kernel.org/tip/10bcc80e9dbced128e3b4aa86e4737e5486a45d0
Author: Mathieu Desnoyers <[email protected]>
AuthorDate: Mon, 29 Jan 2018 15:20:18 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 5 Feb 2018 21:35:11 +0100

membarrier/x86: Provide core serializing command

There are two places where core serialization is needed by membarrier:

1) When returning from the membarrier IPI,
2) After scheduler updates curr to a thread with a different mm, before
going back to user-space, since the curr->mm is used by membarrier to
check whether it needs to send an IPI to that CPU.

x86-32 uses IRET as return from interrupt, and both IRET and SYSEXIT to go
back to user-space. The IRET instruction is core serializing, but not
SYSEXIT.

x86-64 uses IRET as return from interrupt, which takes care of the IPI.
However, it can return to user-space through either SYSRETL (compat
code), SYSRETQ, or IRET. Given that SYSRET{L,Q} is not core serializing,
we rely instead on write_cr3() performed by switch_mm() to provide core
serialization after changing the current mm, and deal with the special
case of kthread -> uthread (temporarily keeping current mm into
active_mm) by adding a sync_core() in that specific case.

Use the new sync_core_before_usermode() to guarantee this.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/entry/entry_32.S | 5 +++++
arch/x86/entry/entry_64.S | 4 ++++
arch/x86/mm/tlb.c | 7 ++++---
4 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 31030ad..e095bdb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV if X86_64
+ select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_REFCOUNT
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2a35b1e..abee6d2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -566,6 +566,11 @@ restore_all:
.Lrestore_nocheck:
RESTORE_REGS 4 # skip orig_eax/error_code
.Lirq_return:
+ /*
+ * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
+ * when returning from IPI handler and when returning from
+ * scheduler to user-space.
+ */
INTERRUPT_RETURN

.section .fixup, "ax"
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a835704..5816858 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -804,6 +804,10 @@ GLOBAL(restore_regs_and_return_to_kernel)
POP_EXTRA_REGS
POP_C_REGS
addq $8, %rsp /* skip regs->orig_ax */
+ /*
+ * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
+ * when returning from IPI handler.
+ */
INTERRUPT_RETURN

ENTRY(native_iret)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9fa7d2e..9b34121 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -229,9 +229,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.is_lazy, false);

/*
- * The membarrier system call requires a full memory barrier
- * before returning to user-space, after storing to rq->curr.
- * Writing to CR3 provides that full memory barrier.
+ * The membarrier system call requires a full memory barrier and
+ * core serialization before returning to user-space, after
+ * storing to rq->curr. Writing to CR3 provides that full
+ * memory barrier and core serializing instruction.
*/
if (real_prev == next) {
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=

2021-06-18 19:25:46

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()



Le 29/01/2018 à 21:20, Mathieu Desnoyers a écrit :
> Allow PowerPC to skip the full memory barrier in switch_mm(), and
> only issue the barrier when scheduling into a task belonging to a
> process that has registered to use expedited private.
>
> Threads targeting the same VM but which belong to different thread
> groups is a tricky case. It has a few consequences:
>
> It turns out that we cannot rely on get_nr_threads(p) to count the
> number of threads using a VM. We can use
> (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
> instead to skip the synchronize_sched() for cases where the VM only has
> a single user, and that user only has a single thread.
>
> It also turns out that we cannot use for_each_thread() to set
> thread flags in all threads using a VM, as it only iterates on the
> thread group.
>
> Therefore, test the membarrier state variable directly rather than
> relying on thread flags. This means
> membarrier_register_private_expedited() needs to set the
> MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
> only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
> private expedited membarrier commands to succeed.
> membarrier_arch_switch_mm() now tests for the
> MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Looking at switch_mm_irqs_off(), I found it more complex than expected and found that this patch is
the reason for that complexity.

Before the patch (ie in kernel 4.14), we have:

00000000 <switch_mm_irqs_off>:
0: 81 24 01 c8 lwz r9,456(r4)
4: 71 29 00 01 andi. r9,r9,1
8: 40 82 00 1c bne 24 <switch_mm_irqs_off+0x24>
c: 39 24 01 c8 addi r9,r4,456
10: 39 40 00 01 li r10,1
14: 7d 00 48 28 lwarx r8,0,r9
18: 7d 08 53 78 or r8,r8,r10
1c: 7d 00 49 2d stwcx. r8,0,r9
20: 40 c2 ff f4 bne- 14 <switch_mm_irqs_off+0x14>
24: 7c 04 18 40 cmplw r4,r3
28: 81 24 00 24 lwz r9,36(r4)
2c: 91 25 04 4c stw r9,1100(r5)
30: 4d 82 00 20 beqlr
34: 48 00 00 00 b 34 <switch_mm_irqs_off+0x34>
34: R_PPC_REL24 switch_mmu_context


After the patch (ie in 5.13-rc6), that now is:

00000000 <switch_mm_irqs_off>:
0: 81 24 02 18 lwz r9,536(r4)
4: 71 29 00 01 andi. r9,r9,1
8: 41 82 00 24 beq 2c <switch_mm_irqs_off+0x2c>
c: 7c 04 18 40 cmplw r4,r3
10: 81 24 00 24 lwz r9,36(r4)
14: 91 25 04 d0 stw r9,1232(r5)
18: 4d 82 00 20 beqlr
1c: 81 24 00 28 lwz r9,40(r4)
20: 71 29 00 0a andi. r9,r9,10
24: 40 82 00 34 bne 58 <switch_mm_irqs_off+0x58>
28: 48 00 00 00 b 28 <switch_mm_irqs_off+0x28>
28: R_PPC_REL24 switch_mmu_context
2c: 39 24 02 18 addi r9,r4,536
30: 39 40 00 01 li r10,1
34: 7d 00 48 28 lwarx r8,0,r9
38: 7d 08 53 78 or r8,r8,r10
3c: 7d 00 49 2d stwcx. r8,0,r9
40: 40 a2 ff f4 bne 34 <switch_mm_irqs_off+0x34>
44: 7c 04 18 40 cmplw r4,r3
48: 81 24 00 24 lwz r9,36(r4)
4c: 91 25 04 d0 stw r9,1232(r5)
50: 4d 82 00 20 beqlr
54: 48 00 00 00 b 54 <switch_mm_irqs_off+0x54>
54: R_PPC_REL24 switch_mmu_context
58: 2c 03 00 00 cmpwi r3,0
5c: 41 82 ff cc beq 28 <switch_mm_irqs_off+0x28>
60: 48 00 00 00 b 60 <switch_mm_irqs_off+0x60>
60: R_PPC_REL24 switch_mmu_context


Especially, the comparison of 'prev' to 0 is pointless as both cases end up with just branching to
'switch_mmu_context'

I don't understand all that complexity to just replace a simple 'smp_mb__after_unlock_lock()'.

#define smp_mb__after_unlock_lock() smp_mb()
#define smp_mb() barrier()
# define barrier() __asm__ __volatile__("": : :"memory")


Am I missing some subtility ?

Thanks
Christophe

2021-06-19 02:17:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy [email protected] wrote:
[...]
>
> I don't understand all that complexity to just replace a simple
> 'smp_mb__after_unlock_lock()'.
>
> #define smp_mb__after_unlock_lock() smp_mb()
> #define smp_mb() barrier()
> # define barrier() __asm__ __volatile__("": : :"memory")
>
>
> Am I missing some subtility ?

On powerpc CONFIG_SMP, smp_mb() is actually defined as:

#define smp_mb() __smp_mb()
#define __smp_mb() mb()
#define mb() __asm__ __volatile__ ("sync" : : : "memory")

So the original motivation here was to skip a "sync" instruction whenever
switching between threads which are part of the same process. But based on
recent discussions, I suspect my implementation may be inaccurately doing
so though.

Thanks,

Mathieu


>
> Thanks
> Christophe

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2021-06-19 10:26:34

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()



Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit :
> ----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy [email protected] wrote:
> [...]
>>
>> I don't understand all that complexity to just replace a simple
>> 'smp_mb__after_unlock_lock()'.
>>
>> #define smp_mb__after_unlock_lock() smp_mb()
>> #define smp_mb() barrier()
>> # define barrier() __asm__ __volatile__("": : :"memory")
>>
>>
>> Am I missing some subtility ?
>
> On powerpc CONFIG_SMP, smp_mb() is actually defined as:
>
> #define smp_mb() __smp_mb()
> #define __smp_mb() mb()
> #define mb() __asm__ __volatile__ ("sync" : : : "memory")
>
> So the original motivation here was to skip a "sync" instruction whenever
> switching between threads which are part of the same process. But based on
> recent discussions, I suspect my implementation may be inaccurately doing
> so though.
>

I see.

Then, if you think a 'sync' is a concern, shouldn't we try and remove the forest of 'sync' in the
I/O accessors ?

I can't really understand why we need all those 'sync' and 'isync' and 'twi' around the accesses
whereas I/O memory is usually mapped as 'Guarded' so memory access ordering is already garantied.

I'm sure we'll save a lot with that.

Christophe

2021-06-19 20:43:58

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

On Sat, Jun 19, 2021 at 11:35:34AM +0200, Christophe Leroy wrote:
>
>
> Le 18/06/2021 ? 19:26, Mathieu Desnoyers a ?crit?:
> >----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy
> >[email protected] wrote:
> >[...]
> >>
> >>I don't understand all that complexity to just replace a simple
> >>'smp_mb__after_unlock_lock()'.
> >>
> >>#define smp_mb__after_unlock_lock() smp_mb()
> >>#define smp_mb() barrier()
> >># define barrier() __asm__ __volatile__("": : :"memory")
> >>
> >>
> >>Am I missing some subtility ?
> >
> >On powerpc CONFIG_SMP, smp_mb() is actually defined as:
> >
> >#define smp_mb() __smp_mb()
> >#define __smp_mb() mb()
> >#define mb() __asm__ __volatile__ ("sync" : : : "memory")
> >
> >So the original motivation here was to skip a "sync" instruction whenever
> >switching between threads which are part of the same process. But based on
> >recent discussions, I suspect my implementation may be inaccurately doing
> >so though.
> >
>
> I see.
>
> Then, if you think a 'sync' is a concern, shouldn't we try and remove the
> forest of 'sync' in the I/O accessors ?
>
> I can't really understand why we need all those 'sync' and 'isync' and
> 'twi' around the accesses whereas I/O memory is usually mapped as 'Guarded'
> so memory access ordering is already garantied.
>
> I'm sure we'll save a lot with that.

The point of the twi in the I/O accessors was to make things easier to
debug if the accesses fail: for the twi insn to complete the load will
have to have completed as well. On a correctly working system you never
should need this (until something fails ;-) )

Without the twi you might need to enforce ordering in some cases still.
The twi is a very heavy hammer, but some of that that gives us is no
doubt actually needed.


Segher

2021-06-21 14:13:34

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()



Le 19/06/2021 à 17:02, Segher Boessenkool a écrit :
> On Sat, Jun 19, 2021 at 11:35:34AM +0200, Christophe Leroy wrote:
>>
>>
>> Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit :
>>> ----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy
>>> [email protected] wrote:
>>> [...]
>>>>
>>>> I don't understand all that complexity to just replace a simple
>>>> 'smp_mb__after_unlock_lock()'.
>>>>
>>>> #define smp_mb__after_unlock_lock() smp_mb()
>>>> #define smp_mb() barrier()
>>>> # define barrier() __asm__ __volatile__("": : :"memory")
>>>>
>>>>
>>>> Am I missing some subtility ?
>>>
>>> On powerpc CONFIG_SMP, smp_mb() is actually defined as:
>>>
>>> #define smp_mb() __smp_mb()
>>> #define __smp_mb() mb()
>>> #define mb() __asm__ __volatile__ ("sync" : : : "memory")
>>>
>>> So the original motivation here was to skip a "sync" instruction whenever
>>> switching between threads which are part of the same process. But based on
>>> recent discussions, I suspect my implementation may be inaccurately doing
>>> so though.
>>>
>>
>> I see.
>>
>> Then, if you think a 'sync' is a concern, shouldn't we try and remove the
>> forest of 'sync' in the I/O accessors ?
>>
>> I can't really understand why we need all those 'sync' and 'isync' and
>> 'twi' around the accesses whereas I/O memory is usually mapped as 'Guarded'
>> so memory access ordering is already garantied.
>>
>> I'm sure we'll save a lot with that.
>
> The point of the twi in the I/O accessors was to make things easier to
> debug if the accesses fail: for the twi insn to complete the load will
> have to have completed as well. On a correctly working system you never
> should need this (until something fails ;-) )
>
> Without the twi you might need to enforce ordering in some cases still.
> The twi is a very heavy hammer, but some of that that gives us is no
> doubt actually needed.
>

Well, I've always been quite perplex about that. According to the documentation of the 8xx, if a bus
error or something happens on an I/O access, the exception will be accounted on the instruction
which does the access. But based on the following function, I understand that some version of
powerpc do generate the trap on the instruction which was being executed at the time the I/O access
failed, not the instruction that does the access itself ?

/*
* I/O accesses can cause machine checks on powermacs.
* Check if the NIP corresponds to the address of a sync
* instruction for which there is an entry in the exception
* table.
* -- paulus.
*/
static inline int check_io_access(struct pt_regs *regs)
{
#ifdef CONFIG_PPC32
unsigned long msr = regs->msr;
const struct exception_table_entry *entry;
unsigned int *nip = (unsigned int *)regs->nip;

if (((msr & 0xffff0000) == 0 || (msr & (0x80000 | 0x40000)))
&& (entry = search_exception_tables(regs->nip)) != NULL) {
/*
* Check that it's a sync instruction, or somewhere
* in the twi; isync; nop sequence that inb/inw/inl uses.
* As the address is in the exception table
* we should be able to read the instr there.
* For the debug message, we look at the preceding
* load or store.
*/
if (*nip == PPC_INST_NOP)
nip -= 2;
else if (*nip == PPC_INST_ISYNC)
--nip;
if (*nip == PPC_INST_SYNC || (*nip >> 26) == OP_TRAP) {
unsigned int rb;

--nip;
rb = (*nip >> 11) & 0x1f;
printk(KERN_DEBUG "%s bad port %lx at %p\n",
(*nip & 0x100)? "OUT to": "IN from",
regs->gpr[rb] - _IO_BASE, nip);
regs->msr |= MSR_RI;
regs->nip = extable_fixup(entry);
return 1;
}
}
#endif /* CONFIG_PPC32 */
return 0;
}

Am I right ?

It is not only the twi which bother's me in the I/O accessors but also the sync/isync and stuff.

A write typically is

sync
stw

A read is

sync
lwz
twi
isync

Taking into account that HW ordering is garanteed by the fact that __iomem is guarded, isn't the
'memory' clobber enough as a barrier ?

Thanks
Christophe

2021-06-22 00:32:08

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

Hi!

On Mon, Jun 21, 2021 at 04:11:04PM +0200, Christophe Leroy wrote:
> Le 19/06/2021 ? 17:02, Segher Boessenkool a ?crit?:
> >The point of the twi in the I/O accessors was to make things easier to
> >debug if the accesses fail: for the twi insn to complete the load will
> >have to have completed as well. On a correctly working system you never
> >should need this (until something fails ;-) )
> >
> >Without the twi you might need to enforce ordering in some cases still.
> >The twi is a very heavy hammer, but some of that that gives us is no
> >doubt actually needed.
>
> Well, I've always been quite perplex about that. According to the
> documentation of the 8xx, if a bus error or something happens on an I/O
> access, the exception will be accounted on the instruction which does the
> access. But based on the following function, I understand that some version
> of powerpc do generate the trap on the instruction which was being executed
> at the time the I/O access failed, not the instruction that does the access
> itself ?

Trap instructions are never speculated (this may not be architectural,
but it is true on all existing implementations). So the instructions
after the twi;isync will not execute until the twi itself has finished,
and that cannot happen before the preceding load has (because it uses
the loaded register).

Now, some I/O accesses can cause machine checks. Machine checks are
asynchronous and can be hard to correlate to specific load insns, and
worse, you may not even have the address loaded from in architected
registers anymore. Since I/O accesses often take *long*, tens or even
hundreds of cycles is not unusual, this can be a challenge.

To recover from machine checks you typically need special debug hardware
and/or software. For the Apple machines those are not so easy to come
by. This "twi after loads" thing made it pretty easy to figure out
where your code was going wrong.

And it isn't as slow as it may sound: typically you really need to have
the result of the load before you can go on do useful work anyway, and
loads from I/O are slow non-posted things.

> /*
> * I/O accesses can cause machine checks on powermacs.
> * Check if the NIP corresponds to the address of a sync
> * instruction for which there is an entry in the exception
> * table.
> * -- paulus.
> */

I suspect this is from before the twi thing was added?

> It is not only the twi which bother's me in the I/O accessors but also the
> sync/isync and stuff.
>
> A write typically is
>
> sync
> stw
>
> A read is
>
> sync
> lwz
> twi
> isync
>
> Taking into account that HW ordering is garanteed by the fact that __iomem
> is guarded,

Yes. But machine checks are asynchronous :-)

> isn't the 'memory' clobber enough as a barrier ?

A "memory" clobber isn't a barrier of any kind. "Compiler barriers" do
not exist.

The only thing such a clobber does is it tells the compiler that this
inline asm can access some memory, and we do not say at what address.
So the compiler cannot reorder this asm with other memory accesses. It
has no other effects, no magical effects, and it is not comparable to
actual barrier instructions (that actually tell the hardware that some
certain ordering is required).

"Compiler barrier" is a harmful misnomer: language shapes thoughts,
using misleading names causes misguided thoughts.

Anyway :-)

The isync is simply to make sure the code after it does not start before
the code before it has completed. The sync before I am not sure.


Segher