2015-05-04 21:00:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH v17 for 4.2 0/2] sys_membarrier()

Hi Andrew,

I have taken care of all feedback on this patchset at this point. Can you pick
up those patches for 4.2 ?

Thanks!

Mathieu

Mathieu Desnoyers (1):
sys_membarrier(): system-wide memory barrier (generic, x86)

Pranith Kumar (1):
selftests: Add membarrier syscall test

MAINTAINERS | 8 ++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/membarrier.h | 53 +++++++++++++++
init/Kconfig | 12 ++++
kernel/Makefile | 1 +
kernel/membarrier.c | 66 +++++++++++++++++++
kernel/sys_ni.c | 3 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/membarrier/.gitignore | 1 +
tools/testing/selftests/membarrier/Makefile | 13 ++++
.../testing/selftests/membarrier/membarrier_test.c | 69 ++++++++++++++++++++
15 files changed, 235 insertions(+), 1 deletions(-)
create mode 100644 include/uapi/linux/membarrier.h
create mode 100644 kernel/membarrier.c
create mode 100644 tools/testing/selftests/membarrier/.gitignore
create mode 100644 tools/testing/selftests/membarrier/Makefile
create mode 100644 tools/testing/selftests/membarrier/membarrier_test.c

--
1.7.7.3


2015-05-04 21:01:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH v17 1/2] sys_membarrier(): system-wide memory barrier (generic, x86)

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to distribute
the cost of user-space memory barriers asymmetrically by transforming
pairs of memory barriers into pairs consisting of sys_membarrier() and a
compiler barrier. For synchronization primitives that distinguish
between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
read-side can be accelerated significantly by moving the bulk of the
memory barrier overhead to the write-side.

It is based on kernel v4.1-rc2.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses

After the change, these pairs become:

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 3129842 writes
signal-based scheme: 9825306874 reads, 5386 writes
sys_membarrier: 7992076602 reads, 220 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
with the expedited scheme, we can see that we are close to the read-side
performance of the signal-based scheme. However, this non-expedited
sys_membarrier implementation has a much slower grace period than signal
and memory barrier schemes.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch adds the system call to x86 and to asm-generic.

membarrier(2) man page:
--------------- snip -------------------
MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

NAME
membarrier - issue memory barriers on a set of threads

SYNOPSIS
#include <linux/membarrier.h>

int membarrier(int cmd, int flags);

DESCRIPTION
The cmd argument is one of the following:

MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of
supported commands.

MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system.
Upon return from system call, the caller thread is ensured that
all running threads have passed through a state where all memory
accesses to user-space addresses match program order between
entry to and return from the system call (non-running threads
are de facto in such a state). This covers threads from all pro‐
cesses running on the system. This command returns 0.

The flags argument needs to be 0. For future extensions.

All memory accesses performed in program order from each targeted
thread is guaranteed to be ordered with respect to sys_membarrier(). If
we use the semantic "barrier()" to represent a compiler barrier forcing
memory accesses to be performed in program order across the barrier,
and smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():

The pair ordering is detailed as (O: ordered, X: not ordered):

barrier() smp_mb() sys_membarrier()
barrier() X X O
smp_mb() X O O
sys_membarrier() O O O

RETURN VALUE
On success, these system calls return zero. On error, -1 is returned,
and errno is set appropriately. For a given command, with flags
argument set to 0, this system call is guaranteed to always return the
same value until reboot.

ERRORS
ENOSYS System call is not implemented.

EINVAL Invalid arguments.

Linux 2015-04-15 MEMBARRIER(2)
--------------- snip -------------------

[1] http://urcu.so

Changes since v16:
- Update documentation.
- Add man page to changelog.
- Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
to not care about the number of processors on the system. Based on
recommendations from Stephen Hemminger and Steven Rostedt.
- Check that flags argument is 0, update documentation to require it.

Changes since v15:
- Add flags argument in addition to cmd.
- Update documentation.

Changes since v14:
- Take care of Thomas Gleixner's comments.

Changes since v13:
- Move to kernel/membarrier.c.
- Remove MEMBARRIER_PRIVATE flag.
- Add MAINTAINERS file entry.

Changes since v12:
- Remove _FLAG suffix from uapi flags.
- Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
- Remove EXPEDITED mode. Only implement non-expedited for now, until
reading the cpu_curr()->mm can be done without holding the CPU's rq
lock.

Changes since v11:
- 5 years have passed.
- Rebase on v3.19 kernel.
- Add futex-alike PRIVATE vs SHARED semantic: private for per-process
barriers, non-private for memory mappings shared between processes.
- Simplify user API.
- Code refactoring.

Changes since v10:
- Apply Randy's comments.
- Rebase on 2.6.34-rc4 -tip.

Changes since v9:
- Clean up #ifdef CONFIG_SMP.

Changes since v8:
- Go back to rq spin locks taken by sys_membarrier() rather than adding
memory barriers to the scheduler. It implies a potential RoS
(reduction of service) if sys_membarrier() is executed in a busy-loop
by a user, but nothing more than what is already possible with other
existing system calls, but saves memory barriers in the scheduler fast
path.
- re-add the memory barrier comments to x86 switch_mm() as an example to
other architectures.
- Update documentation of the memory barriers in sys_membarrier and
switch_mm().
- Append execution scenarios to the changelog showing the purpose of
each memory barrier.

Changes since v7:
- Move spinlock-mb and scheduler related changes to separate patches.
- Add support for sys_membarrier on x86_32.
- Only x86 32/64 system calls are reserved in this patch. It is planned
to incrementally reserve syscall IDs on other architectures as these
are tested.

Changes since v6:
- Remove some unlikely() not so unlikely.
- Add the proper scheduler memory barriers needed to only use the RCU
read lock in sys_membarrier rather than take each runqueue spinlock:
- Move memory barriers from per-architecture switch_mm() to schedule()
and finish_lock_switch(), where they clearly document that all data
protected by the rq lock is guaranteed to have memory barriers issued
between the scheduler update and the task execution. Replacing the
spin lock acquire/release barriers with these memory barriers imply
either no overhead (x86 spinlock atomic instruction already implies a
full mb) or some hopefully small overhead caused by the upgrade of the
spinlock acquire/release barriers to more heavyweight smp_mb().
- The "generic" version of spinlock-mb.h declares both a mapping to
standard spinlocks and full memory barriers. Each architecture can
specialize this header following their own need and declare
CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
- Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
implementations on a wide range of architecture would be welcome.

Changes since v5:
- Plan ahead for extensibility by introducing mandatory/optional masks
to the "flags" system call parameter. Past experience with accept4(),
signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
inotify_init1() indicates that this is the kind of thing we want to
plan for. Return -EINVAL if the mandatory flags received are unknown.
- Create include/linux/membarrier.h to define these flags.
- Add MEMBARRIER_QUERY optional flag.

Changes since v4:
- Add "int expedited" parameter, use synchronize_sched() in the
non-expedited case. Thanks to Lai Jiangshan for making us consider
seriously using synchronize_sched() to provide the low-overhead
membarrier scheme.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Changes since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before
sending an IPI. Ensures that we do not disturb RT tasks in the
presence of lazy TLB shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changes since v2:
- simply send-to-many to the mm_cpumask. It contains the list of
processors we have to IPI to (which use the mm), and this mask is
updated atomically.

Changes since v1:
- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
CC: Josh Triplett <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Nicholas Miell <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Alan Cox <[email protected]>
CC: Lai Jiangshan <[email protected]>
CC: Stephen Hemminger <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: David Howells <[email protected]>
CC: "Pranith Kumar" <[email protected]>
CC: Michael Kerrisk <[email protected]>
---
MAINTAINERS | 8 ++++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 ++-
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/membarrier.h | 53 +++++++++++++++++++++++++++++
init/Kconfig | 12 +++++++
kernel/Makefile | 1 +
kernel/membarrier.c | 66 +++++++++++++++++++++++++++++++++++++
kernel/sys_ni.c | 3 ++
11 files changed, 151 insertions(+), 1 deletions(-)
create mode 100644 include/uapi/linux/membarrier.h
create mode 100644 kernel/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 781e099..fcb63d4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6370,6 +6370,14 @@ W: http://www.mellanox.com
Q: http://patchwork.ozlabs.org/project/netdev/list/
F: drivers/net/ethernet/mellanox/mlx4/en_*

+MEMBARRIER SUPPORT
+M: Mathieu Desnoyers <[email protected]>
+M: "Paul E. McKenney" <[email protected]>
+L: [email protected]
+S: Supported
+F: kernel/membarrier.c
+F: include/uapi/linux/membarrier.h
+
MEMORY MANAGEMENT
L: [email protected]
W: http://www.linux-mm.org
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index ef8187f..e63ad61 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 membarrier sys_membarrier
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 9ef32d5..87f3cd6 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common membarrier sys_membarrier

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..51a9054 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *argv,
const char __user *const __user *envp, int flags);

+asmlinkage long sys_membarrier(int cmd, int flags);
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..8da542a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_membarrier 282
+__SYSCALL(__NR_membarrier, sys_membarrier)

#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283

/*
* All syscalls below here should go away really,
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 1a0006a..7bcc827 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -250,6 +250,7 @@ header-y += mdio.h
header-y += media.h
header-y += media-bus-format.h
header-y += mei.h
+header-y += membarrier.h
header-y += memfd.h
header-y += mempolicy.h
header-y += meye.h
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
new file mode 100644
index 0000000..e0b108b
--- /dev/null
+++ b/include/uapi/linux/membarrier.h
@@ -0,0 +1,53 @@
+#ifndef _UAPI_LINUX_MEMBARRIER_H
+#define _UAPI_LINUX_MEMBARRIER_H
+
+/*
+ * linux/membarrier.h
+ *
+ * membarrier system call API
+ *
+ * Copyright (c) 2010, 2015 Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/**
+ * enum membarrier_cmd - membarrier system call command
+ * @MEMBARRIER_CMD_QUERY: Query the set of supported commands. It returns
+ * a bitmask of valid commands.
+ * @MEMBARRIER_CMD_SHARED: Execute a memory barrier on all running threads.
+ * Upon return from system call, the caller thread
+ * is ensured that all running threads have passed
+ * through a state where all memory accesses to
+ * user-space addresses match program order between
+ * entry to and return from the system call
+ * (non-running threads are de facto in such a
+ * state). This covers threads from all processes
+ * running on the system. This command returns 0.
+ *
+ * Command to be passed to the membarrier system call. The commands need to
+ * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
+ * the value 0.
+ */
+enum membarrier_cmd {
+ MEMBARRIER_CMD_QUERY = 0,
+ MEMBARRIER_CMD_SHARED = (1 << 0),
+};
+
+#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec..307e406 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1583,6 +1583,18 @@ config PCI_QUIRKS
bugs/quirks. Disable this only if your target machine is
unaffected by PCI quirks.

+config MEMBARRIER
+ bool "Enable membarrier() system call" if EXPERT
+ default y
+ help
+ Enable the membarrier() system call that allows issuing memory
+ barriers across all running threads, which can be used to distribute
+ the cost of user-space memory barriers asymmetrically by transforming
+ pairs of memory barriers into pairs consisting of membarrier() and a
+ compiler barrier.
+
+ If unsure, say Y.
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 60c302c..05191fd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_MEMBARRIER) += membarrier.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/membarrier.c b/kernel/membarrier.c
new file mode 100644
index 0000000..a20b279
--- /dev/null
+++ b/kernel/membarrier.c
@@ -0,0 +1,66 @@
+/*
+ * Copyright (C) 2010, 2015 Mathieu Desnoyers <[email protected]>
+ *
+ * membarrier system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/membarrier.h>
+
+/*
+ * Bitmask made from a "or" of all commands within enum membarrier_cmd,
+ * except MEMBARRIER_CMD_QUERY.
+ */
+#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
+
+/**
+ * sys_membarrier - issue memory barriers on a set of threads
+ * @cmd: Takes command values defined in enum membarrier_cmd.
+ * @flags: Currently needs to be 0. For future extensions.
+ *
+ * If this system call is not implemented, -ENOSYS is returned. If the
+ * command specified does not exist, or if the command argument is invalid,
+ * this system call returns -EINVAL. For a given command, with flags argument
+ * set to 0, this system call is guaranteed to always return the same value
+ * until reboot.
+ *
+ * All memory accesses performed in program order from each targeted thread
+ * is guaranteed to be ordered with respect to sys_membarrier(). If we use
+ * the semantic "barrier()" to represent a compiler barrier forcing memory
+ * accesses to be performed in program order across the barrier, and
+ * smp_mb() to represent explicit memory barriers forcing full memory
+ * ordering across the barrier, we have the following ordering table for
+ * each pair of barrier(), sys_membarrier() and smp_mb():
+ *
+ * The pair ordering is detailed as (O: ordered, X: not ordered):
+ *
+ * barrier() smp_mb() sys_membarrier()
+ * barrier() X X O
+ * smp_mb() X O O
+ * sys_membarrier() O O O
+ */
+SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
+{
+ if (flags)
+ return -EINVAL;
+ switch (cmd) {
+ case MEMBARRIER_CMD_QUERY:
+ return MEMBARRIER_CMD_BITMASK;
+ case MEMBARRIER_CMD_SHARED:
+ if (num_online_cpus() > 1)
+ synchronize_sched();
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..eb4fde0 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -243,3 +243,6 @@ cond_syscall(sys_bpf);

/* execveat */
cond_syscall(sys_execveat);
+
+/* membarrier */
+cond_syscall(sys_membarrier);
--
1.7.7.3

2015-05-04 21:00:56

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH 2/2] selftests: Add membarrier syscall test

From: Pranith Kumar <[email protected]>

This commit adds a selftest for the membarrier system call.

Signed-off-by: Pranith Kumar <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/membarrier/.gitignore | 1 +
tools/testing/selftests/membarrier/Makefile | 13 ++++
.../testing/selftests/membarrier/membarrier_test.c | 69 ++++++++++++++++++++
4 files changed, 84 insertions(+), 0 deletions(-)
create mode 100644 tools/testing/selftests/membarrier/.gitignore
create mode 100644 tools/testing/selftests/membarrier/Makefile
create mode 100644 tools/testing/selftests/membarrier/membarrier_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 95abddc..73824b1 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -5,6 +5,7 @@ TARGETS += exec
TARGETS += firmware
TARGETS += ftrace
TARGETS += kcmp
+TARGETS += membarrier
TARGETS += memfd
TARGETS += memory-hotplug
TARGETS += mount
diff --git a/tools/testing/selftests/membarrier/.gitignore b/tools/testing/selftests/membarrier/.gitignore
new file mode 100644
index 0000000..020c44f4
--- /dev/null
+++ b/tools/testing/selftests/membarrier/.gitignore
@@ -0,0 +1 @@
+membarrier_test
diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/selftests/membarrier/Makefile
new file mode 100644
index 0000000..752b719
--- /dev/null
+++ b/tools/testing/selftests/membarrier/Makefile
@@ -0,0 +1,13 @@
+CFLAGS += -g -D_FILE_OFFSET_BITS=64
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+all:
+ gcc $(CFLAGS) membarrier_test.c -o membarrier_test
+
+run_tests: all
+ gcc $(CFLAGS) membarrier_test.c -o membarrier_test
+ @./membarrier_test || echo "membarrier_test: [FAIL]"
+
+clean:
+ $(RM) membarrier_test
diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c
new file mode 100644
index 0000000..ea285b0
--- /dev/null
+++ b/tools/testing/selftests/membarrier/membarrier_test.c
@@ -0,0 +1,69 @@
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <linux/membarrier.h>
+#include <asm-generic/unistd.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+
+#include "../kselftest.h"
+
+static int sys_membarrier(int cmd, int flags)
+{
+ return syscall(__NR_membarrier, cmd, flags);
+}
+
+static void test_membarrier_fail(void)
+{
+ int cmd = -1, flags = 0;
+
+ if (sys_membarrier(cmd, flags) != -1) {
+ printf("membarrier: Should fail but passed\n");
+ ksft_exit_fail();
+ }
+}
+
+static void test_membarrier_success(void)
+{
+ int flags = 0;
+
+ if (sys_membarrier(MEMBARRIER_CMD_SHARED, flags) != 0) {
+ printf("membarrier: Executing MEMBARRIER failed, %s\n",
+ strerror(errno));
+ ksft_exit_fail();
+ }
+
+ printf("membarrier: MEMBARRIER_CMD_SHARED success\n");
+}
+
+static void test_membarrier(void)
+{
+ test_membarrier_fail();
+ test_membarrier_success();
+}
+
+static int test_membarrier_exists(void)
+{
+ int flags = 0;
+
+ if (sys_membarrier(MEMBARRIER_CMD_QUERY, flags))
+ return ksft_exit_fail();
+
+ return 1;
+}
+
+int main(int argc, char **argv)
+{
+ printf("membarrier: MEMBARRIER_CMD_QUERY ");
+ if (test_membarrier_exists()) {
+ printf("syscall implemented\n");
+ test_membarrier();
+ } else
+ printf("syscall not implemented!\n");
+
+ printf("membarrier: tests done!\n");
+
+ return ksft_exit_pass();
+}
--
1.7.7.3

2015-05-04 21:31:07

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v17 1/2] sys_membarrier(): system-wide memory barrier (generic, x86)

On Mon, May 04, 2015 at 05:00:12PM -0400, Mathieu Desnoyers wrote:
> * Benchmarks
>
> On Intel Xeon E5405 (8 cores)
> (one thread is calling sys_membarrier, the other 7 threads are busy
> looping)
>
> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>
> * User-space user of this system call: Userspace RCU library
>
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invocation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
>
> Results in liburcu:
>
> Operations in 10s, 6 readers, 2 writers:
>
> memory barriers in reader: 1701557485 reads, 3129842 writes
> signal-based scheme: 9825306874 reads, 5386 writes
> sys_membarrier: 7992076602 reads, 220 writes
>
> The dynamic sys_membarrier availability check adds some overhead to
> the read-side compared to the signal-based scheme, but besides that,
> with the expedited scheme, we can see that we are close to the read-side
> performance of the signal-based scheme. However, this non-expedited
> sys_membarrier implementation has a much slower grace period than signal
> and memory barrier schemes.
>
> An expedited version of this system call can be added later on to speed
> up the grace period. Its implementation will likely depend on reading
> the cpu_curr()->mm without holding each CPU's rq lock.

So, I realize that there's a lot of history tied up in the previous 16
versions and associated mail threads. However, can you please summarize
in the commit message what the benefit of merging this version is?
Because from the text above, from liburcu's perspective, it appears to
be strictly worse in performance than the signal-based scheme.

There are other non-performance reasons why it might make sense to
include this; for instance, signals don't play nice with libraries, with
other processes you might inject yourself into for tracing purposes, or
with general sanity. However, the explanation for those use cases and
how membarrier() improves them needs to go in the commit message, rather
than only in the collective memory and mail archives of people who have
discussed this patch series.

(My apologies if the if the explanation is in the commit message and
I've just missed it.)

- Josh Triplett

2015-05-05 00:36:51

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 2/2] selftests: Add membarrier syscall test

On Mon, 2015-05-04 at 17:00 -0400, Mathieu Desnoyers wrote:
> From: Pranith Kumar <[email protected]>
>
> This commit adds a selftest for the membarrier system call.
>
> diff --git a/tools/testing/selftests/membarrier/.gitignore b/tools/testing/selftests/membarrier/.gitignore
> new file mode 100644
> index 0000000..020c44f4
> --- /dev/null
> +++ b/tools/testing/selftests/membarrier/.gitignore
> @@ -0,0 +1 @@
> +membarrier_test
> diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/selftests/membarrier/Makefile
> new file mode 100644
> index 0000000..752b719
> --- /dev/null
> +++ b/tools/testing/selftests/membarrier/Makefile
> @@ -0,0 +1,13 @@
> +CFLAGS += -g -D_FILE_OFFSET_BITS=64
> +CFLAGS += -I../../../../include/uapi/
> +CFLAGS += -I../../../../include/

Don't include the kernel headers, this is userspace.

If you want to include the exported headers that would be good, they are in
../../../../usr/include by default.

> +all:
> + gcc $(CFLAGS) membarrier_test.c -o membarrier_test
> +
> +run_tests: all
> + gcc $(CFLAGS) membarrier_test.c -o membarrier_test
> + @./membarrier_test || echo "membarrier_test: [FAIL]"
> +
> +clean:
> + $(RM) membarrier_test

Can you please use lib.mk, it will do most of this for you, will support cross
compilation, and install support as well.

cheers

Subject: Re: [PATCH v17 1/2] sys_membarrier(): system-wide memory barrier (generic, x86)

[CC += [email protected]]

Since this is a kernel-user-space API change, please CC linux-api@.
The kernel source file Documentation/SubmitChecklist notes that all
Linux kernel patches that change userspace interfaces should be CCed
to [email protected], so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html

Thanks,

Michael


On 4 May 2015 at 23:00, Mathieu Desnoyers
<[email protected]> wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads running on the system. It is
> implemented by calling synchronize_sched(). It can be used to distribute
> the cost of user-space memory barriers asymmetrically by transforming
> pairs of memory barriers into pairs consisting of sys_membarrier() and a
> compiler barrier. For synchronization primitives that distinguish
> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
> read-side can be accelerated significantly by moving the bulk of the
> memory barrier overhead to the write-side.
>
> It is based on kernel v4.1-rc2.
>
> To explain the benefit of this scheme, let's introduce two example threads:
>
> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
> Thread B (frequent, e.g. executing liburcu
> rcu_read_lock()/rcu_read_unlock())
>
> In a scheme where all smp_mb() in thread A are ordering memory accesses
> with respect to smp_mb() present in Thread B, we can change each
> smp_mb() within Thread A into calls to sys_membarrier() and each
> smp_mb() within Thread B into compiler barriers "barrier()".
>
> Before the change, we had, for each smp_mb() pairs:
>
> Thread A Thread B
> previous mem accesses previous mem accesses
> smp_mb() smp_mb()
> following mem accesses following mem accesses
>
> After the change, these pairs become:
>
> Thread A Thread B
> prev mem accesses prev mem accesses
> sys_membarrier() barrier()
> follow mem accesses follow mem accesses
>
> As we can see, there are two possible scenarios: either Thread B memory
> accesses do not happen concurrently with Thread A accesses (1), or they
> do (2).
>
> 1) Non-concurrent Thread A vs Thread B accesses:
>
> Thread A Thread B
> prev mem accesses
> sys_membarrier()
> follow mem accesses
> prev mem accesses
> barrier()
> follow mem accesses
>
> In this case, thread B accesses will be weakly ordered. This is OK,
> because at that point, thread A is not particularly interested in
> ordering them with respect to its own accesses.
>
> 2) Concurrent Thread A vs Thread B accesses
>
> Thread A Thread B
> prev mem accesses prev mem accesses
> sys_membarrier() barrier()
> follow mem accesses follow mem accesses
>
> In this case, thread B accesses, which are ensured to be in program
> order thanks to the compiler barrier, will be "upgraded" to full
> smp_mb() by synchronize_sched().
>
> * Benchmarks
>
> On Intel Xeon E5405 (8 cores)
> (one thread is calling sys_membarrier, the other 7 threads are busy
> looping)
>
> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>
> * User-space user of this system call: Userspace RCU library
>
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invocation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
>
> Results in liburcu:
>
> Operations in 10s, 6 readers, 2 writers:
>
> memory barriers in reader: 1701557485 reads, 3129842 writes
> signal-based scheme: 9825306874 reads, 5386 writes
> sys_membarrier: 7992076602 reads, 220 writes
>
> The dynamic sys_membarrier availability check adds some overhead to
> the read-side compared to the signal-based scheme, but besides that,
> with the expedited scheme, we can see that we are close to the read-side
> performance of the signal-based scheme. However, this non-expedited
> sys_membarrier implementation has a much slower grace period than signal
> and memory barrier schemes.
>
> An expedited version of this system call can be added later on to speed
> up the grace period. Its implementation will likely depend on reading
> the cpu_curr()->mm without holding each CPU's rq lock.
>
> This patch adds the system call to x86 and to asm-generic.
>
> membarrier(2) man page:
> --------------- snip -------------------
> MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)
>
> NAME
> membarrier - issue memory barriers on a set of threads
>
> SYNOPSIS
> #include <linux/membarrier.h>
>
> int membarrier(int cmd, int flags);
>
> DESCRIPTION
> The cmd argument is one of the following:
>
> MEMBARRIER_CMD_QUERY
> Query the set of supported commands. It returns a bitmask of
> supported commands.
>
> MEMBARRIER_CMD_SHARED
> Execute a memory barrier on all threads running on the system.
> Upon return from system call, the caller thread is ensured that
> all running threads have passed through a state where all memory
> accesses to user-space addresses match program order between
> entry to and return from the system call (non-running threads
> are de facto in such a state). This covers threads from all pro‐
> cesses running on the system. This command returns 0.
>
> The flags argument needs to be 0. For future extensions.
>
> All memory accesses performed in program order from each targeted
> thread is guaranteed to be ordered with respect to sys_membarrier(). If
> we use the semantic "barrier()" to represent a compiler barrier forcing
> memory accesses to be performed in program order across the barrier,
> and smp_mb() to represent explicit memory barriers forcing full memory
> ordering across the barrier, we have the following ordering table for
> each pair of barrier(), sys_membarrier() and smp_mb():
>
> The pair ordering is detailed as (O: ordered, X: not ordered):
>
> barrier() smp_mb() sys_membarrier()
> barrier() X X O
> smp_mb() X O O
> sys_membarrier() O O O
>
> RETURN VALUE
> On success, these system calls return zero. On error, -1 is returned,
> and errno is set appropriately. For a given command, with flags
> argument set to 0, this system call is guaranteed to always return the
> same value until reboot.
>
> ERRORS
> ENOSYS System call is not implemented.
>
> EINVAL Invalid arguments.
>
> Linux 2015-04-15 MEMBARRIER(2)
> --------------- snip -------------------
>
> [1] http://urcu.so
>
> Changes since v16:
> - Update documentation.
> - Add man page to changelog.
> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
> to not care about the number of processors on the system. Based on
> recommendations from Stephen Hemminger and Steven Rostedt.
> - Check that flags argument is 0, update documentation to require it.
>
> Changes since v15:
> - Add flags argument in addition to cmd.
> - Update documentation.
>
> Changes since v14:
> - Take care of Thomas Gleixner's comments.
>
> Changes since v13:
> - Move to kernel/membarrier.c.
> - Remove MEMBARRIER_PRIVATE flag.
> - Add MAINTAINERS file entry.
>
> Changes since v12:
> - Remove _FLAG suffix from uapi flags.
> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
> - Remove EXPEDITED mode. Only implement non-expedited for now, until
> reading the cpu_curr()->mm can be done without holding the CPU's rq
> lock.
>
> Changes since v11:
> - 5 years have passed.
> - Rebase on v3.19 kernel.
> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
> barriers, non-private for memory mappings shared between processes.
> - Simplify user API.
> - Code refactoring.
>
> Changes since v10:
> - Apply Randy's comments.
> - Rebase on 2.6.34-rc4 -tip.
>
> Changes since v9:
> - Clean up #ifdef CONFIG_SMP.
>
> Changes since v8:
> - Go back to rq spin locks taken by sys_membarrier() rather than adding
> memory barriers to the scheduler. It implies a potential RoS
> (reduction of service) if sys_membarrier() is executed in a busy-loop
> by a user, but nothing more than what is already possible with other
> existing system calls, but saves memory barriers in the scheduler fast
> path.
> - re-add the memory barrier comments to x86 switch_mm() as an example to
> other architectures.
> - Update documentation of the memory barriers in sys_membarrier and
> switch_mm().
> - Append execution scenarios to the changelog showing the purpose of
> each memory barrier.
>
> Changes since v7:
> - Move spinlock-mb and scheduler related changes to separate patches.
> - Add support for sys_membarrier on x86_32.
> - Only x86 32/64 system calls are reserved in this patch. It is planned
> to incrementally reserve syscall IDs on other architectures as these
> are tested.
>
> Changes since v6:
> - Remove some unlikely() not so unlikely.
> - Add the proper scheduler memory barriers needed to only use the RCU
> read lock in sys_membarrier rather than take each runqueue spinlock:
> - Move memory barriers from per-architecture switch_mm() to schedule()
> and finish_lock_switch(), where they clearly document that all data
> protected by the rq lock is guaranteed to have memory barriers issued
> between the scheduler update and the task execution. Replacing the
> spin lock acquire/release barriers with these memory barriers imply
> either no overhead (x86 spinlock atomic instruction already implies a
> full mb) or some hopefully small overhead caused by the upgrade of the
> spinlock acquire/release barriers to more heavyweight smp_mb().
> - The "generic" version of spinlock-mb.h declares both a mapping to
> standard spinlocks and full memory barriers. Each architecture can
> specialize this header following their own need and declare
> CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
> implementations on a wide range of architecture would be welcome.
>
> Changes since v5:
> - Plan ahead for extensibility by introducing mandatory/optional masks
> to the "flags" system call parameter. Past experience with accept4(),
> signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
> inotify_init1() indicates that this is the kind of thing we want to
> plan for. Return -EINVAL if the mandatory flags received are unknown.
> - Create include/linux/membarrier.h to define these flags.
> - Add MEMBARRIER_QUERY optional flag.
>
> Changes since v4:
> - Add "int expedited" parameter, use synchronize_sched() in the
> non-expedited case. Thanks to Lai Jiangshan for making us consider
> seriously using synchronize_sched() to provide the low-overhead
> membarrier scheme.
> - Check num_online_cpus() == 1, quickly return without doing nothing.
>
> Changes since v3a:
> - Confirm that each CPU indeed runs the current task's ->mm before
> sending an IPI. Ensures that we do not disturb RT tasks in the
> presence of lazy TLB shootdown.
> - Document memory barriers needed in switch_mm().
> - Surround helper functions with #ifdef CONFIG_SMP.
>
> Changes since v2:
> - simply send-to-many to the mm_cpumask. It contains the list of
> processors we have to IPI to (which use the mm), and this mask is
> updated atomically.
>
> Changes since v1:
> - Only perform the IPI in CONFIG_SMP.
> - Only perform the IPI if the process has more than one thread.
> - Only send IPIs to CPUs involved with threads belonging to our process.
> - Adaptative IPI scheme (single vs many IPI with threshold).
> - Issue smp_mb() at the beginning and end of the system call.
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Reviewed-by: Paul E. McKenney <[email protected]>
> CC: Josh Triplett <[email protected]>
> CC: KOSAKI Motohiro <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Nicholas Miell <[email protected]>
> CC: Linus Torvalds <[email protected]>
> CC: Ingo Molnar <[email protected]>
> CC: Alan Cox <[email protected]>
> CC: Lai Jiangshan <[email protected]>
> CC: Stephen Hemminger <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Thomas Gleixner <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: David Howells <[email protected]>
> CC: "Pranith Kumar" <[email protected]>
> CC: Michael Kerrisk <[email protected]>
> ---
> MAINTAINERS | 8 ++++
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 2 +
> include/uapi/asm-generic/unistd.h | 4 ++-
> include/uapi/linux/Kbuild | 1 +
> include/uapi/linux/membarrier.h | 53 +++++++++++++++++++++++++++++
> init/Kconfig | 12 +++++++
> kernel/Makefile | 1 +
> kernel/membarrier.c | 66 +++++++++++++++++++++++++++++++++++++
> kernel/sys_ni.c | 3 ++
> 11 files changed, 151 insertions(+), 1 deletions(-)
> create mode 100644 include/uapi/linux/membarrier.h
> create mode 100644 kernel/membarrier.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 781e099..fcb63d4 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6370,6 +6370,14 @@ W: http://www.mellanox.com
> Q: http://patchwork.ozlabs.org/project/netdev/list/
> F: drivers/net/ethernet/mellanox/mlx4/en_*
>
> +MEMBARRIER SUPPORT
> +M: Mathieu Desnoyers <[email protected]>
> +M: "Paul E. McKenney" <[email protected]>
> +L: [email protected]
> +S: Supported
> +F: kernel/membarrier.c
> +F: include/uapi/linux/membarrier.h
> +
> MEMORY MANAGEMENT
> L: [email protected]
> W: http://www.linux-mm.org
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index ef8187f..e63ad61 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -365,3 +365,4 @@
> 356 i386 memfd_create sys_memfd_create
> 357 i386 bpf sys_bpf
> 358 i386 execveat sys_execveat stub32_execveat
> +359 i386 membarrier sys_membarrier
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 9ef32d5..87f3cd6 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
> 320 common kexec_file_load sys_kexec_file_load
> 321 common bpf sys_bpf
> 322 64 execveat stub_execveat
> +323 common membarrier sys_membarrier
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 76d1e38..51a9054 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
> const char __user *const __user *argv,
> const char __user *const __user *envp, int flags);
>
> +asmlinkage long sys_membarrier(int cmd, int flags);
> +
> #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index e016bd9..8da542a 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
> __SYSCALL(__NR_bpf, sys_bpf)
> #define __NR_execveat 281
> __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> +#define __NR_membarrier 282
> +__SYSCALL(__NR_membarrier, sys_membarrier)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 282
> +#define __NR_syscalls 283
>
> /*
> * All syscalls below here should go away really,
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 1a0006a..7bcc827 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -250,6 +250,7 @@ header-y += mdio.h
> header-y += media.h
> header-y += media-bus-format.h
> header-y += mei.h
> +header-y += membarrier.h
> header-y += memfd.h
> header-y += mempolicy.h
> header-y += meye.h
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> new file mode 100644
> index 0000000..e0b108b
> --- /dev/null
> +++ b/include/uapi/linux/membarrier.h
> @@ -0,0 +1,53 @@
> +#ifndef _UAPI_LINUX_MEMBARRIER_H
> +#define _UAPI_LINUX_MEMBARRIER_H
> +
> +/*
> + * linux/membarrier.h
> + *
> + * membarrier system call API
> + *
> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <[email protected]>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +/**
> + * enum membarrier_cmd - membarrier system call command
> + * @MEMBARRIER_CMD_QUERY: Query the set of supported commands. It returns
> + * a bitmask of valid commands.
> + * @MEMBARRIER_CMD_SHARED: Execute a memory barrier on all running threads.
> + * Upon return from system call, the caller thread
> + * is ensured that all running threads have passed
> + * through a state where all memory accesses to
> + * user-space addresses match program order between
> + * entry to and return from the system call
> + * (non-running threads are de facto in such a
> + * state). This covers threads from all processes
> + * running on the system. This command returns 0.
> + *
> + * Command to be passed to the membarrier system call. The commands need to
> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> + * the value 0.
> + */
> +enum membarrier_cmd {
> + MEMBARRIER_CMD_QUERY = 0,
> + MEMBARRIER_CMD_SHARED = (1 << 0),
> +};
> +
> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index dc24dec..307e406 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1583,6 +1583,18 @@ config PCI_QUIRKS
> bugs/quirks. Disable this only if your target machine is
> unaffected by PCI quirks.
>
> +config MEMBARRIER
> + bool "Enable membarrier() system call" if EXPERT
> + default y
> + help
> + Enable the membarrier() system call that allows issuing memory
> + barriers across all running threads, which can be used to distribute
> + the cost of user-space memory barriers asymmetrically by transforming
> + pairs of memory barriers into pairs consisting of membarrier() and a
> + compiler barrier.
> +
> + If unsure, say Y.
> +
> config EMBEDDED
> bool "Embedded system"
> option allnoconfig_y
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 60c302c..05191fd 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
> obj-$(CONFIG_TORTURE_TEST) += torture.o
> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>
> $(obj)/configs.o: $(obj)/config_data.h
>
> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> new file mode 100644
> index 0000000..a20b279
> --- /dev/null
> +++ b/kernel/membarrier.c
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <[email protected]>
> + *
> + * membarrier system call
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/membarrier.h>
> +
> +/*
> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> + * except MEMBARRIER_CMD_QUERY.
> + */
> +#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
> +
> +/**
> + * sys_membarrier - issue memory barriers on a set of threads
> + * @cmd: Takes command values defined in enum membarrier_cmd.
> + * @flags: Currently needs to be 0. For future extensions.
> + *
> + * If this system call is not implemented, -ENOSYS is returned. If the
> + * command specified does not exist, or if the command argument is invalid,
> + * this system call returns -EINVAL. For a given command, with flags argument
> + * set to 0, this system call is guaranteed to always return the same value
> + * until reboot.
> + *
> + * All memory accesses performed in program order from each targeted thread
> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
> + * the semantic "barrier()" to represent a compiler barrier forcing memory
> + * accesses to be performed in program order across the barrier, and
> + * smp_mb() to represent explicit memory barriers forcing full memory
> + * ordering across the barrier, we have the following ordering table for
> + * each pair of barrier(), sys_membarrier() and smp_mb():
> + *
> + * The pair ordering is detailed as (O: ordered, X: not ordered):
> + *
> + * barrier() smp_mb() sys_membarrier()
> + * barrier() X X O
> + * smp_mb() X O O
> + * sys_membarrier() O O O
> + */
> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> +{
> + if (flags)
> + return -EINVAL;
> + switch (cmd) {
> + case MEMBARRIER_CMD_QUERY:
> + return MEMBARRIER_CMD_BITMASK;
> + case MEMBARRIER_CMD_SHARED:
> + if (num_online_cpus() > 1)
> + synchronize_sched();
> + return 0;
> + default:
> + return -EINVAL;
> + }
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 7995ef5..eb4fde0 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>
> /* execveat */
> cond_syscall(sys_execveat);
> +
> +/* membarrier */
> +cond_syscall(sys_membarrier);
> --
> 1.7.7.3
>



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2015-05-05 18:25:14

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v17 1/2] sys_membarrier(): system-wide memory barrier (generic, x86)

----- Original Message -----
> On Mon, May 04, 2015 at 05:00:12PM -0400, Mathieu Desnoyers wrote:
> > * Benchmarks
> >
> > On Intel Xeon E5405 (8 cores)
> > (one thread is calling sys_membarrier, the other 7 threads are busy
> > looping)
> >
> > 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
> >
> > * User-space user of this system call: Userspace RCU library
> >
> > Both the signal-based and the sys_membarrier userspace RCU schemes
> > permit us to remove the memory barrier from the userspace RCU
> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > accelerating them. These memory barriers are replaced by compiler
> > barriers on the read-side, and all matching memory barriers on the
> > write-side are turned into an invocation of a memory barrier on all
> > active threads in the process. By letting the kernel perform this
> > synchronization rather than dumbly sending a signal to every process
> > threads (as we currently do), we diminish the number of unnecessary wake
> > ups and only issue the memory barriers on active threads. Non-running
> > threads do not need to execute such barrier anyway, because these are
> > implied by the scheduler context switches.
> >
> > Results in liburcu:
> >
> > Operations in 10s, 6 readers, 2 writers:
> >
> > memory barriers in reader: 1701557485 reads, 3129842 writes
> > signal-based scheme: 9825306874 reads, 5386 writes
> > sys_membarrier: 7992076602 reads, 220 writes
> >
> > The dynamic sys_membarrier availability check adds some overhead to
> > the read-side compared to the signal-based scheme, but besides that,
> > with the expedited scheme, we can see that we are close to the read-side
> > performance of the signal-based scheme. However, this non-expedited
> > sys_membarrier implementation has a much slower grace period than signal
> > and memory barrier schemes.
> >
> > An expedited version of this system call can be added later on to speed
> > up the grace period. Its implementation will likely depend on reading
> > the cpu_curr()->mm without holding each CPU's rq lock.
>
> So, I realize that there's a lot of history tied up in the previous 16
> versions and associated mail threads. However, can you please summarize
> in the commit message what the benefit of merging this version is?
> Because from the text above, from liburcu's perspective, it appears to
> be strictly worse in performance than the signal-based scheme.
>
> There are other non-performance reasons why it might make sense to
> include this; for instance, signals don't play nice with libraries, with
> other processes you might inject yourself into for tracing purposes, or
> with general sanity. However, the explanation for those use cases and
> how membarrier() improves them needs to go in the commit message, rather
> than only in the collective memory and mail archives of people who have
> discussed this patch series.
>
> (My apologies if the if the explanation is in the commit message and
> I've just missed it.)

I will add info about signals vs libraries, which appears to be missing
from the commit message:

"Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application."

The commit message already point out that sys_membarrier diminishes the
number of unnecessary wake-ups sent to other threads compared to the
signal-based approach.

I re-ran those tests on urcu master branch with a slightly modified
version of the sys_membarrier scheme too: a version which assumes that
sys_membarrier is always available. We can then compare apples to
apples performance-wise between signal and membarrier approaches:

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 3129842 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes

It shows that sys_membarrier read-side actually performs slightly
better than the signal-based scheme, in the absence of dynamic
check for syscall availability. This could be enhanced in userspace
eventually if we decide to implement self-modifying code upon
feature detection in liburcu. I'll update the commit message with
this new table.

Thanks!

Mathieu

>
> - Josh Triplett
>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2015-05-05 23:11:47

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v17 1/2] sys_membarrier(): system-wide memory barrier (generic, x86)

On Tue, May 05, 2015 at 06:25:12PM +0000, Mathieu Desnoyers wrote:
> ----- Original Message -----
> > On Mon, May 04, 2015 at 05:00:12PM -0400, Mathieu Desnoyers wrote:
> > > * Benchmarks
> > >
> > > On Intel Xeon E5405 (8 cores)
> > > (one thread is calling sys_membarrier, the other 7 threads are busy
> > > looping)
> > >
> > > 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
> > >
> > > * User-space user of this system call: Userspace RCU library
> > >
> > > Both the signal-based and the sys_membarrier userspace RCU schemes
> > > permit us to remove the memory barrier from the userspace RCU
> > > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > > accelerating them. These memory barriers are replaced by compiler
> > > barriers on the read-side, and all matching memory barriers on the
> > > write-side are turned into an invocation of a memory barrier on all
> > > active threads in the process. By letting the kernel perform this
> > > synchronization rather than dumbly sending a signal to every process
> > > threads (as we currently do), we diminish the number of unnecessary wake
> > > ups and only issue the memory barriers on active threads. Non-running
> > > threads do not need to execute such barrier anyway, because these are
> > > implied by the scheduler context switches.
> > >
> > > Results in liburcu:
> > >
> > > Operations in 10s, 6 readers, 2 writers:
> > >
> > > memory barriers in reader: 1701557485 reads, 3129842 writes
> > > signal-based scheme: 9825306874 reads, 5386 writes
> > > sys_membarrier: 7992076602 reads, 220 writes
> > >
> > > The dynamic sys_membarrier availability check adds some overhead to
> > > the read-side compared to the signal-based scheme, but besides that,
> > > with the expedited scheme, we can see that we are close to the read-side
> > > performance of the signal-based scheme. However, this non-expedited
> > > sys_membarrier implementation has a much slower grace period than signal
> > > and memory barrier schemes.
> > >
> > > An expedited version of this system call can be added later on to speed
> > > up the grace period. Its implementation will likely depend on reading
> > > the cpu_curr()->mm without holding each CPU's rq lock.
> >
> > So, I realize that there's a lot of history tied up in the previous 16
> > versions and associated mail threads. However, can you please summarize
> > in the commit message what the benefit of merging this version is?
> > Because from the text above, from liburcu's perspective, it appears to
> > be strictly worse in performance than the signal-based scheme.
> >
> > There are other non-performance reasons why it might make sense to
> > include this; for instance, signals don't play nice with libraries, with
> > other processes you might inject yourself into for tracing purposes, or
> > with general sanity. However, the explanation for those use cases and
> > how membarrier() improves them needs to go in the commit message, rather
> > than only in the collective memory and mail archives of people who have
> > discussed this patch series.
> >
> > (My apologies if the if the explanation is in the commit message and
> > I've just missed it.)
>
> I will add info about signals vs libraries, which appears to be missing
> from the commit message:
>
> "Besides diminishing the number of wake-ups, one major advantage of the
> membarrier system call over the signal-based scheme is that it does not
> need to reserve a signal. This plays much more nicely with libraries,
> and with processes injected into for tracing purposes, for which we
> cannot expect that signals will be unused by the application."
>
> The commit message already point out that sys_membarrier diminishes the
> number of unnecessary wake-ups sent to other threads compared to the
> signal-based approach.
>
> I re-ran those tests on urcu master branch with a slightly modified
> version of the sys_membarrier scheme too: a version which assumes that
> sys_membarrier is always available. We can then compare apples to
> apples performance-wise between signal and membarrier approaches:
>
> Results in liburcu:
>
> Operations in 10s, 6 readers, 2 writers:
>
> memory barriers in reader: 1701557485 reads, 3129842 writes
> signal-based scheme: 9830061167 reads, 6700 writes
> sys_membarrier: 9952759104 reads, 425 writes
> sys_membarrier (dyn. check): 7970328887 reads, 425 writes
>
> It shows that sys_membarrier read-side actually performs slightly
> better than the signal-based scheme, in the absence of dynamic
> check for syscall availability. This could be enhanced in userspace
> eventually if we decide to implement self-modifying code upon
> feature detection in liburcu. I'll update the commit message with
> this new table.

That's *much* better, thank you.

- Josh Triplett

2015-05-06 19:08:07

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 2/2] selftests: Add membarrier syscall test

----- Original Message -----
> On Mon, 2015-05-04 at 17:00 -0400, Mathieu Desnoyers wrote:
> > From: Pranith Kumar <[email protected]>
> >
> > This commit adds a selftest for the membarrier system call.
> >
> > diff --git a/tools/testing/selftests/membarrier/.gitignore
> > b/tools/testing/selftests/membarrier/.gitignore
> > new file mode 100644
> > index 0000000..020c44f4
> > --- /dev/null
> > +++ b/tools/testing/selftests/membarrier/.gitignore
> > @@ -0,0 +1 @@
> > +membarrier_test
> > diff --git a/tools/testing/selftests/membarrier/Makefile
> > b/tools/testing/selftests/membarrier/Makefile
> > new file mode 100644
> > index 0000000..752b719
> > --- /dev/null
> > +++ b/tools/testing/selftests/membarrier/Makefile
> > @@ -0,0 +1,13 @@
> > +CFLAGS += -g -D_FILE_OFFSET_BITS=64
> > +CFLAGS += -I../../../../include/uapi/
> > +CFLAGS += -I../../../../include/
>
> Don't include the kernel headers, this is userspace.
>
> If you want to include the exported headers that would be good, they are in
> ../../../../usr/include by default.
>
> > +all:
> > + gcc $(CFLAGS) membarrier_test.c -o membarrier_test
> > +
> > +run_tests: all
> > + gcc $(CFLAGS) membarrier_test.c -o membarrier_test
> > + @./membarrier_test || echo "membarrier_test: [FAIL]"
> > +
> > +clean:
> > + $(RM) membarrier_test
>
> Can you please use lib.mk, it will do most of this for you, will support
> cross
> compilation, and install support as well.

Allright, Pranith sent me an updated patch handling those issues,
and I queued one extra patch updating the selftest on top of his
work. I'll CC you on the next round for those 2 patches.

Thanks!

Mathieu

>
> cheers
>
>
>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com