2017-07-27 21:13:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v2] membarrier: expedited private command

Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as the
thread calling sys_membarrier.

Scheduler-wise, it requires that we add a memory barrier after context
switching between processes (which have different mm). Interestingly,
there is already a memory barrier in mmdrop(), so we only need to add
a barrier when switching from a kernel thread to a userspace thread.
We also don't need to add the barrier when switching to a kernel thread,
because it has no userspace memory mapping, which makes ordering of
user-space memory accesses pretty much useless.

* Benchmark

A stress-test benchmark of sched pipe shows that it does not add
significant overhead to the scheduler switching between processes:

100 runs of:

taskset 01 ./perf bench sched pipe

Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes

Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

A) With 4.13.0-rc2+
at commit a97fb594bc7d ("virtio-net: fix module unloading")

avg.: 2.923 usecs/op
std.dev: 0.057 usecs/op

B) With this commit:

avg.: 2.916 usecs/op
std.dev: 0.043 usecs/op

Changes since v1:
- move membarrier code under kernel/sched/ because it uses the
scheduler runqueue,
- only add the barrier when we switch from a kernel thread. The case
where we switch from a user-space thread is already handled by
the atomic_dec_and_test() in mmdrop().
- add a comment to mmdrop() documenting the requirement on the implicit
memory barrier.

CC: Peter Zijlstra <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Maged Michael <[email protected]>
CC: [email protected]
CC: Avi Kivity <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
---
MAINTAINERS | 2 +-
include/linux/sched/mm.h | 5 +++
include/uapi/linux/membarrier.h | 23 +++++++++++--
kernel/Makefile | 1 -
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 27 ++++++++++++++++
kernel/{ => sched}/membarrier.c | 72 ++++++++++++++++++++++++++++++++++++++++-
7 files changed, 126 insertions(+), 5 deletions(-)
rename kernel/{ => sched}/membarrier.c (59%)

diff --git a/MAINTAINERS b/MAINTAINERS
index f66488dfdbc9..3b035584272f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8621,7 +8621,7 @@ M: Mathieu Desnoyers <[email protected]>
M: "Paul E. McKenney" <[email protected]>
L: [email protected]
S: Supported
-F: kernel/membarrier.c
+F: kernel/sched/membarrier.c
F: include/uapi/linux/membarrier.h

MEMORY MANAGEMENT
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2b24a6974847..5c5384d9ae0f 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -38,6 +38,11 @@ static inline void mmgrab(struct mm_struct *mm)
extern void __mmdrop(struct mm_struct *);
static inline void mmdrop(struct mm_struct *mm)
{
+ /*
+ * Implicit full memory barrier provided by
+ * atomic_dec_and_test() is required by membarrier. See comments
+ * around membarrier_expedited_mb_after_set_current().
+ */
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108bd2624..6d47b3249d8a 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,14 +40,33 @@
* (non-running threads are de facto in such a
* state). This covers threads from all processes
* running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
+ * Execute a memory barrier on each running
+ * thread belonging to the same process as the current
+ * thread. Upon return from system call, the
+ * caller thread is ensured that all its running
+ * threads siblings have passed through a state
+ * where all memory accesses to user-space
+ * addresses match program order between entry
+ * to and return from the system call
+ * (non-running threads are de facto in such a
+ * state). This only covers threads from the
+ * same processes as the caller thread. This
+ * command returns 0. The "expedited" commands
+ * complete faster than the non-expedited ones,
+ * they never block, but have the downside of
+ * causing extra overhead.
*
* Command to be passed to the membarrier system call. The commands need to
* be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
* the value 0.
*/
enum membarrier_cmd {
- MEMBARRIER_CMD_QUERY = 0,
- MEMBARRIER_CMD_SHARED = (1 << 0),
+ MEMBARRIER_CMD_QUERY = 0,
+ MEMBARRIER_CMD_SHARED = (1 << 0),
+ /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
+ /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
};

#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb8e8b23c6e..9c323a6daa46 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
-obj-$(CONFIG_MEMBARRIER) += membarrier.o

obj-$(CONFIG_HAS_IOMEM) += memremap.o

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 53f0164ed362..78f54932ea1d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
+obj-$(CONFIG_MEMBARRIER) += membarrier.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17c667b427b4..01e3b881ab3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2724,6 +2724,32 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
put_user(task_pid_vnr(current), current->set_child_tid);
}

+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
+ struct mm_struct *oldmm)
+{
+ if (!IS_ENABLED(CONFIG_MEMBARRIER))
+ return;
+ /*
+ * __schedule()->
+ * finish_task_switch()->
+ * if (mm)
+ * mmdrop(mm) ->
+ * atomic_dec_and_test()
+ * takes care of issuing a memory barrier when oldmm is
+ * non-NULL. We also don't need the barrier when switching to a
+ * kernel thread, nor when we switch between threads belonging
+ * to the same process.
+ */
+ if (likely(oldmm || !mm || mm == oldmm))
+ return;
+ /*
+ * When switching between processes, membarrier expedited
+ * private requires a memory barrier after we set the current
+ * task.
+ */
+ smp_mb();
+}
+
/*
* context_switch - switch to the new MM and the new thread's register state.
*/
@@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

mm = next->mm;
oldmm = prev->active_mm;
+ membarrier_expedited_mb_after_set_current(mm, oldmm);
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
diff --git a/kernel/membarrier.c b/kernel/sched/membarrier.c
similarity index 59%
rename from kernel/membarrier.c
rename to kernel/sched/membarrier.c
index 9f9284f37f8d..f80828b0b607 100644
--- a/kernel/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -17,12 +17,79 @@
#include <linux/syscalls.h>
#include <linux/membarrier.h>
#include <linux/tick.h>
+#include <linux/cpumask.h>
+
+#include "sched.h" /* for cpu_rq(). */

/*
* Bitmask made from a "or" of all commands within enum membarrier_cmd,
* except MEMBARRIER_CMD_QUERY.
*/
-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
+#define MEMBARRIER_CMD_BITMASK \
+ (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
+
+static void ipi_mb(void *info)
+{
+ smp_mb(); /* IPIs should be serializing but paranoid. */
+}
+
+static void membarrier_private_expedited(void)
+{
+ int cpu, this_cpu;
+ bool fallback = false;
+ cpumask_var_t tmpmask;
+
+ if (num_online_cpus() == 1)
+ return;
+
+ /*
+ * Matches memory barriers around rq->curr modification in
+ * scheduler.
+ */
+ smp_mb(); /* system call entry is not a mb. */
+
+ if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+ /* Fallback for OOM. */
+ fallback = true;
+ }
+
+ /*
+ * Skipping the current CPU is OK even through we can be
+ * migrated at any point. The current CPU, at the point where we
+ * read raw_smp_processor_id(), is ensured to be in program
+ * order with respect to the caller thread. Therefore, we can
+ * skip this CPU from the iteration.
+ */
+ this_cpu = raw_smp_processor_id();
+ cpus_read_lock();
+ for_each_online_cpu(cpu) {
+ struct task_struct *p;
+
+ if (cpu == this_cpu)
+ continue;
+ rcu_read_lock();
+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+ if (p && p->mm == current->mm) {
+ if (!fallback)
+ __cpumask_set_cpu(cpu, tmpmask);
+ else
+ smp_call_function_single(cpu, ipi_mb, NULL, 1);
+ }
+ rcu_read_unlock();
+ }
+ cpus_read_unlock();
+ if (!fallback) {
+ smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
+ free_cpumask_var(tmpmask);
+ }
+
+ /*
+ * Memory barrier on the caller thread _after_ we finished
+ * waiting for the last IPI. Matches memory barriers around
+ * rq->curr modification in scheduler.
+ */
+ smp_mb(); /* exit from system call is not a mb */
+}

/**
* sys_membarrier - issue memory barriers on a set of threads
@@ -64,6 +131,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
if (num_online_cpus() > 1)
synchronize_sched();
return 0;
+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
+ membarrier_private_expedited();
+ return 0;
default:
return -EINVAL;
}
--
2.11.0


2017-07-27 22:14:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> from all runqueues for which current thread's mm is the same as the
> thread calling sys_membarrier.
>
> Scheduler-wise, it requires that we add a memory barrier after context
> switching between processes (which have different mm). Interestingly,
> there is already a memory barrier in mmdrop(), so we only need to add
> a barrier when switching from a kernel thread to a userspace thread.
> We also don't need to add the barrier when switching to a kernel thread,
> because it has no userspace memory mapping, which makes ordering of
> user-space memory accesses pretty much useless.
>
> * Benchmark
>
> A stress-test benchmark of sched pipe shows that it does not add
> significant overhead to the scheduler switching between processes:
>
> 100 runs of:
>
> taskset 01 ./perf bench sched pipe
>
> Running 'sched/pipe' benchmark:
> Executed 1000000 pipe operations between two processes
>
> Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>
> A) With 4.13.0-rc2+
> at commit a97fb594bc7d ("virtio-net: fix module unloading")
>
> avg.: 2.923 usecs/op
> std.dev: 0.057 usecs/op
>
> B) With this commit:
>
> avg.: 2.916 usecs/op
> std.dev: 0.043 usecs/op
>
> Changes since v1:
> - move membarrier code under kernel/sched/ because it uses the
> scheduler runqueue,
> - only add the barrier when we switch from a kernel thread. The case
> where we switch from a user-space thread is already handled by
> the atomic_dec_and_test() in mmdrop().
> - add a comment to mmdrop() documenting the requirement on the implicit
> memory barrier.
>
> CC: Peter Zijlstra <[email protected]>
> CC: Paul E. McKenney <[email protected]>
> CC: Boqun Feng <[email protected]>
> CC: Andrew Hunter <[email protected]>
> CC: Maged Michael <[email protected]>
> CC: [email protected]
> CC: Avi Kivity <[email protected]>
> Signed-off-by: Mathieu Desnoyers <[email protected]>

Looks much better, thank you!

I have queued this in place of my earlier patch for the moment. If there
are no objections, I will push this into the upcoming v4.14 merge window.
If someone else wants to push it into v4.14, I am of course fine with
that, and am happy to give it a reviewed-by along the way. But if there
are objections to your patch (suitably modified based on additional
review and testing, of course) going into v4.14, I can always fall back
to pushing my earlier simpler but less housebroken patch. ;-)

Thanx, Paul

> ---
> MAINTAINERS | 2 +-
> include/linux/sched/mm.h | 5 +++
> include/uapi/linux/membarrier.h | 23 +++++++++++--
> kernel/Makefile | 1 -
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 27 ++++++++++++++++
> kernel/{ => sched}/membarrier.c | 72 ++++++++++++++++++++++++++++++++++++++++-
> 7 files changed, 126 insertions(+), 5 deletions(-)
> rename kernel/{ => sched}/membarrier.c (59%)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f66488dfdbc9..3b035584272f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8621,7 +8621,7 @@ M: Mathieu Desnoyers <[email protected]>
> M: "Paul E. McKenney" <[email protected]>
> L: [email protected]
> S: Supported
> -F: kernel/membarrier.c
> +F: kernel/sched/membarrier.c
> F: include/uapi/linux/membarrier.h
>
> MEMORY MANAGEMENT
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 2b24a6974847..5c5384d9ae0f 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -38,6 +38,11 @@ static inline void mmgrab(struct mm_struct *mm)
> extern void __mmdrop(struct mm_struct *);
> static inline void mmdrop(struct mm_struct *mm)
> {
> + /*
> + * Implicit full memory barrier provided by
> + * atomic_dec_and_test() is required by membarrier. See comments
> + * around membarrier_expedited_mb_after_set_current().
> + */
> if (unlikely(atomic_dec_and_test(&mm->mm_count)))
> __mmdrop(mm);
> }
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> index e0b108bd2624..6d47b3249d8a 100644
> --- a/include/uapi/linux/membarrier.h
> +++ b/include/uapi/linux/membarrier.h
> @@ -40,14 +40,33 @@
> * (non-running threads are de facto in such a
> * state). This covers threads from all processes
> * running on the system. This command returns 0.
> + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> + * Execute a memory barrier on each running
> + * thread belonging to the same process as the current
> + * thread. Upon return from system call, the
> + * caller thread is ensured that all its running
> + * threads siblings have passed through a state
> + * where all memory accesses to user-space
> + * addresses match program order between entry
> + * to and return from the system call
> + * (non-running threads are de facto in such a
> + * state). This only covers threads from the
> + * same processes as the caller thread. This
> + * command returns 0. The "expedited" commands
> + * complete faster than the non-expedited ones,
> + * they never block, but have the downside of
> + * causing extra overhead.
> *
> * Command to be passed to the membarrier system call. The commands need to
> * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> * the value 0.
> */
> enum membarrier_cmd {
> - MEMBARRIER_CMD_QUERY = 0,
> - MEMBARRIER_CMD_SHARED = (1 << 0),
> + MEMBARRIER_CMD_QUERY = 0,
> + MEMBARRIER_CMD_SHARED = (1 << 0),
> + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
> + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
> + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
> };
>
> #endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4cb8e8b23c6e..9c323a6daa46 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
> obj-$(CONFIG_TORTURE_TEST) += torture.o
> -obj-$(CONFIG_MEMBARRIER) += membarrier.o
>
> obj-$(CONFIG_HAS_IOMEM) += memremap.o
>
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 53f0164ed362..78f54932ea1d 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o
> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> obj-$(CONFIG_CPU_FREQ) += cpufreq.o
> obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 17c667b427b4..01e3b881ab3a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2724,6 +2724,32 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> put_user(task_pid_vnr(current), current->set_child_tid);
> }
>
> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> + struct mm_struct *oldmm)
> +{
> + if (!IS_ENABLED(CONFIG_MEMBARRIER))
> + return;
> + /*
> + * __schedule()->
> + * finish_task_switch()->
> + * if (mm)
> + * mmdrop(mm) ->
> + * atomic_dec_and_test()
> + * takes care of issuing a memory barrier when oldmm is
> + * non-NULL. We also don't need the barrier when switching to a
> + * kernel thread, nor when we switch between threads belonging
> + * to the same process.
> + */
> + if (likely(oldmm || !mm || mm == oldmm))
> + return;
> + /*
> + * When switching between processes, membarrier expedited
> + * private requires a memory barrier after we set the current
> + * task.
> + */
> + smp_mb();
> +}
> +
> /*
> * context_switch - switch to the new MM and the new thread's register state.
> */
> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>
> mm = next->mm;
> oldmm = prev->active_mm;
> + membarrier_expedited_mb_after_set_current(mm, oldmm);
> /*
> * For paravirt, this is coupled with an exit in switch_to to
> * combine the page table reload and the switch backend into
> diff --git a/kernel/membarrier.c b/kernel/sched/membarrier.c
> similarity index 59%
> rename from kernel/membarrier.c
> rename to kernel/sched/membarrier.c
> index 9f9284f37f8d..f80828b0b607 100644
> --- a/kernel/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -17,12 +17,79 @@
> #include <linux/syscalls.h>
> #include <linux/membarrier.h>
> #include <linux/tick.h>
> +#include <linux/cpumask.h>
> +
> +#include "sched.h" /* for cpu_rq(). */
>
> /*
> * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> * except MEMBARRIER_CMD_QUERY.
> */
> -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
> +#define MEMBARRIER_CMD_BITMASK \
> + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> +
> +static void ipi_mb(void *info)
> +{
> + smp_mb(); /* IPIs should be serializing but paranoid. */
> +}
> +
> +static void membarrier_private_expedited(void)
> +{
> + int cpu, this_cpu;
> + bool fallback = false;
> + cpumask_var_t tmpmask;
> +
> + if (num_online_cpus() == 1)
> + return;
> +
> + /*
> + * Matches memory barriers around rq->curr modification in
> + * scheduler.
> + */
> + smp_mb(); /* system call entry is not a mb. */
> +
> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> + /* Fallback for OOM. */
> + fallback = true;
> + }
> +
> + /*
> + * Skipping the current CPU is OK even through we can be
> + * migrated at any point. The current CPU, at the point where we
> + * read raw_smp_processor_id(), is ensured to be in program
> + * order with respect to the caller thread. Therefore, we can
> + * skip this CPU from the iteration.
> + */
> + this_cpu = raw_smp_processor_id();
> + cpus_read_lock();
> + for_each_online_cpu(cpu) {
> + struct task_struct *p;
> +
> + if (cpu == this_cpu)
> + continue;
> + rcu_read_lock();
> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> + if (p && p->mm == current->mm) {
> + if (!fallback)
> + __cpumask_set_cpu(cpu, tmpmask);
> + else
> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
> + }
> + rcu_read_unlock();
> + }
> + cpus_read_unlock();
> + if (!fallback) {
> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> + free_cpumask_var(tmpmask);
> + }
> +
> + /*
> + * Memory barrier on the caller thread _after_ we finished
> + * waiting for the last IPI. Matches memory barriers around
> + * rq->curr modification in scheduler.
> + */
> + smp_mb(); /* exit from system call is not a mb */
> +}
>
> /**
> * sys_membarrier - issue memory barriers on a set of threads
> @@ -64,6 +131,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> if (num_online_cpus() > 1)
> synchronize_sched();
> return 0;
> + case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> + membarrier_private_expedited();
> + return 0;
> default:
> return -EINVAL;
> }
> --
> 2.11.0
>

2017-07-27 22:39:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

----- On Jul 27, 2017, at 6:13 PM, Paul E. McKenney [email protected] wrote:

> On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
>> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
>> from all runqueues for which current thread's mm is the same as the
>> thread calling sys_membarrier.
>>
>> Scheduler-wise, it requires that we add a memory barrier after context
>> switching between processes (which have different mm). Interestingly,
>> there is already a memory barrier in mmdrop(), so we only need to add
>> a barrier when switching from a kernel thread to a userspace thread.
>> We also don't need to add the barrier when switching to a kernel thread,
>> because it has no userspace memory mapping, which makes ordering of
>> user-space memory accesses pretty much useless.
>>
>> * Benchmark
>>
>> A stress-test benchmark of sched pipe shows that it does not add
>> significant overhead to the scheduler switching between processes:
>>
>> 100 runs of:
>>
>> taskset 01 ./perf bench sched pipe
>>
>> Running 'sched/pipe' benchmark:
>> Executed 1000000 pipe operations between two processes
>>
>> Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>>
>> A) With 4.13.0-rc2+
>> at commit a97fb594bc7d ("virtio-net: fix module unloading")
>>
>> avg.: 2.923 usecs/op
>> std.dev: 0.057 usecs/op
>>
>> B) With this commit:
>>
>> avg.: 2.916 usecs/op
>> std.dev: 0.043 usecs/op
>>
>> Changes since v1:
>> - move membarrier code under kernel/sched/ because it uses the
>> scheduler runqueue,
>> - only add the barrier when we switch from a kernel thread. The case
>> where we switch from a user-space thread is already handled by
>> the atomic_dec_and_test() in mmdrop().
>> - add a comment to mmdrop() documenting the requirement on the implicit
>> memory barrier.
>>
>> CC: Peter Zijlstra <[email protected]>
>> CC: Paul E. McKenney <[email protected]>
>> CC: Boqun Feng <[email protected]>
>> CC: Andrew Hunter <[email protected]>
>> CC: Maged Michael <[email protected]>
>> CC: [email protected]
>> CC: Avi Kivity <[email protected]>
>> Signed-off-by: Mathieu Desnoyers <[email protected]>
>
> Looks much better, thank you!
>
> I have queued this in place of my earlier patch for the moment. If there
> are no objections, I will push this into the upcoming v4.14 merge window.
> If someone else wants to push it into v4.14, I am of course fine with
> that, and am happy to give it a reviewed-by along the way. But if there
> are objections to your patch (suitably modified based on additional
> review and testing, of course) going into v4.14, I can always fall back
> to pushing my earlier simpler but less housebroken patch. ;-)

I'm fine about you picking up this patch, even though it's tagged "RFC".
I'm sure concerns will have plenty of time to be voiced by others until
it reaches mainline anyway, at which point I'll address them and resubmit
new versions.

Thanks!

Mathieu

>
> Thanx, Paul
>
>> ---
>> MAINTAINERS | 2 +-
>> include/linux/sched/mm.h | 5 +++
>> include/uapi/linux/membarrier.h | 23 +++++++++++--
>> kernel/Makefile | 1 -
>> kernel/sched/Makefile | 1 +
>> kernel/sched/core.c | 27 ++++++++++++++++
>> kernel/{ => sched}/membarrier.c | 72 ++++++++++++++++++++++++++++++++++++++++-
>> 7 files changed, 126 insertions(+), 5 deletions(-)
>> rename kernel/{ => sched}/membarrier.c (59%)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index f66488dfdbc9..3b035584272f 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -8621,7 +8621,7 @@ M: Mathieu Desnoyers <[email protected]>
>> M: "Paul E. McKenney" <[email protected]>
>> L: [email protected]
>> S: Supported
>> -F: kernel/membarrier.c
>> +F: kernel/sched/membarrier.c
>> F: include/uapi/linux/membarrier.h
>>
>> MEMORY MANAGEMENT
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 2b24a6974847..5c5384d9ae0f 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -38,6 +38,11 @@ static inline void mmgrab(struct mm_struct *mm)
>> extern void __mmdrop(struct mm_struct *);
>> static inline void mmdrop(struct mm_struct *mm)
>> {
>> + /*
>> + * Implicit full memory barrier provided by
>> + * atomic_dec_and_test() is required by membarrier. See comments
>> + * around membarrier_expedited_mb_after_set_current().
>> + */
>> if (unlikely(atomic_dec_and_test(&mm->mm_count)))
>> __mmdrop(mm);
>> }
>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>> index e0b108bd2624..6d47b3249d8a 100644
>> --- a/include/uapi/linux/membarrier.h
>> +++ b/include/uapi/linux/membarrier.h
>> @@ -40,14 +40,33 @@
>> * (non-running threads are de facto in such a
>> * state). This covers threads from all processes
>> * running on the system. This command returns 0.
>> + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>> + * Execute a memory barrier on each running
>> + * thread belonging to the same process as the current
>> + * thread. Upon return from system call, the
>> + * caller thread is ensured that all its running
>> + * threads siblings have passed through a state
>> + * where all memory accesses to user-space
>> + * addresses match program order between entry
>> + * to and return from the system call
>> + * (non-running threads are de facto in such a
>> + * state). This only covers threads from the
>> + * same processes as the caller thread. This
>> + * command returns 0. The "expedited" commands
>> + * complete faster than the non-expedited ones,
>> + * they never block, but have the downside of
>> + * causing extra overhead.
>> *
>> * Command to be passed to the membarrier system call. The commands need to
>> * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>> * the value 0.
>> */
>> enum membarrier_cmd {
>> - MEMBARRIER_CMD_QUERY = 0,
>> - MEMBARRIER_CMD_SHARED = (1 << 0),
>> + MEMBARRIER_CMD_QUERY = 0,
>> + MEMBARRIER_CMD_SHARED = (1 << 0),
>> + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
>> + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
>> + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
>> };
>>
>> #endif /* _UAPI_LINUX_MEMBARRIER_H */
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 4cb8e8b23c6e..9c323a6daa46 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>> obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>> obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>> obj-$(CONFIG_TORTURE_TEST) += torture.o
>> -obj-$(CONFIG_MEMBARRIER) += membarrier.o
>>
>> obj-$(CONFIG_HAS_IOMEM) += memremap.o
>>
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index 53f0164ed362..78f54932ea1d 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o
>> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
>> obj-$(CONFIG_CPU_FREQ) += cpufreq.o
>> obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 17c667b427b4..01e3b881ab3a 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2724,6 +2724,32 @@ asmlinkage __visible void schedule_tail(struct
>> task_struct *prev)
>> put_user(task_pid_vnr(current), current->set_child_tid);
>> }
>>
>> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
>> + struct mm_struct *oldmm)
>> +{
>> + if (!IS_ENABLED(CONFIG_MEMBARRIER))
>> + return;
>> + /*
>> + * __schedule()->
>> + * finish_task_switch()->
>> + * if (mm)
>> + * mmdrop(mm) ->
>> + * atomic_dec_and_test()
>> + * takes care of issuing a memory barrier when oldmm is
>> + * non-NULL. We also don't need the barrier when switching to a
>> + * kernel thread, nor when we switch between threads belonging
>> + * to the same process.
>> + */
>> + if (likely(oldmm || !mm || mm == oldmm))
>> + return;
>> + /*
>> + * When switching between processes, membarrier expedited
>> + * private requires a memory barrier after we set the current
>> + * task.
>> + */
>> + smp_mb();
>> +}
>> +
>> /*
>> * context_switch - switch to the new MM and the new thread's register state.
>> */
>> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>
>> mm = next->mm;
>> oldmm = prev->active_mm;
>> + membarrier_expedited_mb_after_set_current(mm, oldmm);
>> /*
>> * For paravirt, this is coupled with an exit in switch_to to
>> * combine the page table reload and the switch backend into
>> diff --git a/kernel/membarrier.c b/kernel/sched/membarrier.c
>> similarity index 59%
>> rename from kernel/membarrier.c
>> rename to kernel/sched/membarrier.c
>> index 9f9284f37f8d..f80828b0b607 100644
>> --- a/kernel/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -17,12 +17,79 @@
>> #include <linux/syscalls.h>
>> #include <linux/membarrier.h>
>> #include <linux/tick.h>
>> +#include <linux/cpumask.h>
>> +
>> +#include "sched.h" /* for cpu_rq(). */
>>
>> /*
>> * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>> * except MEMBARRIER_CMD_QUERY.
>> */
>> -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
>> +#define MEMBARRIER_CMD_BITMASK \
>> + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
>> +
>> +static void ipi_mb(void *info)
>> +{
>> + smp_mb(); /* IPIs should be serializing but paranoid. */
>> +}
>> +
>> +static void membarrier_private_expedited(void)
>> +{
>> + int cpu, this_cpu;
>> + bool fallback = false;
>> + cpumask_var_t tmpmask;
>> +
>> + if (num_online_cpus() == 1)
>> + return;
>> +
>> + /*
>> + * Matches memory barriers around rq->curr modification in
>> + * scheduler.
>> + */
>> + smp_mb(); /* system call entry is not a mb. */
>> +
>> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
>> + /* Fallback for OOM. */
>> + fallback = true;
>> + }
>> +
>> + /*
>> + * Skipping the current CPU is OK even through we can be
>> + * migrated at any point. The current CPU, at the point where we
>> + * read raw_smp_processor_id(), is ensured to be in program
>> + * order with respect to the caller thread. Therefore, we can
>> + * skip this CPU from the iteration.
>> + */
>> + this_cpu = raw_smp_processor_id();
>> + cpus_read_lock();
>> + for_each_online_cpu(cpu) {
>> + struct task_struct *p;
>> +
>> + if (cpu == this_cpu)
>> + continue;
>> + rcu_read_lock();
>> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> + if (p && p->mm == current->mm) {
>> + if (!fallback)
>> + __cpumask_set_cpu(cpu, tmpmask);
>> + else
>> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> + }
>> + rcu_read_unlock();
>> + }
>> + cpus_read_unlock();
>> + if (!fallback) {
>> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>> + free_cpumask_var(tmpmask);
>> + }
>> +
>> + /*
>> + * Memory barrier on the caller thread _after_ we finished
>> + * waiting for the last IPI. Matches memory barriers around
>> + * rq->curr modification in scheduler.
>> + */
>> + smp_mb(); /* exit from system call is not a mb */
>> +}
>>
>> /**
>> * sys_membarrier - issue memory barriers on a set of threads
>> @@ -64,6 +131,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>> if (num_online_cpus() > 1)
>> synchronize_sched();
>> return 0;
>> + case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>> + membarrier_private_expedited();
>> + return 0;
>> default:
>> return -EINVAL;
>> }
>> --
>> 2.11.0

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-07-27 22:57:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Thu, Jul 27, 2017 at 10:41:25PM +0000, Mathieu Desnoyers wrote:
> ----- On Jul 27, 2017, at 6:13 PM, Paul E. McKenney [email protected] wrote:
>
> > On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
> >> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> >> from all runqueues for which current thread's mm is the same as the
> >> thread calling sys_membarrier.
> >>
> >> Scheduler-wise, it requires that we add a memory barrier after context
> >> switching between processes (which have different mm). Interestingly,
> >> there is already a memory barrier in mmdrop(), so we only need to add
> >> a barrier when switching from a kernel thread to a userspace thread.
> >> We also don't need to add the barrier when switching to a kernel thread,
> >> because it has no userspace memory mapping, which makes ordering of
> >> user-space memory accesses pretty much useless.
> >>
> >> * Benchmark
> >>
> >> A stress-test benchmark of sched pipe shows that it does not add
> >> significant overhead to the scheduler switching between processes:
> >>
> >> 100 runs of:
> >>
> >> taskset 01 ./perf bench sched pipe
> >>
> >> Running 'sched/pipe' benchmark:
> >> Executed 1000000 pipe operations between two processes
> >>
> >> Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> >>
> >> A) With 4.13.0-rc2+
> >> at commit a97fb594bc7d ("virtio-net: fix module unloading")
> >>
> >> avg.: 2.923 usecs/op
> >> std.dev: 0.057 usecs/op
> >>
> >> B) With this commit:
> >>
> >> avg.: 2.916 usecs/op
> >> std.dev: 0.043 usecs/op
> >>
> >> Changes since v1:
> >> - move membarrier code under kernel/sched/ because it uses the
> >> scheduler runqueue,
> >> - only add the barrier when we switch from a kernel thread. The case
> >> where we switch from a user-space thread is already handled by
> >> the atomic_dec_and_test() in mmdrop().
> >> - add a comment to mmdrop() documenting the requirement on the implicit
> >> memory barrier.
> >>
> >> CC: Peter Zijlstra <[email protected]>
> >> CC: Paul E. McKenney <[email protected]>
> >> CC: Boqun Feng <[email protected]>
> >> CC: Andrew Hunter <[email protected]>
> >> CC: Maged Michael <[email protected]>
> >> CC: [email protected]
> >> CC: Avi Kivity <[email protected]>
> >> Signed-off-by: Mathieu Desnoyers <[email protected]>
> >
> > Looks much better, thank you!
> >
> > I have queued this in place of my earlier patch for the moment. If there
> > are no objections, I will push this into the upcoming v4.14 merge window.
> > If someone else wants to push it into v4.14, I am of course fine with
> > that, and am happy to give it a reviewed-by along the way. But if there
> > are objections to your patch (suitably modified based on additional
> > review and testing, of course) going into v4.14, I can always fall back
> > to pushing my earlier simpler but less housebroken patch. ;-)
>
> I'm fine about you picking up this patch, even though it's tagged "RFC".
> I'm sure concerns will have plenty of time to be voiced by others until
> it reaches mainline anyway, at which point I'll address them and resubmit
> new versions.

Works for me! My -rcu tree is subject to rebasing, so I can easily
replace the current patch with an updated one.

Thanx, Paul

> Thanks!
>
> Mathieu
>
> >
> > Thanx, Paul
> >
> >> ---
> >> MAINTAINERS | 2 +-
> >> include/linux/sched/mm.h | 5 +++
> >> include/uapi/linux/membarrier.h | 23 +++++++++++--
> >> kernel/Makefile | 1 -
> >> kernel/sched/Makefile | 1 +
> >> kernel/sched/core.c | 27 ++++++++++++++++
> >> kernel/{ => sched}/membarrier.c | 72 ++++++++++++++++++++++++++++++++++++++++-
> >> 7 files changed, 126 insertions(+), 5 deletions(-)
> >> rename kernel/{ => sched}/membarrier.c (59%)
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index f66488dfdbc9..3b035584272f 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -8621,7 +8621,7 @@ M: Mathieu Desnoyers <[email protected]>
> >> M: "Paul E. McKenney" <[email protected]>
> >> L: [email protected]
> >> S: Supported
> >> -F: kernel/membarrier.c
> >> +F: kernel/sched/membarrier.c
> >> F: include/uapi/linux/membarrier.h
> >>
> >> MEMORY MANAGEMENT
> >> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> >> index 2b24a6974847..5c5384d9ae0f 100644
> >> --- a/include/linux/sched/mm.h
> >> +++ b/include/linux/sched/mm.h
> >> @@ -38,6 +38,11 @@ static inline void mmgrab(struct mm_struct *mm)
> >> extern void __mmdrop(struct mm_struct *);
> >> static inline void mmdrop(struct mm_struct *mm)
> >> {
> >> + /*
> >> + * Implicit full memory barrier provided by
> >> + * atomic_dec_and_test() is required by membarrier. See comments
> >> + * around membarrier_expedited_mb_after_set_current().
> >> + */
> >> if (unlikely(atomic_dec_and_test(&mm->mm_count)))
> >> __mmdrop(mm);
> >> }
> >> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> >> index e0b108bd2624..6d47b3249d8a 100644
> >> --- a/include/uapi/linux/membarrier.h
> >> +++ b/include/uapi/linux/membarrier.h
> >> @@ -40,14 +40,33 @@
> >> * (non-running threads are de facto in such a
> >> * state). This covers threads from all processes
> >> * running on the system. This command returns 0.
> >> + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> >> + * Execute a memory barrier on each running
> >> + * thread belonging to the same process as the current
> >> + * thread. Upon return from system call, the
> >> + * caller thread is ensured that all its running
> >> + * threads siblings have passed through a state
> >> + * where all memory accesses to user-space
> >> + * addresses match program order between entry
> >> + * to and return from the system call
> >> + * (non-running threads are de facto in such a
> >> + * state). This only covers threads from the
> >> + * same processes as the caller thread. This
> >> + * command returns 0. The "expedited" commands
> >> + * complete faster than the non-expedited ones,
> >> + * they never block, but have the downside of
> >> + * causing extra overhead.
> >> *
> >> * Command to be passed to the membarrier system call. The commands need to
> >> * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> >> * the value 0.
> >> */
> >> enum membarrier_cmd {
> >> - MEMBARRIER_CMD_QUERY = 0,
> >> - MEMBARRIER_CMD_SHARED = (1 << 0),
> >> + MEMBARRIER_CMD_QUERY = 0,
> >> + MEMBARRIER_CMD_SHARED = (1 << 0),
> >> + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
> >> + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
> >> + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
> >> };
> >>
> >> #endif /* _UAPI_LINUX_MEMBARRIER_H */
> >> diff --git a/kernel/Makefile b/kernel/Makefile
> >> index 4cb8e8b23c6e..9c323a6daa46 100644
> >> --- a/kernel/Makefile
> >> +++ b/kernel/Makefile
> >> @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> >> obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> >> obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
> >> obj-$(CONFIG_TORTURE_TEST) += torture.o
> >> -obj-$(CONFIG_MEMBARRIER) += membarrier.o
> >>
> >> obj-$(CONFIG_HAS_IOMEM) += memremap.o
> >>
> >> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> >> index 53f0164ed362..78f54932ea1d 100644
> >> --- a/kernel/sched/Makefile
> >> +++ b/kernel/sched/Makefile
> >> @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o
> >> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> >> obj-$(CONFIG_CPU_FREQ) += cpufreq.o
> >> obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
> >> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 17c667b427b4..01e3b881ab3a 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -2724,6 +2724,32 @@ asmlinkage __visible void schedule_tail(struct
> >> task_struct *prev)
> >> put_user(task_pid_vnr(current), current->set_child_tid);
> >> }
> >>
> >> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> >> + struct mm_struct *oldmm)
> >> +{
> >> + if (!IS_ENABLED(CONFIG_MEMBARRIER))
> >> + return;
> >> + /*
> >> + * __schedule()->
> >> + * finish_task_switch()->
> >> + * if (mm)
> >> + * mmdrop(mm) ->
> >> + * atomic_dec_and_test()
> >> + * takes care of issuing a memory barrier when oldmm is
> >> + * non-NULL. We also don't need the barrier when switching to a
> >> + * kernel thread, nor when we switch between threads belonging
> >> + * to the same process.
> >> + */
> >> + if (likely(oldmm || !mm || mm == oldmm))
> >> + return;
> >> + /*
> >> + * When switching between processes, membarrier expedited
> >> + * private requires a memory barrier after we set the current
> >> + * task.
> >> + */
> >> + smp_mb();
> >> +}
> >> +
> >> /*
> >> * context_switch - switch to the new MM and the new thread's register state.
> >> */
> >> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >>
> >> mm = next->mm;
> >> oldmm = prev->active_mm;
> >> + membarrier_expedited_mb_after_set_current(mm, oldmm);
> >> /*
> >> * For paravirt, this is coupled with an exit in switch_to to
> >> * combine the page table reload and the switch backend into
> >> diff --git a/kernel/membarrier.c b/kernel/sched/membarrier.c
> >> similarity index 59%
> >> rename from kernel/membarrier.c
> >> rename to kernel/sched/membarrier.c
> >> index 9f9284f37f8d..f80828b0b607 100644
> >> --- a/kernel/membarrier.c
> >> +++ b/kernel/sched/membarrier.c
> >> @@ -17,12 +17,79 @@
> >> #include <linux/syscalls.h>
> >> #include <linux/membarrier.h>
> >> #include <linux/tick.h>
> >> +#include <linux/cpumask.h>
> >> +
> >> +#include "sched.h" /* for cpu_rq(). */
> >>
> >> /*
> >> * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> >> * except MEMBARRIER_CMD_QUERY.
> >> */
> >> -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
> >> +#define MEMBARRIER_CMD_BITMASK \
> >> + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> >> +
> >> +static void ipi_mb(void *info)
> >> +{
> >> + smp_mb(); /* IPIs should be serializing but paranoid. */
> >> +}
> >> +
> >> +static void membarrier_private_expedited(void)
> >> +{
> >> + int cpu, this_cpu;
> >> + bool fallback = false;
> >> + cpumask_var_t tmpmask;
> >> +
> >> + if (num_online_cpus() == 1)
> >> + return;
> >> +
> >> + /*
> >> + * Matches memory barriers around rq->curr modification in
> >> + * scheduler.
> >> + */
> >> + smp_mb(); /* system call entry is not a mb. */
> >> +
> >> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> >> + /* Fallback for OOM. */
> >> + fallback = true;
> >> + }
> >> +
> >> + /*
> >> + * Skipping the current CPU is OK even through we can be
> >> + * migrated at any point. The current CPU, at the point where we
> >> + * read raw_smp_processor_id(), is ensured to be in program
> >> + * order with respect to the caller thread. Therefore, we can
> >> + * skip this CPU from the iteration.
> >> + */
> >> + this_cpu = raw_smp_processor_id();
> >> + cpus_read_lock();
> >> + for_each_online_cpu(cpu) {
> >> + struct task_struct *p;
> >> +
> >> + if (cpu == this_cpu)
> >> + continue;
> >> + rcu_read_lock();
> >> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> >> + if (p && p->mm == current->mm) {
> >> + if (!fallback)
> >> + __cpumask_set_cpu(cpu, tmpmask);
> >> + else
> >> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
> >> + }
> >> + rcu_read_unlock();
> >> + }
> >> + cpus_read_unlock();
> >> + if (!fallback) {
> >> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> >> + free_cpumask_var(tmpmask);
> >> + }
> >> +
> >> + /*
> >> + * Memory barrier on the caller thread _after_ we finished
> >> + * waiting for the last IPI. Matches memory barriers around
> >> + * rq->curr modification in scheduler.
> >> + */
> >> + smp_mb(); /* exit from system call is not a mb */
> >> +}
> >>
> >> /**
> >> * sys_membarrier - issue memory barriers on a set of threads
> >> @@ -64,6 +131,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> >> if (num_online_cpus() > 1)
> >> synchronize_sched();
> >> return 0;
> >> + case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> >> + membarrier_private_expedited();
> >> + return 0;
> >> default:
> >> return -EINVAL;
> >> }
> >> --
> >> 2.11.0
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>

2017-07-28 08:55:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> + struct mm_struct *oldmm)

That is a bit of a mouth-full...

> +{
> + if (!IS_ENABLED(CONFIG_MEMBARRIER))
> + return;
> + /*
> + * __schedule()
> + * finish_task_switch()
> + * if (mm)
> + * mmdrop(mm)
> + * atomic_dec_and_test()
*
> + * takes care of issuing a memory barrier when oldmm is
> + * non-NULL. We also don't need the barrier when switching to a
> + * kernel thread, nor when we switch between threads belonging
> + * to the same process.
> + */
> + if (likely(oldmm || !mm || mm == oldmm))
> + return;
> + /*
> + * When switching between processes, membarrier expedited
> + * private requires a memory barrier after we set the current
> + * task.
> + */
> + smp_mb();
> +}

And because of what it complements, I would have expected the callsite:

> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>
> mm = next->mm;
> oldmm = prev->active_mm;
> + membarrier_expedited_mb_after_set_current(mm, oldmm);
> /*
> * For paravirt, this is coupled with an exit in switch_to to
> * combine the page table reload and the switch backend into

to be in finish_task_switch(), something like:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e9785f7aed75..33f34a201255 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct *prev)
finish_arch_post_lock_switch();

fire_sched_in_preempt_notifiers(current);
+
+ /*
+ * For CONFIG_MEMBARRIER we need a full memory barrier after the
+ * rq->curr assignment. Not all architectures have one in either
+ * switch_to() or switch_mm() so we use (and complement) the one
+ * implied by mmdrop()'s atomic_dec_and_test().
+ */
if (mm)
mmdrop(mm);
+ else if (IS_ENABLED(CONFIG_MEMBARRIER))
+ smp_mb();
+
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);


I realize this is sub-optimal if we're switching to a kernel thread, so
it might want some work, then again, a whole bunch of architectures
don't in fact need this extra barrier at all.

> +static void membarrier_private_expedited(void)
> +{
> + int cpu, this_cpu;
> + bool fallback = false;
> + cpumask_var_t tmpmask;
> +
> + if (num_online_cpus() == 1)
> + return;
> +
> + /*
> + * Matches memory barriers around rq->curr modification in
> + * scheduler.
> + */
> + smp_mb(); /* system call entry is not a mb. */
> +

Weren't you going to put in a comment on that GFP_NOWAIT thing?

> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {

You really want: zalloc_cpumask_var().

> + /* Fallback for OOM. */
> + fallback = true;
> + }
> +
> + /*
> + * Skipping the current CPU is OK even through we can be
> + * migrated at any point. The current CPU, at the point where we
> + * read raw_smp_processor_id(), is ensured to be in program
> + * order with respect to the caller thread. Therefore, we can
> + * skip this CPU from the iteration.
> + */
> + this_cpu = raw_smp_processor_id();

So if instead you do the below, that is still true, but you have the
opportunity to skip moar CPUs, then again, if you migrate the wrong way
you'll end up not skipping yourself.. a well.

> + cpus_read_lock();
> + for_each_online_cpu(cpu) {
> + struct task_struct *p;
> +
if (cpu == raw_smp_processor_id())
continue;

> + rcu_read_lock();
> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> + if (p && p->mm == current->mm) {
> + if (!fallback)
> + __cpumask_set_cpu(cpu, tmpmask);
> + else
> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
> + }
> + rcu_read_unlock();
> + }
> + cpus_read_unlock();

This ^, wants to go after that v

> + if (!fallback) {
> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> + free_cpumask_var(tmpmask);
> + }

Because otherwise the bits in your tmpmask might no longer match the
online state.

> +
> + /*
> + * Memory barrier on the caller thread _after_ we finished
> + * waiting for the last IPI. Matches memory barriers around
> + * rq->curr modification in scheduler.
> + */
> + smp_mb(); /* exit from system call is not a mb */
> +}

2017-07-28 11:10:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e9785f7aed75..33f34a201255 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> finish_arch_post_lock_switch();
>
> fire_sched_in_preempt_notifiers(current);
> +
> + /*
> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
> + * rq->curr assignment. Not all architectures have one in either
> + * switch_to() or switch_mm() so we use (and complement) the one
> + * implied by mmdrop()'s atomic_dec_and_test().
> + */
> if (mm)
> mmdrop(mm);
> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
> + smp_mb();
> +
> if (unlikely(prev_state == TASK_DEAD)) {
> if (prev->sched_class->task_dead)
> prev->sched_class->task_dead(prev);

Hurm.. so going over this again, I'm not sure this is as good as we
thought it was..


context_switch()

mm = next->mm
old_mm = prev->active_mm;

if (!mm) {
next->active_mm = old_mm;
mmgrab(oldmm);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);

if (!prev->mm) {
prev->active_mm = NULL;
rq->prev_mm = old_mm;
}

/* ... */

finish_task_switch()

mm = rq->prev_mm; // oldmm when !prev->mm
rq->prev_mm = NULL;

if (mm)
mmdrop(mm);



That mmdrop() is to balance the mmgrab() for when we switch between
kthreads. Also, it looks to me that if we do kthread->kthread switches,
we do a superfluous mmgrab() / mmdrop().

Something like the below perhaps would optimize and clarify things.

---
kernel/sched/core.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e9785f7aed75..7924b4cc2bff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2733,12 +2733,8 @@ static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
- struct mm_struct *mm, *oldmm;
-
prepare_task_switch(rq, prev, next);

- mm = next->mm;
- oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
@@ -2746,16 +2742,29 @@ context_switch(struct rq *rq, struct task_struct *prev,
*/
arch_start_context_switch(prev);

- if (!mm) {
- next->active_mm = oldmm;
- mmgrab(oldmm);
- enter_lazy_tlb(oldmm, next);
- } else
- switch_mm_irqs_off(oldmm, mm, next);
+ /*
+ * kernel -> kernel transfer active
+ * user -> kernel mmgrab()
+ *
+ * kernel -> user mmdrop()
+ * user -> user switch_mm()
+ */
+ if (!next->mm) { // to kernel
+ next->active_mm = prev->active_mm;
+ enter_lazy_tlb(prev->active_mm, next);
+
+ if (prev->mm) // from user
+ mmgrab(prev->active_mm);
+
+ } else { // to user
+ switch_mm_irqs_off(prev->active_mm, next->mm, next);

- if (!prev->mm) {
- prev->active_mm = NULL;
- rq->prev_mm = oldmm;
+ if (!prev->mm) { // from kernel
+ rq->prev_mm = prev->active_mm;
+ prev->active_mm = NULL;
+
+ /* will mmdrop() in finish_task_switch(). */
+ }
}

rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

2017-07-28 11:57:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e9785f7aed75..33f34a201255 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> finish_arch_post_lock_switch();
>
> fire_sched_in_preempt_notifiers(current);
> +
> + /*
> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
> + * rq->curr assignment. Not all architectures have one in either
> + * switch_to() or switch_mm() so we use (and complement) the one
> + * implied by mmdrop()'s atomic_dec_and_test().
> + */
> if (mm)
> mmdrop(mm);
> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
> + smp_mb();
> +
> if (unlikely(prev_state == TASK_DEAD)) {
> if (prev->sched_class->task_dead)
> prev->sched_class->task_dead(prev);
>
>

> a whole bunch of architectures don't in fact need this extra barrier at all.

In fact, I'm fairly sure its only PPC.

Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
anything other than smp_mb() (for now, Risc-V is in this same boat and
MIPS could be if they ever sort out their fancy barriers).

TSO archs use a regular STORE for RELEASE, but all their atomics imply a
smp_mb() and there are enough around to make one happen (typically
mm_cpumask updates).

Everybody else, aside from ARM64 and PPC must use smp_mb() for
ACQUIRE/RELEASE.

ARM64 has a super duper barrier in switch_to().

Which only leaves PPC stranded.. but the 'good' news is that mpe says
they'll probably need a barrier in switch_mm() in any case.


2017-07-28 15:34:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

----- On Jul 28, 2017, at 4:55 AM, Peter Zijlstra [email protected] wrote:

> On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
>> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
>> + struct mm_struct *oldmm)
>
> That is a bit of a mouth-full...
>
>> +{
>> + if (!IS_ENABLED(CONFIG_MEMBARRIER))
>> + return;
>> + /*
>> + * __schedule()
>> + * finish_task_switch()
>> + * if (mm)
>> + * mmdrop(mm)
>> + * atomic_dec_and_test()
> *
>> + * takes care of issuing a memory barrier when oldmm is
>> + * non-NULL. We also don't need the barrier when switching to a
>> + * kernel thread, nor when we switch between threads belonging
>> + * to the same process.
>> + */
>> + if (likely(oldmm || !mm || mm == oldmm))
>> + return;
>> + /*
>> + * When switching between processes, membarrier expedited
>> + * private requires a memory barrier after we set the current
>> + * task.
>> + */
>> + smp_mb();
>> +}
>
> And because of what it complements, I would have expected the callsite:
>
>> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>
>> mm = next->mm;
>> oldmm = prev->active_mm;
>> + membarrier_expedited_mb_after_set_current(mm, oldmm);
>> /*
>> * For paravirt, this is coupled with an exit in switch_to to
>> * combine the page table reload and the switch backend into
>
> to be in finish_task_switch(), something like:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e9785f7aed75..33f34a201255 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> finish_arch_post_lock_switch();
>
> fire_sched_in_preempt_notifiers(current);
> +
> + /*
> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
> + * rq->curr assignment. Not all architectures have one in either
> + * switch_to() or switch_mm() so we use (and complement) the one
> + * implied by mmdrop()'s atomic_dec_and_test().
> + */
> if (mm)
> mmdrop(mm);
> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
> + smp_mb();
> +
> if (unlikely(prev_state == TASK_DEAD)) {
> if (prev->sched_class->task_dead)
> prev->sched_class->task_dead(prev);
>
>
> I realize this is sub-optimal if we're switching to a kernel thread, so
> it might want some work, then again, a whole bunch of architectures
> don't in fact need this extra barrier at all.

As discussed on IRC, I plan to go instead for:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01e3b881ab3a..dd677fb2ee92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *
prev)
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
finish_lock_switch(rq, prev);
+ /*
+ * The membarrier system call requires a full memory barrier
+ * after storing to rq->curr, before going back to user-space.
+ */
+ smp_mb__after_unlock_lock();
finish_arch_post_lock_switch();

fire_sched_in_preempt_notifiers(current);

Which is free on most architectures, except those defining
CONFIG_ARCH_WEAK_RELEASE_ACQUIRE. CCing PPC maintainers.

>
>> +static void membarrier_private_expedited(void)
>> +{
>> + int cpu, this_cpu;
>> + bool fallback = false;
>> + cpumask_var_t tmpmask;
>> +
>> + if (num_online_cpus() == 1)
>> + return;
>> +
>> + /*
>> + * Matches memory barriers around rq->curr modification in
>> + * scheduler.
>> + */
>> + smp_mb(); /* system call entry is not a mb. */
>> +
>
> Weren't you going to put in a comment on that GFP_NOWAIT thing?

I only added it to the uapi header. Adding this to the implementation
too:

+ /*
+ * Expedited membarrier commands guarantee that they won't
+ * block, hence the GFP_NOWAIT allocation flag and fallback
+ * implementation.
+ */



>
>> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
>
> You really want: zalloc_cpumask_var().

ok

>
>> + /* Fallback for OOM. */
>> + fallback = true;
>> + }
>> +
>> + /*
>> + * Skipping the current CPU is OK even through we can be
>> + * migrated at any point. The current CPU, at the point where we
>> + * read raw_smp_processor_id(), is ensured to be in program
>> + * order with respect to the caller thread. Therefore, we can
>> + * skip this CPU from the iteration.
>> + */
>> + this_cpu = raw_smp_processor_id();
>
> So if instead you do the below, that is still true, but you have the
> opportunity to skip moar CPUs, then again, if you migrate the wrong way
> you'll end up not skipping yourself.. a well.

Chances are better to skip more CPUs in face of migration if we do it
in the loop as you suggest. Will do.

>
>> + cpus_read_lock();
>> + for_each_online_cpu(cpu) {
>> + struct task_struct *p;
>> +
> if (cpu == raw_smp_processor_id())
> continue;
>
>> + rcu_read_lock();
>> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> + if (p && p->mm == current->mm) {
>> + if (!fallback)
>> + __cpumask_set_cpu(cpu, tmpmask);
>> + else
>> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> + }
>> + rcu_read_unlock();
>> + }
>> + cpus_read_unlock();
>
> This ^, wants to go after that v
>
>> + if (!fallback) {
>> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>> + free_cpumask_var(tmpmask);
>> + }
>
> Because otherwise the bits in your tmpmask might no longer match the
> online state.

Good point, thanks!

Mathieu

>
>> +
>> + /*
>> + * Memory barrier on the caller thread _after_ we finished
>> + * waiting for the last IPI. Matches memory barriers around
>> + * rq->curr modification in scheduler.
>> + */
>> + smp_mb(); /* exit from system call is not a mb */
> > +}

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-07-28 15:36:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

----- On Jul 28, 2017, at 7:57 AM, Peter Zijlstra [email protected] wrote:

> On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index e9785f7aed75..33f34a201255 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
>> *prev)
>> finish_arch_post_lock_switch();
>>
>> fire_sched_in_preempt_notifiers(current);
>> +
>> + /*
>> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
>> + * rq->curr assignment. Not all architectures have one in either
>> + * switch_to() or switch_mm() so we use (and complement) the one
>> + * implied by mmdrop()'s atomic_dec_and_test().
>> + */
>> if (mm)
>> mmdrop(mm);
>> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
>> + smp_mb();
>> +
>> if (unlikely(prev_state == TASK_DEAD)) {
>> if (prev->sched_class->task_dead)
>> prev->sched_class->task_dead(prev);
>>
>>
>
>> a whole bunch of architectures don't in fact need this extra barrier at all.
>
> In fact, I'm fairly sure its only PPC.
>
> Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
> anything other than smp_mb() (for now, Risc-V is in this same boat and
> MIPS could be if they ever sort out their fancy barriers).
>
> TSO archs use a regular STORE for RELEASE, but all their atomics imply a
> smp_mb() and there are enough around to make one happen (typically
> mm_cpumask updates).
>
> Everybody else, aside from ARM64 and PPC must use smp_mb() for
> ACQUIRE/RELEASE.
>
> ARM64 has a super duper barrier in switch_to().
>
> Which only leaves PPC stranded.. but the 'good' news is that mpe says
> they'll probably need a barrier in switch_mm() in any case.

As I pointed out in my other email, I plan to do this:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
finish_lock_switch(rq, prev);
+ /*
+ * The membarrier system call requires a full memory barrier
+ * after storing to rq->curr, before going back to user-space.
+ */
+ smp_mb__after_unlock_lock();
finish_arch_post_lock_switch();

fire_sched_in_preempt_notifiers(current);

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-07-28 16:46:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Fri, Jul 28, 2017 at 03:38:15PM +0000, Mathieu Desnoyers wrote:
> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
> > they'll probably need a barrier in switch_mm() in any case.
>
> As I pointed out in my other email, I plan to do this:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> vtime_task_switch(prev);
> perf_event_task_sched_in(prev, current);

Here would place it _inside_ the rq->lock, which seems to make more
sense given the purpose of the barrier, but either way works given its
definition.

> finish_lock_switch(rq, prev);

You could put the whole thing inside IS_ENABLED(CONFIG_SYSMEMBARRIER) or
something.

> + /*
> + * The membarrier system call requires a full memory barrier
> + * after storing to rq->curr, before going back to user-space.
> + */
> + smp_mb__after_unlock_lock();
> finish_arch_post_lock_switch();
>
> fire_sched_in_preempt_notifiers(current);

2017-07-28 17:04:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

----- On Jul 28, 2017, at 12:46 PM, Peter Zijlstra [email protected] wrote:

> On Fri, Jul 28, 2017 at 03:38:15PM +0000, Mathieu Desnoyers wrote:
>> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
>> > they'll probably need a barrier in switch_mm() in any case.
>>
>> As I pointed out in my other email, I plan to do this:
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct
>> *prev)
>> vtime_task_switch(prev);
>> perf_event_task_sched_in(prev, current);
>
> Here would place it _inside_ the rq->lock, which seems to make more
> sense given the purpose of the barrier, but either way works given its
> definition.

Given its naming "...after_unlock_lock", I thought it would be clearer to put
it after the unlock. Anyway, this barrier does not seem to be used to ensure
the release barrier per se (unlock already has release semantic), but rather
ensures a full memory barrier wrt memory accesses that are synchronized by
means other than this this lock.

>
>> finish_lock_switch(rq, prev);
>
> You could put the whole thing inside IS_ENABLED(CONFIG_SYSMEMBARRIER) or
> something.

I'm tempted to wait until we hear from powerpc maintainers, so we learn
whether they deeply care about this extra barrier in finish_task_switch()
before making it conditional on CONFIG_MEMBARRIER.

Having a guaranteed barrier after context switch on all architectures may
have other uses.

Thanks,

Mathieu

>
>> + /*
>> + * The membarrier system call requires a full memory barrier
>> + * after storing to rq->curr, before going back to user-space.
>> + */
>> + smp_mb__after_unlock_lock();
>> finish_arch_post_lock_switch();
>>
> > fire_sched_in_preempt_notifiers(current);

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-07-29 01:58:56

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Fri, 28 Jul 2017 17:06:53 +0000 (UTC)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Jul 28, 2017, at 12:46 PM, Peter Zijlstra [email protected] wrote:
>
> > On Fri, Jul 28, 2017 at 03:38:15PM +0000, Mathieu Desnoyers wrote:
> >> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
> >> > they'll probably need a barrier in switch_mm() in any case.
> >>
> >> As I pointed out in my other email, I plan to do this:
> >>
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct
> >> *prev)
> >> vtime_task_switch(prev);
> >> perf_event_task_sched_in(prev, current);
> >
> > Here would place it _inside_ the rq->lock, which seems to make more
> > sense given the purpose of the barrier, but either way works given its
> > definition.
>
> Given its naming "...after_unlock_lock", I thought it would be clearer to put
> it after the unlock. Anyway, this barrier does not seem to be used to ensure
> the release barrier per se (unlock already has release semantic), but rather
> ensures a full memory barrier wrt memory accesses that are synchronized by
> means other than this this lock.
>
> >
> >> finish_lock_switch(rq, prev);
> >
> > You could put the whole thing inside IS_ENABLED(CONFIG_SYSMEMBARRIER) or
> > something.
>
> I'm tempted to wait until we hear from powerpc maintainers, so we learn
> whether they deeply care about this extra barrier in finish_task_switch()
> before making it conditional on CONFIG_MEMBARRIER.
>
> Having a guaranteed barrier after context switch on all architectures may
> have other uses.

I haven't had time to read the thread and understand exactly why you need
this extra barrier, I'll do it next week. Thanks for cc'ing us on it.

A smp_mb is pretty expensive on powerpc CPUs. Removing the sync from
switch_to increased thread switch performance by 2-3%. Putting it in
switch_mm may be a little less painful, but still we have to weigh it
against the benefit of this new functionality. Would that be a net win
for the average end-user? Seems unlikely.

But we also don't want to lose sys_membarrier completely. Would it be too
painful to make MEMBARRIER_CMD_PRIVATE_EXPEDITED return error, or make it
fall back to a slower case if we decide not to implement it?

Thanks,
Nick

2017-07-29 09:24:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Sat, Jul 29, 2017 at 11:58:40AM +1000, Nicholas Piggin wrote:
> I haven't had time to read the thread and understand exactly why you need
> this extra barrier, I'll do it next week. Thanks for cc'ing us on it.

Bottom of here:

https://lkml.kernel.org/r/[email protected]

is probably the fastest way towards understanding the need for a barrier
after rq->curr assignment.

Any barrier after that assignment is good for us, but so far it looks
like PPC doesn't (and PPC only afaict) provide any smp_mb() after that
point.

> A smp_mb is pretty expensive on powerpc CPUs. Removing the sync from
> switch_to increased thread switch performance by 2-3%. Putting it in
> switch_mm may be a little less painful, but still we have to weigh it
> against the benefit of this new functionality. Would that be a net win
> for the average end-user? Seems unlikely.
>
> But we also don't want to lose sys_membarrier completely. Would it be too
> painful to make MEMBARRIER_CMD_PRIVATE_EXPEDITED return error, or make it
> fall back to a slower case if we decide not to implement it?

One ugly thing we've thought of is tagging each mm that has used
sys_membarrier() and only issue the smp_mb() for those. That way only
those tasks that actually rely on the syscall get to pay the price.

2017-07-29 09:46:01

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Sat, 29 Jul 2017 11:23:33 +0200
Peter Zijlstra <[email protected]> wrote:

> On Sat, Jul 29, 2017 at 11:58:40AM +1000, Nicholas Piggin wrote:
> > I haven't had time to read the thread and understand exactly why you need
> > this extra barrier, I'll do it next week. Thanks for cc'ing us on it.
>
> Bottom of here:
>
> https://lkml.kernel.org/r/[email protected]
>
> is probably the fastest way towards understanding the need for a barrier
> after rq->curr assignment.
>
> Any barrier after that assignment is good for us, but so far it looks
> like PPC doesn't (and PPC only afaict) provide any smp_mb() after that
> point.

Thanks, yeah that's relatively straightforward.

> > A smp_mb is pretty expensive on powerpc CPUs. Removing the sync from
> > switch_to increased thread switch performance by 2-3%. Putting it in
> > switch_mm may be a little less painful, but still we have to weigh it
> > against the benefit of this new functionality. Would that be a net win
> > for the average end-user? Seems unlikely.
> >
> > But we also don't want to lose sys_membarrier completely. Would it be too
> > painful to make MEMBARRIER_CMD_PRIVATE_EXPEDITED return error, or make it
> > fall back to a slower case if we decide not to implement it?
>
> One ugly thing we've thought of is tagging each mm that has used
> sys_membarrier() and only issue the smp_mb() for those. That way only
> those tasks that actually rely on the syscall get to pay the price.

The biggest hammer that puts everything on the syscall side I think would
be to lock each runqueue to while iterating over them, right?

That could be pretty expensive but it would be interesting to know how
bad that is for real apps... hmm, we might be able to restrict iteration
to mm_cpumask(current->mm), no?

2017-07-29 09:49:12

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Sat, 29 Jul 2017 19:45:43 +1000
Nicholas Piggin <[email protected]> wrote:

> hmm, we might be able to restrict iteration
> to mm_cpumask(current->mm), no?

Oh that's been discussed too. I'll read back over it too.

2017-07-29 10:52:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Sat, Jul 29, 2017 at 07:48:56PM +1000, Nicholas Piggin wrote:
> On Sat, 29 Jul 2017 19:45:43 +1000
> Nicholas Piggin <[email protected]> wrote:
>
> > hmm, we might be able to restrict iteration
> > to mm_cpumask(current->mm), no?
>
> Oh that's been discussed too. I'll read back over it too.

Right, the main problem is that some architectures (arm64 for instance,
although Will is looking at it) don't use mm_cpumask() at all.

Other architectures (PPC for instance) only ever set bits.

2017-07-31 13:21:02

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

Peter Zijlstra <[email protected]> writes:

> On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index e9785f7aed75..33f34a201255 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>> finish_arch_post_lock_switch();
>>
>> fire_sched_in_preempt_notifiers(current);
>> +
>> + /*
>> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
>> + * rq->curr assignment. Not all architectures have one in either
>> + * switch_to() or switch_mm() so we use (and complement) the one
>> + * implied by mmdrop()'s atomic_dec_and_test().
>> + */
>> if (mm)
>> mmdrop(mm);
>> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
>> + smp_mb();
>> +
>> if (unlikely(prev_state == TASK_DEAD)) {
>> if (prev->sched_class->task_dead)
>> prev->sched_class->task_dead(prev);
>>
>>
>
>> a whole bunch of architectures don't in fact need this extra barrier at all.
>
> In fact, I'm fairly sure its only PPC.
>
> Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
> anything other than smp_mb() (for now, Risc-V is in this same boat and
> MIPS could be if they ever sort out their fancy barriers).
>
> TSO archs use a regular STORE for RELEASE, but all their atomics imply a
> smp_mb() and there are enough around to make one happen (typically
> mm_cpumask updates).
>
> Everybody else, aside from ARM64 and PPC must use smp_mb() for
> ACQUIRE/RELEASE.
>
> ARM64 has a super duper barrier in switch_to().
>
> Which only leaves PPC stranded.. but the 'good' news is that mpe says
> they'll probably need a barrier in switch_mm() in any case.

I may have been sleep deprived. We have a patch, probably soon to be
merged, which will add a smp_mb() in switch_mm() but *only* when we add
a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.

I'm not across membarrier enough to know if that's sufficient, but it
seems unlikely?

cheers

2017-07-31 13:36:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Mon, Jul 31, 2017 at 11:20:59PM +1000, Michael Ellerman wrote:
> Peter Zijlstra <[email protected]> writes:

> > In fact, I'm fairly sure its only PPC.
> >
> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
> > anything other than smp_mb() (for now, Risc-V is in this same boat and
> > MIPS could be if they ever sort out their fancy barriers).
> >
> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a
> > smp_mb() and there are enough around to make one happen (typically
> > mm_cpumask updates).
> >
> > Everybody else, aside from ARM64 and PPC must use smp_mb() for
> > ACQUIRE/RELEASE.
> >
> > ARM64 has a super duper barrier in switch_to().
> >
> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
> > they'll probably need a barrier in switch_mm() in any case.
>
> I may have been sleep deprived. We have a patch, probably soon to be
> merged, which will add a smp_mb() in switch_mm() but *only* when we add
> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.
>
> I'm not across membarrier enough to know if that's sufficient, but it
> seems unlikely?

Correct, that would be insufficient. We'd need it every time switch_mm()
does indeed change the effective mm.

Now you also spoke of looking at clearing bits in mm_cpumask(), and I
suspect that if you do that, you end up having to do a barrier every
time.

2017-07-31 19:31:29

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

----- On Jul 28, 2017, at 9:58 PM, Nicholas Piggin [email protected] wrote:

> On Fri, 28 Jul 2017 17:06:53 +0000 (UTC)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> ----- On Jul 28, 2017, at 12:46 PM, Peter Zijlstra [email protected] wrote:
>>
>> > On Fri, Jul 28, 2017 at 03:38:15PM +0000, Mathieu Desnoyers wrote:
>> >> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
>> >> > they'll probably need a barrier in switch_mm() in any case.
>> >>
>> >> As I pointed out in my other email, I plan to do this:
>> >>
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct
>> >> *prev)
>> >> vtime_task_switch(prev);
>> >> perf_event_task_sched_in(prev, current);
>> >
>> > Here would place it _inside_ the rq->lock, which seems to make more
>> > sense given the purpose of the barrier, but either way works given its
>> > definition.
>>
>> Given its naming "...after_unlock_lock", I thought it would be clearer to put
>> it after the unlock. Anyway, this barrier does not seem to be used to ensure
>> the release barrier per se (unlock already has release semantic), but rather
>> ensures a full memory barrier wrt memory accesses that are synchronized by
>> means other than this this lock.
>>
>> >
>> >> finish_lock_switch(rq, prev);
>> >
>> > You could put the whole thing inside IS_ENABLED(CONFIG_SYSMEMBARRIER) or
>> > something.
>>
>> I'm tempted to wait until we hear from powerpc maintainers, so we learn
>> whether they deeply care about this extra barrier in finish_task_switch()
>> before making it conditional on CONFIG_MEMBARRIER.
>>
>> Having a guaranteed barrier after context switch on all architectures may
>> have other uses.
>
> I haven't had time to read the thread and understand exactly why you need
> this extra barrier, I'll do it next week. Thanks for cc'ing us on it.
>
> A smp_mb is pretty expensive on powerpc CPUs. Removing the sync from
> switch_to increased thread switch performance by 2-3%. Putting it in
> switch_mm may be a little less painful, but still we have to weigh it
> against the benefit of this new functionality. Would that be a net win
> for the average end-user? Seems unlikely.
>
> But we also don't want to lose sys_membarrier completely. Would it be too
> painful to make MEMBARRIER_CMD_PRIVATE_EXPEDITED return error, or make it
> fall back to a slower case if we decide not to implement it?

The need for an expedited membarrier comes from a need to use it to implement
synchronization schemes like hazard pointers, RCU, and garbage collectors in
user-space. One example is the use-case of hazard pointers. If the memory
free is implemented in the same thread doing the retire, the slowdown
introduced by non-expedited membarrier is not acceptable at all. In that case,
only an expedited membarrier brings an acceptable slowdown. The user's
alternative currently is to rely on undocumented side-effects of mprotect()
to achieve the same result. This happens to work on some architectures, and
may break in the future.

If users do not have membarrier expedited on a given architecture, and are
told that mprotect() does not provide the barrier guarantees they are looking
for, then they would have to add heavy-weight memory barriers on many
user-space fast-paths on those specific architectures, assuming they are
willing to go through that trouble.

I understand that the 2-3% overhead when switching between threads is a big
deal. Do you have numbers on the overhead added by a memory barrier in
switch_mm ? I suspect that switching between processes (including the
cost of following cache line and TLB misses) will be quite heavier in
the first place.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-08-01 00:35:58

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Mon, 31 Jul 2017 23:20:59 +1000
Michael Ellerman <[email protected]> wrote:

> Peter Zijlstra <[email protected]> writes:
>
> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index e9785f7aed75..33f34a201255 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >> finish_arch_post_lock_switch();
> >>
> >> fire_sched_in_preempt_notifiers(current);
> >> +
> >> + /*
> >> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
> >> + * rq->curr assignment. Not all architectures have one in either
> >> + * switch_to() or switch_mm() so we use (and complement) the one
> >> + * implied by mmdrop()'s atomic_dec_and_test().
> >> + */
> >> if (mm)
> >> mmdrop(mm);
> >> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
> >> + smp_mb();
> >> +
> >> if (unlikely(prev_state == TASK_DEAD)) {
> >> if (prev->sched_class->task_dead)
> >> prev->sched_class->task_dead(prev);
> >>
> >>
> >
> >> a whole bunch of architectures don't in fact need this extra barrier at all.
> >
> > In fact, I'm fairly sure its only PPC.
> >
> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
> > anything other than smp_mb() (for now, Risc-V is in this same boat and
> > MIPS could be if they ever sort out their fancy barriers).
> >
> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a
> > smp_mb() and there are enough around to make one happen (typically
> > mm_cpumask updates).
> >
> > Everybody else, aside from ARM64 and PPC must use smp_mb() for
> > ACQUIRE/RELEASE.
> >
> > ARM64 has a super duper barrier in switch_to().
> >
> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
> > they'll probably need a barrier in switch_mm() in any case.
>
> I may have been sleep deprived. We have a patch, probably soon to be
> merged, which will add a smp_mb() in switch_mm() but *only* when we add
> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.
>
> I'm not across membarrier enough to know if that's sufficient, but it
> seems unlikely?

Won't be sufficient, they need a barrier after assigning rq->curr.
It can be avoided when switching between threads with the same mm.

I would like to see how bad membarrier performance is if we made
that side heavy enough to avoid the barrier in context switch (e.g.,
by taking the rq locks, or using synchronize_sched_expedited -- on
a per-arch basis of course).

Is there some (realistic-ish) benchmark using membarrier we can
experiment with?

Thanks,
Nick

2017-08-01 01:33:18

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

----- On Jul 31, 2017, at 8:35 PM, Nicholas Piggin [email protected] wrote:

> On Mon, 31 Jul 2017 23:20:59 +1000
> Michael Ellerman <[email protected]> wrote:
>
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index e9785f7aed75..33f34a201255 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
>> >> *prev)
>> >> finish_arch_post_lock_switch();
>> >>
>> >> fire_sched_in_preempt_notifiers(current);
>> >> +
>> >> + /*
>> >> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
>> >> + * rq->curr assignment. Not all architectures have one in either
>> >> + * switch_to() or switch_mm() so we use (and complement) the one
>> >> + * implied by mmdrop()'s atomic_dec_and_test().
>> >> + */
>> >> if (mm)
>> >> mmdrop(mm);
>> >> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
>> >> + smp_mb();
>> >> +
>> >> if (unlikely(prev_state == TASK_DEAD)) {
>> >> if (prev->sched_class->task_dead)
>> >> prev->sched_class->task_dead(prev);
>> >>
>> >>
>> >
>> >> a whole bunch of architectures don't in fact need this extra barrier at all.
>> >
>> > In fact, I'm fairly sure its only PPC.
>> >
>> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
>> > anything other than smp_mb() (for now, Risc-V is in this same boat and
>> > MIPS could be if they ever sort out their fancy barriers).
>> >
>> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a
>> > smp_mb() and there are enough around to make one happen (typically
>> > mm_cpumask updates).
>> >
>> > Everybody else, aside from ARM64 and PPC must use smp_mb() for
>> > ACQUIRE/RELEASE.
>> >
>> > ARM64 has a super duper barrier in switch_to().
>> >
>> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
>> > they'll probably need a barrier in switch_mm() in any case.
>>
>> I may have been sleep deprived. We have a patch, probably soon to be
>> merged, which will add a smp_mb() in switch_mm() but *only* when we add
>> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.
>>
>> I'm not across membarrier enough to know if that's sufficient, but it
>> seems unlikely?
>
> Won't be sufficient, they need a barrier after assigning rq->curr.
> It can be avoided when switching between threads with the same mm.
>
> I would like to see how bad membarrier performance is if we made
> that side heavy enough to avoid the barrier in context switch (e.g.,
> by taking the rq locks, or using synchronize_sched_expedited -- on
> a per-arch basis of course).
>
> Is there some (realistic-ish) benchmark using membarrier we can
> experiment with?

Hi Nick,

I have benchmark programs using membarrier in liburcu [1]. They
represent very heavy usage of membarrier (perhaps more extreme than
reality).

The master branch is at:

git://git.liburcu.org/userspace-rcu.git

Below is a test branch I quickly hacked to make urcu use membarrier
MEMBARRIER_CMD_PRIVATE_EXPEDITED:

https://github.com/compudj/userspace-rcu-dev/tree/test-mb-private

An example benchmark is under tests/benchmark

Usage: test_urcu nr_readers nr_writers duration (s) <OPTIONS>

Here is an example on my 16-core Intel machine (with hyperthreading, so don't
expect this benchmark to be very reliable) with 4 reader threads, 1 writer
thread. The writer thread in this benchmark issues a synchronize_rcu() at
each iteration, so typically 2 sys_membarrier calls. I'm compiling with
the configure flag "--disable-sys-membarrier-fallback" to have the full
read-side speed.

Using MEMBARRIER_CMD_SHARED:

$ ./test_urcu 4 1 10
SUMMARY [...] test_urcu testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 17522991852 nr_writes 249 nr_ops 17522992101

438'074'796 reads / reader thread / s
25 updates / s

Using MEMBARRIER_CMD_PRIVATE_EXPEDITED:

$ ./test_urcu 4 1 10
SUMMARY [...] testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 12709653290 nr_writes 799668 nr_ops 12710452958

317'741'332 reads / reader thread / s
79'966 updates / s

As a comparison, refer to the test_urcu_mb benchmark, which uses
a full memory barrier in rcu_read_lock and rcu_read_unlock, and
can therefore do without any sys_membarrier. This is our fallback
when sys_membarrier returns -ENOSYS.

$ ./test_urcu_mb 4 1 10
SUMMARY [...] test_urcu_mb testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 137290634 nr_writes 8624184 nr_ops 145914818

3'432'265 reads / reader thread / s
862'418 updates / s

You could try adapting my membarrier private-expedited patch to grab the
rq lock for each CPU to see the impact on the benchmark for your specific
architecture. I suspect that the overhead that will matter in that case
is not only the speed of the membarrier operation, but also its disruption
of the scheduler rq lock cachelines, which will have effect on other unrelated
workloads. You may need to benchmark another scheduler-intensive workload
executed in parallel with heavy use of sys_membarrier adapted to grab rq locks
to figure this one out.

Let me know if you have questions,

Perhaps Dave Watson will have benchmarks on hazard pointers.

Thanks,

Mathieu


[1] http://liburcu.org


>
> Thanks,
> Nick

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-08-01 02:01:05

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, 1 Aug 2017 01:33:09 +0000 (UTC)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Jul 31, 2017, at 8:35 PM, Nicholas Piggin [email protected] wrote:
>
> > On Mon, 31 Jul 2017 23:20:59 +1000
> > Michael Ellerman <[email protected]> wrote:
> >
> >> Peter Zijlstra <[email protected]> writes:
> >>
> >> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
> >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> >> index e9785f7aed75..33f34a201255 100644
> >> >> --- a/kernel/sched/core.c
> >> >> +++ b/kernel/sched/core.c
> >> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
> >> >> *prev)
> >> >> finish_arch_post_lock_switch();
> >> >>
> >> >> fire_sched_in_preempt_notifiers(current);
> >> >> +
> >> >> + /*
> >> >> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
> >> >> + * rq->curr assignment. Not all architectures have one in either
> >> >> + * switch_to() or switch_mm() so we use (and complement) the one
> >> >> + * implied by mmdrop()'s atomic_dec_and_test().
> >> >> + */
> >> >> if (mm)
> >> >> mmdrop(mm);
> >> >> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
> >> >> + smp_mb();
> >> >> +
> >> >> if (unlikely(prev_state == TASK_DEAD)) {
> >> >> if (prev->sched_class->task_dead)
> >> >> prev->sched_class->task_dead(prev);
> >> >>
> >> >>
> >> >
> >> >> a whole bunch of architectures don't in fact need this extra barrier at all.
> >> >
> >> > In fact, I'm fairly sure its only PPC.
> >> >
> >> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
> >> > anything other than smp_mb() (for now, Risc-V is in this same boat and
> >> > MIPS could be if they ever sort out their fancy barriers).
> >> >
> >> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a
> >> > smp_mb() and there are enough around to make one happen (typically
> >> > mm_cpumask updates).
> >> >
> >> > Everybody else, aside from ARM64 and PPC must use smp_mb() for
> >> > ACQUIRE/RELEASE.
> >> >
> >> > ARM64 has a super duper barrier in switch_to().
> >> >
> >> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
> >> > they'll probably need a barrier in switch_mm() in any case.
> >>
> >> I may have been sleep deprived. We have a patch, probably soon to be
> >> merged, which will add a smp_mb() in switch_mm() but *only* when we add
> >> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.
> >>
> >> I'm not across membarrier enough to know if that's sufficient, but it
> >> seems unlikely?
> >
> > Won't be sufficient, they need a barrier after assigning rq->curr.
> > It can be avoided when switching between threads with the same mm.
> >
> > I would like to see how bad membarrier performance is if we made
> > that side heavy enough to avoid the barrier in context switch (e.g.,
> > by taking the rq locks, or using synchronize_sched_expedited -- on
> > a per-arch basis of course).
> >
> > Is there some (realistic-ish) benchmark using membarrier we can
> > experiment with?
>
> Hi Nick,
>
> I have benchmark programs using membarrier in liburcu [1]. They
> represent very heavy usage of membarrier (perhaps more extreme than
> reality).
>
> The master branch is at:
>
> git://git.liburcu.org/userspace-rcu.git
>
> Below is a test branch I quickly hacked to make urcu use membarrier
> MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>
> https://github.com/compudj/userspace-rcu-dev/tree/test-mb-private
>
> An example benchmark is under tests/benchmark
>
> Usage: test_urcu nr_readers nr_writers duration (s) <OPTIONS>
>
> Here is an example on my 16-core Intel machine (with hyperthreading, so don't
> expect this benchmark to be very reliable) with 4 reader threads, 1 writer
> thread. The writer thread in this benchmark issues a synchronize_rcu() at
> each iteration, so typically 2 sys_membarrier calls. I'm compiling with
> the configure flag "--disable-sys-membarrier-fallback" to have the full
> read-side speed.
>
> Using MEMBARRIER_CMD_SHARED:
>
> $ ./test_urcu 4 1 10
> SUMMARY [...] test_urcu testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 17522991852 nr_writes 249 nr_ops 17522992101
>
> 438'074'796 reads / reader thread / s
> 25 updates / s
>
> Using MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>
> $ ./test_urcu 4 1 10
> SUMMARY [...] testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 12709653290 nr_writes 799668 nr_ops 12710452958
>
> 317'741'332 reads / reader thread / s
> 79'966 updates / s
>
> As a comparison, refer to the test_urcu_mb benchmark, which uses
> a full memory barrier in rcu_read_lock and rcu_read_unlock, and
> can therefore do without any sys_membarrier. This is our fallback
> when sys_membarrier returns -ENOSYS.
>
> $ ./test_urcu_mb 4 1 10
> SUMMARY [...] test_urcu_mb testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 137290634 nr_writes 8624184 nr_ops 145914818
>
> 3'432'265 reads / reader thread / s
> 862'418 updates / s
>
> You could try adapting my membarrier private-expedited patch to grab the
> rq lock for each CPU to see the impact on the benchmark for your specific
> architecture. I suspect that the overhead that will matter in that case
> is not only the speed of the membarrier operation, but also its disruption
> of the scheduler rq lock cachelines, which will have effect on other unrelated
> workloads. You may need to benchmark another scheduler-intensive workload
> executed in parallel with heavy use of sys_membarrier adapted to grab rq locks
> to figure this one out.
>
> Let me know if you have questions,

Thanks for this, I'll take a look. This should be a good start as a stress
test, but I'd also be interested in some application. The reason being that
for example using runqueue locks may give reasonable maximum throughput
numbers, but could cause some latency or slowdown when it's used in more
realistic scenario.

We don't have numbers for smp_mb in switch_mm. As you say that should be a
lower overhead, but probably still undesirable. But that's one of the
things I'm looking at.

Will report back after I've tried a few more things.

Thanks,
Nick

2017-08-01 08:13:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, Aug 01, 2017 at 12:00:47PM +1000, Nicholas Piggin wrote:
> Thanks for this, I'll take a look. This should be a good start as a stress
> test, but I'd also be interested in some application. The reason being that
> for example using runqueue locks may give reasonable maximum throughput
> numbers, but could cause some latency or slowdown when it's used in more
> realistic scenario.

Given this is an unprivileged interface we have to consider DoS and
other such lovely things. And since we cannot use mm_cpumask() we're
stuck with for_each_online_cpu().

Combined that means that using rq->lock is completely out of the
question, some numbnut doing 'for (;;) sys_membarrier();' can
completely wreck the system.

Yes, it might work for 'normal' workloads, but the interference
potential is just too big.

I have the same problem with Paul's synchronize_rcu_expedited() patch,
that is a machine wide IPI spray and will interfere with unrelated work.

2017-08-01 09:57:42

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, 1 Aug 2017 10:12:30 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Aug 01, 2017 at 12:00:47PM +1000, Nicholas Piggin wrote:
> > Thanks for this, I'll take a look. This should be a good start as a stress
> > test, but I'd also be interested in some application. The reason being that
> > for example using runqueue locks may give reasonable maximum throughput
> > numbers, but could cause some latency or slowdown when it's used in more
> > realistic scenario.
>
> Given this is an unprivileged interface we have to consider DoS and
> other such lovely things. And since we cannot use mm_cpumask() we're
> stuck with for_each_online_cpu().

I think we *can* make that part of it per-arch, as well as whether
or not to use runqueue locks. It's kind of crazy not to use it when
it's available. Avoiding CPUs you aren't allowed to run on is also
nice for compartmentalization.

> Combined that means that using rq->lock is completely out of the
> question, some numbnut doing 'for (;;) sys_membarrier();' can
> completely wreck the system.

In what way would it wreck the system? It's not holding the lock over
the IPI, only to inspect the rq->curr->mm briefly.

> Yes, it might work for 'normal' workloads, but the interference
> potential is just too big.

Well it's good to be concerned about it. I do see your point. Although
I don't know if it's all that complicated to use unprivileged ops to
badly hurt QoS on most systems already :)

If mm cpumask is used, I think it's okay. You can cause quite similar
kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
stage where we're controlling those kinds of ops in terms of impact
to the rest of the system?

> I have the same problem with Paul's synchronize_rcu_expedited() patch,
> that is a machine wide IPI spray and will interfere with unrelated work.

Possibly global IPI would be a more serious concern.

Thanks,
Nick

2017-08-01 10:22:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, Aug 01, 2017 at 07:57:17PM +1000, Nicholas Piggin wrote:
> On Tue, 1 Aug 2017 10:12:30 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Aug 01, 2017 at 12:00:47PM +1000, Nicholas Piggin wrote:
> > > Thanks for this, I'll take a look. This should be a good start as a stress
> > > test, but I'd also be interested in some application. The reason being that
> > > for example using runqueue locks may give reasonable maximum throughput
> > > numbers, but could cause some latency or slowdown when it's used in more
> > > realistic scenario.
> >
> > Given this is an unprivileged interface we have to consider DoS and
> > other such lovely things. And since we cannot use mm_cpumask() we're
> > stuck with for_each_online_cpu().
>
> I think we *can* make that part of it per-arch, as well as whether
> or not to use runqueue locks. It's kind of crazy not to use it when
> it's available. Avoiding CPUs you aren't allowed to run on is also
> nice for compartmentalization.

Right, but a wee bit hard to do since most of the affinity stuff is per
task, not per process. And that would of course only help with work that
is hard partitioned.

> > Combined that means that using rq->lock is completely out of the
> > question, some numbnut doing 'for (;;) sys_membarrier();' can
> > completely wreck the system.
>
> In what way would it wreck the system? It's not holding the lock over
> the IPI, only to inspect the rq->curr->mm briefly.

But you're bouncing the rq->lock around the system at fairly high rates.
For big enough systems this is enough to severely hurt things.

> > Yes, it might work for 'normal' workloads, but the interference
> > potential is just too big.
>
> Well it's good to be concerned about it. I do see your point. Although
> I don't know if it's all that complicated to use unprivileged ops to
> badly hurt QoS on most systems already :)

If so, we should look at fixing that, not make it worse.

> If mm cpumask is used, I think it's okay. You can cause quite similar
> kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
> munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
> stage where we're controlling those kinds of ops in terms of impact
> to the rest of the system?

So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs
to those CPUs actually running threads of our process (or very
recently). So while there can be the sporadic stray IPI for a CPU that
recently ran a thread of the target process, it will not get another one
until it switches back into the process.

On machines that need manual TLB broadcasts and don't keep a tight mask,
yes you can interfere at will, but if they care they can fix by
tightening the mask.

In either case, the mm_cpumask() will be bounded by the set of CPUs the
threads are allowed to run on and will not interfere with the rest of
the system.

As to scheduler IPIs, those are limited to the CPUs the user is limited
to and are rate limited by the wakeup-latency of the tasks. After all,
all the time a task is runnable but not running, wakeups are no-ops.

Trouble is of course, that not everybody even sets a single bit in
mm_cpumask() and those that never clear bits will end up with a fairly
wide mask, still interfering with work that isn't hard partitioned.

2017-08-01 10:32:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command



On 08/01/2017 01:22 PM, Peter Zijlstra wrote:
>
>> If mm cpumask is used, I think it's okay. You can cause quite similar
>> kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
>> munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
>> stage where we're controlling those kinds of ops in terms of impact
>> to the rest of the system?
> So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs
> to those CPUs actually running threads of our process (or very
> recently). So while there can be the sporadic stray IPI for a CPU that
> recently ran a thread of the target process, it will not get another one
> until it switches back into the process.
>
> On machines that need manual TLB broadcasts and don't keep a tight mask,
> yes you can interfere at will, but if they care they can fix by
> tightening the mask.
>
> In either case, the mm_cpumask() will be bounded by the set of CPUs the
> threads are allowed to run on and will not interfere with the rest of
> the system.
>
> As to scheduler IPIs, those are limited to the CPUs the user is limited
> to and are rate limited by the wakeup-latency of the tasks. After all,
> all the time a task is runnable but not running, wakeups are no-ops.
>
> Trouble is of course, that not everybody even sets a single bit in
> mm_cpumask() and those that never clear bits will end up with a fairly
> wide mask, still interfering with work that isn't hard partitioned.

I hate to propose a way to make this more complicated, but this could be
fixed by a process first declaring its intent to use expedited
process-wide membarrier; if it does, then every context switch updates a
process-wide cpumask indicating which cpus are currently running threads
of that process:

if (prev->mm != next->mm)
if (prev->mm->running_cpumask)
cpumask_clear(...);
else if (next->mm->running_cpumask)
cpumask_set(...);

now only processes that want expedited process-wide membarrier pay for
it (in other than some predictable branches). You can even have threads
opt-in, so unrelated threads that don't participate in the party don't
cause those bits to be set.

2017-08-01 10:39:50

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, 1 Aug 2017 12:22:03 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Aug 01, 2017 at 07:57:17PM +1000, Nicholas Piggin wrote:
> > On Tue, 1 Aug 2017 10:12:30 +0200
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > On Tue, Aug 01, 2017 at 12:00:47PM +1000, Nicholas Piggin wrote:
> > > > Thanks for this, I'll take a look. This should be a good start as a stress
> > > > test, but I'd also be interested in some application. The reason being that
> > > > for example using runqueue locks may give reasonable maximum throughput
> > > > numbers, but could cause some latency or slowdown when it's used in more
> > > > realistic scenario.
> > >
> > > Given this is an unprivileged interface we have to consider DoS and
> > > other such lovely things. And since we cannot use mm_cpumask() we're
> > > stuck with for_each_online_cpu().
> >
> > I think we *can* make that part of it per-arch, as well as whether
> > or not to use runqueue locks. It's kind of crazy not to use it when
> > it's available. Avoiding CPUs you aren't allowed to run on is also
> > nice for compartmentalization.
>
> Right, but a wee bit hard to do since most of the affinity stuff is per
> task, not per process. And that would of course only help with work that
> is hard partitioned.
>
> > > Combined that means that using rq->lock is completely out of the
> > > question, some numbnut doing 'for (;;) sys_membarrier();' can
> > > completely wreck the system.
> >
> > In what way would it wreck the system? It's not holding the lock over
> > the IPI, only to inspect the rq->curr->mm briefly.
>
> But you're bouncing the rq->lock around the system at fairly high rates.
> For big enough systems this is enough to severely hurt things.

If you already have scheduling access on those CPUs to set mm cpumask,
then you can do worse I guess.

>
> > > Yes, it might work for 'normal' workloads, but the interference
> > > potential is just too big.
> >
> > Well it's good to be concerned about it. I do see your point. Although
> > I don't know if it's all that complicated to use unprivileged ops to
> > badly hurt QoS on most systems already :)
>
> If so, we should look at fixing that, not make it worse.

Point is I don't think this *is* worse. It's not too difficult to acquire
per-cpu locks and cachelines and IPIs to other CPUs if you are a malicious
DoS. Particularly not when you can scheudle tasks around those CPUs.

> > If mm cpumask is used, I think it's okay. You can cause quite similar
> > kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
> > munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
> > stage where we're controlling those kinds of ops in terms of impact
> > to the rest of the system?
>
> So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs
> to those CPUs actually running threads of our process (or very
> recently). So while there can be the sporadic stray IPI for a CPU that
> recently ran a thread of the target process, it will not get another one
> until it switches back into the process.
>
> On machines that need manual TLB broadcasts and don't keep a tight mask,
> yes you can interfere at will, but if they care they can fix by
> tightening the mask.
>
> In either case, the mm_cpumask() will be bounded by the set of CPUs the
> threads are allowed to run on and will not interfere with the rest of
> the system.

Yep. Powerpc at least will use its mm_cpumask for iterating here
(regardless of how exactly we get the ordering right). Even though
at the moment it is a lazy mask, it's better than a global walk.

> As to scheduler IPIs, those are limited to the CPUs the user is limited
> to and are rate limited by the wakeup-latency of the tasks. After all,
> all the time a task is runnable but not running, wakeups are no-ops.

That's on the order of millions per second per core though, isn't it?

>
> Trouble is of course, that not everybody even sets a single bit in
> mm_cpumask() and those that never clear bits will end up with a fairly
> wide mask, still interfering with work that isn't hard partitioned.

Right, I just don't see what real problem this opens up that you don't
already have when you are not hard partitioned, therefore it doesn't
make sense to add a slowdown to the context switch fastpath to close
one hole in the sieve.

Completely recognizing that other architectures can do it without
taking rq lock at all and will not be forced to do so.

2017-08-01 10:46:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, Aug 01, 2017 at 01:32:43PM +0300, Avi Kivity wrote:
> I hate to propose a way to make this more complicated, but this could be
> fixed by a process first declaring its intent to use expedited process-wide
> membarrier; if it does, then every context switch updates a process-wide
> cpumask indicating which cpus are currently running threads of that process:
>
> if (prev->mm != next->mm)
> if (prev->mm->running_cpumask)
> cpumask_clear(...);
> else if (next->mm->running_cpumask)
> cpumask_set(...);
>
> now only processes that want expedited process-wide membarrier pay for it
> (in other than some predictable branches). You can even have threads opt-in,
> so unrelated threads that don't participate in the party don't cause those
> bits to be set.

Either that or conditionally put in a smp_mb in switch_mm() for that
process. But yes, once we advertise intent (either explicit or implicit
on first use), there's various things that can be done.


2017-08-01 11:02:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, Aug 01, 2017 at 08:39:28PM +1000, Nicholas Piggin wrote:
> On Tue, 1 Aug 2017 12:22:03 +0200
> Peter Zijlstra <[email protected]> wrote:

> > But you're bouncing the rq->lock around the system at fairly high rates.
> > For big enough systems this is enough to severely hurt things.
>
> If you already have scheduling access on those CPUs to set mm cpumask,
> then you can do worse I guess.

Ah, so currently, because of ARM64, we have to use for_each_online_cpu()
and then we're not limited to any affinity mask and will wreck across
partitioning.

If ARM64 starts setting bits in mm_cpumask().. then yes.

> > > > Yes, it might work for 'normal' workloads, but the interference
> > > > potential is just too big.
> > >
> > > Well it's good to be concerned about it. I do see your point. Although
> > > I don't know if it's all that complicated to use unprivileged ops to
> > > badly hurt QoS on most systems already :)
> >
> > If so, we should look at fixing that, not make it worse.
>
> Point is I don't think this *is* worse. It's not too difficult to acquire
> per-cpu locks and cachelines and IPIs to other CPUs if you are a malicious
> DoS. Particularly not when you can scheudle tasks around those CPUs.

So we try and not take remote rq->locks for wakeups when those locks are
outside of the cache domain.

> > In either case, the mm_cpumask() will be bounded by the set of CPUs the
> > threads are allowed to run on and will not interfere with the rest of
> > the system.
>
> Yep. Powerpc at least will use its mm_cpumask for iterating here
> (regardless of how exactly we get the ordering right). Even though
> at the moment it is a lazy mask, it's better than a global walk.

Agreed.

> > As to scheduler IPIs, those are limited to the CPUs the user is limited
> > to and are rate limited by the wakeup-latency of the tasks. After all,
> > all the time a task is runnable but not running, wakeups are no-ops.
>
> That's on the order of millions per second per core though, isn't it?

Thousands, assuming another process is actually running. We'll not
always win the wakeup-preempt race. And once we're queued, no more
wakeups until we get to run.

> > Trouble is of course, that not everybody even sets a single bit in
> > mm_cpumask() and those that never clear bits will end up with a fairly
> > wide mask, still interfering with work that isn't hard partitioned.
>
> Right, I just don't see what real problem this opens up that you don't
> already have when you are not hard partitioned, therefore it doesn't
> make sense to add a slowdown to the context switch fastpath to close
> one hole in the sieve.
>
> Completely recognizing that other architectures can do it without
> taking rq lock at all and will not be forced to do so.

If we can limit this to hard partitioned, that would be good indeed.

I'm just trying to avoid having two implementation of this thing. At the
same time I very much understand your reluctance to add this barrier.

In any case, supposing we can do that intent thing. How horrible would
something like:


context_switch()
if (unlikely(mm->needs_barrier))
smp_mb__after_unlock_lock();


be? We only need the extra barrier when we switch _into_ mm's that care
about sys_membarrier() in the first place. At which point they pay the
price.

2017-08-01 11:54:32

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, 1 Aug 2017 13:00:23 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Aug 01, 2017 at 08:39:28PM +1000, Nicholas Piggin wrote:

> > Right, I just don't see what real problem this opens up that you don't
> > already have when you are not hard partitioned, therefore it doesn't
> > make sense to add a slowdown to the context switch fastpath to close
> > one hole in the sieve.
> >
> > Completely recognizing that other architectures can do it without
> > taking rq lock at all and will not be forced to do so.
>
> If we can limit this to hard partitioned, that would be good indeed.
>
> I'm just trying to avoid having two implementation of this thing. At the
> same time I very much understand your reluctance to add this barrier.

Well I think we could have some kind of
for_each_cpu_where_this_process_is_running macro that is needed to
abstract the arch details.

Presumably we're already going to get two implementations of that one --
I can't imagine x86 would be happy with doing a for_all_cpus iteration
just because arm does not have the cpumask. powerpc will only make that
3 :)

>
> In any case, supposing we can do that intent thing. How horrible would
> something like:
>
>
> context_switch()
> if (unlikely(mm->needs_barrier))
> smp_mb__after_unlock_lock();
>
>
> be? We only need the extra barrier when we switch _into_ mm's that care
> about sys_membarrier() in the first place. At which point they pay the
> price.

Not beautiful :) and it would also have to have an arch speicific bit on
the other side. Although yes it gives a different way to reduce cost without
rq.

So Paul and googling filled me in on the importance of this syscall. Also
I do appreciate the concern about taking rq lock. I just think maybe we
(powerpc) pay a few more cycles in the new syscall rather than context
switch. It will take a little while to get a good idea of performance and
behaviour on bigger systems where this will matter most.

2017-08-01 13:24:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, Aug 01, 2017 at 12:22:03PM +0200, Peter Zijlstra wrote:

[ . . . ]

> As to scheduler IPIs, those are limited to the CPUs the user is limited
> to and are rate limited by the wakeup-latency of the tasks. After all,
> all the time a task is runnable but not running, wakeups are no-ops.

Can't that wakeup-latency limitation be overcome by a normal user simply
by having lots of tasks to wake up, which then go back to sleep almost
immediately? Coupled with very a low-priority CPU-bound task on each CPU?

Thanx, Paul

2017-08-01 14:17:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, Aug 01, 2017 at 06:23:09AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 01, 2017 at 12:22:03PM +0200, Peter Zijlstra wrote:
>
> [ . . . ]
>
> > As to scheduler IPIs, those are limited to the CPUs the user is limited
> > to and are rate limited by the wakeup-latency of the tasks. After all,
> > all the time a task is runnable but not running, wakeups are no-ops.
>
> Can't that wakeup-latency limitation be overcome by a normal user simply
> by having lots of tasks to wake up, which then go back to sleep almost
> immediately? Coupled with very a low-priority CPU-bound task on each CPU?

Let me put it like this; there is no way to cause more interference
using IPIs then there is simply running while(1) loops ;-)

2017-08-01 23:32:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, Aug 01, 2017 at 04:16:54PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 01, 2017 at 06:23:09AM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 01, 2017 at 12:22:03PM +0200, Peter Zijlstra wrote:
> >
> > [ . . . ]
> >
> > > As to scheduler IPIs, those are limited to the CPUs the user is limited
> > > to and are rate limited by the wakeup-latency of the tasks. After all,
> > > all the time a task is runnable but not running, wakeups are no-ops.
> >
> > Can't that wakeup-latency limitation be overcome by a normal user simply
> > by having lots of tasks to wake up, which then go back to sleep almost
> > immediately? Coupled with very a low-priority CPU-bound task on each CPU?
>
> Let me put it like this; there is no way to cause more interference
> using IPIs then there is simply running while(1) loops ;-)

Very good, that does give us some guidance, give or take context switches
happening during the IPI latency window. ;-)

Thanx, Paul

2017-08-02 00:45:37

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

On Tue, 1 Aug 2017 16:32:03 -0700
"Paul E. McKenney" <[email protected]> wrote:

> On Tue, Aug 01, 2017 at 04:16:54PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 01, 2017 at 06:23:09AM -0700, Paul E. McKenney wrote:
> > > On Tue, Aug 01, 2017 at 12:22:03PM +0200, Peter Zijlstra wrote:
> > >
> > > [ . . . ]
> > >
> > > > As to scheduler IPIs, those are limited to the CPUs the user is limited
> > > > to and are rate limited by the wakeup-latency of the tasks. After all,
> > > > all the time a task is runnable but not running, wakeups are no-ops.
> > >
> > > Can't that wakeup-latency limitation be overcome by a normal user simply
> > > by having lots of tasks to wake up, which then go back to sleep almost
> > > immediately? Coupled with very a low-priority CPU-bound task on each CPU?
> >
> > Let me put it like this; there is no way to cause more interference
> > using IPIs then there is simply running while(1) loops ;-)
>
> Very good, that does give us some guidance, give or take context switches
> happening during the IPI latency window. ;-)

I think we do have to be a bit careful. Peter's right when you're thinking
of just running arbitrary tasks on a single CPU, but for multiple CPUs, the
IPI sender will not necessarily get accounted the cost it incurs on the
target CPU.

So we do need to be careful about allowing a large amount of unprivileged
IPIs to arbitrary CPUs. Fortunately in this case the IPIs are restricted to
CPUs where our process is currently running. That's about the ideal case
where we're only disturbing our own job.

Thanks,
Nick