2017-07-27 18:12:56

by Paul E. McKenney

[permalink] [raw]
Subject: Udpated sys_membarrier() speedup patch, FYI

Hello!

Please see below for a prototype sys_membarrier() speedup patch.
Please note that there is some controversy on this subject, so the final
version will probably be quite a bit different than this prototype.

But my main question is whether the throttling shown below is acceptable
for your use cases, namely only one expedited sys_membarrier() permitted
per scheduling-clock period (1 millisecond on many platforms), with any
excess being silently converted to non-expedited form. The reason for
the throttling is concerns about DoS attacks based on user code with a
tight loop invoking this system call.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit 4cd5253094b6d7f9501e21e13aa4e2e78e8a70cd
Author: Paul E. McKenney <[email protected]>
Date: Tue Jul 18 13:53:32 2017 -0700

sys_membarrier: Add expedited option

The sys_membarrier() system call has proven too slow for some use cases,
which has prompted users to instead rely on TLB shootdown. Although TLB
shootdown is much faster, it has the slight disadvantage of not working
at all on arm and arm64 and also of being vulnerable to reasonable
optimizations that might skip some IPIs. However, the Linux kernel
does not currrently provide a reasonable alternative, so it is hard to
criticize these users from doing what works for them on a given piece
of hardware at a given time.

This commit therefore adds an expedited option to the sys_membarrier()
system call, thus providing a faster mechanism that is portable and
is not subject to death by optimization. Note that if more than one
MEMBARRIER_CMD_SHARED_EXPEDITED sys_membarrier() call happens within
the same jiffy, all but the first will use synchronize_sched() instead
of synchronize_sched_expedited().

Signed-off-by: Paul E. McKenney <[email protected]>
[ paulmck: Fix code style issue pointed out by Boqun Feng. ]
Tested-by: Avi Kivity <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Geoffrey Romer <[email protected]>

diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108bd2624..5720386d0904 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,6 +40,16 @@
* (non-running threads are de facto in such a
* state). This covers threads from all processes
* running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_SHARED_EXPEDITED: Execute a memory barrier on all
+ * running threads, but in an expedited fashion.
+ * Upon return from system call, the caller thread
+ * is ensured that all running threads have passed
+ * through a state where all memory accesses to
+ * user-space addresses match program order between
+ * entry to and return from the system call
+ * (non-running threads are de facto in such a
+ * state). This covers threads from all processes
+ * running on the system. This command returns 0.
*
* Command to be passed to the membarrier system call. The commands need to
* be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
@@ -48,6 +58,7 @@
enum membarrier_cmd {
MEMBARRIER_CMD_QUERY = 0,
MEMBARRIER_CMD_SHARED = (1 << 0),
+ MEMBARRIER_CMD_SHARED_EXPEDITED = (1 << 1),
};

#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
index 9f9284f37f8d..587e3bbfae7e 100644
--- a/kernel/membarrier.c
+++ b/kernel/membarrier.c
@@ -22,7 +22,8 @@
* Bitmask made from a "or" of all commands within enum membarrier_cmd,
* except MEMBARRIER_CMD_QUERY.
*/
-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
+#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED | \
+ MEMBARRIER_CMD_SHARED_EXPEDITED)

/**
* sys_membarrier - issue memory barriers on a set of threads
@@ -64,6 +65,20 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
if (num_online_cpus() > 1)
synchronize_sched();
return 0;
+ case MEMBARRIER_CMD_SHARED_EXPEDITED:
+ if (num_online_cpus() > 1) {
+ static unsigned long lastexp;
+ unsigned long j;
+
+ j = jiffies;
+ if (READ_ONCE(lastexp) == j) {
+ synchronize_sched();
+ WRITE_ONCE(lastexp, j);
+ } else {
+ synchronize_sched_expedited();
+ }
+ }
+ return 0;
default:
return -EINVAL;
}


2017-07-27 18:36:40

by Andrew Hunter

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Thu, Jul 27, 2017 at 11:12 AM, Paul E. McKenney
<[email protected]> wrote:
> Hello!
> But my main question is whether the throttling shown below is acceptable
> for your use cases, namely only one expedited sys_membarrier() permitted
> per scheduling-clock period (1 millisecond on many platforms), with any
> excess being silently converted to non-expedited form.

Google doesn't use sys_membarrier (that I know of...), but we do use
RSEQ fences, which implements membarrier + a little extra to interrupt
RSEQ critical sections (via IPI--smp_call_function_many.) One
important optimization here is that we only throw IPIs to cpus running
the same mm as current (or a subset if requested by userspace), as
this is sufficient for the API guarantees we provide. I suspect a
similar optimization would largely mitigate DOS concerns, no? I don't
know if there are use cases not covered. To answer your question:
throttling these (or our equivalents) would be fine in terms of
userspace throughput. We haven't noticed performance problems
requiring such an intervention, however.

Furthermore: I wince a bit at the silent downgrade; I'd almost prefer
-EAGAIN or -EBUSY. In particular, again for RSEQ fence, the downgrade
simply wouldn't work; rcu_sched_qs() gets called at many points that
aren't sufficiently quiescent for RSEQ (in particular, when userspace
code is running!) This is solvable, but worth thinking about.

2017-07-27 19:06:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Thu, Jul 27, 2017 at 11:36:38AM -0700, Andrew Hunter wrote:
> On Thu, Jul 27, 2017 at 11:12 AM, Paul E. McKenney
> <[email protected]> wrote:
> > Hello!
> > But my main question is whether the throttling shown below is acceptable
> > for your use cases, namely only one expedited sys_membarrier() permitted
> > per scheduling-clock period (1 millisecond on many platforms), with any
> > excess being silently converted to non-expedited form.
>
> Google doesn't use sys_membarrier (that I know of...), but we do use
> RSEQ fences, which implements membarrier + a little extra to interrupt
> RSEQ critical sections (via IPI--smp_call_function_many.) One
> important optimization here is that we only throw IPIs to cpus running
> the same mm as current (or a subset if requested by userspace), as
> this is sufficient for the API guarantees we provide. I suspect a
> similar optimization would largely mitigate DOS concerns, no? I don't
> know if there are use cases not covered. To answer your question:
> throttling these (or our equivalents) would be fine in terms of
> userspace throughput. We haven't noticed performance problems
> requiring such an intervention, however.

IPIin only those CPUs running threads in the same process as the
thread invoking membarrier() would be very nice! There is some LKML
discussion on this topic, which is currently circling around making this
determination reliable on all CPU families. ARM and x86 are thought
to be OK, PowerPC is thought to require a smallish patch, MIPS is
a big question mark, and so on.

Good to hear that the throttling would be OK for your workloads,
thank you!

> Furthermore: I wince a bit at the silent downgrade; I'd almost prefer
> -EAGAIN or -EBUSY. In particular, again for RSEQ fence, the downgrade
> simply wouldn't work; rcu_sched_qs() gets called at many points that
> aren't sufficiently quiescent for RSEQ (in particular, when userspace
> code is running!) This is solvable, but worth thinking about.

Good point! One approach would be to unconditionally return -EAGAIN/-EBUSY
and another would be to have a separate cmd or flag to say what to do
if expedited wasn't currently available. My thought would be to
add a separate expedited command, so that one did the fallback and
the other returned the error.

But I am surprised when you say that the downgrade would not work, at
least if you are not running with nohz_full CPUs. The rcu_sched_qs()
function simply sets a per-CPU quiescent-state flag. The needed strong
ordering is instead supplied by the combination of the code starting
the grace period, reporting the setting of the quiescent-state flag
to core RCU, and the code completing the grace period. Each non-idle
CPU will execute full memory barriers either in RCU_SOFTIRQ context,
on entry to idle, on exit from idle, or within the grace-period kthread.
In particular, a CPU running the same usermode thread for the entire
grace period will execute the needed memory barriers in RCU_SOFTIRQ
context shortly after taking a scheduling-clock interrupt.

So are you running nohz_full CPUs? Or is there something else that I
am missing?

Thanx, Paul

2017-07-27 19:20:19

by Avi Kivity

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On 07/27/2017 09:12 PM, Paul E. McKenney wrote:
> Hello!
>
> Please see below for a prototype sys_membarrier() speedup patch.
> Please note that there is some controversy on this subject, so the final
> version will probably be quite a bit different than this prototype.
>
> But my main question is whether the throttling shown below is acceptable
> for your use cases, namely only one expedited sys_membarrier() permitted
> per scheduling-clock period (1 millisecond on many platforms), with any
> excess being silently converted to non-expedited form. The reason for
> the throttling is concerns about DoS attacks based on user code with a
> tight loop invoking this system call.
>
> Thoughts?

Silent throttling would render it useless for me. -EAGAIN is a little
better, but I'd be forced to spin until either I get kicked out of my
loop, or it succeeds.

IPIing only running threads of my process would be perfect. In fact I
might even be able to make use of "membarrier these threads please" to
reduce IPIs, when I change the topology from fully connected to
something more sparse, on larger machines.

My previous implementations were a signal (but that's horrible on large
machines) and trylock + mprotect (but that doesn't work on ARM).


> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 4cd5253094b6d7f9501e21e13aa4e2e78e8a70cd
> Author: Paul E. McKenney <[email protected]>
> Date: Tue Jul 18 13:53:32 2017 -0700
>
> sys_membarrier: Add expedited option
>
> The sys_membarrier() system call has proven too slow for some use cases,
> which has prompted users to instead rely on TLB shootdown. Although TLB
> shootdown is much faster, it has the slight disadvantage of not working
> at all on arm and arm64 and also of being vulnerable to reasonable
> optimizations that might skip some IPIs. However, the Linux kernel
> does not currrently provide a reasonable alternative, so it is hard to
> criticize these users from doing what works for them on a given piece
> of hardware at a given time.
>
> This commit therefore adds an expedited option to the sys_membarrier()
> system call, thus providing a faster mechanism that is portable and
> is not subject to death by optimization. Note that if more than one
> MEMBARRIER_CMD_SHARED_EXPEDITED sys_membarrier() call happens within
> the same jiffy, all but the first will use synchronize_sched() instead
> of synchronize_sched_expedited().
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> [ paulmck: Fix code style issue pointed out by Boqun Feng. ]
> Tested-by: Avi Kivity <[email protected]>
> Cc: Maged Michael <[email protected]>
> Cc: Andrew Hunter <[email protected]>
> Cc: Geoffrey Romer <[email protected]>
>
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> index e0b108bd2624..5720386d0904 100644
> --- a/include/uapi/linux/membarrier.h
> +++ b/include/uapi/linux/membarrier.h
> @@ -40,6 +40,16 @@
> * (non-running threads are de facto in such a
> * state). This covers threads from all processes
> * running on the system. This command returns 0.
> + * @MEMBARRIER_CMD_SHARED_EXPEDITED: Execute a memory barrier on all
> + * running threads, but in an expedited fashion.
> + * Upon return from system call, the caller thread
> + * is ensured that all running threads have passed
> + * through a state where all memory accesses to
> + * user-space addresses match program order between
> + * entry to and return from the system call
> + * (non-running threads are de facto in such a
> + * state). This covers threads from all processes
> + * running on the system. This command returns 0.
> *
> * Command to be passed to the membarrier system call. The commands need to
> * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> @@ -48,6 +58,7 @@
> enum membarrier_cmd {
> MEMBARRIER_CMD_QUERY = 0,
> MEMBARRIER_CMD_SHARED = (1 << 0),
> + MEMBARRIER_CMD_SHARED_EXPEDITED = (1 << 1),
> };
>
> #endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> index 9f9284f37f8d..587e3bbfae7e 100644
> --- a/kernel/membarrier.c
> +++ b/kernel/membarrier.c
> @@ -22,7 +22,8 @@
> * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> * except MEMBARRIER_CMD_QUERY.
> */
> -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
> +#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED | \
> + MEMBARRIER_CMD_SHARED_EXPEDITED)
>
> /**
> * sys_membarrier - issue memory barriers on a set of threads
> @@ -64,6 +65,20 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> if (num_online_cpus() > 1)
> synchronize_sched();
> return 0;
> + case MEMBARRIER_CMD_SHARED_EXPEDITED:
> + if (num_online_cpus() > 1) {
> + static unsigned long lastexp;
> + unsigned long j;
> +
> + j = jiffies;
> + if (READ_ONCE(lastexp) == j) {
> + synchronize_sched();
> + WRITE_ONCE(lastexp, j);
> + } else {
> + synchronize_sched_expedited();
> + }
> + }
> + return 0;
> default:
> return -EINVAL;
> }
>

2017-07-27 19:43:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
> On 07/27/2017 09:12 PM, Paul E. McKenney wrote:
> >Hello!
> >
> >Please see below for a prototype sys_membarrier() speedup patch.
> >Please note that there is some controversy on this subject, so the final
> >version will probably be quite a bit different than this prototype.
> >
> >But my main question is whether the throttling shown below is acceptable
> >for your use cases, namely only one expedited sys_membarrier() permitted
> >per scheduling-clock period (1 millisecond on many platforms), with any
> >excess being silently converted to non-expedited form. The reason for
> >the throttling is concerns about DoS attacks based on user code with a
> >tight loop invoking this system call.
> >
> >Thoughts?
>
> Silent throttling would render it useless for me. -EAGAIN is a
> little better, but I'd be forced to spin until either I get kicked
> out of my loop, or it succeeds.
>
> IPIing only running threads of my process would be perfect. In fact
> I might even be able to make use of "membarrier these threads
> please" to reduce IPIs, when I change the topology from fully
> connected to something more sparse, on larger machines.
>
> My previous implementations were a signal (but that's horrible on
> large machines) and trylock + mprotect (but that doesn't work on
> ARM).

OK, how about the following patch, which IPIs only the running
threads of the process doing the sys_membarrier()?

Thanx, Paul

------------------------------------------------------------------------

From: Mathieu Desnoyers <[email protected]>
To: Peter Zijlstra <[email protected]>
Cc: [email protected], Mathieu Desnoyers
<[email protected]>,
"Paul E . McKenney" <[email protected]>, Boqun Feng <[email protected]>
Subject: [RFC PATCH] membarrier: expedited private command
Date: Thu, 27 Jul 2017 14:59:43 -0400
Message-Id: <[email protected]>

Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as our own.

Scheduler-wise, it requires that we add a memory barrier after context
switching between processes (which have different mm).

It would be interesting to benchmark the overhead of this added barrier
on the performance of context switching between processes. If the
preexisting overhead of switching between mm is high enough, the
overhead of adding this extra barrier may be insignificant.

[ Compile-tested only! ]

CC: Peter Zijlstra <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Boqun Feng <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
---
include/uapi/linux/membarrier.h | 8 +++--
kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 21 ++++++++++++
3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108bd2624..6a33c5852f6b 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,14 +40,18 @@
* (non-running threads are de facto in such a
* state). This covers threads from all processes
* running on the system. This command returns 0.
+ * TODO: documentation.
*
* Command to be passed to the membarrier system call. The commands need to
* be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
* the value 0.
*/
enum membarrier_cmd {
- MEMBARRIER_CMD_QUERY = 0,
- MEMBARRIER_CMD_SHARED = (1 << 0),
+ MEMBARRIER_CMD_QUERY = 0,
+ MEMBARRIER_CMD_SHARED = (1 << 0),
+ /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
+ /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
};

#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
index 9f9284f37f8d..8c6c0f96f617 100644
--- a/kernel/membarrier.c
+++ b/kernel/membarrier.c
@@ -19,10 +19,81 @@
#include <linux/tick.h>

/*
+ * XXX For cpu_rq(). Should we rather move
+ * membarrier_private_expedited() to sched/core.c or create
+ * sched/membarrier.c ?
+ */
+#include "sched/sched.h"
+
+/*
* Bitmask made from a "or" of all commands within enum membarrier_cmd,
* except MEMBARRIER_CMD_QUERY.
*/
-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
+#define MEMBARRIER_CMD_BITMASK \
+ (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
+
+static void ipi_mb(void *info)
+{
+ smp_mb(); /* IPIs should be serializing but paranoid. */
+}
+
+static void membarrier_private_expedited_ipi_each(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct task_struct *p;
+
+ rcu_read_lock();
+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+ if (p && p->mm == current->mm)
+ smp_call_function_single(cpu, ipi_mb, NULL, 1);
+ rcu_read_unlock();
+ }
+}
+
+static void membarrier_private_expedited(void)
+{
+ int cpu, this_cpu;
+ cpumask_var_t tmpmask;
+
+ if (num_online_cpus() == 1)
+ return;
+
+ /*
+ * Matches memory barriers around rq->curr modification in
+ * scheduler.
+ */
+ smp_mb(); /* system call entry is not a mb. */
+
+ if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+ /* Fallback for OOM. */
+ membarrier_private_expedited_ipi_each();
+ goto end;
+ }
+
+ this_cpu = raw_smp_processor_id();
+ for_each_online_cpu(cpu) {
+ struct task_struct *p;
+
+ if (cpu == this_cpu)
+ continue;
+ rcu_read_lock();
+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+ if (p && p->mm == current->mm)
+ __cpumask_set_cpu(cpu, tmpmask);
+ rcu_read_unlock();
+ }
+ smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
+ free_cpumask_var(tmpmask);
+end:
+ /*
+ * Memory barrier on the caller thread _after_ we finished
+ * waiting for the last IPI. Matches memory barriers around
+ * rq->curr modification in scheduler.
+ */
+ smp_mb(); /* exit from system call is not a mb */
+}

/**
* sys_membarrier - issue memory barriers on a set of threads
@@ -64,6 +135,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
if (num_online_cpus() > 1)
synchronize_sched();
return 0;
+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
+ membarrier_private_expedited();
+ return 0;
default:
return -EINVAL;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17c667b427b4..f171d2aaaf82 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2724,6 +2724,26 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
put_user(task_pid_vnr(current), current->set_child_tid);
}

+#ifdef CONFIG_MEMBARRIER
+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
+ struct mm_struct *oldmm)
+{
+ if (likely(mm == oldmm))
+ return; /* Thread context switch, same mm. */
+ /*
+ * When switching between processes, membarrier expedited
+ * private requires a memory barrier after we set the current
+ * task.
+ */
+ smp_mb();
+}
+#else /* #ifdef CONFIG_MEMBARRIER */
+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
+ struct mm_struct *oldmm)
+{
+}
+#endif /* #else #ifdef CONFIG_MEMBARRIER */
+
/*
* context_switch - switch to the new MM and the new thread's register state.
*/
@@ -2737,6 +2757,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

mm = next->mm;
oldmm = prev->active_mm;
+ membarrier_expedited_mb_after_set_current(mm, oldmm);
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
--
2.11.0


2017-07-27 20:04:27

by Avi Kivity

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On 07/27/2017 10:43 PM, Paul E. McKenney wrote:
> On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
>> On 07/27/2017 09:12 PM, Paul E. McKenney wrote:
>>> Hello!
>>>
>>> Please see below for a prototype sys_membarrier() speedup patch.
>>> Please note that there is some controversy on this subject, so the final
>>> version will probably be quite a bit different than this prototype.
>>>
>>> But my main question is whether the throttling shown below is acceptable
>>> for your use cases, namely only one expedited sys_membarrier() permitted
>>> per scheduling-clock period (1 millisecond on many platforms), with any
>>> excess being silently converted to non-expedited form. The reason for
>>> the throttling is concerns about DoS attacks based on user code with a
>>> tight loop invoking this system call.
>>>
>>> Thoughts?
>> Silent throttling would render it useless for me. -EAGAIN is a
>> little better, but I'd be forced to spin until either I get kicked
>> out of my loop, or it succeeds.
>>
>> IPIing only running threads of my process would be perfect. In fact
>> I might even be able to make use of "membarrier these threads
>> please" to reduce IPIs, when I change the topology from fully
>> connected to something more sparse, on larger machines.
>>
>> My previous implementations were a signal (but that's horrible on
>> large machines) and trylock + mprotect (but that doesn't work on
>> ARM).
> OK, how about the following patch, which IPIs only the running
> threads of the process doing the sys_membarrier()?

Works for me.

>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> From: Mathieu Desnoyers <[email protected]>
> To: Peter Zijlstra <[email protected]>
> Cc: [email protected], Mathieu Desnoyers
> <[email protected]>,
> "Paul E . McKenney" <[email protected]>, Boqun Feng <[email protected]>
> Subject: [RFC PATCH] membarrier: expedited private command
> Date: Thu, 27 Jul 2017 14:59:43 -0400
> Message-Id: <[email protected]>
>
> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> from all runqueues for which current thread's mm is the same as our own.
>
> Scheduler-wise, it requires that we add a memory barrier after context
> switching between processes (which have different mm).
>
> It would be interesting to benchmark the overhead of this added barrier
> on the performance of context switching between processes. If the
> preexisting overhead of switching between mm is high enough, the
> overhead of adding this extra barrier may be insignificant.
>
> [ Compile-tested only! ]
>
> CC: Peter Zijlstra <[email protected]>
> CC: Paul E. McKenney <[email protected]>
> CC: Boqun Feng <[email protected]>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> ---
> include/uapi/linux/membarrier.h | 8 +++--
> kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/core.c | 21 ++++++++++++
> 3 files changed, 102 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> index e0b108bd2624..6a33c5852f6b 100644
> --- a/include/uapi/linux/membarrier.h
> +++ b/include/uapi/linux/membarrier.h
> @@ -40,14 +40,18 @@
> * (non-running threads are de facto in such a
> * state). This covers threads from all processes
> * running on the system. This command returns 0.
> + * TODO: documentation.
> *
> * Command to be passed to the membarrier system call. The commands need to
> * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> * the value 0.
> */
> enum membarrier_cmd {
> - MEMBARRIER_CMD_QUERY = 0,
> - MEMBARRIER_CMD_SHARED = (1 << 0),
> + MEMBARRIER_CMD_QUERY = 0,
> + MEMBARRIER_CMD_SHARED = (1 << 0),
> + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
> + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
> + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
> };
>
> #endif /* _UAPI_LINUX_MEMBARRIER_H */
> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> index 9f9284f37f8d..8c6c0f96f617 100644
> --- a/kernel/membarrier.c
> +++ b/kernel/membarrier.c
> @@ -19,10 +19,81 @@
> #include <linux/tick.h>
>
> /*
> + * XXX For cpu_rq(). Should we rather move
> + * membarrier_private_expedited() to sched/core.c or create
> + * sched/membarrier.c ?
> + */
> +#include "sched/sched.h"
> +
> +/*
> * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> * except MEMBARRIER_CMD_QUERY.
> */
> -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
> +#define MEMBARRIER_CMD_BITMASK \
> + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> +

> rcu_read_unlock();
> + }
> +}
> +
> +static void membarrier_private_expedited(void)
> +{
> + int cpu, this_cpu;
> + cpumask_var_t tmpmask;
> +
> + if (num_online_cpus() == 1)
> + return;
> +
> + /*
> + * Matches memory barriers around rq->curr modification in
> + * scheduler.
> + */
> + smp_mb(); /* system call entry is not a mb. */
> +
> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> + /* Fallback for OOM. */
> + membarrier_private_expedited_ipi_each();
> + goto end;
> + }
> +
> + this_cpu = raw_smp_processor_id();
> + for_each_online_cpu(cpu) {
> + struct task_struct *p;
> +
> + if (cpu == this_cpu)
> + continue;
> + rcu_read_lock();
> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> + if (p && p->mm == current->mm)
> + __cpumask_set_cpu(cpu, tmpmask);

This gets you some false positives, if the CPU idled then mm will not
have changed.

> + rcu_read_unlock();
> + }
> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> + free_cpumask_var(tmpmask);
> +end:
> + /*
> + * Memory barrier on the caller thread _after_ we finished
> + * waiting for the last IPI. Matches memory barriers around
> + * rq->curr modification in scheduler.
> + */
> + smp_mb(); /* exit from system call is not a mb */
> +}
>
> /**
> * sys_membarrier - issue memory barriers on a set of threads
> @@ -64,6 +135,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> if (num_online_cpus() > 1)
> synchronize_sched();
> return 0;
> + case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> + membarrier_private_expedited();
> + return 0;
> default:
> return -EINVAL;
> }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 17c667b427b4..f171d2aaaf82 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2724,6 +2724,26 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> put_user(task_pid_vnr(current), current->set_child_tid);
> }
>
> +#ifdef CONFIG_MEMBARRIER
> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> + struct mm_struct *oldmm)
> +{
> + if (likely(mm == oldmm))
> + return; /* Thread context switch, same mm. */
> + /*
> + * When switching between processes, membarrier expedited
> + * private requires a memory barrier after we set the current
> + * task.
> + */
> + smp_mb();
> +}

Won't the actual page table switch generate a barrier, at least on many
archs? It sure will on x86.

It's also unneeded if kernel entry or exit involve a barrier (not true
for x86, so probably not for anything else either).

> +#else /* #ifdef CONFIG_MEMBARRIER */
> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> + struct mm_struct *oldmm)
> +{
> +}
> +#endif /* #else #ifdef CONFIG_MEMBARRIER */
> +
> /*
> * context_switch - switch to the new MM and the new thread's register state.
> */
> @@ -2737,6 +2757,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>
> mm = next->mm;
> oldmm = prev->active_mm;
> + membarrier_expedited_mb_after_set_current(mm, oldmm);
> /*
> * For paravirt, this is coupled with an exit in switch_to to
> * combine the page table reload and the switch backend into


2017-07-27 20:37:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote:
> On 07/27/2017 10:43 PM, Paul E. McKenney wrote:
> >On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
> >>On 07/27/2017 09:12 PM, Paul E. McKenney wrote:
> >>>Hello!
> >>>
> >>>Please see below for a prototype sys_membarrier() speedup patch.
> >>>Please note that there is some controversy on this subject, so the final
> >>>version will probably be quite a bit different than this prototype.
> >>>
> >>>But my main question is whether the throttling shown below is acceptable
> >>>for your use cases, namely only one expedited sys_membarrier() permitted
> >>>per scheduling-clock period (1 millisecond on many platforms), with any
> >>>excess being silently converted to non-expedited form. The reason for
> >>>the throttling is concerns about DoS attacks based on user code with a
> >>>tight loop invoking this system call.
> >>>
> >>>Thoughts?
> >>Silent throttling would render it useless for me. -EAGAIN is a
> >>little better, but I'd be forced to spin until either I get kicked
> >>out of my loop, or it succeeds.
> >>
> >>IPIing only running threads of my process would be perfect. In fact
> >>I might even be able to make use of "membarrier these threads
> >>please" to reduce IPIs, when I change the topology from fully
> >>connected to something more sparse, on larger machines.
> >>
> >>My previous implementations were a signal (but that's horrible on
> >>large machines) and trylock + mprotect (but that doesn't work on
> >>ARM).
> >OK, how about the following patch, which IPIs only the running
> >threads of the process doing the sys_membarrier()?
>
> Works for me.

Thank you for testing! I expect that Mathieu will have a v2 soon,
hopefully CCing you guys. (If not, I will forward it.)

Mathieu, please note Avi's feedback below.

Thanx, Paul

> >------------------------------------------------------------------------
> >
> >From: Mathieu Desnoyers <[email protected]>
> >To: Peter Zijlstra <[email protected]>
> >Cc: [email protected], Mathieu Desnoyers
> > <[email protected]>,
> > "Paul E . McKenney" <[email protected]>, Boqun Feng <[email protected]>
> >Subject: [RFC PATCH] membarrier: expedited private command
> >Date: Thu, 27 Jul 2017 14:59:43 -0400
> >Message-Id: <[email protected]>
> >
> >Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> >from all runqueues for which current thread's mm is the same as our own.
> >
> >Scheduler-wise, it requires that we add a memory barrier after context
> >switching between processes (which have different mm).
> >
> >It would be interesting to benchmark the overhead of this added barrier
> >on the performance of context switching between processes. If the
> >preexisting overhead of switching between mm is high enough, the
> >overhead of adding this extra barrier may be insignificant.
> >
> >[ Compile-tested only! ]
> >
> >CC: Peter Zijlstra <[email protected]>
> >CC: Paul E. McKenney <[email protected]>
> >CC: Boqun Feng <[email protected]>
> >Signed-off-by: Mathieu Desnoyers <[email protected]>
> >---
> > include/uapi/linux/membarrier.h | 8 +++--
> > kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++-
> > kernel/sched/core.c | 21 ++++++++++++
> > 3 files changed, 102 insertions(+), 3 deletions(-)
> >
> >diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> >index e0b108bd2624..6a33c5852f6b 100644
> >--- a/include/uapi/linux/membarrier.h
> >+++ b/include/uapi/linux/membarrier.h
> >@@ -40,14 +40,18 @@
> > * (non-running threads are de facto in such a
> > * state). This covers threads from all processes
> > * running on the system. This command returns 0.
> >+ * TODO: documentation.
> > *
> > * Command to be passed to the membarrier system call. The commands need to
> > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> > * the value 0.
> > */
> > enum membarrier_cmd {
> >- MEMBARRIER_CMD_QUERY = 0,
> >- MEMBARRIER_CMD_SHARED = (1 << 0),
> >+ MEMBARRIER_CMD_QUERY = 0,
> >+ MEMBARRIER_CMD_SHARED = (1 << 0),
> >+ /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
> >+ /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
> >+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
> > };
> >
> > #endif /* _UAPI_LINUX_MEMBARRIER_H */
> >diff --git a/kernel/membarrier.c b/kernel/membarrier.c
> >index 9f9284f37f8d..8c6c0f96f617 100644
> >--- a/kernel/membarrier.c
> >+++ b/kernel/membarrier.c
> >@@ -19,10 +19,81 @@
> > #include <linux/tick.h>
> >
> > /*
> >+ * XXX For cpu_rq(). Should we rather move
> >+ * membarrier_private_expedited() to sched/core.c or create
> >+ * sched/membarrier.c ?
> >+ */
> >+#include "sched/sched.h"
> >+
> >+/*
> > * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> > * except MEMBARRIER_CMD_QUERY.
> > */
> >-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
> >+#define MEMBARRIER_CMD_BITMASK \
> >+ (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> >+
>
> > rcu_read_unlock();
> >+ }
> >+}
> >+
> >+static void membarrier_private_expedited(void)
> >+{
> >+ int cpu, this_cpu;
> >+ cpumask_var_t tmpmask;
> >+
> >+ if (num_online_cpus() == 1)
> >+ return;
> >+
> >+ /*
> >+ * Matches memory barriers around rq->curr modification in
> >+ * scheduler.
> >+ */
> >+ smp_mb(); /* system call entry is not a mb. */
> >+
> >+ if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> >+ /* Fallback for OOM. */
> >+ membarrier_private_expedited_ipi_each();
> >+ goto end;
> >+ }
> >+
> >+ this_cpu = raw_smp_processor_id();
> >+ for_each_online_cpu(cpu) {
> >+ struct task_struct *p;
> >+
> >+ if (cpu == this_cpu)
> >+ continue;
> >+ rcu_read_lock();
> >+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> >+ if (p && p->mm == current->mm)
> >+ __cpumask_set_cpu(cpu, tmpmask);
>
> This gets you some false positives, if the CPU idled then mm will
> not have changed.

Good point! The battery-powered embedded guys would probably prefer
we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle
state in nohz_full cases. Not sure if is_idle_task() can be used
safely, given things like play_idle().

> >+ rcu_read_unlock();
> >+ }
> >+ smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> >+ free_cpumask_var(tmpmask);
> >+end:
> >+ /*
> >+ * Memory barrier on the caller thread _after_ we finished
> >+ * waiting for the last IPI. Matches memory barriers around
> >+ * rq->curr modification in scheduler.
> >+ */
> >+ smp_mb(); /* exit from system call is not a mb */
> >+}
> >
> > /**
> > * sys_membarrier - issue memory barriers on a set of threads
> >@@ -64,6 +135,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> > if (num_online_cpus() > 1)
> > synchronize_sched();
> > return 0;
> >+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> >+ membarrier_private_expedited();
> >+ return 0;
> > default:
> > return -EINVAL;
> > }
> >diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >index 17c667b427b4..f171d2aaaf82 100644
> >--- a/kernel/sched/core.c
> >+++ b/kernel/sched/core.c
> >@@ -2724,6 +2724,26 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> > put_user(task_pid_vnr(current), current->set_child_tid);
> > }
> >
> >+#ifdef CONFIG_MEMBARRIER
> >+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> >+ struct mm_struct *oldmm)
> >+{
> >+ if (likely(mm == oldmm))
> >+ return; /* Thread context switch, same mm. */
> >+ /*
> >+ * When switching between processes, membarrier expedited
> >+ * private requires a memory barrier after we set the current
> >+ * task.
> >+ */
> >+ smp_mb();
> >+}
>
> Won't the actual page table switch generate a barrier, at least on
> many archs? It sure will on x86.

There are apparently at least a few architectures that don't.

> It's also unneeded if kernel entry or exit involve a barrier (not
> true for x86, so probably not for anything else either).
>
> >+#else /* #ifdef CONFIG_MEMBARRIER */
> >+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> >+ struct mm_struct *oldmm)
> >+{
> >+}
> >+#endif /* #else #ifdef CONFIG_MEMBARRIER */
> >+
> > /*
> > * context_switch - switch to the new MM and the new thread's register state.
> > */
> >@@ -2737,6 +2757,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >
> > mm = next->mm;
> > oldmm = prev->active_mm;
> >+ membarrier_expedited_mb_after_set_current(mm, oldmm);
> > /*
> > * For paravirt, this is coupled with an exit in switch_to to
> > * combine the page table reload and the switch backend into
>
>

2017-07-27 20:56:34

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

----- On Jul 27, 2017, at 4:37 PM, Paul E. McKenney [email protected] wrote:

> On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote:
>> On 07/27/2017 10:43 PM, Paul E. McKenney wrote:
>> >On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
>> >>On 07/27/2017 09:12 PM, Paul E. McKenney wrote:
>> >>>Hello!
>> >>>
>> >>>Please see below for a prototype sys_membarrier() speedup patch.
>> >>>Please note that there is some controversy on this subject, so the final
>> >>>version will probably be quite a bit different than this prototype.
>> >>>
>> >>>But my main question is whether the throttling shown below is acceptable
>> >>>for your use cases, namely only one expedited sys_membarrier() permitted
>> >>>per scheduling-clock period (1 millisecond on many platforms), with any
>> >>>excess being silently converted to non-expedited form. The reason for
>> >>>the throttling is concerns about DoS attacks based on user code with a
>> >>>tight loop invoking this system call.
>> >>>
>> >>>Thoughts?
>> >>Silent throttling would render it useless for me. -EAGAIN is a
>> >>little better, but I'd be forced to spin until either I get kicked
>> >>out of my loop, or it succeeds.
>> >>
>> >>IPIing only running threads of my process would be perfect. In fact
>> >>I might even be able to make use of "membarrier these threads
>> >>please" to reduce IPIs, when I change the topology from fully
>> >>connected to something more sparse, on larger machines.
>> >>
>> >>My previous implementations were a signal (but that's horrible on
>> >>large machines) and trylock + mprotect (but that doesn't work on
>> >>ARM).
>> >OK, how about the following patch, which IPIs only the running
>> >threads of the process doing the sys_membarrier()?
>>
>> Works for me.
>
> Thank you for testing! I expect that Mathieu will have a v2 soon,
> hopefully CCing you guys. (If not, I will forward it.)
>

Will do!

> Mathieu, please note Avi's feedback below.

More below,

>
> Thanx, Paul
>
>> >------------------------------------------------------------------------
>> >
>> >From: Mathieu Desnoyers <[email protected]>
>> >To: Peter Zijlstra <[email protected]>
>> >Cc: [email protected], Mathieu Desnoyers
>> > <[email protected]>,
>> > "Paul E . McKenney" <[email protected]>, Boqun Feng
>> > <[email protected]>
>> >Subject: [RFC PATCH] membarrier: expedited private command
>> >Date: Thu, 27 Jul 2017 14:59:43 -0400
>> >Message-Id: <[email protected]>
>> >
>> >Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
>> >from all runqueues for which current thread's mm is the same as our own.
>> >
>> >Scheduler-wise, it requires that we add a memory barrier after context
>> >switching between processes (which have different mm).
>> >
>> >It would be interesting to benchmark the overhead of this added barrier
>> >on the performance of context switching between processes. If the
>> >preexisting overhead of switching between mm is high enough, the
>> >overhead of adding this extra barrier may be insignificant.
>> >
>> >[ Compile-tested only! ]
>> >
>> >CC: Peter Zijlstra <[email protected]>
>> >CC: Paul E. McKenney <[email protected]>
>> >CC: Boqun Feng <[email protected]>
>> >Signed-off-by: Mathieu Desnoyers <[email protected]>
>> >---
>> > include/uapi/linux/membarrier.h | 8 +++--
>> > kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++-
>> > kernel/sched/core.c | 21 ++++++++++++
>> > 3 files changed, 102 insertions(+), 3 deletions(-)
>> >
>> >diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>> >index e0b108bd2624..6a33c5852f6b 100644
>> >--- a/include/uapi/linux/membarrier.h
>> >+++ b/include/uapi/linux/membarrier.h
>> >@@ -40,14 +40,18 @@
>> > * (non-running threads are de facto in such a
>> > * state). This covers threads from all processes
>> > * running on the system. This command returns 0.
>> >+ * TODO: documentation.
>> > *
>> > * Command to be passed to the membarrier system call. The commands need to
>> > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>> > * the value 0.
>> > */
>> > enum membarrier_cmd {
>> >- MEMBARRIER_CMD_QUERY = 0,
>> >- MEMBARRIER_CMD_SHARED = (1 << 0),
>> >+ MEMBARRIER_CMD_QUERY = 0,
>> >+ MEMBARRIER_CMD_SHARED = (1 << 0),
>> >+ /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
>> >+ /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
>> >+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
>> > };
>> >
>> > #endif /* _UAPI_LINUX_MEMBARRIER_H */
>> >diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>> >index 9f9284f37f8d..8c6c0f96f617 100644
>> >--- a/kernel/membarrier.c
>> >+++ b/kernel/membarrier.c
>> >@@ -19,10 +19,81 @@
>> > #include <linux/tick.h>
>> >
>> > /*
>> >+ * XXX For cpu_rq(). Should we rather move
>> >+ * membarrier_private_expedited() to sched/core.c or create
>> >+ * sched/membarrier.c ?
>> >+ */
>> >+#include "sched/sched.h"
>> >+
>> >+/*
>> > * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>> > * except MEMBARRIER_CMD_QUERY.
>> > */
>> >-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
>> >+#define MEMBARRIER_CMD_BITMASK \
>> >+ (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
>> >+
>>
>> > rcu_read_unlock();
>> >+ }
>> >+}
>> >+
>> >+static void membarrier_private_expedited(void)
>> >+{
>> >+ int cpu, this_cpu;
>> >+ cpumask_var_t tmpmask;
>> >+
>> >+ if (num_online_cpus() == 1)
>> >+ return;
>> >+
>> >+ /*
>> >+ * Matches memory barriers around rq->curr modification in
>> >+ * scheduler.
>> >+ */
>> >+ smp_mb(); /* system call entry is not a mb. */
>> >+
>> >+ if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
>> >+ /* Fallback for OOM. */
>> >+ membarrier_private_expedited_ipi_each();
>> >+ goto end;
>> >+ }
>> >+
>> >+ this_cpu = raw_smp_processor_id();
>> >+ for_each_online_cpu(cpu) {
>> >+ struct task_struct *p;
>> >+
>> >+ if (cpu == this_cpu)
>> >+ continue;
>> >+ rcu_read_lock();
>> >+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> >+ if (p && p->mm == current->mm)
>> >+ __cpumask_set_cpu(cpu, tmpmask);
>>
>> This gets you some false positives, if the CPU idled then mm will
>> not have changed.
>
> Good point! The battery-powered embedded guys would probably prefer
> we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle
> state in nohz_full cases. Not sure if is_idle_task() can be used
> safely, given things like play_idle().

Would changing the check in this loop to:

if (p && !is_idle_task(p) && p->mm == current->mm) {

work for you ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-07-27 21:00:22

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

----- On Jul 27, 2017, at 4:58 PM, Mathieu Desnoyers [email protected] wrote:

> ----- On Jul 27, 2017, at 4:37 PM, Paul E. McKenney [email protected]
> wrote:
>
>> On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote:
[...]
>>> >+
>>> >+ this_cpu = raw_smp_processor_id();
>>> >+ for_each_online_cpu(cpu) {
>>> >+ struct task_struct *p;
>>> >+
>>> >+ if (cpu == this_cpu)
>>> >+ continue;
>>> >+ rcu_read_lock();
>>> >+ p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>>> >+ if (p && p->mm == current->mm)
>>> >+ __cpumask_set_cpu(cpu, tmpmask);
>>>
>>> This gets you some false positives, if the CPU idled then mm will
>>> not have changed.
>>
>> Good point! The battery-powered embedded guys would probably prefer
>> we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle
>> state in nohz_full cases. Not sure if is_idle_task() can be used
>> safely, given things like play_idle().
>
> Would changing the check in this loop to:
>
> if (p && !is_idle_task(p) && p->mm == current->mm) {
>
> work for you ?

Avi, is there an optimization that allows current->mm to be non-null
when the idle task is scheduled that I am missing ?

I would have expected current->mm to be always NULL for idle
tasks.

Thanks,

Mathieu

>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-07-28 17:15:54

by Andrew Hunter

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Thu, Jul 27, 2017 at 12:43 PM, Paul E. McKenney
<[email protected]> wrote:
> On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
>> IPIing only running threads of my process would be perfect. In fact
>> I might even be able to make use of "membarrier these threads
>> please" to reduce IPIs, when I change the topology from fully
>> connected to something more sparse, on larger machines.
>>

We do this as well--sometimes we only need RSEQ fences against
specific CPU(s), and thus pass a subset.

> +static void membarrier_private_expedited_ipi_each(void)
> +{
> + int cpu;
> +
> + for_each_online_cpu(cpu) {
> + struct task_struct *p;
> +
> + rcu_read_lock();
> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> + if (p && p->mm == current->mm)
> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
> + rcu_read_unlock();
> + }
> +}
> +

We have the (simpler imho)

const struct cpumask *mask = mm_cpumask(mm);
/* possibly AND it with a user requested mask */
smp_call_function_many(mask, ipi_func, ....);

which I think will be faster on some archs (that support broadcast)
and have fewer problems with out of sync values (though we do have to
check in our IPI function that we haven't context switched out.

Am I missing why this won't work?

2017-07-28 17:23:07

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

----- On Jul 28, 2017, at 1:15 PM, Andrew Hunter [email protected] wrote:

> On Thu, Jul 27, 2017 at 12:43 PM, Paul E. McKenney
> <[email protected]> wrote:
>> On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
>>> IPIing only running threads of my process would be perfect. In fact
>>> I might even be able to make use of "membarrier these threads
>>> please" to reduce IPIs, when I change the topology from fully
>>> connected to something more sparse, on larger machines.
>>>
>
> We do this as well--sometimes we only need RSEQ fences against
> specific CPU(s), and thus pass a subset.
>
>> +static void membarrier_private_expedited_ipi_each(void)
>> +{
>> + int cpu;
>> +
>> + for_each_online_cpu(cpu) {
>> + struct task_struct *p;
>> +
>> + rcu_read_lock();
>> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> + if (p && p->mm == current->mm)
>> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> + rcu_read_unlock();
>> + }
>> +}
>> +
>
> We have the (simpler imho)
>
> const struct cpumask *mask = mm_cpumask(mm);
> /* possibly AND it with a user requested mask */
> smp_call_function_many(mask, ipi_func, ....);
>
> which I think will be faster on some archs (that support broadcast)
> and have fewer problems with out of sync values (though we do have to
> check in our IPI function that we haven't context switched out.
>
> Am I missing why this won't work?

The mm cpumask is not populated on all architectures, unfortunately, so
we need to do the generic implementation without it. Moreover, I recall
that using this in addition to the rq->curr checks adds extra complexity
wrt memory barriers vs updates of the mm_cpumask.

The ipi_each loop you refer to here is only for the fallback case.
The common case allocates a cpumask, populates it by looking at
each rq->curr, and uses smp_call_function_many on that cpumask.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-07-28 17:31:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Fri, Jul 28, 2017 at 10:15:49AM -0700, Andrew Hunter wrote:
> On Thu, Jul 27, 2017 at 12:43 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
> >> IPIing only running threads of my process would be perfect. In fact
> >> I might even be able to make use of "membarrier these threads
> >> please" to reduce IPIs, when I change the topology from fully
> >> connected to something more sparse, on larger machines.
>
> We do this as well--sometimes we only need RSEQ fences against
> specific CPU(s), and thus pass a subset.

Sounds like a good future enhancement, probably requiring a new syscall
to accommodate the cpumask.

> > +static void membarrier_private_expedited_ipi_each(void)
> > +{
> > + int cpu;
> > +
> > + for_each_online_cpu(cpu) {
> > + struct task_struct *p;
> > +
> > + rcu_read_lock();
> > + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> > + if (p && p->mm == current->mm)
> > + smp_call_function_single(cpu, ipi_mb, NULL, 1);
> > + rcu_read_unlock();
> > + }
> > +}
> > +
>
> We have the (simpler imho)
>
> const struct cpumask *mask = mm_cpumask(mm);
> /* possibly AND it with a user requested mask */
> smp_call_function_many(mask, ipi_func, ....);
>
> which I think will be faster on some archs (that support broadcast)
> and have fewer problems with out of sync values (though we do have to
> check in our IPI function that we haven't context switched out.
>
> Am I missing why this won't work?

My impression is that some architectures don't provide the needed
ordering in this case, and also that some architectures support ASIDs
and would thus IPI CPUs that weren't actually running threads in the
process at the current time.

Mathieu, anything I am missing?

Thanx, Paul

2017-07-28 17:37:28

by Andrew Hunter

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Thu, Jul 27, 2017 at 12:06 PM, Paul E. McKenney
<[email protected]> wrote:
> IPIin only those CPUs running threads in the same process as the
> thread invoking membarrier() would be very nice! There is some LKML
> discussion on this topic, which is currently circling around making this
> determination reliable on all CPU families. ARM and x86 are thought
> to be OK, PowerPC is thought to require a smallish patch, MIPS is
> a big question mark, and so on.
>

I'm not sure what you mean by the determination or how this is arch specific?

> But I am surprised when you say that the downgrade would not work, at
> least if you are not running with nohz_full CPUs. The rcu_sched_qs()
> function simply sets a per-CPU quiescent-state flag. The needed strong
> ordering is instead supplied by the combination of the code starting
> the grace period, reporting the setting of the quiescent-state flag
> to core RCU, and the code completing the grace period. Each non-idle
> CPU will execute full memory barriers either in RCU_SOFTIRQ context,
> on entry to idle, on exit from idle, or within the grace-period kthread.
> In particular, a CPU running the same usermode thread for the entire
> grace period will execute the needed memory barriers in RCU_SOFTIRQ
> context shortly after taking a scheduling-clock interrupt.
>

Recall that I need more than just a memory barrier--also to interrupt
RSEQ critical sections in progress on those CPUs. I know this isn't
general purpose, I'm just saying a trivial downgrade wouldn't work for
me. :) It would probably be sufficient to set NOTIFY_RESUME on all
cpus running my code (which is what my IPI function does anyway...)

2017-07-28 17:46:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

----- On Jul 28, 2017, at 1:31 PM, Paul E. McKenney [email protected] wrote:

> On Fri, Jul 28, 2017 at 10:15:49AM -0700, Andrew Hunter wrote:
>> On Thu, Jul 27, 2017 at 12:43 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
>> >> IPIing only running threads of my process would be perfect. In fact
>> >> I might even be able to make use of "membarrier these threads
>> >> please" to reduce IPIs, when I change the topology from fully
>> >> connected to something more sparse, on larger machines.
>>
>> We do this as well--sometimes we only need RSEQ fences against
>> specific CPU(s), and thus pass a subset.
>
> Sounds like a good future enhancement, probably requiring a new syscall
> to accommodate the cpumask.
>
>> > +static void membarrier_private_expedited_ipi_each(void)
>> > +{
>> > + int cpu;
>> > +
>> > + for_each_online_cpu(cpu) {
>> > + struct task_struct *p;
>> > +
>> > + rcu_read_lock();
>> > + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> > + if (p && p->mm == current->mm)
>> > + smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> > + rcu_read_unlock();
>> > + }
>> > +}
>> > +
>>
>> We have the (simpler imho)
>>
>> const struct cpumask *mask = mm_cpumask(mm);
>> /* possibly AND it with a user requested mask */
>> smp_call_function_many(mask, ipi_func, ....);
>>
>> which I think will be faster on some archs (that support broadcast)
>> and have fewer problems with out of sync values (though we do have to
>> check in our IPI function that we haven't context switched out.
>>
>> Am I missing why this won't work?
>
> My impression is that some architectures don't provide the needed
> ordering in this case, and also that some architectures support ASIDs
> and would thus IPI CPUs that weren't actually running threads in the
> process at the current time.
>
> Mathieu, anything I am missing?

As per my other email, it's pretty much it, yes.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2017-07-28 18:14:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Fri, Jul 28, 2017 at 10:37:25AM -0700, Andrew Hunter wrote:
> On Thu, Jul 27, 2017 at 12:06 PM, Paul E. McKenney
> <[email protected]> wrote:
> > IPIin only those CPUs running threads in the same process as the
> > thread invoking membarrier() would be very nice! There is some LKML
> > discussion on this topic, which is currently circling around making this
> > determination reliable on all CPU families. ARM and x86 are thought
> > to be OK, PowerPC is thought to require a smallish patch, MIPS is
> > a big question mark, and so on.
>
> I'm not sure what you mean by the determination or how this is arch specific?

It looks like Peter and Mathieu are well on the way to solving this,
see his latest patch.

> > But I am surprised when you say that the downgrade would not work, at
> > least if you are not running with nohz_full CPUs. The rcu_sched_qs()
> > function simply sets a per-CPU quiescent-state flag. The needed strong
> > ordering is instead supplied by the combination of the code starting
> > the grace period, reporting the setting of the quiescent-state flag
> > to core RCU, and the code completing the grace period. Each non-idle
> > CPU will execute full memory barriers either in RCU_SOFTIRQ context,
> > on entry to idle, on exit from idle, or within the grace-period kthread.
> > In particular, a CPU running the same usermode thread for the entire
> > grace period will execute the needed memory barriers in RCU_SOFTIRQ
> > context shortly after taking a scheduling-clock interrupt.
>
> Recall that I need more than just a memory barrier--also to interrupt
> RSEQ critical sections in progress on those CPUs. I know this isn't
> general purpose, I'm just saying a trivial downgrade wouldn't work for
> me. :) It would probably be sufficient to set NOTIFY_RESUME on all
> cpus running my code (which is what my IPI function does anyway...)

OK, yes, one major goal of the slowboat sys_membarrier is to -avoid-
IPIing other CPUs, and if you need the CPUs to be IPIed, then a
non-expedited grace period isn't going to do it for you.

And yes, once sys_membarrier() settles a bit, hopefully early next
week, it would be good to work out some way for RSEQ to share the
sys_membarrier() code. Maybe RSEQ adds a bit to the flags argument or
some such?

Thanx, Paul

2017-07-31 06:03:15

by Avi Kivity

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI



On 07/28/2017 12:02 AM, Mathieu Desnoyers wrote:
> ----- On Jul 27, 2017, at 4:58 PM, Mathieu Desnoyers [email protected] wrote:
>
>> ----- On Jul 27, 2017, at 4:37 PM, Paul E. McKenney [email protected]
>> wrote:
>>
>>> On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote:
> [...]
>>>>> +
>>>>> + this_cpu = raw_smp_processor_id();
>>>>> + for_each_online_cpu(cpu) {
>>>>> + struct task_struct *p;
>>>>> +
>>>>> + if (cpu == this_cpu)
>>>>> + continue;
>>>>> + rcu_read_lock();
>>>>> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>>>>> + if (p && p->mm == current->mm)
>>>>> + __cpumask_set_cpu(cpu, tmpmask);
>>>> This gets you some false positives, if the CPU idled then mm will
>>>> not have changed.
>>> Good point! The battery-powered embedded guys would probably prefer
>>> we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle
>>> state in nohz_full cases. Not sure if is_idle_task() can be used
>>> safely, given things like play_idle().
>> Would changing the check in this loop to:
>>
>> if (p && !is_idle_task(p) && p->mm == current->mm) {
>>
>> work for you ?
> Avi, is there an optimization that allows current->mm to be non-null
> when the idle task is scheduled that I am missing ?
>
> I would have expected current->mm to be always NULL for idle
> tasks.
>
>

I remembered that current->mm does not change when switching to a kernel
task, but my Kernlish is very rusty, or maybe it has changed.

2017-07-31 08:38:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Mon, Jul 31, 2017 at 09:03:09AM +0300, Avi Kivity wrote:
> I remembered that current->mm does not change when switching to a kernel
> task, but my Kernlish is very rusty, or maybe it has changed.

kernel threads do indeed preserve the mm of the old userspace task, but
we track that in ->active_mm. Their ->mm will be NULL.

2017-07-31 08:54:05

by Avi Kivity

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On 07/31/2017 11:37 AM, Peter Zijlstra wrote:
> On Mon, Jul 31, 2017 at 09:03:09AM +0300, Avi Kivity wrote:
>> I remembered that current->mm does not change when switching to a kernel
>> task, but my Kernlish is very rusty, or maybe it has changed.
> kernel threads do indeed preserve the mm of the old userspace task, but
> we track that in ->active_mm. Their ->mm will be NULL.

Gah, I'm so rusty, if I were any rustier I'd be doing bitcoin.

2017-07-31 18:01:21

by Dave Watson

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

Hi Paul,

Thanks for looking at this again!

On 07/27/17 11:12 AM, Paul E. McKenney wrote:
> Hello!
>
> But my main question is whether the throttling shown below is acceptable
> for your use cases, namely only one expedited sys_membarrier() permitted
> per scheduling-clock period (1 millisecond on many platforms), with any
> excess being silently converted to non-expedited form. The reason for
> the throttling is concerns about DoS attacks based on user code with a
> tight loop invoking this system call.

We've been using sys_membarrier for the last year or so in a handful
of places with no issues. Recently we made it an option in our hazard
pointers implementation, giving us something with performance between
hazard pointers and RCU:

https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555

Currently hazard pointers tries to free retired memory the same thread
that did the retire(), so the latency spiked for workloads that were
retire() heavy. For the moment we dropped back to using mprotect
hacks.

I've tested Mathieu's v4 patch, it works great. We currently don't
have any cases where we need SHARED.

I also tested the rate-limited version, while better than the current
non-EXPEDITED SHARED version, we still hit the slow path pretty often.
I agree with other commenters that returning an error code instead of
silently slowing down might be better.

> + case MEMBARRIER_CMD_SHARED_EXPEDITED:
> + if (num_online_cpus() > 1) {
> + static unsigned long lastexp;
> + unsigned long j;
> +
> + j = jiffies;
> + if (READ_ONCE(lastexp) == j) {
> + synchronize_sched();
> + WRITE_ONCE(lastexp, j);

It looks like this update of lastexp should be in the other branch?

> + } else {
> + synchronize_sched_expedited();
> + }
> + }
> + return 0;

2017-07-31 18:27:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Udpated sys_membarrier() speedup patch, FYI

On Mon, Jul 31, 2017 at 11:00:19AM -0700, Dave Watson wrote:
> Hi Paul,
>
> Thanks for looking at this again!
>
> On 07/27/17 11:12 AM, Paul E. McKenney wrote:
> > Hello!
> >
> > But my main question is whether the throttling shown below is acceptable
> > for your use cases, namely only one expedited sys_membarrier() permitted
> > per scheduling-clock period (1 millisecond on many platforms), with any
> > excess being silently converted to non-expedited form. The reason for
> > the throttling is concerns about DoS attacks based on user code with a
> > tight loop invoking this system call.
>
> We've been using sys_membarrier for the last year or so in a handful
> of places with no issues. Recently we made it an option in our hazard
> pointers implementation, giving us something with performance between
> hazard pointers and RCU:
>
> https://github.com/facebook/folly/blob/master/folly/experimental/hazptr/hazptr-impl.h#L555
>
> Currently hazard pointers tries to free retired memory the same thread
> that did the retire(), so the latency spiked for workloads that were
> retire() heavy. For the moment we dropped back to using mprotect
> hacks.
>
> I've tested Mathieu's v4 patch, it works great. We currently don't
> have any cases where we need SHARED.

Very good!!! May I have your Tested-by? (Or the Tested-by of whoever
did the testing, as the case may be?)

> I also tested the rate-limited version, while better than the current
> non-EXPEDITED SHARED version, we still hit the slow path pretty often.
> I agree with other commenters that returning an error code instead of
> silently slowing down might be better.

If I need to fall back to the rate-limited version, I will add some sort
of error code capability. For the moment, I am hoping that Mathieu's
patch proves acceptable, but will fall back to the rate-limited version
if some fatal problem arises.

> > + case MEMBARRIER_CMD_SHARED_EXPEDITED:
> > + if (num_online_cpus() > 1) {
> > + static unsigned long lastexp;
> > + unsigned long j;
> > +
> > + j = jiffies;
> > + if (READ_ONCE(lastexp) == j) {
> > + synchronize_sched();
> > + WRITE_ONCE(lastexp, j);
>
> It looks like this update of lastexp should be in the other branch?

Good catch, fixed. It is on branch paulmck.2017.08.01a, and will
hopefully not be needed.

Thanx, Paul

> > + } else {
> > + synchronize_sched_expedited();
> > + }
> > + }
> > + return 0;
>