Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751646AbdG0UE1 (ORCPT ); Thu, 27 Jul 2017 16:04:27 -0400 Received: from mail-wm0-f44.google.com ([74.125.82.44]:38034 "EHLO mail-wm0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751599AbdG0UEZ (ORCPT ); Thu, 27 Jul 2017 16:04:25 -0400 Subject: Re: Udpated sys_membarrier() speedup patch, FYI To: paulmck@linux.vnet.ibm.com Cc: maged.michael@gmail.com, ahh@google.com, gromer@google.com, linux-kernel@vger.kernel.org, mathieu.desnoyers@efficios.com References: <20170727181250.GA20183@linux.vnet.ibm.com> <5c8c6946-ce3a-6183-76a2-027823a9948a@scylladb.com> <20170727194322.GL3730@linux.vnet.ibm.com> From: Avi Kivity Message-ID: <5fe39d32-5fc1-3a59-23fc-9bdb1d90edf9@scylladb.com> Date: Thu, 27 Jul 2017 23:04:13 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170727194322.GL3730@linux.vnet.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8079 Lines: 229 On 07/27/2017 10:43 PM, Paul E. McKenney wrote: > On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote: >> On 07/27/2017 09:12 PM, Paul E. McKenney wrote: >>> Hello! >>> >>> Please see below for a prototype sys_membarrier() speedup patch. >>> Please note that there is some controversy on this subject, so the final >>> version will probably be quite a bit different than this prototype. >>> >>> But my main question is whether the throttling shown below is acceptable >>> for your use cases, namely only one expedited sys_membarrier() permitted >>> per scheduling-clock period (1 millisecond on many platforms), with any >>> excess being silently converted to non-expedited form. The reason for >>> the throttling is concerns about DoS attacks based on user code with a >>> tight loop invoking this system call. >>> >>> Thoughts? >> Silent throttling would render it useless for me. -EAGAIN is a >> little better, but I'd be forced to spin until either I get kicked >> out of my loop, or it succeeds. >> >> IPIing only running threads of my process would be perfect. In fact >> I might even be able to make use of "membarrier these threads >> please" to reduce IPIs, when I change the topology from fully >> connected to something more sparse, on larger machines. >> >> My previous implementations were a signal (but that's horrible on >> large machines) and trylock + mprotect (but that doesn't work on >> ARM). > OK, how about the following patch, which IPIs only the running > threads of the process doing the sys_membarrier()? Works for me. > > Thanx, Paul > > ------------------------------------------------------------------------ > > From: Mathieu Desnoyers > To: Peter Zijlstra > Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers > , > "Paul E . McKenney" , Boqun Feng > Subject: [RFC PATCH] membarrier: expedited private command > Date: Thu, 27 Jul 2017 14:59:43 -0400 > Message-Id: <20170727185943.11570-1-mathieu.desnoyers@efficios.com> > > Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built > from all runqueues for which current thread's mm is the same as our own. > > Scheduler-wise, it requires that we add a memory barrier after context > switching between processes (which have different mm). > > It would be interesting to benchmark the overhead of this added barrier > on the performance of context switching between processes. If the > preexisting overhead of switching between mm is high enough, the > overhead of adding this extra barrier may be insignificant. > > [ Compile-tested only! ] > > CC: Peter Zijlstra > CC: Paul E. McKenney > CC: Boqun Feng > Signed-off-by: Mathieu Desnoyers > --- > include/uapi/linux/membarrier.h | 8 +++-- > kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++- > kernel/sched/core.c | 21 ++++++++++++ > 3 files changed, 102 insertions(+), 3 deletions(-) > > diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h > index e0b108bd2624..6a33c5852f6b 100644 > --- a/include/uapi/linux/membarrier.h > +++ b/include/uapi/linux/membarrier.h > @@ -40,14 +40,18 @@ > * (non-running threads are de facto in such a > * state). This covers threads from all processes > * running on the system. This command returns 0. > + * TODO: documentation. > * > * Command to be passed to the membarrier system call. The commands need to > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to > * the value 0. > */ > enum membarrier_cmd { > - MEMBARRIER_CMD_QUERY = 0, > - MEMBARRIER_CMD_SHARED = (1 << 0), > + MEMBARRIER_CMD_QUERY = 0, > + MEMBARRIER_CMD_SHARED = (1 << 0), > + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ > + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ > + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3), > }; > > #endif /* _UAPI_LINUX_MEMBARRIER_H */ > diff --git a/kernel/membarrier.c b/kernel/membarrier.c > index 9f9284f37f8d..8c6c0f96f617 100644 > --- a/kernel/membarrier.c > +++ b/kernel/membarrier.c > @@ -19,10 +19,81 @@ > #include > > /* > + * XXX For cpu_rq(). Should we rather move > + * membarrier_private_expedited() to sched/core.c or create > + * sched/membarrier.c ? > + */ > +#include "sched/sched.h" > + > +/* > * Bitmask made from a "or" of all commands within enum membarrier_cmd, > * except MEMBARRIER_CMD_QUERY. > */ > -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) > +#define MEMBARRIER_CMD_BITMASK \ > + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) > + > rcu_read_unlock(); > + } > +} > + > +static void membarrier_private_expedited(void) > +{ > + int cpu, this_cpu; > + cpumask_var_t tmpmask; > + > + if (num_online_cpus() == 1) > + return; > + > + /* > + * Matches memory barriers around rq->curr modification in > + * scheduler. > + */ > + smp_mb(); /* system call entry is not a mb. */ > + > + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > + /* Fallback for OOM. */ > + membarrier_private_expedited_ipi_each(); > + goto end; > + } > + > + this_cpu = raw_smp_processor_id(); > + for_each_online_cpu(cpu) { > + struct task_struct *p; > + > + if (cpu == this_cpu) > + continue; > + rcu_read_lock(); > + p = task_rcu_dereference(&cpu_rq(cpu)->curr); > + if (p && p->mm == current->mm) > + __cpumask_set_cpu(cpu, tmpmask); This gets you some false positives, if the CPU idled then mm will not have changed. > + rcu_read_unlock(); > + } > + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); > + free_cpumask_var(tmpmask); > +end: > + /* > + * Memory barrier on the caller thread _after_ we finished > + * waiting for the last IPI. Matches memory barriers around > + * rq->curr modification in scheduler. > + */ > + smp_mb(); /* exit from system call is not a mb */ > +} > > /** > * sys_membarrier - issue memory barriers on a set of threads > @@ -64,6 +135,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > if (num_online_cpus() > 1) > synchronize_sched(); > return 0; > + case MEMBARRIER_CMD_PRIVATE_EXPEDITED: > + membarrier_private_expedited(); > + return 0; > default: > return -EINVAL; > } > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 17c667b427b4..f171d2aaaf82 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2724,6 +2724,26 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) > put_user(task_pid_vnr(current), current->set_child_tid); > } > > +#ifdef CONFIG_MEMBARRIER > +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, > + struct mm_struct *oldmm) > +{ > + if (likely(mm == oldmm)) > + return; /* Thread context switch, same mm. */ > + /* > + * When switching between processes, membarrier expedited > + * private requires a memory barrier after we set the current > + * task. > + */ > + smp_mb(); > +} Won't the actual page table switch generate a barrier, at least on many archs? It sure will on x86. It's also unneeded if kernel entry or exit involve a barrier (not true for x86, so probably not for anything else either). > +#else /* #ifdef CONFIG_MEMBARRIER */ > +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, > + struct mm_struct *oldmm) > +{ > +} > +#endif /* #else #ifdef CONFIG_MEMBARRIER */ > + > /* > * context_switch - switch to the new MM and the new thread's register state. > */ > @@ -2737,6 +2757,7 @@ context_switch(struct rq *rq, struct task_struct *prev, > > mm = next->mm; > oldmm = prev->active_mm; > + membarrier_expedited_mb_after_set_current(mm, oldmm); > /* > * For paravirt, this is coupled with an exit in switch_to to > * combine the page table reload and the switch backend into