Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751667AbdG0U4e (ORCPT ); Thu, 27 Jul 2017 16:56:34 -0400 Received: from mail.efficios.com ([167.114.142.141]:56438 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751517AbdG0U4d (ORCPT ); Thu, 27 Jul 2017 16:56:33 -0400 Date: Thu, 27 Jul 2017 20:58:38 +0000 (UTC) From: Mathieu Desnoyers To: "Paul E. McKenney" Cc: Avi Kivity , maged michael , Andrew Hunter , gromer@google.com, linux-kernel Message-ID: <1035920775.29109.1501189118277.JavaMail.zimbra@efficios.com> In-Reply-To: <20170727203706.GO3730@linux.vnet.ibm.com> References: <20170727181250.GA20183@linux.vnet.ibm.com> <5c8c6946-ce3a-6183-76a2-027823a9948a@scylladb.com> <20170727194322.GL3730@linux.vnet.ibm.com> <5fe39d32-5fc1-3a59-23fc-9bdb1d90edf9@scylladb.com> <20170727203706.GO3730@linux.vnet.ibm.com> Subject: Re: Udpated sys_membarrier() speedup patch, FYI MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [167.114.142.141] X-Mailer: Zimbra 8.7.9_GA_1794 (ZimbraWebClient - FF52 (Linux)/8.7.9_GA_1794) Thread-Topic: Udpated sys_membarrier() speedup patch, FYI Thread-Index: Zq6dVh+YeLSLW0CGGUGiA/6gU3H9sw== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6901 Lines: 190 ----- On Jul 27, 2017, at 4:37 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote: > On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote: >> On 07/27/2017 10:43 PM, Paul E. McKenney wrote: >> >On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote: >> >>On 07/27/2017 09:12 PM, Paul E. McKenney wrote: >> >>>Hello! >> >>> >> >>>Please see below for a prototype sys_membarrier() speedup patch. >> >>>Please note that there is some controversy on this subject, so the final >> >>>version will probably be quite a bit different than this prototype. >> >>> >> >>>But my main question is whether the throttling shown below is acceptable >> >>>for your use cases, namely only one expedited sys_membarrier() permitted >> >>>per scheduling-clock period (1 millisecond on many platforms), with any >> >>>excess being silently converted to non-expedited form. The reason for >> >>>the throttling is concerns about DoS attacks based on user code with a >> >>>tight loop invoking this system call. >> >>> >> >>>Thoughts? >> >>Silent throttling would render it useless for me. -EAGAIN is a >> >>little better, but I'd be forced to spin until either I get kicked >> >>out of my loop, or it succeeds. >> >> >> >>IPIing only running threads of my process would be perfect. In fact >> >>I might even be able to make use of "membarrier these threads >> >>please" to reduce IPIs, when I change the topology from fully >> >>connected to something more sparse, on larger machines. >> >> >> >>My previous implementations were a signal (but that's horrible on >> >>large machines) and trylock + mprotect (but that doesn't work on >> >>ARM). >> >OK, how about the following patch, which IPIs only the running >> >threads of the process doing the sys_membarrier()? >> >> Works for me. > > Thank you for testing! I expect that Mathieu will have a v2 soon, > hopefully CCing you guys. (If not, I will forward it.) > Will do! > Mathieu, please note Avi's feedback below. More below, > > Thanx, Paul > >> >------------------------------------------------------------------------ >> > >> >From: Mathieu Desnoyers >> >To: Peter Zijlstra >> >Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers >> > , >> > "Paul E . McKenney" , Boqun Feng >> > >> >Subject: [RFC PATCH] membarrier: expedited private command >> >Date: Thu, 27 Jul 2017 14:59:43 -0400 >> >Message-Id: <20170727185943.11570-1-mathieu.desnoyers@efficios.com> >> > >> >Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built >> >from all runqueues for which current thread's mm is the same as our own. >> > >> >Scheduler-wise, it requires that we add a memory barrier after context >> >switching between processes (which have different mm). >> > >> >It would be interesting to benchmark the overhead of this added barrier >> >on the performance of context switching between processes. If the >> >preexisting overhead of switching between mm is high enough, the >> >overhead of adding this extra barrier may be insignificant. >> > >> >[ Compile-tested only! ] >> > >> >CC: Peter Zijlstra >> >CC: Paul E. McKenney >> >CC: Boqun Feng >> >Signed-off-by: Mathieu Desnoyers >> >--- >> > include/uapi/linux/membarrier.h | 8 +++-- >> > kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++- >> > kernel/sched/core.c | 21 ++++++++++++ >> > 3 files changed, 102 insertions(+), 3 deletions(-) >> > >> >diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h >> >index e0b108bd2624..6a33c5852f6b 100644 >> >--- a/include/uapi/linux/membarrier.h >> >+++ b/include/uapi/linux/membarrier.h >> >@@ -40,14 +40,18 @@ >> > * (non-running threads are de facto in such a >> > * state). This covers threads from all processes >> > * running on the system. This command returns 0. >> >+ * TODO: documentation. >> > * >> > * Command to be passed to the membarrier system call. The commands need to >> > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to >> > * the value 0. >> > */ >> > enum membarrier_cmd { >> >- MEMBARRIER_CMD_QUERY = 0, >> >- MEMBARRIER_CMD_SHARED = (1 << 0), >> >+ MEMBARRIER_CMD_QUERY = 0, >> >+ MEMBARRIER_CMD_SHARED = (1 << 0), >> >+ /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ >> >+ /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ >> >+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3), >> > }; >> > >> > #endif /* _UAPI_LINUX_MEMBARRIER_H */ >> >diff --git a/kernel/membarrier.c b/kernel/membarrier.c >> >index 9f9284f37f8d..8c6c0f96f617 100644 >> >--- a/kernel/membarrier.c >> >+++ b/kernel/membarrier.c >> >@@ -19,10 +19,81 @@ >> > #include >> > >> > /* >> >+ * XXX For cpu_rq(). Should we rather move >> >+ * membarrier_private_expedited() to sched/core.c or create >> >+ * sched/membarrier.c ? >> >+ */ >> >+#include "sched/sched.h" >> >+ >> >+/* >> > * Bitmask made from a "or" of all commands within enum membarrier_cmd, >> > * except MEMBARRIER_CMD_QUERY. >> > */ >> >-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) >> >+#define MEMBARRIER_CMD_BITMASK \ >> >+ (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) >> >+ >> >> > rcu_read_unlock(); >> >+ } >> >+} >> >+ >> >+static void membarrier_private_expedited(void) >> >+{ >> >+ int cpu, this_cpu; >> >+ cpumask_var_t tmpmask; >> >+ >> >+ if (num_online_cpus() == 1) >> >+ return; >> >+ >> >+ /* >> >+ * Matches memory barriers around rq->curr modification in >> >+ * scheduler. >> >+ */ >> >+ smp_mb(); /* system call entry is not a mb. */ >> >+ >> >+ if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { >> >+ /* Fallback for OOM. */ >> >+ membarrier_private_expedited_ipi_each(); >> >+ goto end; >> >+ } >> >+ >> >+ this_cpu = raw_smp_processor_id(); >> >+ for_each_online_cpu(cpu) { >> >+ struct task_struct *p; >> >+ >> >+ if (cpu == this_cpu) >> >+ continue; >> >+ rcu_read_lock(); >> >+ p = task_rcu_dereference(&cpu_rq(cpu)->curr); >> >+ if (p && p->mm == current->mm) >> >+ __cpumask_set_cpu(cpu, tmpmask); >> >> This gets you some false positives, if the CPU idled then mm will >> not have changed. > > Good point! The battery-powered embedded guys would probably prefer > we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle > state in nohz_full cases. Not sure if is_idle_task() can be used > safely, given things like play_idle(). Would changing the check in this loop to: if (p && !is_idle_task(p) && p->mm == current->mm) { work for you ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com