Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751546AbdG0UhK (ORCPT ); Thu, 27 Jul 2017 16:37:10 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:41275 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751160AbdG0UhI (ORCPT ); Thu, 27 Jul 2017 16:37:08 -0400 Date: Thu, 27 Jul 2017 13:37:06 -0700 From: "Paul E. McKenney" To: Avi Kivity Cc: maged.michael@gmail.com, ahh@google.com, gromer@google.com, linux-kernel@vger.kernel.org, mathieu.desnoyers@efficios.com Subject: Re: Udpated sys_membarrier() speedup patch, FYI Reply-To: paulmck@linux.vnet.ibm.com References: <20170727181250.GA20183@linux.vnet.ibm.com> <5c8c6946-ce3a-6183-76a2-027823a9948a@scylladb.com> <20170727194322.GL3730@linux.vnet.ibm.com> <5fe39d32-5fc1-3a59-23fc-9bdb1d90edf9@scylladb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5fe39d32-5fc1-3a59-23fc-9bdb1d90edf9@scylladb.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17072720-0040-0000-0000-00000386AE27 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007437; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000214; SDB=6.00893789; UDB=6.00446872; IPR=6.00673950; BA=6.00005495; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016413; XFM=3.00000015; UTC=2017-07-27 20:37:07 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17072720-0041-0000-0000-0000077ACE67 Message-Id: <20170727203706.GO3730@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-07-27_11:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1706020000 definitions=main-1707270319 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8865 Lines: 242 On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote: > On 07/27/2017 10:43 PM, Paul E. McKenney wrote: > >On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote: > >>On 07/27/2017 09:12 PM, Paul E. McKenney wrote: > >>>Hello! > >>> > >>>Please see below for a prototype sys_membarrier() speedup patch. > >>>Please note that there is some controversy on this subject, so the final > >>>version will probably be quite a bit different than this prototype. > >>> > >>>But my main question is whether the throttling shown below is acceptable > >>>for your use cases, namely only one expedited sys_membarrier() permitted > >>>per scheduling-clock period (1 millisecond on many platforms), with any > >>>excess being silently converted to non-expedited form. The reason for > >>>the throttling is concerns about DoS attacks based on user code with a > >>>tight loop invoking this system call. > >>> > >>>Thoughts? > >>Silent throttling would render it useless for me. -EAGAIN is a > >>little better, but I'd be forced to spin until either I get kicked > >>out of my loop, or it succeeds. > >> > >>IPIing only running threads of my process would be perfect. In fact > >>I might even be able to make use of "membarrier these threads > >>please" to reduce IPIs, when I change the topology from fully > >>connected to something more sparse, on larger machines. > >> > >>My previous implementations were a signal (but that's horrible on > >>large machines) and trylock + mprotect (but that doesn't work on > >>ARM). > >OK, how about the following patch, which IPIs only the running > >threads of the process doing the sys_membarrier()? > > Works for me. Thank you for testing! I expect that Mathieu will have a v2 soon, hopefully CCing you guys. (If not, I will forward it.) Mathieu, please note Avi's feedback below. Thanx, Paul > >------------------------------------------------------------------------ > > > >From: Mathieu Desnoyers > >To: Peter Zijlstra > >Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers > > , > > "Paul E . McKenney" , Boqun Feng > >Subject: [RFC PATCH] membarrier: expedited private command > >Date: Thu, 27 Jul 2017 14:59:43 -0400 > >Message-Id: <20170727185943.11570-1-mathieu.desnoyers@efficios.com> > > > >Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built > >from all runqueues for which current thread's mm is the same as our own. > > > >Scheduler-wise, it requires that we add a memory barrier after context > >switching between processes (which have different mm). > > > >It would be interesting to benchmark the overhead of this added barrier > >on the performance of context switching between processes. If the > >preexisting overhead of switching between mm is high enough, the > >overhead of adding this extra barrier may be insignificant. > > > >[ Compile-tested only! ] > > > >CC: Peter Zijlstra > >CC: Paul E. McKenney > >CC: Boqun Feng > >Signed-off-by: Mathieu Desnoyers > >--- > > include/uapi/linux/membarrier.h | 8 +++-- > > kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++- > > kernel/sched/core.c | 21 ++++++++++++ > > 3 files changed, 102 insertions(+), 3 deletions(-) > > > >diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h > >index e0b108bd2624..6a33c5852f6b 100644 > >--- a/include/uapi/linux/membarrier.h > >+++ b/include/uapi/linux/membarrier.h > >@@ -40,14 +40,18 @@ > > * (non-running threads are de facto in such a > > * state). This covers threads from all processes > > * running on the system. This command returns 0. > >+ * TODO: documentation. > > * > > * Command to be passed to the membarrier system call. The commands need to > > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to > > * the value 0. > > */ > > enum membarrier_cmd { > >- MEMBARRIER_CMD_QUERY = 0, > >- MEMBARRIER_CMD_SHARED = (1 << 0), > >+ MEMBARRIER_CMD_QUERY = 0, > >+ MEMBARRIER_CMD_SHARED = (1 << 0), > >+ /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ > >+ /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ > >+ MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3), > > }; > > > > #endif /* _UAPI_LINUX_MEMBARRIER_H */ > >diff --git a/kernel/membarrier.c b/kernel/membarrier.c > >index 9f9284f37f8d..8c6c0f96f617 100644 > >--- a/kernel/membarrier.c > >+++ b/kernel/membarrier.c > >@@ -19,10 +19,81 @@ > > #include > > > > /* > >+ * XXX For cpu_rq(). Should we rather move > >+ * membarrier_private_expedited() to sched/core.c or create > >+ * sched/membarrier.c ? > >+ */ > >+#include "sched/sched.h" > >+ > >+/* > > * Bitmask made from a "or" of all commands within enum membarrier_cmd, > > * except MEMBARRIER_CMD_QUERY. > > */ > >-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) > >+#define MEMBARRIER_CMD_BITMASK \ > >+ (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) > >+ > > > rcu_read_unlock(); > >+ } > >+} > >+ > >+static void membarrier_private_expedited(void) > >+{ > >+ int cpu, this_cpu; > >+ cpumask_var_t tmpmask; > >+ > >+ if (num_online_cpus() == 1) > >+ return; > >+ > >+ /* > >+ * Matches memory barriers around rq->curr modification in > >+ * scheduler. > >+ */ > >+ smp_mb(); /* system call entry is not a mb. */ > >+ > >+ if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > >+ /* Fallback for OOM. */ > >+ membarrier_private_expedited_ipi_each(); > >+ goto end; > >+ } > >+ > >+ this_cpu = raw_smp_processor_id(); > >+ for_each_online_cpu(cpu) { > >+ struct task_struct *p; > >+ > >+ if (cpu == this_cpu) > >+ continue; > >+ rcu_read_lock(); > >+ p = task_rcu_dereference(&cpu_rq(cpu)->curr); > >+ if (p && p->mm == current->mm) > >+ __cpumask_set_cpu(cpu, tmpmask); > > This gets you some false positives, if the CPU idled then mm will > not have changed. Good point! The battery-powered embedded guys would probably prefer we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle state in nohz_full cases. Not sure if is_idle_task() can be used safely, given things like play_idle(). > >+ rcu_read_unlock(); > >+ } > >+ smp_call_function_many(tmpmask, ipi_mb, NULL, 1); > >+ free_cpumask_var(tmpmask); > >+end: > >+ /* > >+ * Memory barrier on the caller thread _after_ we finished > >+ * waiting for the last IPI. Matches memory barriers around > >+ * rq->curr modification in scheduler. > >+ */ > >+ smp_mb(); /* exit from system call is not a mb */ > >+} > > > > /** > > * sys_membarrier - issue memory barriers on a set of threads > >@@ -64,6 +135,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > > if (num_online_cpus() > 1) > > synchronize_sched(); > > return 0; > >+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED: > >+ membarrier_private_expedited(); > >+ return 0; > > default: > > return -EINVAL; > > } > >diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >index 17c667b427b4..f171d2aaaf82 100644 > >--- a/kernel/sched/core.c > >+++ b/kernel/sched/core.c > >@@ -2724,6 +2724,26 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) > > put_user(task_pid_vnr(current), current->set_child_tid); > > } > > > >+#ifdef CONFIG_MEMBARRIER > >+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, > >+ struct mm_struct *oldmm) > >+{ > >+ if (likely(mm == oldmm)) > >+ return; /* Thread context switch, same mm. */ > >+ /* > >+ * When switching between processes, membarrier expedited > >+ * private requires a memory barrier after we set the current > >+ * task. > >+ */ > >+ smp_mb(); > >+} > > Won't the actual page table switch generate a barrier, at least on > many archs? It sure will on x86. There are apparently at least a few architectures that don't. > It's also unneeded if kernel entry or exit involve a barrier (not > true for x86, so probably not for anything else either). > > >+#else /* #ifdef CONFIG_MEMBARRIER */ > >+static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, > >+ struct mm_struct *oldmm) > >+{ > >+} > >+#endif /* #else #ifdef CONFIG_MEMBARRIER */ > >+ > > /* > > * context_switch - switch to the new MM and the new thread's register state. > > */ > >@@ -2737,6 +2757,7 @@ context_switch(struct rq *rq, struct task_struct *prev, > > > > mm = next->mm; > > oldmm = prev->active_mm; > >+ membarrier_expedited_mb_after_set_current(mm, oldmm); > > /* > > * For paravirt, this is coupled with an exit in switch_to to > > * combine the page table reload and the switch backend into > >