Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751764AbdG0Tn2 (ORCPT ); Thu, 27 Jul 2017 15:43:28 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:55259 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751706AbdG0Tn0 (ORCPT ); Thu, 27 Jul 2017 15:43:26 -0400 Date: Thu, 27 Jul 2017 12:43:22 -0700 From: "Paul E. McKenney" To: Avi Kivity Cc: maged.michael@gmail.com, ahh@google.com, gromer@google.com, linux-kernel@vger.kernel.org, mathieu.desnoyers@efficios.com Subject: Re: Udpated sys_membarrier() speedup patch, FYI Reply-To: paulmck@linux.vnet.ibm.com References: <20170727181250.GA20183@linux.vnet.ibm.com> <5c8c6946-ce3a-6183-76a2-027823a9948a@scylladb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5c8c6946-ce3a-6183-76a2-027823a9948a@scylladb.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17072719-0008-0000-0000-00000265A19D X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007436; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000214; SDB=6.00893772; UDB=6.00446861; IPR=6.00673932; BA=6.00005495; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016413; XFM=3.00000015; UTC=2017-07-27 19:43:23 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17072719-0009-0000-0000-0000362D222F Message-Id: <20170727194322.GL3730@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-07-27_11:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1706020000 definitions=main-1707270306 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7690 Lines: 233 On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote: > On 07/27/2017 09:12 PM, Paul E. McKenney wrote: > >Hello! > > > >Please see below for a prototype sys_membarrier() speedup patch. > >Please note that there is some controversy on this subject, so the final > >version will probably be quite a bit different than this prototype. > > > >But my main question is whether the throttling shown below is acceptable > >for your use cases, namely only one expedited sys_membarrier() permitted > >per scheduling-clock period (1 millisecond on many platforms), with any > >excess being silently converted to non-expedited form. The reason for > >the throttling is concerns about DoS attacks based on user code with a > >tight loop invoking this system call. > > > >Thoughts? > > Silent throttling would render it useless for me. -EAGAIN is a > little better, but I'd be forced to spin until either I get kicked > out of my loop, or it succeeds. > > IPIing only running threads of my process would be perfect. In fact > I might even be able to make use of "membarrier these threads > please" to reduce IPIs, when I change the topology from fully > connected to something more sparse, on larger machines. > > My previous implementations were a signal (but that's horrible on > large machines) and trylock + mprotect (but that doesn't work on > ARM). OK, how about the following patch, which IPIs only the running threads of the process doing the sys_membarrier()? Thanx, Paul ------------------------------------------------------------------------ From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , "Paul E . McKenney" , Boqun Feng Subject: [RFC PATCH] membarrier: expedited private command Date: Thu, 27 Jul 2017 14:59:43 -0400 Message-Id: <20170727185943.11570-1-mathieu.desnoyers@efficios.com> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built from all runqueues for which current thread's mm is the same as our own. Scheduler-wise, it requires that we add a memory barrier after context switching between processes (which have different mm). It would be interesting to benchmark the overhead of this added barrier on the performance of context switching between processes. If the preexisting overhead of switching between mm is high enough, the overhead of adding this extra barrier may be insignificant. [ Compile-tested only! ] CC: Peter Zijlstra CC: Paul E. McKenney CC: Boqun Feng Signed-off-by: Mathieu Desnoyers --- include/uapi/linux/membarrier.h | 8 +++-- kernel/membarrier.c | 76 ++++++++++++++++++++++++++++++++++++++++- kernel/sched/core.c | 21 ++++++++++++ 3 files changed, 102 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h index e0b108bd2624..6a33c5852f6b 100644 --- a/include/uapi/linux/membarrier.h +++ b/include/uapi/linux/membarrier.h @@ -40,14 +40,18 @@ * (non-running threads are de facto in such a * state). This covers threads from all processes * running on the system. This command returns 0. + * TODO: documentation. * * Command to be passed to the membarrier system call. The commands need to * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to * the value 0. */ enum membarrier_cmd { - MEMBARRIER_CMD_QUERY = 0, - MEMBARRIER_CMD_SHARED = (1 << 0), + MEMBARRIER_CMD_QUERY = 0, + MEMBARRIER_CMD_SHARED = (1 << 0), + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3), }; #endif /* _UAPI_LINUX_MEMBARRIER_H */ diff --git a/kernel/membarrier.c b/kernel/membarrier.c index 9f9284f37f8d..8c6c0f96f617 100644 --- a/kernel/membarrier.c +++ b/kernel/membarrier.c @@ -19,10 +19,81 @@ #include /* + * XXX For cpu_rq(). Should we rather move + * membarrier_private_expedited() to sched/core.c or create + * sched/membarrier.c ? + */ +#include "sched/sched.h" + +/* * Bitmask made from a "or" of all commands within enum membarrier_cmd, * except MEMBARRIER_CMD_QUERY. */ -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) +#define MEMBARRIER_CMD_BITMASK \ + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) + +static void ipi_mb(void *info) +{ + smp_mb(); /* IPIs should be serializing but paranoid. */ +} + +static void membarrier_private_expedited_ipi_each(void) +{ + int cpu; + + for_each_online_cpu(cpu) { + struct task_struct *p; + + rcu_read_lock(); + p = task_rcu_dereference(&cpu_rq(cpu)->curr); + if (p && p->mm == current->mm) + smp_call_function_single(cpu, ipi_mb, NULL, 1); + rcu_read_unlock(); + } +} + +static void membarrier_private_expedited(void) +{ + int cpu, this_cpu; + cpumask_var_t tmpmask; + + if (num_online_cpus() == 1) + return; + + /* + * Matches memory barriers around rq->curr modification in + * scheduler. + */ + smp_mb(); /* system call entry is not a mb. */ + + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { + /* Fallback for OOM. */ + membarrier_private_expedited_ipi_each(); + goto end; + } + + this_cpu = raw_smp_processor_id(); + for_each_online_cpu(cpu) { + struct task_struct *p; + + if (cpu == this_cpu) + continue; + rcu_read_lock(); + p = task_rcu_dereference(&cpu_rq(cpu)->curr); + if (p && p->mm == current->mm) + __cpumask_set_cpu(cpu, tmpmask); + rcu_read_unlock(); + } + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); + free_cpumask_var(tmpmask); +end: + /* + * Memory barrier on the caller thread _after_ we finished + * waiting for the last IPI. Matches memory barriers around + * rq->curr modification in scheduler. + */ + smp_mb(); /* exit from system call is not a mb */ +} /** * sys_membarrier - issue memory barriers on a set of threads @@ -64,6 +135,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) if (num_online_cpus() > 1) synchronize_sched(); return 0; + case MEMBARRIER_CMD_PRIVATE_EXPEDITED: + membarrier_private_expedited(); + return 0; default: return -EINVAL; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 17c667b427b4..f171d2aaaf82 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2724,6 +2724,26 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) put_user(task_pid_vnr(current), current->set_child_tid); } +#ifdef CONFIG_MEMBARRIER +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, + struct mm_struct *oldmm) +{ + if (likely(mm == oldmm)) + return; /* Thread context switch, same mm. */ + /* + * When switching between processes, membarrier expedited + * private requires a memory barrier after we set the current + * task. + */ + smp_mb(); +} +#else /* #ifdef CONFIG_MEMBARRIER */ +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, + struct mm_struct *oldmm) +{ +} +#endif /* #else #ifdef CONFIG_MEMBARRIER */ + /* * context_switch - switch to the new MM and the new thread's register state. */ @@ -2737,6 +2757,7 @@ context_switch(struct rq *rq, struct task_struct *prev, mm = next->mm; oldmm = prev->active_mm; + membarrier_expedited_mb_after_set_current(mm, oldmm); /* * For paravirt, this is coupled with an exit in switch_to to * combine the page table reload and the switch backend into -- 2.11.0