Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751648AbdG0TGl (ORCPT ); Thu, 27 Jul 2017 15:06:41 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:57447 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751458AbdG0TGj (ORCPT ); Thu, 27 Jul 2017 15:06:39 -0400 Date: Thu, 27 Jul 2017 12:06:37 -0700 From: "Paul E. McKenney" To: Andrew Hunter Cc: avi@scylladb.com, Maged Michael , Geoffrey Romer , lkml Subject: Re: Udpated sys_membarrier() speedup patch, FYI Reply-To: paulmck@linux.vnet.ibm.com References: <20170727181250.GA20183@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17072719-0040-0000-0000-00000386A5AD X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007436; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000214; SDB=6.00893759; UDB=6.00446854; IPR=6.00673920; BA=6.00005495; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016412; XFM=3.00000015; UTC=2017-07-27 19:06:37 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17072719-0041-0000-0000-0000077AC5D1 Message-Id: <20170727190637.GK3730@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-07-27_11:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1706020000 definitions=main-1707270297 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3164 Lines: 59 On Thu, Jul 27, 2017 at 11:36:38AM -0700, Andrew Hunter wrote: > On Thu, Jul 27, 2017 at 11:12 AM, Paul E. McKenney > wrote: > > Hello! > > But my main question is whether the throttling shown below is acceptable > > for your use cases, namely only one expedited sys_membarrier() permitted > > per scheduling-clock period (1 millisecond on many platforms), with any > > excess being silently converted to non-expedited form. > > Google doesn't use sys_membarrier (that I know of...), but we do use > RSEQ fences, which implements membarrier + a little extra to interrupt > RSEQ critical sections (via IPI--smp_call_function_many.) One > important optimization here is that we only throw IPIs to cpus running > the same mm as current (or a subset if requested by userspace), as > this is sufficient for the API guarantees we provide. I suspect a > similar optimization would largely mitigate DOS concerns, no? I don't > know if there are use cases not covered. To answer your question: > throttling these (or our equivalents) would be fine in terms of > userspace throughput. We haven't noticed performance problems > requiring such an intervention, however. IPIin only those CPUs running threads in the same process as the thread invoking membarrier() would be very nice! There is some LKML discussion on this topic, which is currently circling around making this determination reliable on all CPU families. ARM and x86 are thought to be OK, PowerPC is thought to require a smallish patch, MIPS is a big question mark, and so on. Good to hear that the throttling would be OK for your workloads, thank you! > Furthermore: I wince a bit at the silent downgrade; I'd almost prefer > -EAGAIN or -EBUSY. In particular, again for RSEQ fence, the downgrade > simply wouldn't work; rcu_sched_qs() gets called at many points that > aren't sufficiently quiescent for RSEQ (in particular, when userspace > code is running!) This is solvable, but worth thinking about. Good point! One approach would be to unconditionally return -EAGAIN/-EBUSY and another would be to have a separate cmd or flag to say what to do if expedited wasn't currently available. My thought would be to add a separate expedited command, so that one did the fallback and the other returned the error. But I am surprised when you say that the downgrade would not work, at least if you are not running with nohz_full CPUs. The rcu_sched_qs() function simply sets a per-CPU quiescent-state flag. The needed strong ordering is instead supplied by the combination of the code starting the grace period, reporting the setting of the quiescent-state flag to core RCU, and the code completing the grace period. Each non-idle CPU will execute full memory barriers either in RCU_SOFTIRQ context, on entry to idle, on exit from idle, or within the grace-period kthread. In particular, a CPU running the same usermode thread for the entire grace period will execute the needed memory barriers in RCU_SOFTIRQ context shortly after taking a scheduling-clock interrupt. So are you running nohz_full CPUs? Or is there something else that I am missing? Thanx, Paul