Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752066AbdIQWgS (ORCPT ); Sun, 17 Sep 2017 18:36:18 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:52744 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751986AbdIQWgP (ORCPT ); Sun, 17 Sep 2017 18:36:15 -0400 Date: Sun, 17 Sep 2017 15:36:08 -0700 From: "Paul E. McKenney" To: peterz@infradead.org, mathieu.desnoyers@efficios.com, will.deacon@arm.com, stern@rowland.harvard.edu Cc: luto@kernel.org, mpe@ellerman.id.au, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, davejwatson@fb.com, maged.michael@gmail.com Subject: Rough notes from sys_membarrier() lightning BoF Reply-To: paulmck@linux.vnet.ibm.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17091722-2213-0000-0000-0000021CAE37 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007754; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000229; SDB=6.00918506; UDB=6.00461396; IPR=6.00698733; BA=6.00005593; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00017189; XFM=3.00000015; UTC=2017-09-17 22:36:11 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17091722-2214-0000-0000-00005793DAE3 Message-Id: <20170917223608.GA14577@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-09-17_15:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1709170329 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4209 Lines: 139 Hello! Rough notes from our discussion last Thursday. Please reply to the group with any needed elaborations or corrections. Adding Andy and Michael on CC since this most closely affects their architectures. Also adding Dave Watson and Maged Michael because the preferred approach requires that processes wanting to use the lightweight sys_membarrier() do a registration step. Thanx, Paul ------------------------------------------------------------------------ Problem: 1. The current sys_membarrier() introduces an smp_mb() that is not otherwise required on powerpc. 2. The envisioned JIT variant of sys_membarrier() assumes that the return-to-user instruction sequence handling any change to the usermode instruction stream, and Andy Lutomirski's upcoming changes invalidate this assumption. It is believed that powerpc has a similar issue. Here are diagrams indicating the memory-ordering requirements: Scenario 1: Access preceding sys_membarrier() must see changes from thread that concurrently switches in. ---------------------------------------------------------------- Scheduler sys_membarrier() --------- ---------------- smp_mb(); usermode load or store to Y /* begin system call */ sys_membarrier() smp_mb(); Check rq->curr rq->curr = new_thread; smp_mb(); /* not powerpc! */ /* return to user */ usermode load or store to X smp_mb(); ---------------------------------------------------------------- Due to the fr link from the check of rq->curr to the scheduler's write, we need full memory barriers on both sides. However, we don't want to lose the powerpc optimization, at least not in the common case. Scenario 2: Access following sys_membarrier() must see changes from thread that concurrently switches out. ---------------------------------------------------------------- Scheduler sys_membarrier() --------- ---------------- /* begin system call */ sys_membarrier() smp_mb(); usermode load or store to X /* Schedule from user */ smp_mb(); rq->curr = new_thread; Check rq->curr smp_mb(); smp_mb(); /* not powerpc! */ /* return to user */ usermode load or store to Y ---------------------------------------------------------------- Here less ordering is required, given that a read is returning the value previously written. Weaker barriers could be used, but full memory barriers are in place in any case. Potential resolutions, including known stupid ones: A. IPI all CPUs all the time. Not so good for real-time workloads, and a usermode-induced set of IPIs could potentially be used for a denial-of-service (DoS) attack. B. Lock all runqueues all the time. This could potentially also be used in a usermode-induced DoS attack. C. Explicitly interact with all threads rather than with CPUs. This can be quite expensive for the surprisingly common case where applications have very large numbers of thread. (Java, we are looking at you!!!) D. Just keep the redundant smp_mb() and just say "no" to Andy's x86 optimizations. We would like to avoid the performance degradation in both cases. E. Require that threads register before using sys_membarrier() for private or JIT usage. (The historical implementation using synchronize_sched() would continue to -not- require registration, both for compatibility and because there is no need to do so.) For x86 and powerpc, this registration would set a TIF flag on all of the current process's threads. This flag would be inherited by any later thread creation within that process, and would be cleared by fork() and exec(). When this TIF flag is set, the return-to-user path would execute additional code that would ensure that ordering and newly JITed code was handled correctly. We believe that checks for these TIF flags could be combined with existing checks to avoid adding any overhead in the common case where the process was not using these sys_membarrier() features. For all other architecture, the registration step would be a no-op. Does anyone have any better solution? If so, please don't keep it a secret! Thanx, Paul