Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752135AbbFXXr4 (ORCPT ); Wed, 24 Jun 2015 19:47:56 -0400 Received: from mail-ig0-f173.google.com ([209.85.213.173]:35688 "EHLO mail-ig0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751614AbbFXXrs (ORCPT ); Wed, 24 Jun 2015 19:47:48 -0400 Subject: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections From: Paul Turner To: Peter Zijlstra , "Paul E. McKenney" , Mathieu Desnoyers Cc: Andrew Hunter , Andi Kleen , Lai Jiangshan , linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Steven Rostedt , Josh Triplett , Ingo Molnar , Andrew Morton , Andy Lutomirski , Linus Torvalds , Chris Lameter Date: Wed, 24 Jun 2015 15:26:09 -0700 Message-ID: <20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3927 Lines: 77 This is a fairly small series demonstrating a feature we've found to be quite powerful in practice, "restartable sequences". Most simply: these sequences comprise small snippets of user-code that are guaranteed to be (effectively) executed serially, with support for restart (or other handling) in the event of preemption or interruption. This (when combined with an in-memory cpu-id; potentially optional on some architectures) can be used to implement extremely fast per-cpu operations that do not rely on the use of actual processor atomics. We've used this to back performance critical code such as our malloc implementation with good results. We previously discussed this feature at LPC: http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Mathieu Desnoyers posted an alternate implementation based on the ideas above at: https://lkml.org/lkml/2015/5/21/472 This RFC is based on the approach we currently use internally. However, I'll likely posit that neither this approach, nor the one posted above, is what we should ultimately adopt (provided sufficient interest exists). The implementation in this series can be summarized as follows: - We allow a single (per-address) space critical section (and associated handler) to be defined. - When a thread with RSEQ configured (via new syscall) is interrupted within a critical section, we modify its return address. Either within signal delivery, or the notify-resume path. The scheduler touchpoint is only a preempt_notifier which (potentially, dependent on config) examines the kernel copy of pt_regs. There are a few core requirements which motivate the approach taken here: 1) We must not interfere with regular scheduling. Unlike preemption protection (or similar), there are no special considerations for code running within a critical region beyond that we must move to the restart handler if interrupted. 2) The code executing in scheduler context must be tightly constrained, both in terms of overhead and that it must not require access to user memory. 3) The fast-path must be fast. That is, user entry/execution/exit of a non-interrupted critical section is the most important case. The restart handler is a 'slow' path that should only happen comparatively infrequently. We're strongly motivated here by high-performance, low-level primitives: think malloc, or rcu_read_lock. While the contained implementation performs well under these constraints, it has some notable limitations which we should consider for more general usage: 1) The definition of a single region works well for statically linked binaries but can be challenging when shared-libraries want to use this feature. This is partially mitigated in our experience that a library of primitives is generally more useful than a myriad of custom sequences, but it's still a major concern. 2) Due to the nature of restart and explicit location requirements it's only really reasonable to write this critical section in assembly; which makes porting and maintenance less convenient. (One of the included tests shows a reasonable example of what this looks like.) 3) TLS spreads the entrails of its implementation all over compile _and_ link. This makes properly handling it within the critical section cumbersome in the shared binary case. We've been thinking about how to address these issues and are considering some alternate ABIs that still satisfy (1)-(3), but have a more convenient programming model. I'll send this as a follow-up, but wanted to first share the approach we've used to date. Thanks, - Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/