Received: by 10.192.165.148 with SMTP id m20csp183514imm; Tue, 1 May 2018 20:54:28 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpLd+jXLWbM7wCNTwR1hZ9zYrlaslAjpDI549cnd6vGmIlmJPDd1wdtwlFakiXJuktMi2+F X-Received: by 2002:a17:902:6113:: with SMTP id t19-v6mr18097086plj.372.1525233268818; Tue, 01 May 2018 20:54:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525233268; cv=none; d=google.com; s=arc-20160816; b=lpiGezDY1Y9JaAL4wIGcVRB9yNkoqAMTIyJu8I+km3VN0E1GPu2WILLrjLR4tuqr/b 2bJ9WDOLbe5ceMfnbsaYr/XpdhbKptQIpaQwC9WEZYwSDw7n8X4of3i0SrUaWEGa4aun pumN/EA5T7nIQ2p+jj/Nz9JPFSS0ze2Y2aSJXnEeASLtJ3Y6Q3Jq5E4zHeW7zHDZRJ5m wB9YLHlUC/bJhKHKn5SPKNcdLt3JGcEWz5jo4QtAMYs27eF1xspjLmJLhq2db/UHa5AX J7ExnXiKFAjPqetsgqx0grQzrIEEGZfm/0wqoz/x2Fi+rbBgR2vlYsiZxYzbvSQtKku+ iOTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature :arc-authentication-results; bh=zlnpnDRrfWVC6nHAVJsV8f9+JvIWD6y4dqVwEivVB10=; b=hYN3ER7iQ4HGbhQyBflI+O+ra0r/UgdqHf0QH3xxqJyShXKxWvTXZjIaA1868ODs9R dlzfxZa2OGpJMmOZMMIwWePlUixv0iQBEGsvXu2Hgp1YXJ7WUz0iyqWQWDLYLK8s9fyZ ePYW0vd7tO9m/wJrkcCCicGaeruuJft+BRcgmctPX83PGMv8JETpEc7TJsXMtijFqZvW qcKg9CVdxRCU1uvD2DP48P35jXUeVEKsyFrTxJiOZ/xYXVCUqU1jYG7FMXlWlYx7IvmJ cROErY/KIEN7vJOnsBk+nLwm+lAKRhR6iLzAYxdwlt/e2s+sEBgae+BlJOvK/yHjTtx7 eLPw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=KC/DXuuZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u6-v6si6201949pgv.420.2018.05.01.20.54.14; Tue, 01 May 2018 20:54:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=KC/DXuuZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751004AbeEBDyC (ORCPT + 99 others); Tue, 1 May 2018 23:54:02 -0400 Received: from mail-it0-f48.google.com ([209.85.214.48]:55687 "EHLO mail-it0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750822AbeEBDx7 (ORCPT ); Tue, 1 May 2018 23:53:59 -0400 Received: by mail-it0-f48.google.com with SMTP id 144-v6so15889815iti.5 for ; Tue, 01 May 2018 20:53:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zlnpnDRrfWVC6nHAVJsV8f9+JvIWD6y4dqVwEivVB10=; b=KC/DXuuZK3EHlhKhiKMuEq3DjS+0Izf1UCKnnfehUY3jEvkeHVIeUo8cdyXNTPJSJF XylclmE5bOzxGeXrqinDyDjlaVv/eOsdA0nfbP9H1uUVC6gzcCuC14fq2n2lP2v2RzS/ /WKnYsJSj15TPS0bNjlY2KWX/EDpMlRos/aIkdQkleXT0vrO9of/MS59tzt2oiOlKA6H 8EKw1wp8pihhgQOOFCjzrAmlecfgpRHw8rhbx46/iN0iiUyVEJBl0bYWr9KtjfuFPYlV FkCMyoBivhf29VA8B9jVb7TqvcaAsnO55KwcWNRgIRfRFjgMfp/Uq9Lw/eJw7evN87YR By8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zlnpnDRrfWVC6nHAVJsV8f9+JvIWD6y4dqVwEivVB10=; b=gW1Fz/m9kGA6LuJWB4ADquRIikkTC5F+PHB/gOzWiNiAYkSYWnxptTUk3VnSlq+zf7 mAfVbf0lN0zXTQ0zrCxBdwUVT/QXpkrzluc1sXdZtMxOnUZtprTuETaRyHnYMIVmZlFj saOiQG/ah6U1E5Gcg/uVMudfzC+NzUN1H0lO2AFAA/Ci828WJAY4aOIW4u/cUtQAE10l NpscSuTF/YCtCgfhoY29iL3YBd72NRmP5OsqvNMPtP0iB56x86JwZ3hQU4QfJqqn8Vwd kZdDIOTmFymffNrdQhs5lPOpltBdeWGXanX8Z/ilcbQqdoHKqHcXKj8RJECQyJDnfV/i Kp/w== X-Gm-Message-State: ALQs6tBAk0oyWv2aKJO98sg7IHw2obDfjd+aCcLZu3JIIQzksvuErxel uJofqX3QIgfvFeGz0r9pw4qTMS1EHineBe8FQYoWNQ== X-Received: by 2002:a24:3ccf:: with SMTP id m198-v6mr7997128ita.113.1525233238693; Tue, 01 May 2018 20:53:58 -0700 (PDT) MIME-Version: 1.0 References: <20180430224433.17407-1-mathieu.desnoyers@efficios.com> In-Reply-To: <20180430224433.17407-1-mathieu.desnoyers@efficios.com> From: Daniel Colascione Date: Wed, 02 May 2018 03:53:47 +0000 Message-ID: Subject: Re: [RFC PATCH for 4.18 00/14] Restartable Sequences To: mathieu.desnoyers@efficios.com Cc: Peter Zijlstra , paulmck@linux.vnet.ibm.com, boqun.feng@gmail.com, luto@amacapital.net, davejwatson@fb.com, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, pjt@google.com, Andrew Morton , linux@arm.linux.org.uk, tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, Andrew Hunter , andi@firstfloor.org, cl@linux.com, bmaurer@fb.com, rostedt@goodmis.org, josh@joshtriplett.org, torvalds@linux-foundation.org, catalin.marinas@arm.com, will.deacon@arm.com, mtk.manpages@gmail.com, Joel Fernandes Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mathieu: this work looks very cool. See inline. On Mon, Apr 30, 2018 at 3:48 PM Mathieu Desnoyers < mathieu.desnoyers@efficios.com> wrote: > Hi, > Here is an updated RFC round of the Restartable Sequences patchset > based on kernel 4.17-rc3. Based on feedback from Linus, I'm introducing > only the rseq system call, keeping the rest for later. > This already enables speeding up the Facebook jemalloc and arm64 PMC > read from user-space use-cases, as well as speedup of use-cases relying > on getting the current cpu number from user-space. We'll have to wait > until a more complete solution is introduced before the LTTng-UST > tracer can replace its ring buffer atomic instructions with rseq > though. But let's proceed one step at a time. I like the general theme of the kernel using its "superpowers" (in this case, knowledge of preemption) to help userspace do a better job without userspace code needing to enter the kernel to benefit. The per-CPU data structures this patch enables help in a lot of use cases, but I think there's another use case that you might not have considered, one that can benefit from a extension to your proposed API. Consider mutexes: in the kernel, for mutual exclusion, we can use a spinlock, which in the kernel ends up being simpler and (in a lot of scenarios) more efficient than a mutex: a core that takes a spinlock promises to keep the lock held for only a very short time, and it disables interrupts to delay asynchronous work that might unexpectedly lengthen the critical section. A different core that wants to grab that spinlock can just spin on the lock word, confident that its spin will be short because any core owning the lock is guaranteed to release it very quickly. (Long spins would be very bad for power.) The overall result is a lock that's much lighter than a mutex. (A spinlock can also be used in places we can't sleep, but this ability isn't relevant to the discussion below.) Userspace doesn't have a good equivalent to a lightweight spinlock. While you can build a spinlock in userspace, the result ends up having serious problems because of preemption: first, a thread owning such a lock can be preempted in its critical section, lengthening the critical section arbitrarily. Second, a thread spinning on a lock will keep spinning even when the owning thread isn't scheduled anywhere. Userspace can just implement a mutex as a try-acquire and a FUTEX_WAIT on failure. This approach works fine when there's no contention, but a system call is a pretty heavy operation. Why pay for a system call on occasional light congestion with a short critical section. Can we do better? The usual approach to "better" is an "adaptive mutex". Such a thing, when it attempts to acquire a lock another thread owns, spins for some number of iterations, then falls back to futex. I guess that's a little better than immediately jumping to futex, but it's not optimal. We can still spin when the lock owner isn't scheduled, and the spin count is usually some guess (either specified manually or estimated statistically) that's not guaranteed to produce decent results. Even if we do pick a good spin count, we run a very good chance of under- or over-spinning on any given lock-acquire. We always want to sleep when spinning would be pointless. One important case of the spin-while-not-scheduled problem is operation on a uniprocessor system: on such a system, only a single task can be scheduled at a time, making all spins maximally pointless. The usual approach to avoiding wasted spins for adaptive mutexes is for the adaptive mutex library to ask upon initialization "How many cores are in this system?", and if the answer comes back as "1", disable spinning. This approach is inadequate: CPU affinity can change at arbitrary times, and CPU affinity can produce the same spin pessimization that a uniprocessor system does. I think a small enhancement to rseq would let us build a perfect userspace mutex, one that spins on lock-acquire only when the lock owner is running and that sleeps otherwise, freeing userspace from both specifying ad-hoc spin counts and from trying to detect situations in which spinning is generally pointless. It'd work like this: in the per-thread rseq data structure, we'd include a description of a futex operation for the kernel would perform (in the context of the preempted thread) upon preemption, immediately before schedule(). If the futex operation itself sleeps, that's no problem: we will have still accomplished our goal of running some other thread instead of the preempted thread. Suppose we make a userspace mutex implemented with a lock word having three bits: acquired, sleep_mode, and wait_pending, with the rest of the word not being relevant at the moment. We'd implement lock-acquire the usual way, CASing the acquired bit into the lock and deeming the lock taken if we're successful. Except that before attempting the CAS, we'd configure the current thread's rseq area to bitwise-or the sleep_mode bit into the lock word if the current thread is scheduled. Now, imagine that we're a different thread that wants to take the lock while the first thread owns it. We'll attempt a CAS as before. The CAS will fail. That's fine --- we'll spin by retrying the CAS. Here's where we differ from a conventional from a conventional adaptive mutex. A normal adaptive mutex will execute a fixed maximum number of CAS attempts, then FUTEX_WAIT. We, instead, keep spinning until we either grab the lock or we notice the sleep_mode bit set in the lock word. (On every CAS attempt, we update our local cached copy of the lock word.) When we do notice the sleep_mode bit set, we'll fall back to the usual sleeping strategy: CAS the wait_pending bit into the lock word and FUTEX_WAIT. Back in the owning thread, when we release the model, we'll CAS again to reset the acquired bit and (if set) sleep_mode bit, and if we see wait_pending, FUTEX_WAKE any waiters. At this point, we can disable the rseq registration. (If we're preempted after the unlock but before the rseq disarm, we'll spuriously set sleep_mode, but that's fine, since we'll reset it on next lock-acquire.) This scheme gives us optimal spinning behavior. We spin on lock-acquire only as long as the owning thread is actually running. We don't spin at all on uniprocessor machines or in situations where we've set up affinity to create the moral equivalent of a uniprocessor system. We correctly fall back to sleeping when the owner itself schedules (which indicates that the critical section is likely to last a while). And we don't need to choose some arbitrary constant or use some estimation function to guess how many times we want to spin. We can stop spinning as soon as we know it'll be unproductive. In practice, I think you'd want to impose a maximum spin count anyway to guard against 1) unexpected non-scheduling critical section lengthening via bugs, and 2) the possibility that the futex-on-schedule operation sleeps before setting sleep_mode. If you don't think the futex-on-schedule thing is a good idea, there are other ways to accomplish the same basic task. For example, you could add an is_running field to struct rseq, and the kernel would take of making this field true only when the task owning the struct rseq is, in fact, running. A sufficiently clever runtime system could stash the owning thread ID in the lockword and provide a way to find a thread's struct rseq given its thread ID. On lock contention, instead of switching to FUTEX_WAIT when we notice sleep_mode set in the lock word, we'd switch to FUTEX_WAIT when we notice is_running in the owning thread's struct rseq become false. This approach is probably simpler, but makes each spin a bit heavier due to the need to fetch two separate memory locations (the lockword and the is_running field). Anyway, I'm sure there are other variations on the general theme of the rseq mechanism helping to optimize mutex acquisition. What do you think?