Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp3916955ybl; Tue, 21 Jan 2020 09:22:30 -0800 (PST) X-Google-Smtp-Source: APXvYqw2HSuZywggQFMVrKt8Hb49lrMjtf4+pw/CO71M8PVfq+5MCBwvHv3/dQlxbVMXxKkjp0/G X-Received: by 2002:aca:d6c8:: with SMTP id n191mr3905178oig.103.1579627350219; Tue, 21 Jan 2020 09:22:30 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579627350; cv=none; d=google.com; s=arc-20160816; b=hh/Z6qfCL3BhWMUI+9qDbebANCB1AoRGaP3tYzxUgeRUXN8YGq7KRDYarDfMJtqszp tpxb9N9Yh/9ImOvheho3ASQ4Yn87OGhw285emmpaRuRTdAnIxok+hF5JFruhrjK0fCuc HscNlrHV7Po1C0llf6er5IhVCWaeDwRftubDktqovL1ekKMO7QBR7qwnhLOujr+ARyy7 tDvlwLGpb0zJq+pvh26ONYXEnq4r6l+9nOhRTcRhKRlzzkmcBMhaqr06VJX0M2Fswc8l ZQZfpyylvxtxhB8DH8bHtw3rtLHZksqsnkoid1/nqp0vR9D53a7NxJF8EoG3dDx3HS3s rEGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=q5QgG9WnPTL3nteIJEXmB0Es5EWve1SbvFBsjUgPlKU=; b=mtndVQpD7Z1uE4i6gnF4kVHtx3ejn2BcCYEl7IwIscrFJXw6/HbntfDzyY3y1TEPt3 OhPX3mfnVjDHuTrTBPU3Gej/nfl8mozptufEMsfpNTzVSgaq2Kgvq0S2eS24qW2Pb6nU RafqyN/pbVRooe8urlz60elWIIDgvMYnQCP43UX29jSuawg76IIIwcA5a2tuZUWfxa4D i2wY6am5u/8RxwkRhtBmZ2dARzTRFjGZG37zppZRsnx62HhhnBzoWs/geXBWhnebjS0j XPCSc6zokKGdh9Q9BsDj9i+94GtlzihU7bQxgBXq4GmUKusBNdnnfDM1t2CXcdoUdWV0 iKUQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Hqck99Ij; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s1si20713212oic.234.2020.01.21.09.22.17; Tue, 21 Jan 2020 09:22:30 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Hqck99Ij; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729108AbgAURVP (ORCPT + 99 others); Tue, 21 Jan 2020 12:21:15 -0500 Received: from mail-oi1-f196.google.com ([209.85.167.196]:33195 "EHLO mail-oi1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728186AbgAURVP (ORCPT ); Tue, 21 Jan 2020 12:21:15 -0500 Received: by mail-oi1-f196.google.com with SMTP id q81so3330146oig.0 for ; Tue, 21 Jan 2020 09:21:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=q5QgG9WnPTL3nteIJEXmB0Es5EWve1SbvFBsjUgPlKU=; b=Hqck99IjyucVajmKBNg5n6iwdy1Ba0Mz2Hv7kvqwVGnr1C5wiqxxGa6DCxk+QAHJMw cXceKiYFoYrkDPegt/RiLyqFE5RrUbXQmskCI624CSUZwIkdsYeR7ozgCXW/2NkqjUky zg5KXfgHjVYYDDWNfVZ+wcouy7tJQ8JJJ4ATWB9qwFy0TrYv28pZxxpkNpETf5xjLwjF k+H6lk6iR3rq1X88xVaAVMLnUBS2zqYVoBub6hlkuDBppV1PkIgYMt+ELbbvNGUory2d sR/mOkrzrP4IaEbwArJZYpYycJ7S5WK2drEdwQXGv8OQlFmyEPsHKxNi7GI/RXlpY2/f mSRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=q5QgG9WnPTL3nteIJEXmB0Es5EWve1SbvFBsjUgPlKU=; b=PxdNV37YudsVKEdzFhwZFa+QlM3Gz5g7Cntg9r+TKmhUutZXq3z9NHdD4wP6MeOkYb 3cVGRc8daKrRdyg6wSwTE8XRfjQBbzWldZkUn0FOxtI0PVervnTMBWqR6A1W+Nk9qHda 1R8qVlpqott1UTHrvpuB3k3Qja1tcrv0uGtINdLDiMMSTjJRjJHUzrWt/1i7j9A5+LQ9 ww2q29R0kPKNf6KJPedzyiMvYNigGLH2wBbGRvqkxvtauO4Yd9qH2+CLlG7zbOt2V5/f O5Yvtn684ggUkvkEZ7T4iO+OhEl0j8fw23UBTaimA+FtP/QWOqBevK0jL73+WpPkp0ik ewCw== X-Gm-Message-State: APjAAAU8VzTXgsqap9hbsd6BZpSfFN9tQH47lwIWja63w1Jssrrdo61X 0LYB93Glax7wCD+FzqB6rZUXycBrNt70VMPHrcKdywL4lDE= X-Received: by 2002:aca:4a08:: with SMTP id x8mr3813040oia.39.1579627274088; Tue, 21 Jan 2020 09:21:14 -0800 (PST) MIME-Version: 1.0 References: <20200121160312.26545-1-mathieu.desnoyers@efficios.com> In-Reply-To: <20200121160312.26545-1-mathieu.desnoyers@efficios.com> From: Jann Horn Date: Tue, 21 Jan 2020 18:20:47 +0100 Message-ID: Subject: Re: [RFC PATCH v1] pin_on_cpu: Introduce thread CPU pinning system call To: Mathieu Desnoyers Cc: Peter Zijlstra , Thomas Gleixner , kernel list , Joel Fernandes , Ingo Molnar , Catalin Marinas , Dave Watson , Will Deacon , Shuah Khan , Andi Kleen , "open list:KERNEL SELFTEST FRAMEWORK" , "H . Peter Anvin" , Chris Lameter , Russell King , Michael Kerrisk , "Paul E . McKenney" , Paul Turner , Boqun Feng , Josh Triplett , Steven Rostedt , Ben Maurer , Linux API , Andy Lutomirski Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 21, 2020 at 5:13 PM Mathieu Desnoyers wrote: > There is an important use-case which is not possible with the > "rseq" (Restartable Sequences) system call, which was left as > future work. > > That use-case is to modify user-space per-cpu data structures > belonging to specific CPUs which may be brought offline and > online again by CPU hotplug. This can be used by memory > allocators to migrate free memory pools when CPUs are brought > offline, or by ring buffer consumers to target specific per-CPU > buffers, even when CPUs are brought offline. > > A few rather complex prior attempts were made to solve this. > Those were based on in-kernel interpreters (cpu_opv, do_on_cpu). > That complexity was generally frowned upon, even by their author. > > This patch fulfills this use-case in a refreshingly simple way: > it introduces a "pin_on_cpu" system call, which allows user-space > threads to pin themselves on a specific CPU (which needs to be > present in the thread's allowed cpu mask), and then clear this > pinned state. [...] > For instance, this allows implementing this userspace library API > for incrementing a per-cpu counter for a specific cpu number > received as parameter: > > static inline __attribute__((always_inline)) > int percpu_addv(intptr_t *v, intptr_t count, int cpu) > { > int ret; > > ret = rseq_addv(v, count, cpu); > check: > if (rseq_unlikely(ret)) { > pin_on_cpu_set(cpu); > ret = rseq_addv(v, count, percpu_current_cpu()); > pin_on_cpu_clear(); > goto check; > } > return 0; > } What does userspace have to do if the set of allowed CPUs switches all the time? For example, on Android, if you first open Chrome and then look at its allowed CPUs, Chrome is allowed to use all CPU cores because it's running in the foreground: walleye:/ # ps -AZ | grep 'android.chrome$' u:r:untrusted_app:s0:c145,c256,c512,c768 u0_a145 7845 805 1474472 197868 SyS_epoll_wait f09c0194 S com.android.chrome walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list /proc/7845/status 3:cpuset:/top-app Cpus_allowed_list: 0-7 But if you then switch to the home screen, the application is moved into a different cgroup, and is restricted to two CPU cores: walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list /proc/7845/status 3:cpuset:/background Cpus_allowed_list: 0-1 At the same time, I also wonder whether it is a good idea to allow userspace to stay active on a CPU even after the task has been told to move to another CPU core - that's probably not exactly a big deal, but seems suboptimal to me. I'm wondering whether it might be possible to rework this mechanism such that, instead of moving the current task onto a target CPU, it prevents all *other* threads of the current process from running on that CPU (either entirely or in user mode). That might be the easiest way to take care of issues like CPU hotplugging and changing cpusets all at once? The only potential issue I see with that approach would be that you wouldn't be able to use it for inter-process communication; and I have no idea whether this would be good or bad performance-wise.