Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp4087940ybl; Tue, 21 Jan 2020 12:37:28 -0800 (PST) X-Google-Smtp-Source: APXvYqzH3WbunmQk6r3kq77n/p/FhkbFWq23AHkErYRVIxo+4zewc4M3RmiQ8oAUQaowpGTVzw6A X-Received: by 2002:aca:6289:: with SMTP id w131mr4233023oib.61.1579639048208; Tue, 21 Jan 2020 12:37:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579639048; cv=none; d=google.com; s=arc-20160816; b=rupnJGo5xJBeS7NFON3tawmQP+XLVu2U9LEMGb/hSvmsGIr6u+DIMTnxy/jHfiat4w GFGYWShhxp/bjg+ca90AiHbkrzSF1IqxuDAOOrm5OszmwC5n8z23uNwpGz3DSviw/tMS eKIiG/BL8WOg4ofepD6ga/kj1x4jVGcjYzK0id3WgG4z9eQp9ARY/zmhQxPXXBl/wdUu NMyaOddO7ve9B2PMFHX/+DNH+vZrKskPqVd7mFhKnyxkL7tc6LxmxY5wqMewkGGAzs7m zbjeqsfDV4U2XJXu9w+/MwMF4EFfx6CEEJnAme4jrf5oYvFbFqf5ue1vTsormgbbmfJ7 J6QA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=AjmyNrJxYMj9hx1pEuIb5xYTG+vY8gWyJHwqNS1wCUw=; b=0hCMKzJorr6O1/9a1CRi1oBlyhK2Y0MYfTrHoWdA2TGHrPJ+75O9iGJsMTxKYt5asN 81+6HHmwjNvtGVRf4AwqccsHbEquKflKBSwFE9ut5jjHRVpwaia+SwbS0pcxRiWV4vNq Tcr9TmA1d+Ss8pUcmFFISk9Qaeokh2iLsS5dUJpDx/BtM+9zgBTpEERtpKD7+r8p7Jfn KH2CjsVziAOG5jb5e0cZ6AIlY63vxmMQcGq8/haMAxw9dMrgTMns7RDDh7rh9GdrLYfM rAepzwZVhtXrjCDpJZ6+f9BpLknU72IN4P0ol9FRp6wMOzn3/DZYX7MsCLnV4f8sVOL2 Vu1A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=lYMW2yUn; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b11si20532006oie.152.2020.01.21.12.37.15; Tue, 21 Jan 2020 12:37:28 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=lYMW2yUn; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729091AbgAUUgD (ORCPT + 99 others); Tue, 21 Jan 2020 15:36:03 -0500 Received: from mail-ot1-f67.google.com ([209.85.210.67]:45510 "EHLO mail-ot1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728596AbgAUUgC (ORCPT ); Tue, 21 Jan 2020 15:36:02 -0500 Received: by mail-ot1-f67.google.com with SMTP id 59so4181667otp.12 for ; Tue, 21 Jan 2020 12:36:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=AjmyNrJxYMj9hx1pEuIb5xYTG+vY8gWyJHwqNS1wCUw=; b=lYMW2yUnqU+zx4ieWubcOjSROpxN401Tpbrn63gqgQ4x6brvwtQf7WG/yzoS5AsTRI WfzTFKvIH2QF+P+nH4Vc0B4YYLANVU0PI19pHWWP78vCzJ8jXCep86AuuKqS/TkOYJ28 hxgriH4/k2QXkoKHO31Jun2sDkQahURBWm0aQ6TEM+yFi9xFJd2vU9hDC2cLVpyzz8aY u4n4IrbFv/Zhj1cUWps5ugGdUybSqNl3GTv+mHXDESyjyV5pN9+TaOW7NvfK5pUHK2iB RMeeEtXec/MxQnYQpAahMSSAaODYLL15L2jpd2DdsAEmtOZ0JxhXMGneRH5UlzCbFQrj xopQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=AjmyNrJxYMj9hx1pEuIb5xYTG+vY8gWyJHwqNS1wCUw=; b=YOEZcVJcYgBDwdgFtCBGpWzK6MYxCH4L52GMJfnzVAboV8XM0YWrZu1kflyGjv+0aO zEVhdClTAsxshc4H+VIJ2y5gGeKPK9566fUCW4bndlxPmecglLlrEf4obGTsopGQ7eV8 G5reOjmdwfmYNUXXkb6CtFNmuvwRNyU3NftP/e1RieGDa7p9IvWVEDvmTMgUSzAC1cNg gxJdUroYWrSCKBqOBUOjMr/GpU0T18upRIOs055MDYP3CJRuTSxdEdfOjJ5kPJCX/EIf fE9cJzUNItgEIMW6CKOzrFEUtMgyMxe10eGQbSYweXaYhuTtldwAtCgJ+DmtCSdtKn7H /0zg== X-Gm-Message-State: APjAAAUtgMlrUuSzVfVN8Y6+hlyli96HebReg5YnuOtrLxCe/RzghMrZ lRD8J5npG1NagmNUS1j2OAq8iN1k5oxkaWdswFfJGg== X-Received: by 2002:a9d:5f13:: with SMTP id f19mr5029900oti.180.1579638961077; Tue, 21 Jan 2020 12:36:01 -0800 (PST) MIME-Version: 1.0 References: <20200121160312.26545-1-mathieu.desnoyers@efficios.com> <430172781.596271.1579636021412.JavaMail.zimbra@efficios.com> In-Reply-To: <430172781.596271.1579636021412.JavaMail.zimbra@efficios.com> From: Jann Horn Date: Tue, 21 Jan 2020 21:35:33 +0100 Message-ID: Subject: Re: [RFC PATCH v1] pin_on_cpu: Introduce thread CPU pinning system call To: Mathieu Desnoyers Cc: Peter Zijlstra , Thomas Gleixner , linux-kernel , Joel Fernandes , Ingo Molnar , Catalin Marinas , Dave Watson , Will Deacon , shuah , Andi Kleen , linux-kselftest , "H. Peter Anvin" , Chris Lameter , Russell King , Michael Kerrisk , "Paul E . McKenney" , Paul Turner , Boqun Feng , Josh Triplett , rostedt , Ben Maurer , linux-api , Andy Lutomirski Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 21, 2020 at 8:47 PM Mathieu Desnoyers wrote: > > ----- On Jan 21, 2020, at 12:20 PM, Jann Horn jannh@google.com wrote: > > > On Tue, Jan 21, 2020 at 5:13 PM Mathieu Desnoyers > > wrote: > >> There is an important use-case which is not possible with the > >> "rseq" (Restartable Sequences) system call, which was left as > >> future work. > >> > >> That use-case is to modify user-space per-cpu data structures > >> belonging to specific CPUs which may be brought offline and > >> online again by CPU hotplug. This can be used by memory > >> allocators to migrate free memory pools when CPUs are brought > >> offline, or by ring buffer consumers to target specific per-CPU > >> buffers, even when CPUs are brought offline. > >> > >> A few rather complex prior attempts were made to solve this. > >> Those were based on in-kernel interpreters (cpu_opv, do_on_cpu). > >> That complexity was generally frowned upon, even by their author. > >> > >> This patch fulfills this use-case in a refreshingly simple way: > >> it introduces a "pin_on_cpu" system call, which allows user-space > >> threads to pin themselves on a specific CPU (which needs to be > >> present in the thread's allowed cpu mask), and then clear this > >> pinned state. > > [...] > >> For instance, this allows implementing this userspace library API > >> for incrementing a per-cpu counter for a specific cpu number > >> received as parameter: > >> > >> static inline __attribute__((always_inline)) > >> int percpu_addv(intptr_t *v, intptr_t count, int cpu) > >> { > >> int ret; > >> > >> ret = rseq_addv(v, count, cpu); > >> check: > >> if (rseq_unlikely(ret)) { > >> pin_on_cpu_set(cpu); > >> ret = rseq_addv(v, count, percpu_current_cpu()); > >> pin_on_cpu_clear(); > >> goto check; > >> } > >> return 0; > >> } > > > > What does userspace have to do if the set of allowed CPUs switches all > > the time? For example, on Android, if you first open Chrome and then > > look at its allowed CPUs, Chrome is allowed to use all CPU cores > > because it's running in the foreground: > > > > walleye:/ # ps -AZ | grep 'android.chrome$' > > u:r:untrusted_app:s0:c145,c256,c512,c768 u0_a145 7845 805 1474472 > > 197868 SyS_epoll_wait f09c0194 S com.android.chrome > > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list > > /proc/7845/status > > 3:cpuset:/top-app > > Cpus_allowed_list: 0-7 > > > > But if you then switch to the home screen, the application is moved > > into a different cgroup, and is restricted to two CPU cores: > > > > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list > > /proc/7845/status > > 3:cpuset:/background > > Cpus_allowed_list: 0-1 > > Then at that point, pin_on_cpu() would only be allowed to pin on > CPUs 0 and 1. Which means that you can't actually reliably use pin_on_cpu_set() to manipulate percpu data structures since you have to call it with the assumption that it might randomly fail at any time, right? And then userspace needs to code a fallback path that somehow halts all the threads with thread-directed signals or something? Especially if the task trying to use pin_on_cpu_set() isn't allowed to pin to the target CPU, but all the other tasks using the shared data structure are allowed to do that. Or if the CPU you want to pin to is always removed from your affinity mask immediately before pin_on_cpu_set() and added back immediately afterwards. > > At the same time, I also wonder whether it is a good idea to allow > > userspace to stay active on a CPU even after the task has been told to > > move to another CPU core - that's probably not exactly a big deal, but > > seems suboptimal to me. > > Do you mean the following scenario for a given task ? > > 1) set affinity to CPU 0,1,2 > 2) pin on CPU 2 > 3) set affinity to CPU 0,1 > > In the patch I propose here, step (3) is forbidden by this added check in > __set_cpus_allowed_ptr, which is used by sched_setaffinity(2): > > + /* Prevent removing the currently pinned CPU from the allowed cpu mask. */ > + if (is_pinned_task(p) && !cpumask_test_cpu(p->pinned_cpu, new_mask)) { > + ret = -EINVAL; > + goto out; > + } How does that interact with things like cgroups? What happens when root tries to move a process from the cpuset:/top-app cgroup to the cpuset:/background cgroup? Does that also just throw an error code? > > I'm wondering whether it might be possible to rework this mechanism > > such that, instead of moving the current task onto a target CPU, it > > prevents all *other* threads of the current process from running on > > that CPU (either entirely or in user mode). That might be the easiest > > way to take care of issues like CPU hotplugging and changing cpusets > > all at once? The only potential issue I see with that approach would > > be that you wouldn't be able to use it for inter-process > > communication; and I have no idea whether this would be good or bad > > performance-wise. > > Firstly, inter-process communication over shared memory is one of my use-cases > (for lttng-ust's ring buffer). > > I'm not convinced that applying constraints on all other threads belonging to > the current process would be easier or faster than migrating the current thread > over to the target CPU. I'm unsure how those additional constraints would > fit with other threads already having their own cpu affinity masks (which > could generate an empty cpumask by adding an extra constraint). Hm - is an empty cpumask a problem here? If one task is in the middle of a critical section for performing maintenance on a percpu data structure, isn't it a nice design property to exclude concurrent access from other tasks to that data structure automatically (by preventing those tasks by running on that CPU)? That way the (presumably rarely-executed) update path doesn't have to be rseq-reentrancy-safe.