Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp4579750ybl; Wed, 22 Jan 2020 00:24:40 -0800 (PST) X-Google-Smtp-Source: APXvYqwFQxPJkzxpNpomyPv2yvjJ7rHQmI8RxpDvGC4RRDJe9D/FVBqYQNgcqH6EQHcq7H+KlaSc X-Received: by 2002:a05:6830:10a:: with SMTP id i10mr6340070otp.365.1579681480712; Wed, 22 Jan 2020 00:24:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579681480; cv=none; d=google.com; s=arc-20160816; b=me77UYPSG3xSmIaxMLRfM5/+c5+5ksTUQhu7P1DgxrT2l0cN4XnfkmWH0l0D+wdS3P wqVCN8fQAL5NGaxtZdhLIj32IKEPPdGakWwbs7gha2fF+jmLKITYMdP+EYln5yhy6XKi yf9xmnfAPcOL/EJB6Q9AywxkG04kwdTcFCo0FU1iy/y7piyA897ZmNGaBTkOyMCVF95/ 5SaCAlJqJq/3806CXOAvUSo5FR9vs60wc8S9qXoPvTzbAzsTGBw4OtAF5aIP1MDStsHN hWgV3P3AQVorCmRNJ3yU+A7vz/qICkgJu7u8lnE8yp66Vw49YNk54Grd6yIShEuDAjdp /rAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=newysj2ZL/VSTfuHShTKoBBsv78QbKLgFXEyPsWB43Q=; b=wQaizwonoe8h7vyZiw4i2u6oy/sHd8M/rc39euqvZgrEsdHyLJZs5wSqzdV5JAoTlx p350gxXD0nQIwATL/hKmLZa8MU9+sXg2DdKusTh7xr2W1GN1NhQoYc4+csOk0l9PUQxt ZOQSEnZplBu6r9yIuazuJxwohaBQz8SqY1bwxbslD/gv97SgbiCyNWe7TRvSJsNFj2r6 wO/fycLt6KMZ+h1tN035cvfXalkIrP3APEPzgizw7TPvwPTPvvM9JKaddSOReAERx76Z qBY6cDg0ucCWA0qRKj6f8oxp985aH2GsNb3F2VisaoBO0/29sZBfMpodZWW1Xy3bJaJB InOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=sad90dne; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j24si25148580otk.76.2020.01.22.00.24.27; Wed, 22 Jan 2020 00:24:40 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=sad90dne; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726605AbgAVIXa (ORCPT + 99 others); Wed, 22 Jan 2020 03:23:30 -0500 Received: from mail-ot1-f66.google.com ([209.85.210.66]:39040 "EHLO mail-ot1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725868AbgAVIX3 (ORCPT ); Wed, 22 Jan 2020 03:23:29 -0500 Received: by mail-ot1-f66.google.com with SMTP id 77so5506502oty.6 for ; Wed, 22 Jan 2020 00:23:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=newysj2ZL/VSTfuHShTKoBBsv78QbKLgFXEyPsWB43Q=; b=sad90dnekqlccjQQPFDugkDbhJ08amgSt8lL/wJWNY1SSgdp5GEjDXlD3mkpAqamru Sn7mpch98hT/OWzDxgqgwuRaPVE1Keluw8AxfuAEkQmTSqJHms2vRmAef3Vuex3MtapB zUUeBzCIW9Mg08cXEQnWIN26XqyDZVr4bulTtYjQ2VRlMuWQZS2BfbyTNdcKah5VKKyz L4clA549Rk+yhGpg1igXPIRaKRiVWdjEaMj9vDYdqlQMV2JrdxnelSQME7Woqle8TWxx rRJGQBcuFW1iwfaH0FcERy0YV2T4IZgiRFWt7c6EhJ9lWfY+bW9QJkDlXsyS2LJ1N2sn Qcxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=newysj2ZL/VSTfuHShTKoBBsv78QbKLgFXEyPsWB43Q=; b=mhSNf3cvnM8T0MI7NrliMHhcMineR1JlBfqq0eAwFZx7ao2hhmy/ma5zVUHIV/70YA 1qCcsBU5A0WvzlBi+nL69+YHMbh2kBEH2rrwX1OsCgOd+SFc9BN2Yf4uIZVhziES/9ei XADUpA/GCw/cdYij1ptBobIeIrqwfw85YPF5E5gt2j2ulRwlEw92TRNcKkuAsPsYF2tu /QY/bH9jU3mThAuZnP2KTTuIUCXm569NclrHZqUC3hJR4rqfBGGWXUT/NkrPbA45yPZW wDAXBFZZfd/UApjtwvPZratH6Dvtp/7T/I3GSbaoKzHLJbwVr1vJfAN+8qa26egvABJb wBzg== X-Gm-Message-State: APjAAAV6U3DNFlqVP6oHk+dY67Ah8e51m+cSjzJRxLJkntwMdQCugR2R QrpnsKBKvpSgY7v3ctS/vuIqe67pj8Dx6EW4+Jw4RA== X-Received: by 2002:a05:6830:44e:: with SMTP id d14mr6268112otc.228.1579681408192; Wed, 22 Jan 2020 00:23:28 -0800 (PST) MIME-Version: 1.0 References: <20200121160312.26545-1-mathieu.desnoyers@efficios.com> <430172781.596271.1579636021412.JavaMail.zimbra@efficios.com> <2049164886.596497.1579641536619.JavaMail.zimbra@efficios.com> In-Reply-To: <2049164886.596497.1579641536619.JavaMail.zimbra@efficios.com> From: Jann Horn Date: Wed, 22 Jan 2020 09:23:00 +0100 Message-ID: Subject: Re: [RFC PATCH v1] pin_on_cpu: Introduce thread CPU pinning system call To: Mathieu Desnoyers Cc: Peter Zijlstra , Thomas Gleixner , linux-kernel , Joel Fernandes , Ingo Molnar , Catalin Marinas , Dave Watson , Will Deacon , shuah , Andi Kleen , linux-kselftest , "H. Peter Anvin" , Chris Lameter , Russell King , Michael Kerrisk , Paul , Paul Turner , Boqun Feng , Josh Triplett , rostedt , Ben Maurer , linux-api , Andy Lutomirski Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 21, 2020 at 10:18 PM Mathieu Desnoyers wrote: > ----- On Jan 21, 2020, at 3:35 PM, Jann Horn jannh@google.com wrote: > > > On Tue, Jan 21, 2020 at 8:47 PM Mathieu Desnoyers > > wrote: > >> > >> ----- On Jan 21, 2020, at 12:20 PM, Jann Horn jannh@google.com wrote: > >> > >> > On Tue, Jan 21, 2020 at 5:13 PM Mathieu Desnoyers > >> > wrote: > >> >> There is an important use-case which is not possible with the > >> >> "rseq" (Restartable Sequences) system call, which was left as > >> >> future work. > >> >> > >> >> That use-case is to modify user-space per-cpu data structures > >> >> belonging to specific CPUs which may be brought offline and > >> >> online again by CPU hotplug. This can be used by memory > >> >> allocators to migrate free memory pools when CPUs are brought > >> >> offline, or by ring buffer consumers to target specific per-CPU > >> >> buffers, even when CPUs are brought offline. > >> >> > >> >> A few rather complex prior attempts were made to solve this. > >> >> Those were based on in-kernel interpreters (cpu_opv, do_on_cpu). > >> >> That complexity was generally frowned upon, even by their author. > >> >> > >> >> This patch fulfills this use-case in a refreshingly simple way: > >> >> it introduces a "pin_on_cpu" system call, which allows user-space > >> >> threads to pin themselves on a specific CPU (which needs to be > >> >> present in the thread's allowed cpu mask), and then clear this > >> >> pinned state. > >> > [...] > >> >> For instance, this allows implementing this userspace library API > >> >> for incrementing a per-cpu counter for a specific cpu number > >> >> received as parameter: > >> >> > >> >> static inline __attribute__((always_inline)) > >> >> int percpu_addv(intptr_t *v, intptr_t count, int cpu) > >> >> { > >> >> int ret; > >> >> > >> >> ret =3D rseq_addv(v, count, cpu); > >> >> check: > >> >> if (rseq_unlikely(ret)) { > >> >> pin_on_cpu_set(cpu); > >> >> ret =3D rseq_addv(v, count, percpu_current_cpu()); > >> >> pin_on_cpu_clear(); > >> >> goto check; > >> >> } > >> >> return 0; > >> >> } > >> > > >> > What does userspace have to do if the set of allowed CPUs switches a= ll > >> > the time? For example, on Android, if you first open Chrome and then > >> > look at its allowed CPUs, Chrome is allowed to use all CPU cores > >> > because it's running in the foreground: > >> > > >> > walleye:/ # ps -AZ | grep 'android.chrome$' > >> > u:r:untrusted_app:s0:c145,c256,c512,c768 u0_a145 7845 805 1474472 > >> > 197868 SyS_epoll_wait f09c0194 S com.android.chrome > >> > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list > >> > /proc/7845/status > >> > 3:cpuset:/top-app > >> > Cpus_allowed_list: 0-7 > >> > > >> > But if you then switch to the home screen, the application is moved > >> > into a different cgroup, and is restricted to two CPU cores: > >> > > >> > walleye:/ # grep cpuset /proc/7845/cgroup; grep Cpus_allowed_list > >> > /proc/7845/status > >> > 3:cpuset:/background > >> > Cpus_allowed_list: 0-1 > >> > >> Then at that point, pin_on_cpu() would only be allowed to pin on > >> CPUs 0 and 1. > > > > Which means that you can't actually reliably use pin_on_cpu_set() to > > manipulate percpu data structures since you have to call it with the > > assumption that it might randomly fail at any time, right? > > Only if the cpu affinity of the thread is being changed concurrently > by another thread which is a possibility in some applications, indeed. Not just some applications, but also some environments, right? See the Android example - the set of permitted CPUs is changed not by the application itself, but by a management process that uses cgroup modifications to indirectly change the set of permitted CPUs. I wouldn't be surprised if the same could happen in e.g. container environments. > > And then > > userspace needs to code a fallback path that somehow halts all the > > threads with thread-directed signals or something? > > The example use of pin_on_cpu() did not include handling of the return > value in that case (-1, errno=3DEINVAL) for conciseness. But yes, the > application would have to handle this. > > It's not so different from error handling which is required when using > sched_setaffinity(), which can fail with -1, errno=3DEINVAL in the follow= ing > scenario: > > EINVAL The affinity bit mask mask contains no processors that are= cur=E2=80=90 > rently physically on the system and permitted to the t= hread > according to any restrictions that may be imposed b= y the > "cpuset" mechanism described in cpuset(7). Except that sched_setaffinity() is normally just a performance optimization, right? Whereas pin_to_cpu() is more of a correctness thing? > > Especially if the task trying to use pin_on_cpu_set() isn't allowed to > > pin to the target CPU, but all the other tasks using the shared data > > structure are allowed to do that. Or if the CPU you want to pin to is > > always removed from your affinity mask immediately before > > pin_on_cpu_set() and added back immediately afterwards. > > I am tempted to state that using pin_on_cpu() targeting a disallowed cpu > should be considered a programming error and handled accordingly by the > application. How can it be a programming error if that situation can be triggered by legitimate external modifications to CPU affinity? [...] > >> > I'm wondering whether it might be possible to rework this mechanism > >> > such that, instead of moving the current task onto a target CPU, it > >> > prevents all *other* threads of the current process from running on > >> > that CPU (either entirely or in user mode). That might be the easies= t > >> > way to take care of issues like CPU hotplugging and changing cpusets > >> > all at once? The only potential issue I see with that approach would > >> > be that you wouldn't be able to use it for inter-process > >> > communication; and I have no idea whether this would be good or bad > >> > performance-wise. > >> > >> Firstly, inter-process communication over shared memory is one of my u= se-cases > >> (for lttng-ust's ring buffer). > >> > >> I'm not convinced that applying constraints on all other threads belon= ging to > >> the current process would be easier or faster than migrating the curre= nt thread > >> over to the target CPU. I'm unsure how those additional constraints wo= uld > >> fit with other threads already having their own cpu affinity masks (wh= ich > >> could generate an empty cpumask by adding an extra constraint). > > > > Hm - is an empty cpumask a problem here? If one task is in the middle > > of a critical section for performing maintenance on a percpu data > > structure, isn't it a nice design property to exclude concurrent > > access from other tasks to that data structure automatically (by > > preventing those tasks by running on that CPU)? That way the > > (presumably rarely-executed) update path doesn't have to be > > rseq-reentrancy-safe. > > Given we already have rseq, simply using it to protect against other > threads trying to touch the same per-cpu data seems rather more lightweig= ht > than to try to exclude all other threads from that CPU for a possibly > unbounded amount of time. That only works if the cpu-targeted maintenance operation is something that can be implemented in rseq, right? I was thinking that it might be nice to avoid that limitation - but I don't know much about the kinds of data structures that one might want to build on top of rseq, so maybe that's a silly idea. > Allowing a completely empty cpumask could effectively allow those > critical sections to prevent execution of possibly higher priority > threads on the system, which ends up being the definition of a priority > inversion, which I'd like to avoid. Linux does have the infrastructure for RT futexes, right? Maybe that could be useful here.