Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935007AbcCJDe4 (ORCPT ); Wed, 9 Mar 2016 22:34:56 -0500 Received: from 216-12-86-13.cv.mvl.ntelos.net ([216.12.86.13]:51649 "EHLO brightrain.aerifal.cx" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933173AbcCJDey (ORCPT ); Wed, 9 Mar 2016 22:34:54 -0500 Date: Wed, 9 Mar 2016 22:34:46 -0500 From: Rich Felker To: Linus Torvalds Cc: Ingo Molnar , Andy Lutomirski , the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , "musl@lists.openwall.com" , Andrew Morton , Thomas Gleixner , Peter Zijlstra Subject: Re: [musl] Re: [RFC PATCH] x86/vdso/32: Add AT_SYSINFO cancellation helpers Message-ID: <20160310033446.GL9349@brightrain.aerifal.cx> References: <06079088639eddd756e2092b735ce4a682081308.1457486598.git.luto@kernel.org> <20160309085631.GA3247@gmail.com> <20160309113449.GZ29662@port70.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3307 Lines: 65 On Wed, Mar 09, 2016 at 11:47:30AM -0800, Linus Torvalds wrote: > On Wed, Mar 9, 2016 at 3:34 AM, Szabolcs Nagy wrote: > >> > >> Could someone remind me why cancellation points matter to user-space? > > > > because of standards. > > So quite frankly, if we have to do kernel support for this, then let's > do it right, instead of just perpetuating a hack that was done in user > space in a new way. > > We already have support for cancelling blocking system calls early: we > do it for fatal signals (exactly because we know that it's ok to > return -EINTR without failing POSIX semantics - the dying thread will > never actually *see* the -EINTR because it's dying). > > I suspect that what you guys want is the same semantics as a fatal > signal (return early with -EINTR), but without the actual fatality > (you want to do cleanup in the cancelled thread). No, the semantics need to be identical to EINTR -- you can't cancel an operation where some work has already been done. This is both a POSIX requirement and a conceptual requirement. When a thread is cancelled, the process is not terminating abnormally; it's continuing. It needs to be able to know whether some work was completed, because that changes what the cleanup code needs to do in order for a consistent state to be maintained. This is most critical with syscalls that allocate or free resources -- open, close, recvmsg accepting file descriptors, etc. -- but it can even matter for reads and writes. This is the whole reason we need a race-free cancellation rather than the buggy implementation glibc historically used (which they are in the process of fixing too). Anyway, in the case where some but not all work was completed already at the time the cancellation request was made, the function needs to return and report whatever was successful. > I suspect that we could fairly easily give those kinds of semantics. > We could add a new flag to the sigaction (sa_flags) that says "this > signal interrupts even uninterruptible system calls". This would not help, because whether the system call should be cancellable is a function of the caller, not the system call; some syscalls are cancellable when used in one place but not in others. Also it does not solve the race condition; it's possible that the signal is delivered _after_ userspace checks the cancellation flag, but _before_ the syscall is made. Thus we need a way to probe whether the program counter is in a range between the userspace flag check and the syscall instruction. I believe a new kernel cancellation API with a sticky cancellation flag (rather than a signal), and a flag or'd onto the syscall number to make it cancellable at the call point, could work, but then userspace needs to support fairly different old and new kernel APIs in order to be able to run on old kernels while also taking advantage of new ones, and it's not clear to me that it would actually be worthwhile to do so. I could see doing it for a completely new syscall API, but as a second syscall API for a system that already has one it seems gratuitous. From my perspective the existing approach (checking program counter from signal handler) is very clean and simple. After all it made enough sense that I was able to convince the glibc folks to adopt it. Rich