Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933109AbcCKAS7 (ORCPT ); Thu, 10 Mar 2016 19:18:59 -0500 Received: from port70.net ([81.7.13.123]:52233 "EHLO port70.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932844AbcCKAS5 (ORCPT ); Thu, 10 Mar 2016 19:18:57 -0500 Date: Fri, 11 Mar 2016 01:18:54 +0100 From: Szabolcs Nagy To: Rich Felker Cc: Ingo Molnar , Linus Torvalds , Andy Lutomirski , the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , "musl@lists.openwall.com" , Andrew Morton , Thomas Gleixner , Peter Zijlstra Subject: Re: [musl] Re: [RFC PATCH] x86/vdso/32: Add AT_SYSINFO cancellation helpers Message-ID: <20160311001853.GA10198@port70.net> Mail-Followup-To: Rich Felker , Ingo Molnar , Linus Torvalds , Andy Lutomirski , the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , "musl@lists.openwall.com" , Andrew Morton , Thomas Gleixner , Peter Zijlstra References: <06079088639eddd756e2092b735ce4a682081308.1457486598.git.luto@kernel.org> <20160309085631.GA3247@gmail.com> <20160309113449.GZ29662@port70.net> <20160310033446.GL9349@brightrain.aerifal.cx> <20160310111646.GA13102@gmail.com> <20160310164104.GM9349@brightrain.aerifal.cx> <20160310180331.GB15940@gmail.com> <20160310232819.GR9349@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160310232819.GR9349@brightrain.aerifal.cx> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6736 Lines: 124 * Rich Felker [2016-03-10 18:28:20 -0500]: > On Thu, Mar 10, 2016 at 07:03:31PM +0100, Ingo Molnar wrote: > > > > * Rich Felker wrote: > > > > > > So instead of a sticky cancellation flag, we could introduce a sticky > > > > cancellation signal. > > > > > > > > A 'sticky signal' is not cleared from signal_pending() when the signal handler > > > > executes, but it's automatically blocked so no signal handler recursion > > > > occurs. (A sticky signal could still be cleared via a separate mechanism, by > > > > the cancellation cleanup code.) > > > > > > > > Such a 'sticky cancellation signal' would, in the racy situation, cause new > > > > blocking system calls to immediately return with -EINTR. Non-blocking syscalls > > > > could still be used. (So the cancellation signal handler itself would still > > > > have access to various fundamental system calls.) > > > > > > > > I think this would avoid messy coupling between the kernel's increasingly more > > > > varied system call entry code and C libraries. > > > > > > > > Sticky signals could be requested via a new SA_ flag. > > > > > > > > What do you think? > > > > > > This still doesn't address the issue that the code making the syscall needs to > > > be able to control whether it's cancellable or not. Not only do some syscalls > > > whose public functions are cancellation points need to be used internally in > > > non-cancellable ways; there's also the pthread_setcancelstate interface that > > > allows deferring cancellation so that it's possible to call functions which are > > > cancellation points without invoking cancellation. > > > > I don't think there's a problem - but I might be wrong: > > > > One way I think it would work is the following: a sticky signal is not the > > cancellation flag - it's a helper construct to implement the flag in user-space in > > a race-free way. > > > > Say you have RT signal-32 as the cancellation signal, and it's a sticky signal. > > > > When pthread_cancel() wants to cancel another thread, it first (atomically) sets > > the desired cancel state of the target thread. If that state signals that the > > thread is cancellable right now, and that we initiated its cancellation, then we > > send signal-32. I.e. the signal only ever gets sent if the thread is in a > > cancellable state. > > > > libc internal functions and the pthread_setcancelstate() API can temporarily > > change the cancel state of a thread to non-cancellable - but pthread_cancel() > > observes those state transitions. > > > > The 'sticky' nature of signal-32 will make a difference in the following race > > condition, if the cancellation flag is checked before a system call by the C > > library, and signal-32 arrives before the system call is executed. In that case > > the 'sticky' nature of the signal makes sure that all subsequent system calls > > return immediately. > > > > The sticky signal is only ever sent when the thread is in cancellable state - and > > if the target thread notices the cancellation request before the signal arrives, > > it first waits for its arrival before executing any new system calls (as part of > > the teardown, etc.). > > > > So the C library never has to do complex work with a sticky signal pending. > > > > Does that make more sense to you? > > No, it doesn't work. Cancellability of the target thread at the time > of the cancellation request (when you would decide whether or not to > send the signal) has no relation to cancellability at the time of > calling the cancellation point. Consider 2 threads A and B and the > following sequence of events: > > 1. A has cancellation enabled > 2. B calls pthread_cancel(A) and sets sticky pending signal > 3. A disables cancellation > 4. A calls cancellation point and syscall wrongly gets interrupted > > This can be solved with more synchronization in pthread_cancel and > pthread_setcancelstate, but it seems costly. pthread_setcancelstate > would have to clear pending sticky cancellation signals, and any > internal non-cancellable syscalls would have to be made using the same > mechanism (effectively calling pthread_setcancelstate). A naive > implementation of such clearing would involve a syscall itself, i think a syscall in setcancelstate in case of pending sticky signal is not that bad given that cancellation is very rarely used. however maintaining two completely different cancellation designs is expensive and only the current one works on old kernels. > defeating the purpose of using the vdso syscall at all (since an extra > syscall costs a lot more than the cycles you save from sysenter vs int > $0x80). It should be possible to track the state of the pending signal > in userspace, so that syscalls to clear it can be avoided except when > it's actually pending, but this requires some very tricky locking to > implement since most of these syscalls have to be async-signal-safe > but would also need to be using locking that synchronizes with the > thread calling pthread_cancel. At worst, implementing such locking > would require blocking all signals before taking the lock, which would > again introduce the requirement of more syscalls. > > > > From my standpoint the simplest and cleanest solution is for vdso to provide a > > > predicate function that takes a ucontext_t and returns true/false for whether it > > > represents a state prior to entering (or reentering, for restart state) the vdso > > > syscall. If vdso exports this symbol libc can use vdso syscall with > > > cancellation. If not, it can just fallback to straight inline syscall like now. > > > > Offering such a flag pushes unreasonable conceptual overhead into the vDSO proper > > in the long run: right now it might be easy to implement because the code paths > > are relatively simple and we can generate the flag passively via RIP checking - > > but if the vDSO grows more complex interfaces in the future, we'd essentially have > > to track our entry/exit state dynamically which sucks ... > > I don't see what you think it would grow. We're not talking about all > functionality in the vdso, only the vdso syscall/sysenter replacement > (AT_SYSINFO) to be used in place of int $0x80. The only way it would > get more complex is if whole syscalls were being fast-pathed in > userspace, but I think it was already determined that this approach > was wrong and that the vdso should export public symbols (like > __vdso_clock_gettime) instead of transparently fast-pathing them in > userspace via the AT_SYSINFO function. Also, any function that would > be a candidate for fast-pathing in userspace would be a > non-cancellation-point anyway. > > Rich