Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756767AbbDWHgT (ORCPT ); Thu, 23 Apr 2015 03:36:19 -0400 Received: from relay3-d.mail.gandi.net ([217.70.183.195]:58191 "EHLO relay3-d.mail.gandi.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752057AbbDWHgO (ORCPT ); Thu, 23 Apr 2015 03:36:14 -0400 X-Originating-IP: 50.43.43.179 Date: Thu, 23 Apr 2015 00:36:02 -0700 From: Josh Triplett To: Ingo Molnar Cc: Linus Torvalds , Andy Lutomirski , Denys Vlasenko , Steven Rostedt , Borislav Petkov , "H. Peter Anvin" , Oleg Nesterov , Frederic Weisbecker , Alexei Starovoitov , Will Drewry , Kees Cook , X86 ML , "linux-kernel@vger.kernel.org" , Andrew Morton , Peter Zijlstra , Thomas Gleixner Subject: Re: [PATCH 2/2] x86/asm/entry/32: Remove unnecessary optimization in stub32_clone Message-ID: <20150423073602.GA7217@x> References: <1429720808-7173-1-git-send-email-dvlasenk@redhat.com> <1429720808-7173-2-git-send-email-dvlasenk@redhat.com> <20150422171005.GA1020@jtriplet-mobl1> <20150422201241.GA954@jtriplet-mobl1> <20150423062438.GB28898@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150423062438.GB28898@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7212 Lines: 161 On Thu, Apr 23, 2015 at 08:24:38AM +0200, Ingo Molnar wrote: > * Josh Triplett wrote: > > On Wed, Apr 22, 2015 at 11:22:02AM -0700, Linus Torvalds wrote: > > > On Wed, Apr 22, 2015 at 10:10 AM, Josh Triplett wrote: > > > > > > > > I do think my two-patch HAVE_COPY_THREAD_TLS series should go in fixing > > > > this > > > > > > Ugh, I absolutely detesrt that patch. > > > > > > Don't make random crazy function signatures that depend on some config > > > option. That's just evil. The patch is a mess of #ifdef's and should > > > be shot in the head and staked with a silver stake to make sure it > > > never re-appears. > > > > > > Either: > > > > > > (a) make the change for every architecture > > > > > > (b) have side-by-side interfaces. With different names! > > > > ...that's exactly what I did. They're called copy_thread and > > copy_thread_tls; I very intentionally did not conditionally change > > the signature of copy_thread, for exactly that reason. Those > > functions are implemented in architecture-specific code, so the > > config option just specifies which of the two functions the > > architecture provides. > > > > *sys_clone* has different function signatures based on config > > options, but I didn't touch that other than fixing the type of the > > tls argument. That's historical baggage that we can't throw away > > without breaking userspace. > > So you want to add a new clone() parameter. I strongly suspect that > you won't be the last person who wants to do this. > > So why not leave the compatibility baggage alone and introduce a new > clean clone syscall with a flexible, backwards and forwards compatible > parameter ABI? ...that's also exactly what I did, in the clone4 patch series, which uses exactly the arg-structure approach you're describing. :) Take a look at the clone4 patch series (v2). We'll be updating it to v3 to address some feedback, and we're hoping to aim for the 4.2 merge window. > Something like: > > SYSCALL_DEFINE1(clone_params, struct clone_params __user *, params) clone4 passes the size outside the params structure, since otherwise you'd need to copy the first few bytes to know how much more to copy. clone4 also passes flags outside the params structure, to allow for a potential future flag that might indicate a completely different interpretation of the params structure. But otherwise, yes, clone4 moves all the parameters into a structure. (Differences between clone4_args and your clone_params structure: size and flags passed outside, type of the tls parameter is "unsigned long" since it's pointer-sized, and the structure includes the stack_size argument needed on some architectures.) > The only real cost of this approach is that this parameter structure > has to be copied (once): should be fast as it fits into a single cache > line and is a constant size copy. Also note how this parameter block > can be passed down by const reference from that point on, instead of > copying/shuffling 5-6 parameters into increasingly long function > parameter lists. So it might in fact turn out to be slightly faster as > well. Agreed. > Note how easily extensible it is: a new field can be added by > appending to the structure, and compatibility is achieved in the > kernel without explicit versioning, by checking params->size: > > params->size == sizeof(*params): > > Good, kernel and userspace ABI version matches. This is a simple > check and the most typical situation - it will be the fastest as > well. > > > params->size < sizeof(*params): > > Old binary calls into new kernel. Missing fields are set to 0. > Binaries will be forward compatible without any versioning hacks. I combined these two cases by just copying the userspace struct over a pre-zeroed params structure; then any fields not copied over will remain zero. > params->size > sizeof(*params): > > New user-space calls into old kernel. System call can still be > serviced if the new field(s) are all zero. We return -ENOSYS if > any field is non-zero. (i.e. if new user-space tries to make use > of a new syscall feature that this kernel has not implemented > yet.) This way user-space can figure out whether a particular > new parameter is supported by a kernel, without having to add new > system calls or versioning. In theory clone4 could have had this special case for "userspace passed extra parameters this kernel doesn't understand, but they're all zero", but since this is a brand new ABI, it seems far easier for userspace to simply pass only the size needed for its non-zero parameters, and then the kernel can reject structures larger than it expects. Only pass the new version of the args structure if you need to pass non-zero values for the new fields in that version. Since clone4 doesn't even copy the structure if the size exceeds what it understands, that also avoids additional special cases such as the user passing in an obscenely large size. > Also note how 'fool proof' this kind of 'versioning' is: the parameter > block is extended by appending to the end, cleanly extending the > kernel internals to handle the new parameter - end of story. The > zeroing logic on size mismatch will take care of the rest > automatically, it's all ABI forwards and backwards compatible to the > maximum possible extent. Right. > It's relatively easy to use this same interface from 32-bit compat > environments as well: they can use the very same parameter block, the > kernel's compat layer possibly just needs to do a minor amount of > pointer translation for the _addr fields, for example to truncate them > down to 32 bits for security, by filling in the high words of pointers > with 0. (This too can be optimized further if needed, by open coding > the copy and zeroing.) This too is forwards and backwards ABI > compatible to the maximum possible extent. That's precisely what the compat version of clone4 does, using the kernel's existing compat types and conversion functions. > So introduce a single new syscall with the right ABI architecture and > you can leave the old mistakes alone. Not quite. Because copy_thread digs the tls argument out of the saved syscall registers rather than via C parameter passing, with the current implementation, no syscall can call down into copy_thread (by way of copy_process) unless it has the tls parameter passed in the architecture-specific register corresponding to the same syscall argument sys_clone passes it in. That's exactly why I submitted the HAVE_COPY_THREAD_TLS series (the first two patches of the clone4 series): to eliminate that hack and allow for a replacement clone syscall. Would you consider reviewing the clone4 v2 patch series, which should provide exactly the new, easily-versioned, cross-architecture interface you're asking for? - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/