Date: Thu, 23 Apr 2015 00:36:02 -0700
From: Josh Triplett <josh@joshtriplett.org>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Andy Lutomirski <luto@amacapital.net>,
        Denys Vlasenko <dvlasenk@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>, Borislav Petkov <bp@alien8.de>,
        "H. Peter Anvin" <hpa@zytor.com>, Oleg Nesterov <oleg@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Alexei Starovoitov <ast@plumgrid.com>, Will Drewry <wad@chromium.org>,
        Kees Cook <keescook@chromium.org>, X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>,
        Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 2/2] x86/asm/entry/32: Remove unnecessary optimization in
 stub32_clone
Message-ID: <20150423073602.GA7217@x>
References: <1429720808-7173-1-git-send-email-dvlasenk@redhat.com>
 <1429720808-7173-2-git-send-email-dvlasenk@redhat.com>
 <CALCETrV1f8oTqeKfnzpCxAHX-d_CwqdrZrZsV+TX=Y1OSd6qdQ@mail.gmail.com>
 <20150422171005.GA1020@jtriplet-mobl1>
 <CA+55aFxTmMSUbtQCP_ob-n4c2oGKWtGeQ8A165hVmt8-dHJrxA@mail.gmail.com>
 <20150422201241.GA954@jtriplet-mobl1>
 <20150423062438.GB28898@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150423062438.GB28898@gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7212
Lines: 161

On Thu, Apr 23, 2015 at 08:24:38AM +0200, Ingo Molnar wrote:
> * Josh Triplett <josh@joshtriplett.org> wrote:
> > On Wed, Apr 22, 2015 at 11:22:02AM -0700, Linus Torvalds wrote:
> > > On Wed, Apr 22, 2015 at 10:10 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> > > >
> > > > I do think my two-patch HAVE_COPY_THREAD_TLS series should go in fixing
> > > > this
> > > 
> > > Ugh, I absolutely detesrt that patch.
> > > 
> > > Don't make random crazy function signatures that depend on some config
> > > option. That's just evil. The patch is a mess of #ifdef's and should
> > > be shot in the head and staked with a silver stake to  make sure it
> > > never re-appears.
> > > 
> > > Either:
> > > 
> > >  (a) make the change for every architecture
> > > 
> > >  (b) have side-by-side interfaces. With different names!
> > 
> > ...that's exactly what I did.  They're called copy_thread and 
> > copy_thread_tls; I very intentionally did not conditionally change 
> > the signature of copy_thread, for exactly that reason.  Those 
> > functions are implemented in architecture-specific code, so the 
> > config option just specifies which of the two functions the 
> > architecture provides.
> > 
> > *sys_clone* has different function signatures based on config 
> > options, but I didn't touch that other than fixing the type of the 
> > tls argument. That's historical baggage that we can't throw away 
> > without breaking userspace.
> 
> So you want to add a new clone() parameter. I strongly suspect that 
> you won't be the last person who wants to do this.
> 
> So why not leave the compatibility baggage alone and introduce a new 
> clean clone syscall with a flexible, backwards and forwards compatible 
> parameter ABI?

...that's also exactly what I did, in the clone4 patch series, which
uses exactly the arg-structure approach you're describing. :)

Take a look at the clone4 patch series (v2).  We'll be updating it to v3
to address some feedback, and we're hoping to aim for the 4.2 merge
window.

> Something like:
> 
>   SYSCALL_DEFINE1(clone_params, struct clone_params __user *, params)

clone4 passes the size outside the params structure, since otherwise
you'd need to copy the first few bytes to know how much more to copy.

clone4 also passes flags outside the params structure, to allow for a
potential future flag that might indicate a completely different
interpretation of the params structure.

But otherwise, yes, clone4 moves all the parameters into a structure.

(Differences between clone4_args and your clone_params structure: size
and flags passed outside, type of the tls parameter is "unsigned long"
since it's pointer-sized, and the structure includes the stack_size
argument needed on some architectures.)

> The only real cost of this approach is that this parameter structure 
> has to be copied (once): should be fast as it fits into a single cache 
> line and is a constant size copy. Also note how this parameter block 
> can be passed down by const reference from that point on, instead of 
> copying/shuffling 5-6 parameters into increasingly long function 
> parameter lists. So it might in fact turn out to be slightly faster as 
> well.

Agreed.

> Note how easily extensible it is: a new field can be added by 
> appending to the structure, and compatibility is achieved in the 
> kernel without explicit versioning, by checking params->size:
> 
>   params->size == sizeof(*params):
> 
>      Good, kernel and userspace ABI version matches. This is a simple
>      check and the most typical situation - it will be the fastest as
>      well.
> 
> 
>   params->size < sizeof(*params):
> 
>      Old binary calls into new kernel. Missing fields are set to 0.
>      Binaries will be forward compatible without any versioning hacks.

I combined these two cases by just copying the userspace struct over a
pre-zeroed params structure; then any fields not copied over will remain
zero.

>   params->size > sizeof(*params):
> 
>      New user-space calls into old kernel. System call can still be
>      serviced if the new field(s) are all zero. We return -ENOSYS if 
>      any field is non-zero. (i.e. if new user-space tries to make use
>      of a new syscall feature that this kernel has not implemented
>      yet.) This way user-space can figure out whether a particular
>      new parameter is supported by a kernel, without having to add new
>      system calls or versioning.

In theory clone4 could have had this special case for "userspace passed
extra parameters this kernel doesn't understand, but they're all zero",
but since this is a brand new ABI, it seems far easier for userspace to
simply pass only the size needed for its non-zero parameters, and then
the kernel can reject structures larger than it expects.  Only pass the
new version of the args structure if you need to pass non-zero values
for the new fields in that version.

Since clone4 doesn't even copy the structure if the size exceeds what it
understands, that also avoids additional special cases such as the user
passing in an obscenely large size.

> Also note how 'fool proof' this kind of 'versioning' is: the parameter 
> block is extended by appending to the end, cleanly extending the 
> kernel internals to handle the new parameter - end of story. The 
> zeroing logic on size mismatch will take care of the rest 
> automatically, it's all ABI forwards and backwards compatible to the 
> maximum possible extent.

Right.

> It's relatively easy to use this same interface from 32-bit compat 
> environments as well: they can use the very same parameter block, the 
> kernel's compat layer possibly just needs to do a minor amount of 
> pointer translation for the _addr fields, for example to truncate them 
> down to 32 bits for security, by filling in the high words of pointers 
> with 0. (This too can be optimized further if needed, by open coding 
> the copy and zeroing.) This too is forwards and backwards ABI 
> compatible to the maximum possible extent.

That's precisely what the compat version of clone4 does, using the
kernel's existing compat types and conversion functions.

> So introduce a single new syscall with the right ABI architecture and 
> you can leave the old mistakes alone.

Not quite.  Because copy_thread digs the tls argument out of the saved
syscall registers rather than via C parameter passing, with the current
implementation, no syscall can call down into copy_thread (by way of
copy_process) unless it has the tls parameter passed in the
architecture-specific register corresponding to the same syscall
argument sys_clone passes it in.  That's exactly why I submitted the
HAVE_COPY_THREAD_TLS series (the first two patches of the clone4
series): to eliminate that hack and allow for a replacement clone
syscall.

Would you consider reviewing the clone4 v2 patch series, which should
provide exactly the new, easily-versioned, cross-architecture interface
you're asking for?

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/