The reason for copying of %r8 to %rcx is quite non-obvious.
Add a comment which explains why it is done.
Fix indentation and trailing whitespace while at it.
Signed-off-by: Denys Vlasenko <[email protected]>
---
CC: Linus Torvalds <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Alexei Starovoitov <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
arch/x86/ia32/ia32entry.S | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 2ca052e..8e72256 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -562,9 +562,17 @@ GLOBAL(\label)
ALIGN
GLOBAL(stub32_clone)
- leaq sys_clone(%rip),%rax
+ leaq sys_clone(%rip), %rax
+ /*
+ * 32-bit clone API is clone(..., int tls_val, int *child_tidptr).
+ * 64-bit clone API is clone(..., int *child_tidptr, int tls_val).
+ * Native 64-bit kernel's sys_clone() implements the latter.
+ * We need to swap args here. But since tls_val is in fact ignored
+ * by sys_clone(), we can get away with an assignment
+ * (arg4 = arg5) instead of a full swap:
+ */
mov %r8, %rcx
- jmp ia32_ptregs_common
+ jmp ia32_ptregs_common
ALIGN
ia32_ptregs_common:
--
1.8.1.4
Really swap arguments #4 and #5 in stub32_clone instead of "optimizing"
it into a move.
Yes, tls_val is currently unused. Yes, on some CPUs XCHG is a little bit
more expensive than MOV. But a cycle or two on an expensive syscall like
clone() is way below noise floor, and obfuscation of logic introduced
by this optimization is simply not worth it.
Signed-off-by: Denys Vlasenko <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Alexei Starovoitov <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/ia32/ia32entry.S | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 8e72256..0c302d0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -567,11 +567,9 @@ GLOBAL(stub32_clone)
* 32-bit clone API is clone(..., int tls_val, int *child_tidptr).
* 64-bit clone API is clone(..., int *child_tidptr, int tls_val).
* Native 64-bit kernel's sys_clone() implements the latter.
- * We need to swap args here. But since tls_val is in fact ignored
- * by sys_clone(), we can get away with an assignment
- * (arg4 = arg5) instead of a full swap:
+ * We need to swap args here:
*/
- mov %r8, %rcx
+ xchg %r8, %rcx
jmp ia32_ptregs_common
ALIGN
--
1.8.1.4
On Wed, Apr 22, 2015 at 9:40 AM, Denys Vlasenko <[email protected]> wrote:
> The reason for copying of %r8 to %rcx is quite non-obvious.
> Add a comment which explains why it is done.
>
> Fix indentation and trailing whitespace while at it.
Seems reasonable, but I think that Josh's clone patch is even better.
--Andy
On Wed, Apr 22, 2015 at 9:40 AM, Denys Vlasenko <[email protected]> wrote:
> Really swap arguments #4 and #5 in stub32_clone instead of "optimizing"
> it into a move.
>
> Yes, tls_val is currently unused. Yes, on some CPUs XCHG is a little bit
> more expensive than MOV. But a cycle or two on an expensive syscall like
> clone() is way below noise floor, and obfuscation of logic introduced
> by this optimization is simply not worth it.
Ditto re: Josh's patch.
--Andy
>
> Signed-off-by: Denys Vlasenko <[email protected]>
> CC: Linus Torvalds <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Ingo Molnar <[email protected]>
> CC: Borislav Petkov <[email protected]>
> CC: "H. Peter Anvin" <[email protected]>
> CC: Andy Lutomirski <[email protected]>
> CC: Oleg Nesterov <[email protected]>
> CC: Frederic Weisbecker <[email protected]>
> CC: Alexei Starovoitov <[email protected]>
> CC: Will Drewry <[email protected]>
> CC: Kees Cook <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> arch/x86/ia32/ia32entry.S | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 8e72256..0c302d0 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -567,11 +567,9 @@ GLOBAL(stub32_clone)
> * 32-bit clone API is clone(..., int tls_val, int *child_tidptr).
> * 64-bit clone API is clone(..., int *child_tidptr, int tls_val).
> * Native 64-bit kernel's sys_clone() implements the latter.
> - * We need to swap args here. But since tls_val is in fact ignored
> - * by sys_clone(), we can get away with an assignment
> - * (arg4 = arg5) instead of a full swap:
> + * We need to swap args here:
> */
> - mov %r8, %rcx
> + xchg %r8, %rcx
> jmp ia32_ptregs_common
>
> ALIGN
> --
> 1.8.1.4
>
--
Andy Lutomirski
AMA Capital Management, LLC
On Wed, Apr 22, 2015 at 09:54:24AM -0700, Andy Lutomirski wrote:
> On Wed, Apr 22, 2015 at 9:40 AM, Denys Vlasenko <[email protected]> wrote:
> > Really swap arguments #4 and #5 in stub32_clone instead of "optimizing"
> > it into a move.
> >
> > Yes, tls_val is currently unused. Yes, on some CPUs XCHG is a little bit
> > more expensive than MOV. But a cycle or two on an expensive syscall like
> > clone() is way below noise floor, and obfuscation of logic introduced
> > by this optimization is simply not worth it.
>
> Ditto re: Josh's patch.
I do think my two-patch HAVE_COPY_THREAD_TLS series should go in fixing
this, but I'd like to see the final version of Denys' comment added on
top of it (with an update to the type and name of the tls argument to
match the changes to sys_clone).
Denys, would you consider submitting a patch adding your comment on top
of the two-patch series I just sent?
Thanks,
Josh Triplett
> --Andy
>
> >
> > Signed-off-by: Denys Vlasenko <[email protected]>
> > CC: Linus Torvalds <[email protected]>
> > CC: Steven Rostedt <[email protected]>
> > CC: Ingo Molnar <[email protected]>
> > CC: Borislav Petkov <[email protected]>
> > CC: "H. Peter Anvin" <[email protected]>
> > CC: Andy Lutomirski <[email protected]>
> > CC: Oleg Nesterov <[email protected]>
> > CC: Frederic Weisbecker <[email protected]>
> > CC: Alexei Starovoitov <[email protected]>
> > CC: Will Drewry <[email protected]>
> > CC: Kees Cook <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > arch/x86/ia32/ia32entry.S | 6 ++----
> > 1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> > index 8e72256..0c302d0 100644
> > --- a/arch/x86/ia32/ia32entry.S
> > +++ b/arch/x86/ia32/ia32entry.S
> > @@ -567,11 +567,9 @@ GLOBAL(stub32_clone)
> > * 32-bit clone API is clone(..., int tls_val, int *child_tidptr).
> > * 64-bit clone API is clone(..., int *child_tidptr, int tls_val).
> > * Native 64-bit kernel's sys_clone() implements the latter.
> > - * We need to swap args here. But since tls_val is in fact ignored
> > - * by sys_clone(), we can get away with an assignment
> > - * (arg4 = arg5) instead of a full swap:
> > + * We need to swap args here:
> > */
> > - mov %r8, %rcx
> > + xchg %r8, %rcx
> > jmp ia32_ptregs_common
> >
> > ALIGN
> > --
> > 1.8.1.4
> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
On 04/22/2015 07:10 PM, Josh Triplett wrote:
> On Wed, Apr 22, 2015 at 09:54:24AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 22, 2015 at 9:40 AM, Denys Vlasenko <[email protected]> wrote:
>>> Really swap arguments #4 and #5 in stub32_clone instead of "optimizing"
>>> it into a move.
>>>
>>> Yes, tls_val is currently unused. Yes, on some CPUs XCHG is a little bit
>>> more expensive than MOV. But a cycle or two on an expensive syscall like
>>> clone() is way below noise floor, and obfuscation of logic introduced
>>> by this optimization is simply not worth it.
>>
>> Ditto re: Josh's patch.
>
> I do think my two-patch HAVE_COPY_THREAD_TLS series should go in fixing
> this, but I'd like to see the final version of Denys' comment added on
> top of it (with an update to the type and name of the tls argument to
> match the changes to sys_clone).
>
> Denys, would you consider submitting a patch adding your comment on top
> of the two-patch series I just sent?
Okay.
On Wed, Apr 22, 2015 at 10:10 AM, Josh Triplett <[email protected]> wrote:
>
> I do think my two-patch HAVE_COPY_THREAD_TLS series should go in fixing
> this
Ugh, I absolutely detesrt that patch.
Don't make random crazy function signatures that depend on some config
option. That's just evil. The patch is a mess of #ifdef's and should
be shot in the head and staked with a silver stake to make sure it
never re-appears.
Either:
(a) make the change for every architecture
(b) have side-by-side interfaces. With different names!
but not that disgusting "the calling conventions of these random
functions are different on different architectures and we use a config
flag to distinguish the cases".
Linus
On Wed, Apr 22, 2015 at 11:22:02AM -0700, Linus Torvalds wrote:
> On Wed, Apr 22, 2015 at 10:10 AM, Josh Triplett <[email protected]> wrote:
> >
> > I do think my two-patch HAVE_COPY_THREAD_TLS series should go in fixing
> > this
>
> Ugh, I absolutely detesrt that patch.
>
> Don't make random crazy function signatures that depend on some config
> option. That's just evil. The patch is a mess of #ifdef's and should
> be shot in the head and staked with a silver stake to make sure it
> never re-appears.
>
> Either:
>
> (a) make the change for every architecture
>
> (b) have side-by-side interfaces. With different names!
...that's exactly what I did. They're called copy_thread and
copy_thread_tls; I very intentionally did not conditionally change the
signature of copy_thread, for exactly that reason. Those functions are
implemented in architecture-specific code, so the config option just
specifies which of the two functions the architecture provides.
*sys_clone* has different function signatures based on config options,
but I didn't touch that other than fixing the type of the tls argument.
That's historical baggage that we can't throw away without breaking
userspace.
- Josh Triplett
* Josh Triplett <[email protected]> wrote:
> On Wed, Apr 22, 2015 at 11:22:02AM -0700, Linus Torvalds wrote:
> > On Wed, Apr 22, 2015 at 10:10 AM, Josh Triplett <[email protected]> wrote:
> > >
> > > I do think my two-patch HAVE_COPY_THREAD_TLS series should go in fixing
> > > this
> >
> > Ugh, I absolutely detesrt that patch.
> >
> > Don't make random crazy function signatures that depend on some config
> > option. That's just evil. The patch is a mess of #ifdef's and should
> > be shot in the head and staked with a silver stake to make sure it
> > never re-appears.
> >
> > Either:
> >
> > (a) make the change for every architecture
> >
> > (b) have side-by-side interfaces. With different names!
>
> ...that's exactly what I did. They're called copy_thread and
> copy_thread_tls; I very intentionally did not conditionally change
> the signature of copy_thread, for exactly that reason. Those
> functions are implemented in architecture-specific code, so the
> config option just specifies which of the two functions the
> architecture provides.
>
> *sys_clone* has different function signatures based on config
> options, but I didn't touch that other than fixing the type of the
> tls argument. That's historical baggage that we can't throw away
> without breaking userspace.
So you want to add a new clone() parameter. I strongly suspect that
you won't be the last person who wants to do this.
So why not leave the compatibility baggage alone and introduce a new
clean clone syscall with a flexible, backwards and forwards compatible
parameter ABI?
Something like:
SYSCALL_DEFINE1(clone_params, struct clone_params __user *, params)
struct clone_params {
__u32 size; /* User-space sets it to sizeof(struct clone_params) */
__u32 tls_val; /* Slightly out of order, for optimal structure packing */
__u64 clone_flags;
__u64 new_sp_addr;
__u64 parent_tid_addr;
__u64 child_tid_addr;
__u64 tls;
};
The only real cost of this approach is that this parameter structure
has to be copied (once): should be fast as it fits into a single cache
line and is a constant size copy. Also note how this parameter block
can be passed down by const reference from that point on, instead of
copying/shuffling 5-6 parameters into increasingly long function
parameter lists. So it might in fact turn out to be slightly faster as
well.
Note how easily extensible it is: a new field can be added by
appending to the structure, and compatibility is achieved in the
kernel without explicit versioning, by checking params->size:
params->size == sizeof(*params):
Good, kernel and userspace ABI version matches. This is a simple
check and the most typical situation - it will be the fastest as
well.
params->size < sizeof(*params):
Old binary calls into new kernel. Missing fields are set to 0.
Binaries will be forward compatible without any versioning hacks.
params->size > sizeof(*params):
New user-space calls into old kernel. System call can still be
serviced if the new field(s) are all zero. We return -ENOSYS if
any field is non-zero. (i.e. if new user-space tries to make use
of a new syscall feature that this kernel has not implemented
yet.) This way user-space can figure out whether a particular
new parameter is supported by a kernel, without having to add new
system calls or versioning.
Also note how 'fool proof' this kind of 'versioning' is: the parameter
block is extended by appending to the end, cleanly extending the
kernel internals to handle the new parameter - end of story. The
zeroing logic on size mismatch will take care of the rest
automatically, it's all ABI forwards and backwards compatible to the
maximum possible extent.
It's relatively easy to use this same interface from 32-bit compat
environments as well: they can use the very same parameter block, the
kernel's compat layer possibly just needs to do a minor amount of
pointer translation for the _addr fields, for example to truncate them
down to 32 bits for security, by filling in the high words of pointers
with 0. (This too can be optimized further if needed, by open coding
the copy and zeroing.) This too is forwards and backwards ABI
compatible to the maximum possible extent.
So introduce a single new syscall with the right ABI architecture and
you can leave the old mistakes alone.
Thanks,
Ingo
On Thu, Apr 23, 2015 at 08:24:38AM +0200, Ingo Molnar wrote:
> * Josh Triplett <[email protected]> wrote:
> > On Wed, Apr 22, 2015 at 11:22:02AM -0700, Linus Torvalds wrote:
> > > On Wed, Apr 22, 2015 at 10:10 AM, Josh Triplett <[email protected]> wrote:
> > > >
> > > > I do think my two-patch HAVE_COPY_THREAD_TLS series should go in fixing
> > > > this
> > >
> > > Ugh, I absolutely detesrt that patch.
> > >
> > > Don't make random crazy function signatures that depend on some config
> > > option. That's just evil. The patch is a mess of #ifdef's and should
> > > be shot in the head and staked with a silver stake to make sure it
> > > never re-appears.
> > >
> > > Either:
> > >
> > > (a) make the change for every architecture
> > >
> > > (b) have side-by-side interfaces. With different names!
> >
> > ...that's exactly what I did. They're called copy_thread and
> > copy_thread_tls; I very intentionally did not conditionally change
> > the signature of copy_thread, for exactly that reason. Those
> > functions are implemented in architecture-specific code, so the
> > config option just specifies which of the two functions the
> > architecture provides.
> >
> > *sys_clone* has different function signatures based on config
> > options, but I didn't touch that other than fixing the type of the
> > tls argument. That's historical baggage that we can't throw away
> > without breaking userspace.
>
> So you want to add a new clone() parameter. I strongly suspect that
> you won't be the last person who wants to do this.
>
> So why not leave the compatibility baggage alone and introduce a new
> clean clone syscall with a flexible, backwards and forwards compatible
> parameter ABI?
...that's also exactly what I did, in the clone4 patch series, which
uses exactly the arg-structure approach you're describing. :)
Take a look at the clone4 patch series (v2). We'll be updating it to v3
to address some feedback, and we're hoping to aim for the 4.2 merge
window.
> Something like:
>
> SYSCALL_DEFINE1(clone_params, struct clone_params __user *, params)
clone4 passes the size outside the params structure, since otherwise
you'd need to copy the first few bytes to know how much more to copy.
clone4 also passes flags outside the params structure, to allow for a
potential future flag that might indicate a completely different
interpretation of the params structure.
But otherwise, yes, clone4 moves all the parameters into a structure.
(Differences between clone4_args and your clone_params structure: size
and flags passed outside, type of the tls parameter is "unsigned long"
since it's pointer-sized, and the structure includes the stack_size
argument needed on some architectures.)
> The only real cost of this approach is that this parameter structure
> has to be copied (once): should be fast as it fits into a single cache
> line and is a constant size copy. Also note how this parameter block
> can be passed down by const reference from that point on, instead of
> copying/shuffling 5-6 parameters into increasingly long function
> parameter lists. So it might in fact turn out to be slightly faster as
> well.
Agreed.
> Note how easily extensible it is: a new field can be added by
> appending to the structure, and compatibility is achieved in the
> kernel without explicit versioning, by checking params->size:
>
> params->size == sizeof(*params):
>
> Good, kernel and userspace ABI version matches. This is a simple
> check and the most typical situation - it will be the fastest as
> well.
>
>
> params->size < sizeof(*params):
>
> Old binary calls into new kernel. Missing fields are set to 0.
> Binaries will be forward compatible without any versioning hacks.
I combined these two cases by just copying the userspace struct over a
pre-zeroed params structure; then any fields not copied over will remain
zero.
> params->size > sizeof(*params):
>
> New user-space calls into old kernel. System call can still be
> serviced if the new field(s) are all zero. We return -ENOSYS if
> any field is non-zero. (i.e. if new user-space tries to make use
> of a new syscall feature that this kernel has not implemented
> yet.) This way user-space can figure out whether a particular
> new parameter is supported by a kernel, without having to add new
> system calls or versioning.
In theory clone4 could have had this special case for "userspace passed
extra parameters this kernel doesn't understand, but they're all zero",
but since this is a brand new ABI, it seems far easier for userspace to
simply pass only the size needed for its non-zero parameters, and then
the kernel can reject structures larger than it expects. Only pass the
new version of the args structure if you need to pass non-zero values
for the new fields in that version.
Since clone4 doesn't even copy the structure if the size exceeds what it
understands, that also avoids additional special cases such as the user
passing in an obscenely large size.
> Also note how 'fool proof' this kind of 'versioning' is: the parameter
> block is extended by appending to the end, cleanly extending the
> kernel internals to handle the new parameter - end of story. The
> zeroing logic on size mismatch will take care of the rest
> automatically, it's all ABI forwards and backwards compatible to the
> maximum possible extent.
Right.
> It's relatively easy to use this same interface from 32-bit compat
> environments as well: they can use the very same parameter block, the
> kernel's compat layer possibly just needs to do a minor amount of
> pointer translation for the _addr fields, for example to truncate them
> down to 32 bits for security, by filling in the high words of pointers
> with 0. (This too can be optimized further if needed, by open coding
> the copy and zeroing.) This too is forwards and backwards ABI
> compatible to the maximum possible extent.
That's precisely what the compat version of clone4 does, using the
kernel's existing compat types and conversion functions.
> So introduce a single new syscall with the right ABI architecture and
> you can leave the old mistakes alone.
Not quite. Because copy_thread digs the tls argument out of the saved
syscall registers rather than via C parameter passing, with the current
implementation, no syscall can call down into copy_thread (by way of
copy_process) unless it has the tls parameter passed in the
architecture-specific register corresponding to the same syscall
argument sys_clone passes it in. That's exactly why I submitted the
HAVE_COPY_THREAD_TLS series (the first two patches of the clone4
series): to eliminate that hack and allow for a replacement clone
syscall.
Would you consider reviewing the clone4 v2 patch series, which should
provide exactly the new, easily-versioned, cross-architecture interface
you're asking for?
- Josh Triplett