2019-07-08 19:28:22

by Christian Brauner

[permalink] [raw]
Subject: [GIT PULL] clone3 for v5.3

Hi Linus,

This is the PR for the clone3 system call.

The following changes since commit a188339ca5a396acc588e5851ed7e19f66b0ebd9:

Linux 5.2-rc1 (2019-05-19 15:47:09 -0700)

are available in the Git repository at:

[email protected]:pub/scm/linux/kernel/git/brauner/linux tags/clone3-v5.3

for you to fetch changes up to d68dbb0c9ac8b1ff52eb09aa58ce6358400fa939:

arch: handle arches who do not yet define clone3 (2019-06-21 01:54:53 +0200)

/* Summary */
This adds the clone3 syscall which is an extensible successor to clone
after we snagged the last flag with CLONE_PIDFD during the 5.2 merge window
for clone(). It cleanly supports all of the flags from clone() and thus all
legacy workloads.

There are few user visible differences between clone3 and clone. First,
CLONE_DETACHED will cause EINVAL with clone3 so we can reuse this flag.
Second, the CSIGNAL flag is deprecated and will cause EINVAL to be
reported. It is superseeded by a dedicated "exit_signal" argument in struct
clone_args thus freeing up even more flags. And third, clone3 gives
CLONE_PIDFD a dedicated return argument in struct clone_args instead of
abusing CLONE_PARENT_SETTID's parent_tidptr argument.

The clone3 uapi is designed to be easy to handle on 32- and 64 bit:

/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};

and a separate kernel struct is used that uses proper kernel typing:

/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};

The system call comes with a size argument which enables the kernel to
detect what version of clone_args userspace is passing in. clone3 validates
that any additional bytes a given kernel does not know about are set to
zero and that the size never exceeds a page.

A nice feature is that this patchset allowed us to cleanup and simplify
various core kernel codepaths in kernel/fork.c by making the internal
_do_fork() function take struct kernel_clone_args even for legacy clone().

This patch also unblocks the time namespace patchset which wants to
introduce a new CLONE_TIMENS flag.

Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
xtensa. These were the architectures that did not require special
massaging. Other architectures treat fork-like system calls individually
and after some back and forth neither Arnd nor I felt confident that we
dared to add clone3 unconditionally to all architectures. We agreed to
leave this up to individual architecture maintainers. This is why there's
an additional patch that introduces __ARCH_WANT_SYS_CLONE3 which any
architecture can set once it has implemented support for clone3. The patch
also adds a cond_syscall(clone3) for architectures such as nios2 or h8300
that generate their syscall table by simply including asm-generic/unistd.h.
The hope is to get rid of __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather
soon.

I sent this patchset separately from the pidfd patches for polling support
and pidfd_open() mostly because they were not necessarily related but also
so you can easily decide what you want to pull.

/* Testing */
The patch is based on v5.2-rc1 and have been sitting in linux-next since
then.

/* Conflicts with v5.2 */
A test-merge of my tree into pristine v5.2 revealed a conflict in
kernel/fork.c. It's not a very difficult conflict to resolve but it's also
not trivial. The fix that was carried in linux-next and that I carry
locally can be seen in [1]. I can also provide a fixed/rebased tree if
needed.

/* Conflicts with other trees */
I am not aware of any conflicts with other trees based on linux-next.

/* Syscall number 435 */
clone3() uses syscall number 435 and is coordinated with pidfd_open() which
uses syscall number 434. I'm not aware of any other syscall targeted for
5.3 that has chosen the same number.

/* New signing subkey */
So that there are no suprises, please note that I signed-off the tag with a
new signing key: 0x91C61BC06578DCA2
It's an ed25519 signing subkey that I moved to a Nitrokey and that's
available from the new, safer keys.openpgp.org keyserver.
Given the SKS churn I figured it might be a good idea to use a new,
dedicated kernel-only subkey instead of my general signing subkey.

Thanks!
Christian

[1]:

diff --cc kernel/fork.c
index fe83343da24b,98abea995629..000000000000
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@@ -1753,7 -1768,7 +1748,8 @@@ static __latent_entropy struct task_str
int pidfd = -1, retval;
struct task_struct *p;
struct multiprocess_signals delayed;
+ struct file *pidfile = NULL;
+ u64 clone_flags = args->flags;

/*
* Don't allow sharing the root directory with processes in a different
@@@ -2031,17 -2044,7 +2025,17 @@@
goto bad_fork_free_pid;

pidfd = retval;
+
+ pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
+ O_RDWR | O_CLOEXEC);
+ if (IS_ERR(pidfile)) {
+ put_unused_fd(pidfd);
+ retval = PTR_ERR(pidfile);
+ goto bad_fork_free_pid;
+ }
+ get_pid(pid); /* held by pidfile now */
+
- retval = put_user(pidfd, parent_tidptr);
+ retval = put_user(pidfd, args->pidfd);
if (retval)
goto bad_fork_put_pidfd;
}

----------------------------------------------------------------
clone3-v5.3

----------------------------------------------------------------
Christian Brauner (3):
fork: add clone3
arch: wire-up clone3() syscall
arch: handle arches who do not yet define clone3

arch/arm/include/asm/unistd.h | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 3 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/ia32/sys_ia32.c | 12 +-
arch/x86/include/asm/unistd.h | 1 +
arch/xtensa/include/asm/unistd.h | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/sched/task.h | 17 ++-
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/sched.h | 16 +++
kernel/fork.c | 203 +++++++++++++++++++++-------
kernel/sys_ni.c | 2 +
17 files changed, 218 insertions(+), 53 deletions(-)


2019-07-11 05:29:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] clone3 for v5.3

On Mon, Jul 8, 2019 at 8:05 AM Christian Brauner <[email protected]> wrote:
>
> /* Syscall number 435 */
> clone3() uses syscall number 435 and is coordinated with pidfd_open() which
> uses syscall number 434. I'm not aware of any other syscall targeted for
> 5.3 that has chosen the same number.

You say that, and 434/435 would make sense, but that's not what the
code I see in the pull request actually does.

It seems to use syscall 436.

I think it's because openat2() is looking to use 435, but I'm a bit
nervous about the conflict between the code and your commentary..

Linus

2019-07-11 05:36:15

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL] clone3 for v5.3

On Wed, Jul 10, 2019 at 10:24:26PM -0700, Linus Torvalds wrote:
> On Mon, Jul 8, 2019 at 8:05 AM Christian Brauner <[email protected]> wrote:
> >
> > /* Syscall number 435 */
> > clone3() uses syscall number 435 and is coordinated with pidfd_open() which
> > uses syscall number 434. I'm not aware of any other syscall targeted for
> > 5.3 that has chosen the same number.
>
> You say that, and 434/435 would make sense, but that's not what the
> code I see in the pull request actually does.
>
> It seems to use syscall 436.
>
> I think it's because openat2() is looking to use 435, but I'm a bit
> nervous about the conflict between the code and your commentary..

Sorry, that was just me being dumb and forgetting that there was
close_range() which had a chance of going through Al's tree. So I left a
hole for it.

I don't terribly mind if it's 435 or 436. People pointed out you might
even renumber yourself if something makes more sense to you.

Christian

2019-07-11 13:47:11

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL] clone3 for v5.3

On Thu, Jul 11, 2019 at 07:34:28AM +0200, Christian Brauner wrote:
> On Wed, Jul 10, 2019 at 10:24:26PM -0700, Linus Torvalds wrote:
> > On Mon, Jul 8, 2019 at 8:05 AM Christian Brauner <[email protected]> wrote:
> > >
> > > /* Syscall number 435 */
> > > clone3() uses syscall number 435 and is coordinated with pidfd_open() which
> > > uses syscall number 434. I'm not aware of any other syscall targeted for
> > > 5.3 that has chosen the same number.
> >
> > You say that, and 434/435 would make sense, but that's not what the
> > code I see in the pull request actually does.
> >
> > It seems to use syscall 436.
> >
> > I think it's because openat2() is looking to use 435, but I'm a bit
> > nervous about the conflict between the code and your commentary..
>
> Sorry, that was just me being dumb and forgetting that there was
> close_range() which had a chance of going through Al's tree. So I left a
> hole for it.
>
> I don't terribly mind if it's 435 or 436. People pointed out you might
> even renumber yourself if something makes more sense to you.

Just in case you prefer pulling from a __rebased__ tree. I prepared one.
It does use syscall number 435 as advertised here and has the merge
conflict I pointed out in the PR resolved:

The following changes since commit 5450e8a316a64cddcbc15f90733ebc78aa736545:

Merge tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux (2019-07-10 22:17:21 -0700)

are available in the Git repository at:

[email protected]:pub/scm/linux/kernel/git/brauner/linux tags/clone3-rebased-v5.3

for you to fetch changes up to 611e04b559151c441deedc79f1e57471c953bda7:

arch: handle arches who do not yet define clone3 (2019-07-11 14:59:15 +0200)

Christian

2019-07-11 17:09:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] clone3 for v5.3

On Thu, Jul 11, 2019 at 6:45 AM Christian Brauner <[email protected]> wrote:
>
> Just in case you prefer pulling from a __rebased__ tree. I prepared one.

No, I'd much rather just resolve the conflict. I just wanted to know
what the deal was with the number confusion.

I'll make it use 435, we can sort out openat2 and close_range() when
those happen.

Linus

2019-07-11 18:36:09

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [GIT PULL] clone3 for v5.3

The pull request you sent on Mon, 8 Jul 2019 17:00:42 +0200:

> [email protected]:pub/scm/linux/kernel/git/brauner/linux tags/clone3-v5.3

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8f6ccf6159aed1f04c6d179f61f6fb2691261e84

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker