2022-08-30 20:06:03

by Alexey Izbyshev

[permalink] [raw]
Subject: Potentially undesirable interactions between vfork() and time namespaces

Hi,

I've looked at Andrei's patch[1] that permitted vfork() after
unshare(CLONE_NEWTIME) and noticed a couple of odd things that I'd like
to point out.

/*
* If the new process will be in a different time namespace
* do not allow it to share VM or a thread group with the forking
task.
+ *
+ * On vfork, the child process enters the target time namespace only
+ * after exec.
*/
- if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
+ if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
if (nsp->time_ns != nsp->time_ns_for_children)
return ERR_PTR(-EINVAL);
}

This change permits not only a normal vfork(), but also
clone(CLONE_VM|CLONE_VFORK|CLONE_SIGHAND|CLONE_THREAD). I'm not sure
whether it can cause real harm, but it's pretty inconsistent to forbid
creation of normal threads after unshare(CLONE_NEWTIME), but permit such
weird ones, so maybe the check should be strengthened.

Also, if such a thread execs, no time namespace switch will happen
because it's vfork_done field will be cleared when its creator (a
sibling thread) is killed by de_thread().

+ vfork = !!tsk->vfork_done;
old_mm = current->mm;
exec_mm_release(tsk, old_mm);
if (old_mm)
@@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
tsk->mm->vmacache_seqnum = 0;
vmacache_flush(tsk);
task_unlock(tsk);
+
+ if (vfork)
+ timens_on_fork(tsk->nsproxy, tsk);
+

Similarly, even after a normal vfork(), time namespace switch could be
silently skipped if the parent dies before "tsk->vfork_done" is read.
Again, I don't know whether anybody cares, but this behavior seems
non-obvious and probably unintended to me.

Thanks,
Alexey

[1]
https://lore.kernel.org/all/[email protected]/


2022-08-31 01:35:28

by Andrei Vagin

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Tue, Aug 30, 2022 at 12:49 PM Alexey Izbyshev <[email protected]> wrote:
>
> Hi,
>
> I've looked at Andrei's patch[1] that permitted vfork() after
> unshare(CLONE_NEWTIME) and noticed a couple of odd things that I'd like
> to point out.
>
> /*
> * If the new process will be in a different time namespace
> * do not allow it to share VM or a thread group with the forking
> task.
> + *
> + * On vfork, the child process enters the target time namespace only
> + * after exec.
> */
> - if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
> + if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
> if (nsp->time_ns != nsp->time_ns_for_children)
> return ERR_PTR(-EINVAL);
> }
>
> This change permits not only a normal vfork(), but also
> clone(CLONE_VM|CLONE_VFORK|CLONE_SIGHAND|CLONE_THREAD). I'm not sure
> whether it can cause real harm, but it's pretty inconsistent to forbid
> creation of normal threads after unshare(CLONE_NEWTIME), but permit such
> weird ones, so maybe the check should be strengthened.

Good catch. I was not aware that CLONE_VFORK is allowed to be used with
CLONE_THREAD. I will send a fix. Thanks.

>
> Also, if such a thread execs, no time namespace switch will happen
> because it's vfork_done field will be cleared when its creator (a
> sibling thread) is killed by de_thread().
>
> + vfork = !!tsk->vfork_done;
> old_mm = current->mm;
> exec_mm_release(tsk, old_mm);
> if (old_mm)
> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
> tsk->mm->vmacache_seqnum = 0;
> vmacache_flush(tsk);
> task_unlock(tsk);
> +
> + if (vfork)
> + timens_on_fork(tsk->nsproxy, tsk);
> +
>
> Similarly, even after a normal vfork(), time namespace switch could be
> silently skipped if the parent dies before "tsk->vfork_done" is read.
> Again, I don't know whether anybody cares, but this behavior seems
> non-obvious and probably unintended to me.

This is the more interesting case. I will try to find out how we can
handle it properly.

Thanks,
Andrei

2022-09-01 04:30:42

by Andrei Vagin

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]> wrote:
>On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
<snip>
>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>> tsk->mm->vmacache_seqnum = 0;
>> vmacache_flush(tsk);
>> task_unlock(tsk);
>> +
>> + if (vfork)
>> + timens_on_fork(tsk->nsproxy, tsk);
>> +
>>
>> Similarly, even after a normal vfork(), time namespace switch could be
>> silently skipped if the parent dies before "tsk->vfork_done" is read. Again,
>> I don't know whether anybody cares, but this behavior seems non-obvious and
>> probably unintended to me.
> This is the more interesting case. I will try to find out how we can
> handle it properly.

It might not be a good idea to use vfork_done in this case. Let's
think about what we have and what we want to change. We don't want to
allow switching timens if a process mm is used by someone else. But we
forgot to handle execve that creates a new mm, and we can't change this
behavior right now because it can affect current users. Right?

So maybe the best choice, in this case, is to change behavior by adding
a new control that enables it. The first interface that comes to my mind
is to introduce a new ioctl for a namespace file descriptor. Here is a
draft patch below that should help to understand what I mean.

---
fs/exec.c | 4 +---
fs/nsfs.c | 3 +++
include/linux/proc_ns.h | 1 +
include/linux/time_namespace.h | 1 +
include/uapi/linux/nsfs.h | 2 ++
kernel/fork.c | 3 ++-
kernel/time/namespace.c | 15 +++++++++++++++
tools/testing/selftests/timens/vfork_exec.c | 14 +++++++++++++-
8 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 9a5ca7b82bfc..961348084257 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct *old_mm, *active_mm;
- bool vfork;
int ret;

/* Notify parent that we're no longer interested in the old VM */
tsk = current;
- vfork = !!tsk->vfork_done;
old_mm = current->mm;
exec_mm_release(tsk, old_mm);
if (old_mm)
@@ -1030,7 +1028,7 @@ static int exec_mmap(struct mm_struct *mm)
vmacache_flush(tsk);
task_unlock(tsk);

- if (vfork)
+ if (READ_ONCE(tsk->nsproxy->time_ns_for_children->switch_on_exec))
timens_on_fork(tsk->nsproxy, tsk);

if (old_mm) {
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 800c1d0eb0d0..723ab5f69bcd 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -11,6 +11,7 @@
#include <linux/user_namespace.h>
#include <linux/nsfs.h>
#include <linux/uaccess.h>
+#include <linux/nsfs.h>

#include "internal.h"

@@ -210,6 +211,8 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl,
uid = from_kuid_munged(current_user_ns(), user_ns->owner);
return put_user(uid, argp);
default:
+ if (ns->ops->ioctl)
+ return ns->ops->ioctl(ns, ioctl, arg);
return -ENOTTY;
}
}
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 75807ecef880..b690eb1a3468 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -22,6 +22,7 @@ struct proc_ns_operations {
int (*install)(struct nsset *nsset, struct ns_common *ns);
struct user_namespace *(*owner)(struct ns_common *ns);
struct ns_common *(*get_parent)(struct ns_common *ns);
+ long (*ioctl)(struct ns_common *ns, unsigned int ioctl, unsigned long arg);
} __randomize_layout;

extern const struct proc_ns_operations netns_operations;
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 3146f1c056c9..6569300d68ce 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -24,6 +24,7 @@ struct time_namespace {
struct page *vvar_page;
/* If set prevents changing offsets after any task joined namespace. */
bool frozen_offsets;
+ bool switch_on_exec;
} __randomize_layout;

extern struct time_namespace init_time_ns;
diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
index a0c8552b64ee..ce3a9f9b1bcf 100644
--- a/include/uapi/linux/nsfs.h
+++ b/include/uapi/linux/nsfs.h
@@ -16,4 +16,6 @@
/* Get owner UID (in the caller's user namespace) for a user namespace */
#define NS_GET_OWNER_UID _IO(NSIO, 0x4)

+#define TIMENS_SET_SWITCH_ON_EXEC _IO(NSIO, 0x100)
+
#endif /* __LINUX_NSFS_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 90c85b17bf69..1f7bf2a087e9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2050,7 +2050,8 @@ static __latent_entropy struct task_struct *copy_process(
* On vfork, the child process enters the target time namespace only
* after exec.
*/
- if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
+ if ((clone_flags & CLONE_THREAD) ||
+ (clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
if (nsp->time_ns != nsp->time_ns_for_children)
return ERR_PTR(-EINVAL);
}
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index aec832801c26..9966e0bdefa7 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -17,6 +17,7 @@
#include <linux/cred.h>
#include <linux/err.h>
#include <linux/mm.h>
+#include <linux/nsfs.h>

#include <vdso/datapage.h>

@@ -439,6 +440,18 @@ int proc_timens_set_offset(struct file *file, struct task_struct *p,
return err;
}

+static long timens_ioctl(struct ns_common *ns, unsigned int ioctl, unsigned long arg)
+{
+ struct time_namespace *time_ns = to_time_ns(ns);
+
+ switch (ioctl) {
+ case TIMENS_SET_SWITCH_ON_EXEC:
+ WRITE_ONCE(time_ns->switch_on_exec, true);
+ return 0;
+ }
+ return -ENOTTY;
+}
+
const struct proc_ns_operations timens_operations = {
.name = "time",
.type = CLONE_NEWTIME,
@@ -446,6 +459,7 @@ const struct proc_ns_operations timens_operations = {
.put = timens_put,
.install = timens_install,
.owner = timens_owner,
+ .ioctl = timens_ioctl,
};

const struct proc_ns_operations timens_for_children_operations = {
@@ -456,6 +470,7 @@ const struct proc_ns_operations timens_for_children_operations = {
.put = timens_put,
.install = timens_install,
.owner = timens_owner,
+ .ioctl = timens_ioctl,
};

struct time_namespace init_time_ns = {
diff --git a/tools/testing/selftests/timens/vfork_exec.c b/tools/testing/selftests/timens/vfork_exec.c
index e6ccd900f30a..5f4e2043e0a7 100644
--- a/tools/testing/selftests/timens/vfork_exec.c
+++ b/tools/testing/selftests/timens/vfork_exec.c
@@ -12,6 +12,11 @@
#include <time.h>
#include <unistd.h>
#include <string.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <linux/nsfs.h>
+
+#define TIMENS_SET_SWITCH_ON_EXEC _IO(NSIO, 0x100)

#include "log.h"
#include "timens.h"
@@ -21,7 +26,7 @@
int main(int argc, char *argv[])
{
struct timespec now, tst;
- int status, i;
+ int status, i, nsfd;
pid_t pid;

if (argc > 1) {
@@ -45,6 +50,13 @@ int main(int argc, char *argv[])
if (unshare_timens())
return 1;

+ nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+ if (nsfd < 0)
+ return pr_perror("open");
+ if (ioctl(nsfd, TIMENS_SET_SWITCH_ON_EXEC))
+ return pr_perror("ioctl");
+ close(nsfd);
+
if (_settime(CLOCK_MONOTONIC, OFFSET))
return 1;

--
2.37.2

2022-09-01 04:31:29

by Florian Weimer

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

* Andrei Vagin:

> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]> wrote:
>>On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
> <snip>
>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>>> tsk->mm->vmacache_seqnum = 0;
>>> vmacache_flush(tsk);
>>> task_unlock(tsk);
>>> +
>>> + if (vfork)
>>> + timens_on_fork(tsk->nsproxy, tsk);
>>> +
>>>
>>> Similarly, even after a normal vfork(), time namespace switch could be
>>> silently skipped if the parent dies before "tsk->vfork_done" is read. Again,
>>> I don't know whether anybody cares, but this behavior seems non-obvious and
>>> probably unintended to me.
>> This is the more interesting case. I will try to find out how we can
>> handle it properly.
>
> It might not be a good idea to use vfork_done in this case. Let's
> think about what we have and what we want to change. We don't want to
> allow switching timens if a process mm is used by someone else. But we
> forgot to handle execve that creates a new mm, and we can't change this
> behavior right now because it can affect current users. Right?
>
> So maybe the best choice, in this case, is to change behavior by adding
> a new control that enables it. The first interface that comes to my mind
> is to introduce a new ioctl for a namespace file descriptor. Here is a
> draft patch below that should help to understand what I mean.

Doesn't this bring back the old posix_spawn (vfork) failure?

Thanks,
Florian

2022-09-01 16:01:38

by Alexey Izbyshev

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On 2022-09-01 06:45, Andrei Vagin wrote:
> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]> wrote:
>> On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
> <snip>
>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>>> tsk->mm->vmacache_seqnum = 0;
>>> vmacache_flush(tsk);
>>> task_unlock(tsk);
>>> +
>>> + if (vfork)
>>> + timens_on_fork(tsk->nsproxy, tsk);
>>> +
>>>
>>> Similarly, even after a normal vfork(), time namespace switch could
>>> be
>>> silently skipped if the parent dies before "tsk->vfork_done" is read.
>>> Again,
>>> I don't know whether anybody cares, but this behavior seems
>>> non-obvious and
>>> probably unintended to me.
>> This is the more interesting case. I will try to find out how we can
>> handle it properly.
>
> It might not be a good idea to use vfork_done in this case. Let's
> think about what we have and what we want to change. We don't want to
> allow switching timens if a process mm is used by someone else. But we
> forgot to handle execve that creates a new mm, and we can't change this
> behavior right now because it can affect current users. Right?
>
> So maybe the best choice, in this case, is to change behavior by adding
> a new control that enables it. The first interface that comes to my
> mind
> is to introduce a new ioctl for a namespace file descriptor. Here is a
> draft patch below that should help to understand what I mean.
>
While I'm not a user of time namespaces (at least yet), I welcome a
change that makes time namespace switching and inheritance semantics
easier to understand and document. Here is my understanding of how that
evolved.

Before the original patch that allowed vfork():

* Switching happens only on clone(~CLONE_VM).
* clone(CLONE_VM) is forbidden after unshare(CLONE_NEWTIME) (thereby
vfork() and pthread_create() fail).
* time_ns/time_ns_for_children is preserved across execve().

After that patch:

* Switching happens on clone(~CLONE_VM).
* Switching also happens on execve() if the current task is a
vfork-child whose creator task is still alive (because of reliance on
"vfork_done").
* clone(CLONE_VM) is forbidden after unshare(CLONE_NEWTIME) unless it's
clone(CLONE_VM|CLONE_VFORK), in which case time_ns/time_ns_for_children
is inherited.
* time_ns/time_ns_for_children is preserved across execve() unless
switched as described above.

Note that switching conditions on execve() are very subtle. Apart from
the motivating use case of "unshare(CLONE_NEWTIME) -> vfork() ->
execve()", it would also happen on e.g. "vfork() ->
unshare(CLONE_NEWTIME) -> execve()", because unshare(CLONE_NEWTIME) is
not forbidden for tasks which share mm.

With the current patch:

* Switching happens on clone(~CLONE_VM).
* Switching also happens on execve() if ioctl(TIMENS_SET_SWITCH_ON_EXEC)
was called on time_ns_for_children.
* clone(CLONE_VM) is forbidden after unshare(CLONE_NEWTIME) unless it's
clone(CLONE_VM|CLONE_VFORK) without CLONE_THREAD, in which case
time_ns/time_ns_for_children is inherited. Thereby vfork() is permitted,
while pthread_create() is not.
* time_ns/time_ns_for_children is preserved across execve() unless
switched as described above.

So in terms of cognitive complexity it seems like a clear improvement
that regains some of the simplicity of the initial implementation.

However, I'd like to point out that while for a narrow fix of the
original issue (vfork() doesn't work when fork() does) time ns switching
on execve() is not required at all, removing "automatic" switching in
posix_spawn()-like cases could potentially surprise time namespace
users. In the initial time ns implementation, "unshare(CLONE_NEWTIME);
posix_spawn(...)" would either succeed with the expected effect (an
executable is running in a new time ns) or fail, depending on whether
posix_spawn() uses fork() or vfork(). With the first patch, vfork-based
posix_spawn() would *usually* behave as a fork-based one (modulo the
parent death issue). But with the current patch, unless user space is
modified to set switch_on_exec, vfork-based posix_spawn() will succeed
but the exe will be running in the parent's time ns. I'm not in a
position to estimate whether any actual time ns users are affected,
though it still looks like something that could affect *future* time ns
users that are not careful enough.

Regarding the interface to control switching on execve(), one possible
alternative to ioctl() is a separate file in /proc like
/proc/$PID/setgroups that was added in a somewhat similar situation
(fixing a problem with user namespaces implementation). Regardless of
the interface, it'd probably be nice to also have the ability to get the
current value of switch_on_exec flag.

Thanks,
Alexey

> ---
> fs/exec.c | 4 +---
> fs/nsfs.c | 3 +++
> include/linux/proc_ns.h | 1 +
> include/linux/time_namespace.h | 1 +
> include/uapi/linux/nsfs.h | 2 ++
> kernel/fork.c | 3 ++-
> kernel/time/namespace.c | 15 +++++++++++++++
> tools/testing/selftests/timens/vfork_exec.c | 14 +++++++++++++-
> 8 files changed, 38 insertions(+), 5 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 9a5ca7b82bfc..961348084257 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct *old_mm, *active_mm;
> - bool vfork;
> int ret;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> - vfork = !!tsk->vfork_done;
> old_mm = current->mm;
> exec_mm_release(tsk, old_mm);
> if (old_mm)
> @@ -1030,7 +1028,7 @@ static int exec_mmap(struct mm_struct *mm)
> vmacache_flush(tsk);
> task_unlock(tsk);
>
> - if (vfork)
> + if (READ_ONCE(tsk->nsproxy->time_ns_for_children->switch_on_exec))
> timens_on_fork(tsk->nsproxy, tsk);
>
> if (old_mm) {
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index 800c1d0eb0d0..723ab5f69bcd 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -11,6 +11,7 @@
> #include <linux/user_namespace.h>
> #include <linux/nsfs.h>
> #include <linux/uaccess.h>
> +#include <linux/nsfs.h>
>
> #include "internal.h"
>
> @@ -210,6 +211,8 @@ static long ns_ioctl(struct file *filp, unsigned
> int ioctl,
> uid = from_kuid_munged(current_user_ns(), user_ns->owner);
> return put_user(uid, argp);
> default:
> + if (ns->ops->ioctl)
> + return ns->ops->ioctl(ns, ioctl, arg);
> return -ENOTTY;
> }
> }
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 75807ecef880..b690eb1a3468 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -22,6 +22,7 @@ struct proc_ns_operations {
> int (*install)(struct nsset *nsset, struct ns_common *ns);
> struct user_namespace *(*owner)(struct ns_common *ns);
> struct ns_common *(*get_parent)(struct ns_common *ns);
> + long (*ioctl)(struct ns_common *ns, unsigned int ioctl, unsigned long
> arg);
> } __randomize_layout;
>
> extern const struct proc_ns_operations netns_operations;
> diff --git a/include/linux/time_namespace.h
> b/include/linux/time_namespace.h
> index 3146f1c056c9..6569300d68ce 100644
> --- a/include/linux/time_namespace.h
> +++ b/include/linux/time_namespace.h
> @@ -24,6 +24,7 @@ struct time_namespace {
> struct page *vvar_page;
> /* If set prevents changing offsets after any task joined namespace.
> */
> bool frozen_offsets;
> + bool switch_on_exec;
> } __randomize_layout;
>
> extern struct time_namespace init_time_ns;
> diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
> index a0c8552b64ee..ce3a9f9b1bcf 100644
> --- a/include/uapi/linux/nsfs.h
> +++ b/include/uapi/linux/nsfs.h
> @@ -16,4 +16,6 @@
> /* Get owner UID (in the caller's user namespace) for a user namespace
> */
> #define NS_GET_OWNER_UID _IO(NSIO, 0x4)
>
> +#define TIMENS_SET_SWITCH_ON_EXEC _IO(NSIO, 0x100)
> +
> #endif /* __LINUX_NSFS_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 90c85b17bf69..1f7bf2a087e9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2050,7 +2050,8 @@ static __latent_entropy struct task_struct
> *copy_process(
> * On vfork, the child process enters the target time namespace only
> * after exec.
> */
> - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
> + if ((clone_flags & CLONE_THREAD) ||
> + (clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
> if (nsp->time_ns != nsp->time_ns_for_children)
> return ERR_PTR(-EINVAL);
> }
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index aec832801c26..9966e0bdefa7 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -17,6 +17,7 @@
> #include <linux/cred.h>
> #include <linux/err.h>
> #include <linux/mm.h>
> +#include <linux/nsfs.h>
>
> #include <vdso/datapage.h>
>
> @@ -439,6 +440,18 @@ int proc_timens_set_offset(struct file *file,
> struct task_struct *p,
> return err;
> }
>
> +static long timens_ioctl(struct ns_common *ns, unsigned int ioctl,
> unsigned long arg)
> +{
> + struct time_namespace *time_ns = to_time_ns(ns);
> +
> + switch (ioctl) {
> + case TIMENS_SET_SWITCH_ON_EXEC:
> + WRITE_ONCE(time_ns->switch_on_exec, true);
> + return 0;
> + }
> + return -ENOTTY;
> +}
> +
> const struct proc_ns_operations timens_operations = {
> .name = "time",
> .type = CLONE_NEWTIME,
> @@ -446,6 +459,7 @@ const struct proc_ns_operations timens_operations =
> {
> .put = timens_put,
> .install = timens_install,
> .owner = timens_owner,
> + .ioctl = timens_ioctl,
> };
>
> const struct proc_ns_operations timens_for_children_operations = {
> @@ -456,6 +470,7 @@ const struct proc_ns_operations
> timens_for_children_operations = {
> .put = timens_put,
> .install = timens_install,
> .owner = timens_owner,
> + .ioctl = timens_ioctl,
> };
>
> struct time_namespace init_time_ns = {
> diff --git a/tools/testing/selftests/timens/vfork_exec.c
> b/tools/testing/selftests/timens/vfork_exec.c
> index e6ccd900f30a..5f4e2043e0a7 100644
> --- a/tools/testing/selftests/timens/vfork_exec.c
> +++ b/tools/testing/selftests/timens/vfork_exec.c
> @@ -12,6 +12,11 @@
> #include <time.h>
> #include <unistd.h>
> #include <string.h>
> +#include <fcntl.h>
> +#include <sys/ioctl.h>
> +#include <linux/nsfs.h>
> +
> +#define TIMENS_SET_SWITCH_ON_EXEC _IO(NSIO, 0x100)
>
> #include "log.h"
> #include "timens.h"
> @@ -21,7 +26,7 @@
> int main(int argc, char *argv[])
> {
> struct timespec now, tst;
> - int status, i;
> + int status, i, nsfd;
> pid_t pid;
>
> if (argc > 1) {
> @@ -45,6 +50,13 @@ int main(int argc, char *argv[])
> if (unshare_timens())
> return 1;
>
> + nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
> + if (nsfd < 0)
> + return pr_perror("open");
> + if (ioctl(nsfd, TIMENS_SET_SWITCH_ON_EXEC))
> + return pr_perror("ioctl");
> + close(nsfd);
> +
> if (_settime(CLOCK_MONOTONIC, OFFSET))
> return 1;

2022-09-01 18:28:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

Andrei Vagin <[email protected]> writes:

> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]> wrote:
>>On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
> <snip>
>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>>> tsk->mm->vmacache_seqnum = 0;
>>> vmacache_flush(tsk);
>>> task_unlock(tsk);
>>> +
>>> + if (vfork)
>>> + timens_on_fork(tsk->nsproxy, tsk);
>>> +
>>>
>>> Similarly, even after a normal vfork(), time namespace switch could be
>>> silently skipped if the parent dies before "tsk->vfork_done" is read. Again,
>>> I don't know whether anybody cares, but this behavior seems non-obvious and
>>> probably unintended to me.
>> This is the more interesting case. I will try to find out how we can
>> handle it properly.
>
> It might not be a good idea to use vfork_done in this case. Let's
> think about what we have and what we want to change. We don't want to
> allow switching timens if a process mm is used by someone else. But we
> forgot to handle execve that creates a new mm, and we can't change this
> behavior right now because it can affect current users. Right?

What we can't changes are things that will break existing programs. If
existing programs don't care we can change the behavior of the kernel.

> So maybe the best choice, in this case, is to change behavior by adding
> a new control that enables it. The first interface that comes to my mind
> is to introduce a new ioctl for a namespace file descriptor. Here is a
> draft patch below that should help to understand what I mean.

I don't think adding a new control works, because programs that are
calling vfork or posix_spawn today will stop working.

We should recognize that basing things off of CLONE_VFORK was a bad idea
as CLONE_VFORK is all about waiting for the created task to exec or
exit, and really has nothing to do with creating a new mm.

Instead I think the rule should be that a new time namespaces is
installed as soon as we have a new mm.

That will be a behavioral change if the time ns is unshared and then the
program exec's instead of forking children, but I suspect it is the
proper behavior all the same, and that existing userspace won't care.
Especially since all of the vfork_done work is new behavior as
of v6.0-rc1.

Ugh. I just spotted another bug. The function timens_on_fork as
written is not safe to call without first creating a fresh copy
of the nsproxy, and we don't do that during exec. Because nsproxy
is shared between tasks and processes updating the values needs to
create a new nsproxy or other tasks/processes can be affected.
Not hard to handle just something that needs to be addressed.

Say something like this:

diff --git a/fs/exec.c b/fs/exec.c
index 9a5ca7b82bfc..8a6947e631dd 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct *old_mm, *active_mm;
- bool vfork;
int ret;

/* Notify parent that we're no longer interested in the old VM */
tsk = current;
- vfork = !!tsk->vfork_done;
old_mm = current->mm;
exec_mm_release(tsk, old_mm);
if (old_mm)
@@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm)
vmacache_flush(tsk);
task_unlock(tsk);

- if (vfork)
- timens_on_fork(tsk->nsproxy, tsk);
-
if (old_mm) {
mmap_read_unlock(old_mm);
BUG_ON(active_mm != old_mm);
@@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)

bprm->mm = NULL;

+ retval = exec_task_namespaces();
+ if (retval)
+ goto out_unlock;
+
#ifdef CONFIG_POSIX_TIMERS
spin_lock_irq(&me->sighand->siglock);
posix_cpu_timers_exit(me);
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index cdb171efc7cb..fee881cded01 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset *set)
int copy_namespaces(unsigned long flags, struct task_struct *tsk);
void exit_task_namespaces(struct task_struct *tsk);
void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
+int exec_task_namespaces(void);
void free_nsproxy(struct nsproxy *ns);
int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
struct cred *, struct fs_struct *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 90c85b17bf69..b4a799d9c50f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct *copy_process(
return ERR_PTR(-EINVAL);
}

- /*
- * If the new process will be in a different time namespace
- * do not allow it to share VM or a thread group with the forking task.
- *
- * On vfork, the child process enters the target time namespace only
- * after exec.
- */
- if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
- if (nsp->time_ns != nsp->time_ns_for_children)
- return ERR_PTR(-EINVAL);
- }
-
if (clone_flags & CLONE_PIDFD) {
/*
* - CLONE_DETACHED is blocked so that we can potentially
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b4cbb406bc28..b6647846fe42 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p)
switch_task_namespaces(p, NULL);
}

+int exec_task_namespaces(void)
+{
+ struct task_struct *tsk = current;
+ struct nsproxy *new;
+
+ if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
+ return 0;
+
+ new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+
+ timens_on_fork(new, tsk);
+ switch_task_namespaces(tsk, new);
+ return 0;
+}
+
+
static int check_setns_flags(unsigned long flags)
{
if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |



To keep things from being too confusing it probably makes sense to
rename the nsproxy variable from time_ns_for_children to
time_ns_for_new_mm. Likewise timens_on_fork can be renamed
timens_on_new_mm.

But that would be follow up work.

How does the above change sound to folks?

Eric

2022-09-02 16:18:31

by Andrei Vagin

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Thu, Sep 01, 2022 at 01:11:37PM -0500, Eric W. Biederman wrote:
> Andrei Vagin <[email protected]> writes:
>
> > On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]> wrote:
> >>On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
> > <snip>
> >>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
> >>> tsk->mm->vmacache_seqnum = 0;
> >>> vmacache_flush(tsk);
> >>> task_unlock(tsk);
> >>> +
> >>> + if (vfork)
> >>> + timens_on_fork(tsk->nsproxy, tsk);
> >>> +
> >>>
> >>> Similarly, even after a normal vfork(), time namespace switch could be
> >>> silently skipped if the parent dies before "tsk->vfork_done" is read. Again,
> >>> I don't know whether anybody cares, but this behavior seems non-obvious and
> >>> probably unintended to me.
> >> This is the more interesting case. I will try to find out how we can
> >> handle it properly.
> >
> > It might not be a good idea to use vfork_done in this case. Let's
> > think about what we have and what we want to change. We don't want to
> > allow switching timens if a process mm is used by someone else. But we
> > forgot to handle execve that creates a new mm, and we can't change this
> > behavior right now because it can affect current users. Right?
>
> What we can't changes are things that will break existing programs. If
> existing programs don't care we can change the behavior of the kernel.

I agree that it is very unlikely that anyone will notice
these changes. And it is hard to imagine that anyone uses the old
behavior intentionally.

>
> > So maybe the best choice, in this case, is to change behavior by adding
> > a new control that enables it. The first interface that comes to my mind
> > is to introduce a new ioctl for a namespace file descriptor. Here is a
> > draft patch below that should help to understand what I mean.
>
> I don't think adding a new control works, because programs that are
> calling vfork or posix_spawn today will stop working.
>
> We should recognize that basing things off of CLONE_VFORK was a bad idea
> as CLONE_VFORK is all about waiting for the created task to exec or
> exit, and really has nothing to do with creating a new mm.
>
> Instead I think the rule should be that a new time namespaces is
> installed as soon as we have a new mm.
>
> That will be a behavioral change if the time ns is unshared and then the
> program exec's instead of forking children, but I suspect it is the
> proper behavior all the same, and that existing userspace won't care.
> Especially since all of the vfork_done work is new behavior as
> of v6.0-rc1.
>
> Ugh. I just spotted another bug. The function timens_on_fork as
> written is not safe to call without first creating a fresh copy
> of the nsproxy, and we don't do that during exec. Because nsproxy
> is shared between tasks and processes updating the values needs to
> create a new nsproxy or other tasks/processes can be affected.
> Not hard to handle just something that needs to be addressed.

You are right. Thanks.

>
> Say something like this:
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 9a5ca7b82bfc..8a6947e631dd 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct *old_mm, *active_mm;
> - bool vfork;
> int ret;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> - vfork = !!tsk->vfork_done;
> old_mm = current->mm;
> exec_mm_release(tsk, old_mm);
> if (old_mm)
> @@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm)
> vmacache_flush(tsk);
> task_unlock(tsk);
>
> - if (vfork)
> - timens_on_fork(tsk->nsproxy, tsk);
> -
> if (old_mm) {
> mmap_read_unlock(old_mm);
> BUG_ON(active_mm != old_mm);
> @@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>
> bprm->mm = NULL;
>
> + retval = exec_task_namespaces();
> + if (retval)
> + goto out_unlock;
> +
> #ifdef CONFIG_POSIX_TIMERS
> spin_lock_irq(&me->sighand->siglock);
> posix_cpu_timers_exit(me);
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index cdb171efc7cb..fee881cded01 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset *set)
> int copy_namespaces(unsigned long flags, struct task_struct *tsk);
> void exit_task_namespaces(struct task_struct *tsk);
> void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
> +int exec_task_namespaces(void);
> void free_nsproxy(struct nsproxy *ns);
> int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
> struct cred *, struct fs_struct *);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 90c85b17bf69..b4a799d9c50f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct *copy_process(
> return ERR_PTR(-EINVAL);
> }
>
> - /*
> - * If the new process will be in a different time namespace
> - * do not allow it to share VM or a thread group with the forking task.
> - *
> - * On vfork, the child process enters the target time namespace only
> - * after exec.
> - */
> - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
> - if (nsp->time_ns != nsp->time_ns_for_children)
> - return ERR_PTR(-EINVAL);
> - }

pls don't remove this part. It was one of the concerns that vfork
doesn't work after unshare(CLONE_NEWTIME), but it is one of the standard
ways of creating a new process. For example, posix_spawn uses it.

> -
> if (clone_flags & CLONE_PIDFD) {
> /*
> * - CLONE_DETACHED is blocked so that we can potentially
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index b4cbb406bc28..b6647846fe42 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p)
> switch_task_namespaces(p, NULL);
> }
>
> +int exec_task_namespaces(void)
> +{
> + struct task_struct *tsk = current;
> + struct nsproxy *new;
> +
> + if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
> + return 0;
> +
> + new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
> + if (IS_ERR(new))
> + return PTR_ERR(new);
> +
> + timens_on_fork(new, tsk);
> + switch_task_namespaces(tsk, new);
> + return 0;
> +}
> +
> +
> static int check_setns_flags(unsigned long flags)
> {
> if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>
>
>
> To keep things from being too confusing it probably makes sense to
> rename the nsproxy variable from time_ns_for_children to
> time_ns_for_new_mm. Likewise timens_on_fork can be renamed
> timens_on_new_mm.
>
> But that would be follow up work.
>
> How does the above change sound to folks?

It looks good to me.

Thanks,
Andrei

2022-09-02 16:45:15

by Alexey Izbyshev

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On 2022-09-02 19:14, Andrei Vagin wrote:
> On Thu, Sep 01, 2022 at 01:11:37PM -0500, Eric W. Biederman wrote:
>> Andrei Vagin <[email protected]> writes:
>>
>> > On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]> wrote:
>> >>On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
>> > <snip>
>> >>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>> >>> tsk->mm->vmacache_seqnum = 0;
>> >>> vmacache_flush(tsk);
>> >>> task_unlock(tsk);
>> >>> +
>> >>> + if (vfork)
>> >>> + timens_on_fork(tsk->nsproxy, tsk);
>> >>> +
>> >>>
>> >>> Similarly, even after a normal vfork(), time namespace switch could be
>> >>> silently skipped if the parent dies before "tsk->vfork_done" is read. Again,
>> >>> I don't know whether anybody cares, but this behavior seems non-obvious and
>> >>> probably unintended to me.
>> >> This is the more interesting case. I will try to find out how we can
>> >> handle it properly.
>> >
>> > It might not be a good idea to use vfork_done in this case. Let's
>> > think about what we have and what we want to change. We don't want to
>> > allow switching timens if a process mm is used by someone else. But we
>> > forgot to handle execve that creates a new mm, and we can't change this
>> > behavior right now because it can affect current users. Right?
>>
>> What we can't changes are things that will break existing programs.
>> If
>> existing programs don't care we can change the behavior of the kernel.
>
> I agree that it is very unlikely that anyone will notice
> these changes. And it is hard to imagine that anyone uses the old
> behavior intentionally.
>
>>
>> > So maybe the best choice, in this case, is to change behavior by adding
>> > a new control that enables it. The first interface that comes to my mind
>> > is to introduce a new ioctl for a namespace file descriptor. Here is a
>> > draft patch below that should help to understand what I mean.
>>
>> I don't think adding a new control works, because programs that are
>> calling vfork or posix_spawn today will stop working.
>>
>> We should recognize that basing things off of CLONE_VFORK was a bad
>> idea
>> as CLONE_VFORK is all about waiting for the created task to exec or
>> exit, and really has nothing to do with creating a new mm.
>>
>> Instead I think the rule should be that a new time namespaces is
>> installed as soon as we have a new mm.
>>
>> That will be a behavioral change if the time ns is unshared and then
>> the
>> program exec's instead of forking children, but I suspect it is the
>> proper behavior all the same, and that existing userspace won't care.
>> Especially since all of the vfork_done work is new behavior as
>> of v6.0-rc1.
>>
>> Ugh. I just spotted another bug. The function timens_on_fork as
>> written is not safe to call without first creating a fresh copy
>> of the nsproxy, and we don't do that during exec. Because nsproxy
>> is shared between tasks and processes updating the values needs to
>> create a new nsproxy or other tasks/processes can be affected.
>> Not hard to handle just something that needs to be addressed.
>
> You are right. Thanks.
>
>>
>> Say something like this:
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 9a5ca7b82bfc..8a6947e631dd 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
>> {
>> struct task_struct *tsk;
>> struct mm_struct *old_mm, *active_mm;
>> - bool vfork;
>> int ret;
>>
>> /* Notify parent that we're no longer interested in the old VM */
>> tsk = current;
>> - vfork = !!tsk->vfork_done;
>> old_mm = current->mm;
>> exec_mm_release(tsk, old_mm);
>> if (old_mm)
>> @@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm)
>> vmacache_flush(tsk);
>> task_unlock(tsk);
>>
>> - if (vfork)
>> - timens_on_fork(tsk->nsproxy, tsk);
>> -
>> if (old_mm) {
>> mmap_read_unlock(old_mm);
>> BUG_ON(active_mm != old_mm);
>> @@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>>
>> bprm->mm = NULL;
>>
>> + retval = exec_task_namespaces();
>> + if (retval)
>> + goto out_unlock;
>> +
>> #ifdef CONFIG_POSIX_TIMERS
>> spin_lock_irq(&me->sighand->siglock);
>> posix_cpu_timers_exit(me);
>> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>> index cdb171efc7cb..fee881cded01 100644
>> --- a/include/linux/nsproxy.h
>> +++ b/include/linux/nsproxy.h
>> @@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset
>> *set)
>> int copy_namespaces(unsigned long flags, struct task_struct *tsk);
>> void exit_task_namespaces(struct task_struct *tsk);
>> void switch_task_namespaces(struct task_struct *tsk, struct nsproxy
>> *new);
>> +int exec_task_namespaces(void);
>> void free_nsproxy(struct nsproxy *ns);
>> int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
>> struct cred *, struct fs_struct *);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 90c85b17bf69..b4a799d9c50f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct
>> *copy_process(
>> return ERR_PTR(-EINVAL);
>> }
>>
>> - /*
>> - * If the new process will be in a different time namespace
>> - * do not allow it to share VM or a thread group with the forking
>> task.
>> - *
>> - * On vfork, the child process enters the target time namespace only
>> - * after exec.
>> - */
>> - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
>> - if (nsp->time_ns != nsp->time_ns_for_children)
>> - return ERR_PTR(-EINVAL);
>> - }
>
> pls don't remove this part. It was one of the concerns that vfork
> doesn't work after unshare(CLONE_NEWTIME), but it is one of the
> standard
> ways of creating a new process. For example, posix_spawn uses it.
>
What do you mean? On the contrary, removing this restriction of the
original time namespace implementation allows vfork(), pthread_create()
and the like, solving the issue with posix_spawn() as well.

Thanks,
Alexey
>> -
>> if (clone_flags & CLONE_PIDFD) {
>> /*
>> * - CLONE_DETACHED is blocked so that we can potentially
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index b4cbb406bc28..b6647846fe42 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p)
>> switch_task_namespaces(p, NULL);
>> }
>>
>> +int exec_task_namespaces(void)
>> +{
>> + struct task_struct *tsk = current;
>> + struct nsproxy *new;
>> +
>> + if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
>> + return 0;
>> +
>> + new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
>> + if (IS_ERR(new))
>> + return PTR_ERR(new);
>> +
>> + timens_on_fork(new, tsk);
>> + switch_task_namespaces(tsk, new);
>> + return 0;
>> +}
>> +
>> +
>> static int check_setns_flags(unsigned long flags)
>> {
>> if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>>
>>
>>
>> To keep things from being too confusing it probably makes sense to
>> rename the nsproxy variable from time_ns_for_children to
>> time_ns_for_new_mm. Likewise timens_on_fork can be renamed
>> timens_on_new_mm.
>>
>> But that would be follow up work.
>>
>> How does the above change sound to folks?
>
> It looks good to me.
>
> Thanks,
> Andrei

2022-09-02 17:17:44

by Alexey Izbyshev

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On 2022-09-01 21:11, Eric W. Biederman wrote:
> Andrei Vagin <[email protected]> writes:
>
>> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]> wrote:
>>> On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
>> <snip>
>>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>> tsk->mm->vmacache_seqnum = 0;
>>>> vmacache_flush(tsk);
>>>> task_unlock(tsk);
>>>> +
>>>> + if (vfork)
>>>> + timens_on_fork(tsk->nsproxy, tsk);
>>>> +
>>>>
>>>> Similarly, even after a normal vfork(), time namespace switch could
>>>> be
>>>> silently skipped if the parent dies before "tsk->vfork_done" is
>>>> read. Again,
>>>> I don't know whether anybody cares, but this behavior seems
>>>> non-obvious and
>>>> probably unintended to me.
>>> This is the more interesting case. I will try to find out how we can
>>> handle it properly.
>>
>> It might not be a good idea to use vfork_done in this case. Let's
>> think about what we have and what we want to change. We don't want to
>> allow switching timens if a process mm is used by someone else. But we
>> forgot to handle execve that creates a new mm, and we can't change
>> this
>> behavior right now because it can affect current users. Right?
>
> What we can't changes are things that will break existing programs. If
> existing programs don't care we can change the behavior of the kernel.
>
>> So maybe the best choice, in this case, is to change behavior by
>> adding
>> a new control that enables it. The first interface that comes to my
>> mind
>> is to introduce a new ioctl for a namespace file descriptor. Here is a
>> draft patch below that should help to understand what I mean.
>
> I don't think adding a new control works, because programs that are
> calling vfork or posix_spawn today will stop working.
>
> We should recognize that basing things off of CLONE_VFORK was a bad
> idea
> as CLONE_VFORK is all about waiting for the created task to exec or
> exit, and really has nothing to do with creating a new mm.
>
> Instead I think the rule should be that a new time namespaces is
> installed as soon as we have a new mm.
>
> That will be a behavioral change if the time ns is unshared and then
> the
> program exec's instead of forking children, but I suspect it is the
> proper behavior all the same, and that existing userspace won't care.
> Especially since all of the vfork_done work is new behavior as
> of v6.0-rc1.
>
While vfork_done work is indeed new, preservation of
time_ns_for_children on execve() instead of switching to it is how time
namespaces were originally implemented in 5.6. If this can be changed
even now, thereby fixing the original design, that's great, I just want
to point out that it's not the recent 6.0 work that is being fixed.
Fixes/clarifications for man pages[1][2], which talk about "subsequently
created children", will also be needed.

[1] https://man7.org/linux/man-pages/man7/time_namespaces.7.html
[2] https://man7.org/linux/man-pages/man2/unshare.2.html

> Ugh. I just spotted another bug. The function timens_on_fork as
> written is not safe to call without first creating a fresh copy
> of the nsproxy, and we don't do that during exec. Because nsproxy
> is shared between tasks and processes updating the values needs to
> create a new nsproxy or other tasks/processes can be affected.
> Not hard to handle just something that needs to be addressed.
>
> Say something like this:
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 9a5ca7b82bfc..8a6947e631dd 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct *old_mm, *active_mm;
> - bool vfork;
> int ret;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> - vfork = !!tsk->vfork_done;
> old_mm = current->mm;
> exec_mm_release(tsk, old_mm);
> if (old_mm)
> @@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm)
> vmacache_flush(tsk);
> task_unlock(tsk);
>
> - if (vfork)
> - timens_on_fork(tsk->nsproxy, tsk);
> -
> if (old_mm) {
> mmap_read_unlock(old_mm);
> BUG_ON(active_mm != old_mm);
> @@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>
> bprm->mm = NULL;
>
> + retval = exec_task_namespaces();
> + if (retval)
> + goto out_unlock;
> +
> #ifdef CONFIG_POSIX_TIMERS
> spin_lock_irq(&me->sighand->siglock);
> posix_cpu_timers_exit(me);
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index cdb171efc7cb..fee881cded01 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset
> *set)
> int copy_namespaces(unsigned long flags, struct task_struct *tsk);
> void exit_task_namespaces(struct task_struct *tsk);
> void switch_task_namespaces(struct task_struct *tsk, struct nsproxy
> *new);
> +int exec_task_namespaces(void);
> void free_nsproxy(struct nsproxy *ns);
> int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
> struct cred *, struct fs_struct *);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 90c85b17bf69..b4a799d9c50f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct
> *copy_process(
> return ERR_PTR(-EINVAL);
> }
>
> - /*
> - * If the new process will be in a different time namespace
> - * do not allow it to share VM or a thread group with the forking
> task.
> - *
> - * On vfork, the child process enters the target time namespace only
> - * after exec.
> - */
> - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
> - if (nsp->time_ns != nsp->time_ns_for_children)
> - return ERR_PTR(-EINVAL);
> - }
> -
> if (clone_flags & CLONE_PIDFD) {
> /*
> * - CLONE_DETACHED is blocked so that we can potentially
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index b4cbb406bc28..b6647846fe42 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p)
> switch_task_namespaces(p, NULL);
> }
>
> +int exec_task_namespaces(void)
> +{
> + struct task_struct *tsk = current;
> + struct nsproxy *new;
> +
> + if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
> + return 0;
> +
> + new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
> + if (IS_ERR(new))
> + return PTR_ERR(new);
> +
> + timens_on_fork(new, tsk);
> + switch_task_namespaces(tsk, new);
> + return 0;
> +}
> +
> +
> static int check_setns_flags(unsigned long flags)
> {
> if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>
>
>
> To keep things from being too confusing it probably makes sense to
> rename the nsproxy variable from time_ns_for_children to
> time_ns_for_new_mm. Likewise timens_on_fork can be renamed
> timens_on_new_mm.
>
Do you imply renaming "/proc/[pid]/ns/time_for_children" as well, or
will it be preserved for compatibility?

Thanks,
Alexey

> But that would be follow up work.
>
> How does the above change sound to folks?
>
> Eric

2022-09-02 17:28:42

by Andrei Vagin

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Fri, Sep 02, 2022 at 07:39:28PM +0300, Alexey Izbyshev wrote:

<snip>

> > > @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct
> > > *copy_process(
> > > return ERR_PTR(-EINVAL);
> > > }
> > >
> > > - /*
> > > - * If the new process will be in a different time namespace
> > > - * do not allow it to share VM or a thread group with the forking
> > > task.
> > > - *
> > > - * On vfork, the child process enters the target time namespace only
> > > - * after exec.
> > > - */
> > > - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
> > > - if (nsp->time_ns != nsp->time_ns_for_children)
> > > - return ERR_PTR(-EINVAL);
> > > - }
> >
> > pls don't remove this part. It was one of the concerns that vfork
> > doesn't work after unshare(CLONE_NEWTIME), but it is one of the standard
> > ways of creating a new process. For example, posix_spawn uses it.
> >
> What do you mean? On the contrary, removing this restriction of the original
> time namespace implementation allows vfork(), pthread_create() and the like,
> solving the issue with posix_spawn() as well.
>

Sorry, I was not woken up completely and decided that it just reverted
the change that allows vfork. Now, I see that it removes this
restriction completely. So it looks good to me.

Thanks,
Andrei.

2022-09-02 18:06:49

by Andrei Vagin

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Fri, Sep 02, 2022 at 08:01:57PM +0300, Alexey Izbyshev wrote:
> On 2022-09-01 21:11, Eric W. Biederman wrote:

<snip>

> > static int check_setns_flags(unsigned long flags)
> > {
> > if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> >
> >
> >
> > To keep things from being too confusing it probably makes sense to
> > rename the nsproxy variable from time_ns_for_children to
> > time_ns_for_new_mm. Likewise timens_on_fork can be renamed
> > timens_on_new_mm.
> >
> Do you imply renaming "/proc/[pid]/ns/time_for_children" as well, or will it
> be preserved for compatibility?

I don't think this is possible. It is used by a few tools already.

2022-09-06 22:25:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

Alexey Izbyshev <[email protected]> writes:

> On 2022-09-01 21:11, Eric W. Biederman wrote:
>> Andrei Vagin <[email protected]> writes:
>>
>>> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]> wrote:
>>>> On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
>>> <snip>
>>>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>>> tsk->mm->vmacache_seqnum = 0;
>>>>> vmacache_flush(tsk);
>>>>> task_unlock(tsk);
>>>>> +
>>>>> + if (vfork)
>>>>> + timens_on_fork(tsk->nsproxy, tsk);
>>>>> +
>>>>> Similarly, even after a normal vfork(), time namespace switch could
>>>>> be
>>>>> silently skipped if the parent dies before "tsk->vfork_done" is
>>>>> read. Again,
>>>>> I don't know whether anybody cares, but this behavior seems non-obvious and
>>>>> probably unintended to me.
>>>> This is the more interesting case. I will try to find out how we can
>>>> handle it properly.
>>> It might not be a good idea to use vfork_done in this case. Let's
>>> think about what we have and what we want to change. We don't want to
>>> allow switching timens if a process mm is used by someone else. But we
>>> forgot to handle execve that creates a new mm, and we can't change this
>>> behavior right now because it can affect current users. Right?
>> What we can't changes are things that will break existing programs. If
>> existing programs don't care we can change the behavior of the kernel.
>>
>>> So maybe the best choice, in this case, is to change behavior by adding
>>> a new control that enables it. The first interface that comes to my mind
>>> is to introduce a new ioctl for a namespace file descriptor. Here is a
>>> draft patch below that should help to understand what I mean.
>> I don't think adding a new control works, because programs that are
>> calling vfork or posix_spawn today will stop working.
>> We should recognize that basing things off of CLONE_VFORK was a bad
>> idea
>> as CLONE_VFORK is all about waiting for the created task to exec or
>> exit, and really has nothing to do with creating a new mm.
>> Instead I think the rule should be that a new time namespaces is
>> installed as soon as we have a new mm.
>> That will be a behavioral change if the time ns is unshared and then
>> the
>> program exec's instead of forking children, but I suspect it is the
>> proper behavior all the same, and that existing userspace won't care.
>> Especially since all of the vfork_done work is new behavior as
>> of v6.0-rc1.
>>
> While vfork_done work is indeed new, preservation of time_ns_for_children on
> execve() instead of switching to it is how time namespaces were originally
> implemented in 5.6. If this can be changed even now, thereby fixing the original
> design, that's great, I just want to point out that it's not the recent 6.0 work
> that is being fixed. Fixes/clarifications for man pages[1][2], which talk about
> "subsequently created children", will also be needed.
>
> [1] https://man7.org/linux/man-pages/man7/time_namespaces.7.html
> [2] https://man7.org/linux/man-pages/man2/unshare.2.html

Sorry, yes.

That is something to be double checked.

I can't see where it would make sense to unshare a time namespace and
then call exec, instead of calling exit. So I suspect we can just
change this behavior and no one will notice.

>> Ugh. I just spotted another bug. The function timens_on_fork as
>> written is not safe to call without first creating a fresh copy
>> of the nsproxy, and we don't do that during exec. Because nsproxy
>> is shared between tasks and processes updating the values needs to
>> create a new nsproxy or other tasks/processes can be affected.
>> Not hard to handle just something that needs to be addressed.
>> Say something like this:
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 9a5ca7b82bfc..8a6947e631dd 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
>> {
>> struct task_struct *tsk;
>> struct mm_struct *old_mm, *active_mm;
>> - bool vfork;
>> int ret;
>> /* Notify parent that we're no longer interested in the old VM */
>> tsk = current;
>> - vfork = !!tsk->vfork_done;
>> old_mm = current->mm;
>> exec_mm_release(tsk, old_mm);
>> if (old_mm)
>> @@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm)
>> vmacache_flush(tsk);
>> task_unlock(tsk);
>> - if (vfork)
>> - timens_on_fork(tsk->nsproxy, tsk);
>> -
>> if (old_mm) {
>> mmap_read_unlock(old_mm);
>> BUG_ON(active_mm != old_mm);
>> @@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>> bprm->mm = NULL;
>> + retval = exec_task_namespaces();
>> + if (retval)
>> + goto out_unlock;
>> +
>> #ifdef CONFIG_POSIX_TIMERS
>> spin_lock_irq(&me->sighand->siglock);
>> posix_cpu_timers_exit(me);
>> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>> index cdb171efc7cb..fee881cded01 100644
>> --- a/include/linux/nsproxy.h
>> +++ b/include/linux/nsproxy.h
>> @@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset *set)
>> int copy_namespaces(unsigned long flags, struct task_struct *tsk);
>> void exit_task_namespaces(struct task_struct *tsk);
>> void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
>> +int exec_task_namespaces(void);
>> void free_nsproxy(struct nsproxy *ns);
>> int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
>> struct cred *, struct fs_struct *);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 90c85b17bf69..b4a799d9c50f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct
>> *copy_process(
>> return ERR_PTR(-EINVAL);
>> }
>> - /*
>> - * If the new process will be in a different time namespace
>> - * do not allow it to share VM or a thread group with the forking task.
>> - *
>> - * On vfork, the child process enters the target time namespace only
>> - * after exec.
>> - */
>> - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
>> - if (nsp->time_ns != nsp->time_ns_for_children)
>> - return ERR_PTR(-EINVAL);
>> - }
>> -
>> if (clone_flags & CLONE_PIDFD) {
>> /*
>> * - CLONE_DETACHED is blocked so that we can potentially
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index b4cbb406bc28..b6647846fe42 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p)
>> switch_task_namespaces(p, NULL);
>> }
>> +int exec_task_namespaces(void)
>> +{
>> + struct task_struct *tsk = current;
>> + struct nsproxy *new;
>> +
>> + if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
>> + return 0;
>> +
>> + new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
>> + if (IS_ERR(new))
>> + return PTR_ERR(new);
>> +
>> + timens_on_fork(new, tsk);
>> + switch_task_namespaces(tsk, new);
>> + return 0;
>> +}
>> +
>> +
>> static int check_setns_flags(unsigned long flags)
>> {
>> if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>>
>> To keep things from being too confusing it probably makes sense to
>> rename the nsproxy variable from time_ns_for_children to
>> time_ns_for_new_mm. Likewise timens_on_fork can be renamed
>> timens_on_new_mm.
>>
> Do you imply renaming "/proc/[pid]/ns/time_for_children" as well, or will it be
> preserved for compatibility?

Unfortunately I don't think we can change that one. We could add
another better named one, update the tools to use it. Then wait a
couple of millenia and remove the current name. Depending it might be
worth it, but only if you have a lot of patience.

We should get the implementation details sorted out first, and the
in-kernel name before touching the proc files.

Eric

2022-09-07 05:46:38

by Alexey Izbyshev

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On 2022-09-07 01:16, Eric W. Biederman wrote:
> Alexey Izbyshev <[email protected]> writes:
>
>> On 2022-09-01 21:11, Eric W. Biederman wrote:
>>> Andrei Vagin <[email protected]> writes:
>>>
>>>> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <[email protected]>
>>>> wrote:
>>>>> On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
>>>> <snip>
>>>>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>> tsk->mm->vmacache_seqnum = 0;
>>>>>> vmacache_flush(tsk);
>>>>>> task_unlock(tsk);
>>>>>> +
>>>>>> + if (vfork)
>>>>>> + timens_on_fork(tsk->nsproxy, tsk);
>>>>>> +
>>>>>> Similarly, even after a normal vfork(), time namespace switch
>>>>>> could
>>>>>> be
>>>>>> silently skipped if the parent dies before "tsk->vfork_done" is
>>>>>> read. Again,
>>>>>> I don't know whether anybody cares, but this behavior seems
>>>>>> non-obvious and
>>>>>> probably unintended to me.
>>>>> This is the more interesting case. I will try to find out how we
>>>>> can
>>>>> handle it properly.
>>>> It might not be a good idea to use vfork_done in this case. Let's
>>>> think about what we have and what we want to change. We don't want
>>>> to
>>>> allow switching timens if a process mm is used by someone else. But
>>>> we
>>>> forgot to handle execve that creates a new mm, and we can't change
>>>> this
>>>> behavior right now because it can affect current users. Right?
>>> What we can't changes are things that will break existing programs.
>>> If
>>> existing programs don't care we can change the behavior of the
>>> kernel.
>>>
>>>> So maybe the best choice, in this case, is to change behavior by
>>>> adding
>>>> a new control that enables it. The first interface that comes to my
>>>> mind
>>>> is to introduce a new ioctl for a namespace file descriptor. Here is
>>>> a
>>>> draft patch below that should help to understand what I mean.
>>> I don't think adding a new control works, because programs that are
>>> calling vfork or posix_spawn today will stop working.
>>> We should recognize that basing things off of CLONE_VFORK was a bad
>>> idea
>>> as CLONE_VFORK is all about waiting for the created task to exec or
>>> exit, and really has nothing to do with creating a new mm.
>>> Instead I think the rule should be that a new time namespaces is
>>> installed as soon as we have a new mm.
>>> That will be a behavioral change if the time ns is unshared and then
>>> the
>>> program exec's instead of forking children, but I suspect it is the
>>> proper behavior all the same, and that existing userspace won't care.
>>> Especially since all of the vfork_done work is new behavior as
>>> of v6.0-rc1.
>>>
>> While vfork_done work is indeed new, preservation of
>> time_ns_for_children on
>> execve() instead of switching to it is how time namespaces were
>> originally
>> implemented in 5.6. If this can be changed even now, thereby fixing
>> the original
>> design, that's great, I just want to point out that it's not the
>> recent 6.0 work
>> that is being fixed. Fixes/clarifications for man pages[1][2], which
>> talk about
>> "subsequently created children", will also be needed.
>>
>> [1] https://man7.org/linux/man-pages/man7/time_namespaces.7.html
>> [2] https://man7.org/linux/man-pages/man2/unshare.2.html
>
> Sorry, yes.
>
> That is something to be double checked.
>
> I can't see where it would make sense to unshare a time namespace and
> then call exec, instead of calling exit. So I suspect we can just
> change this behavior and no one will notice.
>
One can imagine a helper binary that calls unshare, forks some children
in new namespaces, and then calls exec to hand off actual work to
another binary (which might not expect being in the new time namespace).
I'm purely theorizing here, however. Keeping a special case for vfork()
based only on FUD is likely a net negative, so it'd be nice to hear
actual time namespace users speak up, and switch to the solution you
suggested if they don't care.

The "unshare" tool from util-linux will also change behavior if called
without "--fork" (e.g. "unshare --user --time"), but that would be
unusual usage (just as for "--pid"), so most people probably don't do
that (or don't care about the time namespace of the exec'ed process, but
care only about its children).

>>> Ugh. I just spotted another bug. The function timens_on_fork as
>>> written is not safe to call without first creating a fresh copy
>>> of the nsproxy, and we don't do that during exec. Because nsproxy
>>> is shared between tasks and processes updating the values needs to
>>> create a new nsproxy or other tasks/processes can be affected.
>>> Not hard to handle just something that needs to be addressed.
>>> Say something like this:
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index 9a5ca7b82bfc..8a6947e631dd 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
>>> {
>>> struct task_struct *tsk;
>>> struct mm_struct *old_mm, *active_mm;
>>> - bool vfork;
>>> int ret;
>>> /* Notify parent that we're no longer interested in the old VM */
>>> tsk = current;
>>> - vfork = !!tsk->vfork_done;
>>> old_mm = current->mm;
>>> exec_mm_release(tsk, old_mm);
>>> if (old_mm)
>>> @@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm)
>>> vmacache_flush(tsk);
>>> task_unlock(tsk);
>>> - if (vfork)
>>> - timens_on_fork(tsk->nsproxy, tsk);
>>> -
>>> if (old_mm) {
>>> mmap_read_unlock(old_mm);
>>> BUG_ON(active_mm != old_mm);
>>> @@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>>> bprm->mm = NULL;
>>> + retval = exec_task_namespaces();
>>> + if (retval)
>>> + goto out_unlock;
>>> +
>>> #ifdef CONFIG_POSIX_TIMERS
>>> spin_lock_irq(&me->sighand->siglock);
>>> posix_cpu_timers_exit(me);
>>> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>>> index cdb171efc7cb..fee881cded01 100644
>>> --- a/include/linux/nsproxy.h
>>> +++ b/include/linux/nsproxy.h
>>> @@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset
>>> *set)
>>> int copy_namespaces(unsigned long flags, struct task_struct *tsk);
>>> void exit_task_namespaces(struct task_struct *tsk);
>>> void switch_task_namespaces(struct task_struct *tsk, struct nsproxy
>>> *new);
>>> +int exec_task_namespaces(void);
>>> void free_nsproxy(struct nsproxy *ns);
>>> int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
>>> struct cred *, struct fs_struct *);
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 90c85b17bf69..b4a799d9c50f 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct
>>> *copy_process(
>>> return ERR_PTR(-EINVAL);
>>> }
>>> - /*
>>> - * If the new process will be in a different time namespace
>>> - * do not allow it to share VM or a thread group with the forking
>>> task.
>>> - *
>>> - * On vfork, the child process enters the target time namespace
>>> only
>>> - * after exec.
>>> - */
>>> - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
>>> - if (nsp->time_ns != nsp->time_ns_for_children)
>>> - return ERR_PTR(-EINVAL);
>>> - }
>>> -
>>> if (clone_flags & CLONE_PIDFD) {
>>> /*
>>> * - CLONE_DETACHED is blocked so that we can potentially
>>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>>> index b4cbb406bc28..b6647846fe42 100644
>>> --- a/kernel/nsproxy.c
>>> +++ b/kernel/nsproxy.c
>>> @@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p)
>>> switch_task_namespaces(p, NULL);
>>> }
>>> +int exec_task_namespaces(void)
>>> +{
>>> + struct task_struct *tsk = current;
>>> + struct nsproxy *new;
>>> +
>>> + if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
>>> + return 0;
>>> +
>>> + new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
>>> + if (IS_ERR(new))
>>> + return PTR_ERR(new);
>>> +
>>> + timens_on_fork(new, tsk);
>>> + switch_task_namespaces(tsk, new);
>>> + return 0;
>>> +}
>>> +
>>> +
>>> static int check_setns_flags(unsigned long flags)
>>> {
>>> if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC
>>> |
>>>
>>> To keep things from being too confusing it probably makes sense to
>>> rename the nsproxy variable from time_ns_for_children to
>>> time_ns_for_new_mm. Likewise timens_on_fork can be renamed
>>> timens_on_new_mm.
>>>
>> Do you imply renaming "/proc/[pid]/ns/time_for_children" as well, or
>> will it be
>> preserved for compatibility?
>
> Unfortunately I don't think we can change that one. We could add
> another better named one, update the tools to use it. Then wait a
> couple of millenia and remove the current name. Depending it might be
> worth it, but only if you have a lot of patience.
>
I agree with you and Andrei that the name in /proc shouldn't be changed.
I was asking only to understand the scope of changes that you suggested.

> We should get the implementation details sorted out first, and the
> in-kernel name before touching the proc files.
>
FWIW, your patch looks good to me. I've also run some simple manual
tests with it applied on top of 6.0.0-rc4, and it works as expected.

I've also noticed one missed optimization in copy_namespaces().

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
if (likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
get_nsproxy(old_ns);
return 0;
}
} else if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;

The time ns comparison on the fast path was originally added together
with time namespace support, and back then clone(CLONE_VM) wasn't
allowed with non-matching time_ns and time_ns_for_children. Then
Andrei's patch 133e2d3e81 allowed clone(CLONE_VM|CLONE_VFORK) in this
case, and your patch removes CLONE_VM restriction altogether, so
non-matching time_ns/time_ns_for_children are simply inherited if
CLONE_VM is set. However, the fast path didn't learn about that, so
copy_namespaces() will uselessly create a new nsproxy even though
timens_on_fork() won't be called. Probably the fast path check should be
fixed.

Thanks,
Alexey

2022-09-07 17:29:02

by Andrei Vagin

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Wed, Sep 07, 2022 at 08:33:20AM +0300, Alexey Izbyshev wrote:
> >
> > That is something to be double checked.
> >
> > I can't see where it would make sense to unshare a time namespace and
> > then call exec, instead of calling exit. So I suspect we can just
> > change this behavior and no one will notice.
> >
> One can imagine a helper binary that calls unshare, forks some children in
> new namespaces, and then calls exec to hand off actual work to another
> binary (which might not expect being in the new time namespace). I'm purely
> theorizing here, however. Keeping a special case for vfork() based only on
> FUD is likely a net negative, so it'd be nice to hear actual time namespace
> users speak up, and switch to the solution you suggested if they don't care.

I can speak for one tool that uses time namespaces for the right
reasons. It is CRIU. When a process is restored, the monotonic and
boottime clocks have to be adjusted to match old values. It is for what
the timens was designed for. These changes doesn't affect CRIU.

Honestly, I haven't heard about other users of timens yet. I don't take
into account tools like unshare.

>
> The "unshare" tool from util-linux will also change behavior if called
> without "--fork" (e.g. "unshare --user --time"), but that would be unusual
> usage (just as for "--pid"), so most people probably don't do that (or don't
> care about the time namespace of the exec'ed process, but care only about
> its children).


2022-09-08 08:34:51

by Christian Brauner

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Wed, Sep 07, 2022 at 10:15:51AM -0700, Andrei Vagin wrote:
> On Wed, Sep 07, 2022 at 08:33:20AM +0300, Alexey Izbyshev wrote:
> > >
> > > That is something to be double checked.
> > >
> > > I can't see where it would make sense to unshare a time namespace and
> > > then call exec, instead of calling exit. So I suspect we can just
> > > change this behavior and no one will notice.
> > >
> > One can imagine a helper binary that calls unshare, forks some children in
> > new namespaces, and then calls exec to hand off actual work to another
> > binary (which might not expect being in the new time namespace). I'm purely
> > theorizing here, however. Keeping a special case for vfork() based only on
> > FUD is likely a net negative, so it'd be nice to hear actual time namespace
> > users speak up, and switch to the solution you suggested if they don't care.
>
> I can speak for one tool that uses time namespaces for the right
> reasons. It is CRIU. When a process is restored, the monotonic and
> boottime clocks have to be adjusted to match old values. It is for what
> the timens was designed for. These changes doesn't affect CRIU.
>
> Honestly, I haven't heard about other users of timens yet. I don't take
> into account tools like unshare.

LXC/LXD does

unshare(CLONE_NEWTIME)
// write offsets to /proc/self/timens_offsets
timens_fd = open("/proc/self/ns/time_for_children", O_RDONLY | O_CLOEXEC)
setns(timens_fd, CLONE_NEWTIME)
exec(payload)

so I agree don't change the uapi, please.

But as you can see what we do is basically emulating changing time
namespace during exec via the setns() prior to the exec call.

2022-09-08 22:37:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

Christian Brauner <[email protected]> writes:

> On Wed, Sep 07, 2022 at 10:15:51AM -0700, Andrei Vagin wrote:
>> On Wed, Sep 07, 2022 at 08:33:20AM +0300, Alexey Izbyshev wrote:
>> > >
>> > > That is something to be double checked.
>> > >
>> > > I can't see where it would make sense to unshare a time namespace and
>> > > then call exec, instead of calling exit. So I suspect we can just
>> > > change this behavior and no one will notice.
>> > >
>> > One can imagine a helper binary that calls unshare, forks some children in
>> > new namespaces, and then calls exec to hand off actual work to another
>> > binary (which might not expect being in the new time namespace). I'm purely
>> > theorizing here, however. Keeping a special case for vfork() based only on
>> > FUD is likely a net negative, so it'd be nice to hear actual time namespace
>> > users speak up, and switch to the solution you suggested if they don't care.
>>
>> I can speak for one tool that uses time namespaces for the right
>> reasons. It is CRIU. When a process is restored, the monotonic and
>> boottime clocks have to be adjusted to match old values. It is for what
>> the timens was designed for. These changes doesn't affect CRIU.
>>
>> Honestly, I haven't heard about other users of timens yet. I don't take
>> into account tools like unshare.
>
> LXC/LXD does
>
> unshare(CLONE_NEWTIME)
> // write offsets to /proc/self/timens_offsets
> timens_fd = open("/proc/self/ns/time_for_children", O_RDONLY | O_CLOEXEC)
> setns(timens_fd, CLONE_NEWTIME)
> exec(payload)
>
> so I agree don't change the uapi, please.
>
> But as you can see what we do is basically emulating changing time
> namespace during exec via the setns() prior to the exec call.

If I understand the description of lxc/lxd correctly the proposed change
will not effect lxc/lxd, as the time namespace is already installed
before exec. If anything what is proposed would potentially allow
lxc/lxd to be simplified in the future by removing the setns.

Are you then requesting the behavior of the time namespace not change
when the proposed change will not effect lxc/lxd?

Eric

2022-09-09 08:33:45

by Christian Brauner

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Thu, Sep 08, 2022 at 05:13:08PM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Wed, Sep 07, 2022 at 10:15:51AM -0700, Andrei Vagin wrote:
> >> On Wed, Sep 07, 2022 at 08:33:20AM +0300, Alexey Izbyshev wrote:
> >> > >
> >> > > That is something to be double checked.
> >> > >
> >> > > I can't see where it would make sense to unshare a time namespace and
> >> > > then call exec, instead of calling exit. So I suspect we can just
> >> > > change this behavior and no one will notice.
> >> > >
> >> > One can imagine a helper binary that calls unshare, forks some children in
> >> > new namespaces, and then calls exec to hand off actual work to another
> >> > binary (which might not expect being in the new time namespace). I'm purely
> >> > theorizing here, however. Keeping a special case for vfork() based only on
> >> > FUD is likely a net negative, so it'd be nice to hear actual time namespace
> >> > users speak up, and switch to the solution you suggested if they don't care.
> >>
> >> I can speak for one tool that uses time namespaces for the right
> >> reasons. It is CRIU. When a process is restored, the monotonic and
> >> boottime clocks have to be adjusted to match old values. It is for what
> >> the timens was designed for. These changes doesn't affect CRIU.
> >>
> >> Honestly, I haven't heard about other users of timens yet. I don't take
> >> into account tools like unshare.
> >
> > LXC/LXD does
> >
> > unshare(CLONE_NEWTIME)
> > // write offsets to /proc/self/timens_offsets
> > timens_fd = open("/proc/self/ns/time_for_children", O_RDONLY | O_CLOEXEC)
> > setns(timens_fd, CLONE_NEWTIME)
> > exec(payload)
> >
> > so I agree don't change the uapi, please.
> >
> > But as you can see what we do is basically emulating changing time
> > namespace during exec via the setns() prior to the exec call.
>
> If I understand the description of lxc/lxd correctly the proposed change
> will not effect lxc/lxd, as the time namespace is already installed
> before exec. If anything what is proposed would potentially allow
> lxc/lxd to be simplified in the future by removing the setns.
>
> Are you then requesting the behavior of the time namespace not change
> when the proposed change will not effect lxc/lxd?

Don't change /proc/self/ns/time_for_children to a different name.
As stated above the proposed exec behavior we currently clearly emulate
in userspace. So that part is fine.

2022-09-11 15:33:50

by Kees Cook

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Fri, Sep 09, 2022 at 09:51:58AM +0200, Christian Brauner wrote:
> As stated above the proposed exec behavior we currently clearly emulate
> in userspace. So that part is fine.

It's not clear to me yet what the right solution is from this thread so
far... what's needed for v6.0 release (since we're quickly running out
of release candidates)?

--
Kees Cook

2022-09-11 22:54:27

by Andrei Vagin

[permalink] [raw]
Subject: Re: Potentially undesirable interactions between vfork() and time namespaces

On Sun, Sep 11, 2022 at 8:12 AM Kees Cook <[email protected]> wrote:
>
> On Fri, Sep 09, 2022 at 09:51:58AM +0200, Christian Brauner wrote:
> > As stated above the proposed exec behavior we currently clearly emulate
> > in userspace. So that part is fine.
>
> It's not clear to me yet what the right solution is from this thread so
> far... what's needed for v6.0 release (since we're quickly running out
> of release candidates)?

Kees,

I think we reached a consensus to go with Eric's idea. We will send
the patch shortly.

Thanks,
Andrei