From: Andrei Vagin <[email protected]>
Changing a time namespace requires remapping a vvar page, so we don't want
to allow doing that if any other tasks can use the same mm.
Currently, we install a time namespace when a task is created with a new
vm. exec() is another case when a task gets a new mm and so it can switch
a time namespace safely, but it isn't handled now.
One more issue of the current interface is that clone() with CLONE_VM isn't
allowed if the current task has unshared a time namespace
(timens_for_children doesn't match the current timens).
Both these issues make some inconvenience for users. For example, Alexey
and Florian reported that posix_spawn() uses vfork+exec and this pattern
doesn't work with time namespaces due to the both described issues.
LXC needed to workaround the exec() issue by calling setns.
In the commit 133e2d3e81de5 ("fs/exec: allow to unshare a time namespace on
vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But
it adds extra complexity and some undesirable side effects. Eric suggested
fixing the issues properly because here are all the reasons to suppose that
there are no users that depend on the old behavior.
Cc: Alexey Izbyshev <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Kees Cook <[email protected]>
Suggested-by: "Eric W. Biederman" <[email protected]>
Origin-author: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
---
fs/exec.c | 5 +++++
include/linux/nsproxy.h | 1 +
kernel/fork.c | 9 ---------
kernel/nsproxy.c | 23 +++++++++++++++++++++--
4 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index d046dbb9cbd0..71284188b96d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
#include <linux/io_uring.h>
#include <linux/syscall_user_dispatch.h>
#include <linux/coredump.h>
+#include <linux/time_namespace.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -1296,6 +1297,10 @@ int begin_new_exec(struct linux_binprm * bprm)
bprm->mm = NULL;
+ retval = exec_task_namespaces();
+ if (retval)
+ goto out_unlock;
+
#ifdef CONFIG_POSIX_TIMERS
spin_lock_irq(&me->sighand->siglock);
posix_cpu_timers_exit(me);
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index cdb171efc7cb..fee881cded01 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset *set)
int copy_namespaces(unsigned long flags, struct task_struct *tsk);
void exit_task_namespaces(struct task_struct *tsk);
void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
+int exec_task_namespaces(void);
void free_nsproxy(struct nsproxy *ns);
int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
struct cred *, struct fs_struct *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 2b6bd511c6ed..4eb803f75225 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2044,15 +2044,6 @@ static __latent_entropy struct task_struct *copy_process(
return ERR_PTR(-EINVAL);
}
- /*
- * If the new process will be in a different time namespace
- * do not allow it to share VM or a thread group with the forking task.
- */
- if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
- if (nsp->time_ns != nsp->time_ns_for_children)
- return ERR_PTR(-EINVAL);
- }
-
if (clone_flags & CLONE_PIDFD) {
/*
* - CLONE_DETACHED is blocked so that we can potentially
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index eec72ca962e2..a487ff24129b 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -157,7 +157,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
- if (likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
+ if ((flags & CLONE_VM) ||
+ likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
get_nsproxy(old_ns);
return 0;
}
@@ -179,7 +180,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (IS_ERR(new_ns))
return PTR_ERR(new_ns);
- timens_on_fork(new_ns, tsk);
+ if ((flags & CLONE_VM) == 0)
+ timens_on_fork(new_ns, tsk);
tsk->nsproxy = new_ns;
return 0;
@@ -254,6 +256,23 @@ void exit_task_namespaces(struct task_struct *p)
switch_task_namespaces(p, NULL);
}
+int exec_task_namespaces(void)
+{
+ struct task_struct *tsk = current;
+ struct nsproxy *new;
+
+ if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
+ return 0;
+
+ new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+
+ timens_on_fork(new, tsk);
+ switch_task_namespaces(tsk, new);
+ return 0;
+}
+
static int check_setns_flags(unsigned long flags)
{
if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
--
2.37.3.968.ga6b4b080e4-goog
On Tue, Sep 20, 2022 at 05:31:19PM -0700, Andrei Vagin wrote:
> From: Andrei Vagin <[email protected]>
>
> Changing a time namespace requires remapping a vvar page, so we don't want
> to allow doing that if any other tasks can use the same mm.
>
> Currently, we install a time namespace when a task is created with a new
> vm. exec() is another case when a task gets a new mm and so it can switch
> a time namespace safely, but it isn't handled now.
>
> One more issue of the current interface is that clone() with CLONE_VM isn't
> allowed if the current task has unshared a time namespace
> (timens_for_children doesn't match the current timens).
>
> Both these issues make some inconvenience for users. For example, Alexey
> and Florian reported that posix_spawn() uses vfork+exec and this pattern
> doesn't work with time namespaces due to the both described issues.
> LXC needed to workaround the exec() issue by calling setns.
>
> In the commit 133e2d3e81de5 ("fs/exec: allow to unshare a time namespace on
> vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But
> it adds extra complexity and some undesirable side effects. Eric suggested
> fixing the issues properly because here are all the reasons to suppose that
> there are no users that depend on the old behavior.
>
> Cc: Alexey Izbyshev <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Florian Weimer <[email protected]>
> Cc: Kees Cook <[email protected]>
> Suggested-by: "Eric W. Biederman" <[email protected]>
> Origin-author: "Eric W. Biederman" <[email protected]>
> Signed-off-by: Andrei Vagin <[email protected]>
This looks good -- my intention is for this to go into -next after the
v6.1 merge window closes. Does that match everyone's expectations?
Thanks!
-Kees
--
Kees Cook
On Wed, Sep 21, 2022 at 8:29 PM Kees Cook <[email protected]> wrote:
...
>
> This looks good -- my intention is for this to go into -next after the
> v6.1 merge window closes. Does that match everyone's expectations?
>
This works for me.
If everything goes well, it will be merged to v6.2. Do I understand this right?
Thanks,
Andrei
On Thu, Sep 22, 2022 at 06:47:46PM -0700, Andrei Vagin wrote:
> On Wed, Sep 21, 2022 at 8:29 PM Kees Cook <[email protected]> wrote:
> ...
> >
> > This looks good -- my intention is for this to go into -next after the
> > v6.1 merge window closes. Does that match everyone's expectations?
> >
>
> This works for me.
>
> If everything goes well, it will be merged to v6.2. Do I understand this right?
Right, that's what I'm expecting. :)
--
Kees Cook
On Tue, 20 Sep 2022 17:31:19 -0700, Andrei Vagin wrote:
> From: Andrei Vagin <[email protected]>
>
> Changing a time namespace requires remapping a vvar page, so we don't want
> to allow doing that if any other tasks can use the same mm.
>
> Currently, we install a time namespace when a task is created with a new
> vm. exec() is another case when a task gets a new mm and so it can switch
> a time namespace safely, but it isn't handled now.
>
> [...]
Applied to for-next/execve, thanks!
[1/2] fs/exec: switch timens when a task gets a new mm
https://git.kernel.org/kees/c/f8e06046dd69
[2/2] selftests/timens: add a test for vfork+exit
https://git.kernel.org/kees/c/de8517320d01
--
Kees Cook