2020-06-19 15:39:46

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 0/3] nsproxy: support CLONE_NEWTIME with setns()

Hey,

So far setns() was missing time namespace support. This was partially
due to it simply not being implemented but also because
vdso_join_timens() could still fail which made switching to multiple
namespaces atomically problematic. This series first fixes
vdso_join_timens() to never fail, introduces timens_commit() and finally
adds CLONE_NEWTIME support for setns().

Please note, that arm is currently in the process of adding
vdso_join_timens() support (cf. [1]) so it might make sense to split the
vdso_join_timens() change out and route it to mainline as a fix so both
my series and the arm support can be rebased on top of it. I've Cced the
relevant people and I'm also replying to the arm thread now.

[1]: https://lore.kernel.org/lkml/[email protected]/

Thanks!
Christian

Christian Brauner (3):
timens: make vdso_join_timens() always succeed
timens: add timens_commit() helper
nsproxy: support CLONE_NEWTIME with setns()

arch/x86/entry/vdso/vma.c | 6 ++----
include/linux/time_namespace.h | 13 +++++++++----
kernel/nsproxy.c | 21 +++++++++++++++++++--
kernel/time/namespace.c | 22 ++++++++--------------
4 files changed, 38 insertions(+), 24 deletions(-)


base-commit: b3a9e3b9622ae10064826dccb4f7a52bd88c7407
--
2.27.0


2020-06-19 15:41:34

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 1/3] timens: make vdso_join_timens() always succeed

As discussed on-list (cf. [1]), in order to make setns() support time
namespaces properly we need to tweak vdso_join_timens() to always succeed.
So switch vdso_join_timens() from mmap_write_lock_killable() to
mmap_write_lock().

Last cycle setns() was changed to support attaching to multiple namespaces
atomically. This requires all namespaces to have a point of no return where
they can't fail anymore. Specifically, <namespace-type>_install() is
allowed to perform permission checks and install the namespace into the new
struct nsset that it has been given but it is not allowed to make visible
changes to the affected task. Once <namespace-type>_install() returns
anything that the given namespace type requires to be setup in addition
needs to ideally be done in a function that can't fail or if it fails the
failure is not fatal. For time namespaces the relevant functions that fall
into this category are timens_set_vvar_page() and vdso_join_timens().
Currently the latter can fail but doesn't need to. With this we can go on
to implement a timens_commit() helper in a follow up patch to be used by
setns().

[1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein
Cc: Will Deacon <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
---
arch/x86/entry/vdso/vma.c | 6 ++----
include/linux/time_namespace.h | 7 +++----
kernel/time/namespace.c | 10 ++--------
3 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index ea7c1f0b79df..be3f542e419c 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -139,13 +139,12 @@ static struct page *find_timens_vvar_page(struct vm_area_struct *vma)
* corresponding layout.
* See also the comment near timens_setup_vdso_data() for details.
*/
-int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
+void vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
{
struct mm_struct *mm = task->mm;
struct vm_area_struct *vma;

- if (mmap_write_lock_killable(mm))
- return -EINTR;
+ mmap_write_lock(mm);

for (vma = mm->mmap; vma; vma = vma->vm_next) {
unsigned long size = vma->vm_end - vma->vm_start;
@@ -155,7 +154,6 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
}

mmap_write_unlock(mm);
- return 0;
}
#else
static inline struct page *find_timens_vvar_page(struct vm_area_struct *vma)
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 824d54e057eb..4d1768c6f836 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -31,8 +31,8 @@ struct time_namespace {
extern struct time_namespace init_time_ns;

#ifdef CONFIG_TIME_NS
-extern int vdso_join_timens(struct task_struct *task,
- struct time_namespace *ns);
+extern void vdso_join_timens(struct task_struct *task,
+ struct time_namespace *ns);

static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
{
@@ -90,10 +90,9 @@ static inline ktime_t timens_ktime_to_host(clockid_t clockid, ktime_t tim)
}

#else
-static inline int vdso_join_timens(struct task_struct *task,
+static inline void vdso_join_timens(struct task_struct *task,
struct time_namespace *ns)
{
- return 0;
}

static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 5d9fc22d836a..e5af6fe87af8 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -284,7 +284,6 @@ static int timens_install(struct nsset *nsset, struct ns_common *new)
{
struct nsproxy *nsproxy = nsset->nsproxy;
struct time_namespace *ns = to_time_ns(new);
- int err;

if (!current_is_single_threaded())
return -EUSERS;
@@ -295,9 +294,7 @@ static int timens_install(struct nsset *nsset, struct ns_common *new)

timens_set_vvar_page(current, ns);

- err = vdso_join_timens(current, ns);
- if (err)
- return err;
+ vdso_join_timens(current, ns);

get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
@@ -313,7 +310,6 @@ int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
{
struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
struct time_namespace *ns = to_time_ns(nsc);
- int err;

/* create_new_namespaces() already incremented the ref counter */
if (nsproxy->time_ns == nsproxy->time_ns_for_children)
@@ -321,9 +317,7 @@ int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)

timens_set_vvar_page(tsk, ns);

- err = vdso_join_timens(tsk, ns);
- if (err)
- return err;
+ vdso_join_timens(tsk, ns);

get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
--
2.27.0

2020-06-20 00:45:56

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 2/3] timens: add timens_commit() helper

Wrap the calls to timens_set_vvar_page() and vdso_join_timens() in
timens_on_fork() and timens_install() in a new timens_commit() helper.
We'll use this helper in a follow-up patch in nsproxy too.

Cc: Will Deacon <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
---
kernel/time/namespace.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index e5af6fe87af8..aa7b90aac2a7 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -280,6 +280,12 @@ static void timens_put(struct ns_common *ns)
put_time_ns(to_time_ns(ns));
}

+static void timens_commit(struct task_struct *tsk, struct time_namespace *ns)
+{
+ timens_set_vvar_page(tsk, ns);
+ vdso_join_timens(tsk, ns);
+}
+
static int timens_install(struct nsset *nsset, struct ns_common *new)
{
struct nsproxy *nsproxy = nsset->nsproxy;
@@ -292,9 +298,8 @@ static int timens_install(struct nsset *nsset, struct ns_common *new)
!ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
return -EPERM;

- timens_set_vvar_page(current, ns);

- vdso_join_timens(current, ns);
+ timens_commit(current, ns);

get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
@@ -315,14 +320,12 @@ int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
if (nsproxy->time_ns == nsproxy->time_ns_for_children)
return 0;

- timens_set_vvar_page(tsk, ns);
-
- vdso_join_timens(tsk, ns);
-
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
nsproxy->time_ns = ns;

+ timens_commit(tsk, ns);
+
return 0;
}

--
2.27.0

2020-06-20 00:45:56

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 3/3] nsproxy: support CLONE_NEWTIME with setns()

So far setns() was missing time namespace support. This was partially due
to it simply not being implemented but also because vdso_join_timens()
could still fail which made switching to multiple namespaces atomically
problematic. This is now fixed so support CLONE_NEWTIME with setns()

Cc: Thomas Gleixner <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Andrei Vagin <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
include/linux/time_namespace.h | 6 ++++++
kernel/nsproxy.c | 21 +++++++++++++++++++--
kernel/time/namespace.c | 5 +----
3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 4d1768c6f836..d308a3812f79 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -33,6 +33,7 @@ extern struct time_namespace init_time_ns;
#ifdef CONFIG_TIME_NS
extern void vdso_join_timens(struct task_struct *task,
struct time_namespace *ns);
+extern void timens_commit(struct task_struct *tsk, struct time_namespace *ns);

static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
{
@@ -95,6 +96,11 @@ static inline void vdso_join_timens(struct task_struct *task,
{
}

+static inline void timens_commit(struct task_struct *tsk,
+ struct time_namespace *ns)
+{
+}
+
static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
{
return NULL;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b03df67621d0..f12231c41b69 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -262,8 +262,8 @@ void exit_task_namespaces(struct task_struct *p)
static int check_setns_flags(unsigned long flags)
{
if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWUSER | CLONE_NEWPID |
- CLONE_NEWCGROUP)))
+ CLONE_NEWNET | CLONE_NEWTIME | CLONE_NEWUSER |
+ CLONE_NEWPID | CLONE_NEWCGROUP)))
return -EINVAL;

#ifndef CONFIG_USER_NS
@@ -290,6 +290,10 @@ static int check_setns_flags(unsigned long flags)
if (flags & CLONE_NEWNET)
return -EINVAL;
#endif
+#ifndef CONFIG_TIME_NS
+ if (flags & CLONE_NEWTIME)
+ return -EINVAL;
+#endif

return 0;
}
@@ -464,6 +468,14 @@ static int validate_nsset(struct nsset *nsset, struct pid *pid)
}
#endif

+#ifdef CONFIG_TIME_NS
+ if (flags & CLONE_NEWTIME) {
+ ret = validate_ns(nsset, &nsp->time_ns->ns);
+ if (ret)
+ goto out;
+ }
+#endif
+
out:
if (pid_ns)
put_pid_ns(pid_ns);
@@ -507,6 +519,11 @@ static void commit_nsset(struct nsset *nsset)
exit_sem(me);
#endif

+#ifdef CONFIG_TIME_NS
+ if (flags & CLONE_NEWTIME)
+ timens_commit(me, nsset->nsproxy->time_ns);
+#endif
+
/* transfer ownership */
switch_task_namespaces(me, nsset->nsproxy);
nsset->nsproxy = NULL;
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index aa7b90aac2a7..afc65e6be33e 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -280,7 +280,7 @@ static void timens_put(struct ns_common *ns)
put_time_ns(to_time_ns(ns));
}

-static void timens_commit(struct task_struct *tsk, struct time_namespace *ns)
+void timens_commit(struct task_struct *tsk, struct time_namespace *ns)
{
timens_set_vvar_page(tsk, ns);
vdso_join_timens(tsk, ns);
@@ -298,9 +298,6 @@ static int timens_install(struct nsset *nsset, struct ns_common *new)
!ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
return -EPERM;

-
- timens_commit(current, ns);
-
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
nsproxy->time_ns = ns;
--
2.27.0

2020-06-23 11:59:38

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 3/3] nsproxy: support CLONE_NEWTIME with setns()

On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> So far setns() was missing time namespace support. This was partially due
> to it simply not being implemented but also because vdso_join_timens()
> could still fail which made switching to multiple namespaces atomically
> problematic. This is now fixed so support CLONE_NEWTIME with setns()
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Michael Kerrisk <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Signed-off-by: Christian Brauner <[email protected]>
> ---

Andrei,
Dmitry,

A little off-topic since its not related to the patch here but I've been
going through the current time namespace semantics and i just want to
confirm something with you:

Afaict, unshare(CLONE_NEWTIME) currently works similar to
unshare(CLONE_NEWPID) in that it only changes {pid,time}_for_children
but does _not_ change the {pid, time} namespace of the caller itself.
For pid namespaces that makes a lot of sense but I'm not completely
clear why you're doing this for time namespaces, especially since the
setns() behavior for CLONE_NEWPID and CLONE_NEWTIME is very different:
Similar to unshare(CLONE_NEWPID), setns(CLONE_NEWPID) doesn't change the
pid namespace of the caller itself, it only changes it for it's
children by setting up pid_for_children. _But_ for setns(CLONE_NEWTIME)
both the caller's and the children's time namespace is changed, i.e.
unshare(CLONE_NEWTIME) behaves different from setns(CLONE_NEWTIME). Why?

This also has the consequence that the unshare(CLONE_NEWTIME) +
setns(CLONE_NEWTIME) sequence can be used to change the callers pid
namespace. Is this intended?
Here's some code where you can verify this (please excuse the aweful
code I'm using to illustrate this):

int main(int argc, char *argv[])
{
char buf1[4096], buf2[4096];

if (unshare(0x00000080))
exit(1);

int fd = open("/proc/self/ns/time", O_RDONLY);
if (fd < 0)
exit(2);

readlink("/proc/self/ns/time", buf1, sizeof(buf1));
readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
printf("unshare(CLONE_NEWTIME): time(%s) ~= time_for_children(%s)\n", buf1, buf2);

if (setns(fd, 0x00000080))
exit(3);

readlink("/proc/self/ns/time", buf1, sizeof(buf1));
readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
printf("setns(self, CLONE_NEWTIME): time(%s) == time_for_children(%s)\n", buf1, buf2);

exit(EXIT_SUCCESS);
}

which gives:

root@f2-vm:/# ./test
unshare(CLONE_NEWTIME): time(time:[4026531834]) ~= time_for_children(time:[4026532366])
setns(self, CLONE_NEWTIME): time(time:[4026531834]) == time_for_children(time:[4026531834])

why is unshare(CLONE_NEWTIME) blocked from changing the callers pid
namespace when setns(CLONE_NEWTIME) is allowed to do this?

Christian

2020-06-25 09:46:34

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 3/3] nsproxy: support CLONE_NEWTIME with setns()

On Tue, Jun 23, 2020 at 01:55:21PM +0200, Christian Brauner wrote:
> On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> > So far setns() was missing time namespace support. This was partially due
> > to it simply not being implemented but also because vdso_join_timens()
> > could still fail which made switching to multiple namespaces atomically
> > problematic. This is now fixed so support CLONE_NEWTIME with setns()
> >
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Michael Kerrisk <[email protected]>
> > Cc: Serge Hallyn <[email protected]>
> > Cc: Dmitry Safonov <[email protected]>
> > Cc: Andrei Vagin <[email protected]>
> > Signed-off-by: Christian Brauner <[email protected]>
> > ---
>
> Andrei,
> Dmitry,
>
> A little off-topic since its not related to the patch here but I've been
> going through the current time namespace semantics and i just want to
> confirm something with you:
>
> Afaict, unshare(CLONE_NEWTIME) currently works similar to
> unshare(CLONE_NEWPID) in that it only changes {pid,time}_for_children
> but does _not_ change the {pid, time} namespace of the caller itself.
> For pid namespaces that makes a lot of sense but I'm not completely
> clear why you're doing this for time namespaces, especially since the
> setns() behavior for CLONE_NEWPID and CLONE_NEWTIME is very different:
> Similar to unshare(CLONE_NEWPID), setns(CLONE_NEWPID) doesn't change the
> pid namespace of the caller itself, it only changes it for it's
> children by setting up pid_for_children. _But_ for setns(CLONE_NEWTIME)
> both the caller's and the children's time namespace is changed, i.e.
> unshare(CLONE_NEWTIME) behaves different from setns(CLONE_NEWTIME). Why?

This scheme allows setting clock offsets for a namespace, before any
processes appear in it. It is not allowed to change offsets if any task
has joined a time namespace. We need this to avoid corner cases with
timers and tasks don't need to be aware of offset changes.

>
> This also has the consequence that the unshare(CLONE_NEWTIME) +
> setns(CLONE_NEWTIME) sequence can be used to change the callers pid
> namespace. Is this intended?
> Here's some code where you can verify this (please excuse the aweful
> code I'm using to illustrate this):
>
> int main(int argc, char *argv[])
> {
> char buf1[4096], buf2[4096];
>
> if (unshare(0x00000080))
> exit(1);
>
> int fd = open("/proc/self/ns/time", O_RDONLY);
> if (fd < 0)
> exit(2);
>
> readlink("/proc/self/ns/time", buf1, sizeof(buf1));
> readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
> printf("unshare(CLONE_NEWTIME): time(%s) ~= time_for_children(%s)\n", buf1, buf2);
>
> if (setns(fd, 0x00000080))
> exit(3);

And in this example, you use the right sequence of steps: unshare, set
offsets, setns. With clone3, we will be able to do this in one call.

>
> readlink("/proc/self/ns/time", buf1, sizeof(buf1));
> readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
> printf("setns(self, CLONE_NEWTIME): time(%s) == time_for_children(%s)\n", buf1, buf2);
>
> exit(EXIT_SUCCESS);
> }
>
> which gives:
>
> root@f2-vm:/# ./test
> unshare(CLONE_NEWTIME): time(time:[4026531834]) ~= time_for_children(time:[4026532366])
> setns(self, CLONE_NEWTIME): time(time:[4026531834]) == time_for_children(time:[4026531834])
>
> why is unshare(CLONE_NEWTIME) blocked from changing the callers pid
> namespace when setns(CLONE_NEWTIME) is allowed to do this?
>
> Christian

2020-06-25 09:54:16

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 3/3] nsproxy: support CLONE_NEWTIME with setns()

On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> So far setns() was missing time namespace support. This was partially due
> to it simply not being implemented but also because vdso_join_timens()
> could still fail which made switching to multiple namespaces atomically
> problematic. This is now fixed so support CLONE_NEWTIME with setns()
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Michael Kerrisk <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Signed-off-by: Christian Brauner <[email protected]>

Hi Christian,

I have reviewed this series and it looks good to me.

We decided to not change the return type of vdso_join_timens to avoid
conflicts with the arm64 timens patchset. With this change, you can add
my Reviewed-by to all patched in this series.

Reviewed-by: Andrei Vagin <[email protected]>

Thanks,
Andrei

2020-06-25 12:49:29

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 3/3] nsproxy: support CLONE_NEWTIME with setns()

On Thu, Jun 25, 2020 at 02:06:18AM -0700, Andrei Vagin wrote:
> On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> > So far setns() was missing time namespace support. This was partially due
> > to it simply not being implemented but also because vdso_join_timens()
> > could still fail which made switching to multiple namespaces atomically
> > problematic. This is now fixed so support CLONE_NEWTIME with setns()
> >
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Michael Kerrisk <[email protected]>
> > Cc: Serge Hallyn <[email protected]>
> > Cc: Dmitry Safonov <[email protected]>
> > Cc: Andrei Vagin <[email protected]>
> > Signed-off-by: Christian Brauner <[email protected]>
>
> Hi Christian,
>
> I have reviewed this series and it looks good to me.
>
> We decided to not change the return type of vdso_join_timens to avoid
> conflicts with the arm64 timens patchset. With this change, you can add
> my Reviewed-by to all patched in this series.
>
> Reviewed-by: Andrei Vagin <[email protected]>

Thanks! As discussed in the thread for th arm changes. We'll defer the
return type changes!

Christian