2020-03-17 08:33:01

by Adrian Reber

[permalink] [raw]
Subject: clone3: allow creation of time namespace with offset

This is an attempt to add time namespace support to clone3(). I am not
really sure which way clone3() should handle time namespaces. The time
namespace through /proc cannot be used with clone3() because the offsets
for the time namespace need to be written before a process has been
created in that time namespace. This means it is necessary to somehow
tell clone3() the offsets for the clocks.

The time namespace offers the possibility to set offsets for
CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
'struct clone_args` with '__aligned_u64 monotonic_offset' and
'__aligned_u64 boottime_offset'. The problem with this approach was that
it was not possible to set nanoseconds for the clocks in the time
namespace.

One of the motivations for clone3() with CLONE_NEWTIME was to enable
CRIU to restore a process in a time namespace with the corresponding
offsets. And although the nanosecond value can probably never be
restored to the same value it had during checkpointing, because the
clock keeps on running between CRIU pausing all processes and CRIU
actually reading the value of the clocks, the nanosecond value is still
necessary for CRIU to not restore a process where the clock jumps back
due to CRIU restoring it with a nanonsecond value that is too small.

Requiring nanoseconds as well as seconds for two clocks during clone3()
means that it would require 4 additional members to 'struct clone_args':

__aligned_u64 tls;
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
+ __aligned_u64 boottime_offset_seconds;
+ __aligned_u64 boottime_offset_nanoseconds;
+ __aligned_u64 monotonic_offset_seconds;
+ __aligned_u64 monotonic_offset_nanoseconds;
};

To avoid four additional members to 'struct clone_args' this patchset
uses another approach:

__aligned_u64 tls;
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
+ __aligned_u64 timens_offset;
+ __aligned_u64 timens_offset_size;
};

timens_offset is a pointer to an array just as previously done with
set_tid and timens_offset_size is the size of the array.

The timens_offset array is expected to contain a struct like this:

struct set_timens_offset {
int clockid;
struct timespec val;
};

This way it is possible to pass the information of multiple clocks with
seconds and nanonseconds to clone3().

To me this seems the better approach, but I am not totally convinced
that it is the right thing. If there are other ideas how to pass two
clock offsets with seconds and nanonseconds to clone3() I would be happy
to hear other ideas.

Adrian



2020-03-17 08:34:01

by Adrian Reber

[permalink] [raw]
Subject: [PATCH 4/4] selftests: add clone3() in time namespace test

This adds a test to use clone3() with CLONE_NEWTIME. The test tries to
set CLOCK_BOOTTIME to 1 in the new process in the new time namespace and
CLOCK_MONOTONIC to +10000.

Signed-off-by: Adrian Reber <[email protected]>
---
tools/testing/selftests/timens/Makefile | 4 +-
tools/testing/selftests/timens/clone3.c | 134 ++++++++++++++++++++++++
2 files changed, 136 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/timens/clone3.c

diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index b4fd9a934654..4a50c326bdab 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,7 +1,7 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec clone3
TEST_GEN_PROGS_EXTENDED := gettime_perf

-CFLAGS := -Wall -Werror -pthread
+CFLAGS := -Wall -Werror -pthread -I../../../../usr/include/
LDLIBS := -lrt -ldl

include ../lib.mk
diff --git a/tools/testing/selftests/timens/clone3.c b/tools/testing/selftests/timens/clone3.c
new file mode 100644
index 000000000000..212b35d43fa5
--- /dev/null
+++ b/tools/testing/selftests/timens/clone3.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <sched.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "log.h"
+#include "timens.h"
+
+#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
+
+struct set_timens_offset {
+ int clockid;
+ struct timespec val;
+};
+
+static void child_exit(int ret)
+{
+ fflush(stdout);
+ fflush(stderr);
+ _exit(ret);
+}
+
+static pid_t sys_clone3(struct clone_args *args, size_t size)
+{
+ fflush(stdout);
+ fflush(stderr);
+ return syscall(__NR_clone3, args, size);
+}
+
+int main(int argc, char *argv[])
+{
+ struct set_timens_offset timens_offset[2];
+ struct timespec monotonic;
+ struct timespec boottime;
+ pid_t pid = -1;
+ int ret = 0;
+ int status;
+
+ struct clone_args args = {
+ .flags = CLONE_NEWTIME,
+ .exit_signal = SIGCHLD,
+ };
+
+ nscheck();
+
+ ksft_set_plan(1);
+
+ ret = clock_gettime(CLOCK_MONOTONIC, &monotonic);
+ if (ret)
+ goto out;
+
+ ret = clock_gettime(CLOCK_BOOTTIME, &boottime);
+ if (ret)
+ goto out;
+
+ /* Set CLOCK_BOOTTIME to be 1 (sec). */
+ timens_offset[0].clockid = CLOCK_BOOTTIME;
+ timens_offset[0].val.tv_sec = -boottime.tv_sec + 1;
+ timens_offset[0].val.tv_nsec = 42;
+ /* Set CLOCK_MONOTONIC to be 10000 (sec) larger than current. */
+ timens_offset[1].clockid = CLOCK_MONOTONIC;
+ timens_offset[1].val.tv_sec = 10000;
+ timens_offset[1].val.tv_nsec = 37;
+
+ args.timens_offset_size = 2;
+ args.timens_offset = ptr_to_u64(timens_offset);
+
+ pid = sys_clone3(&args, sizeof(struct clone_args));
+ if (pid < 0) {
+ ksft_print_msg("%s - Failed to create new process\n",
+ strerror(errno));
+ ret = -errno;
+ goto out;
+ }
+
+ if (pid == 0) {
+ struct timespec monotonic_in_ns;
+ struct timespec boottime_in_ns;
+
+ ksft_print_msg("I am the child, my PID is %d\n", getpid());
+ if (clock_gettime(CLOCK_MONOTONIC, &monotonic_in_ns))
+ return -1;
+
+ if (clock_gettime(CLOCK_BOOTTIME, &boottime_in_ns))
+ return -1;
+
+ /*
+ * CLOCK_BOOTTIME has been set to such an offset to
+ * be 1 in the time namespace after clone3().
+ */
+ if (boottime_in_ns.tv_sec > 5)
+ child_exit(EXIT_FAILURE);
+ ksft_print_msg("CLOCK_BOOTTIME %ld\n", boottime_in_ns.tv_sec);
+ if (monotonic_in_ns.tv_sec <= monotonic.tv_sec + 9998)
+ child_exit(EXIT_FAILURE);
+ ksft_print_msg("CLOCK_MONOTONIC %ld\n", monotonic_in_ns.tv_sec);
+ child_exit(EXIT_SUCCESS);
+ }
+
+ if (waitpid(pid, &status, 0) < 0) {
+ ksft_print_msg("Child returned %s\n", strerror(errno));
+ ret = -errno;
+ goto out;
+ }
+
+ if (!WIFEXITED(status)) {
+ ret = -1;
+ ksft_test_result_fail("Child clocks in time NS are wrong\n");
+ goto out;
+ }
+
+ if (WEXITSTATUS(status)) {
+ ret = -1;
+ ksft_test_result_fail("Child clocks in time NS are wrong\n");
+ goto out;
+ }
+
+ ret = 0;
+ ksft_test_result_pass("Clocks set successfully with clone3()\n");
+out:
+ if (ret)
+ ksft_exit_fail();
+
+ ksft_exit_pass();
+ return !!ret;
+}
--
2.24.1

2020-03-17 08:34:26

by Adrian Reber

[permalink] [raw]
Subject: [PATCH 2/4] clone3: allow creation of time namespace with offset

This extends clone3() to support the time namespace via CLONE_NEWTIME.
In addition to creating a new process in a new time namespace this
allows setting the clock offset in the newly created time namspace.

The time namespace allows to set an offset for two clocks.
CLOCK_MONOTONIC and CLOCK_BOOTTIME.

This clone3() extension also offers setting both offsets through the
newly introduced clone_args members timens_offset and
timens_offset_size.

timens_offset: Pointer to an array of clock offsets for the
newly created process in a time namespaces.
This requires that a new time namespace has been
requested via CLONE_NEWTIME. It is only possible
to set an offset for CLOCK_MONOTONIC and
CLOCK_BOOTTIME. The array can therefore never
have more than two elements.
clone3() expects the array to contain the
following struct:
struct set_timens_offset {
int clockid;
struct timespec val;
};

timens_offset_size: This defines the size of the array referenced
in timens_offset. Currently this is limited
to two elements.

To create a new process using clone3() in a new time namespace with
clock offsets, something like this can be used:

struct set_timens_offset timens_offset[2];

timens_offset[0].clockid = CLOCK_BOOTTIME;
timens_offset[0].val.tv_sec = -1000;
timens_offset[0].val.tv_nsec = 42;
timens_offset[1].clockid = CLOCK_MONOTONIC;
timens_offset[1].val.tv_sec = 1000000;
timens_offset[1].val.tv_nsec = 37;

struct _clone_args args = {
.flags = CLONE_NEWTIME,
.timens_offset = ptr_to_u64(timens_offset),
.timens_offset_size = 2;
};

Although the command-line tool unshare only supports setting the offset
in seconds and not nanoseconds, it is possible to set nanoseconds using
the /proc interface of the time namespace. Therefore (and for CRIU) it
probably makes sense to offer nanoseconds also via clone3().

The main motivation for the clone3() integration is to enable CRIU to
restore time namespaces. Not sure how useful the restoration of the
nanosecond value for CRIU is, as during the time the process is paused
by CRIU the clock will continue to run, so that the nanosecond value
will not be the same as when the process has been paused. Between
pausing the process and reading all time namespace information for all
paused process the nanosecond value will change, so that CRIU will never
be able to restore a process with exactly the same nanosecond value the
process had when it was paused. It will make sure, however, that the
nanosecond value will not jump back. So maybe it is useful for CRIU in
the end.

Signed-off-by: Adrian Reber <[email protected]>
---
include/linux/nsproxy.h | 4 +++-
include/linux/sched/task.h | 3 +++
include/linux/time_namespace.h | 7 +++++--
include/uapi/linux/sched.h | 19 +++++++++++++++++
kernel/fork.c | 35 ++++++++++++++++++++++++++++++--
kernel/nsproxy.c | 5 +++--
kernel/time/namespace.c | 11 +++++++++-
tools/include/uapi/linux/sched.h | 1 +
8 files changed, 77 insertions(+), 8 deletions(-)

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 074f395b9ad2..3c032bf7758a 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -11,6 +11,7 @@ struct ipc_namespace;
struct pid_namespace;
struct cgroup_namespace;
struct fs_struct;
+struct set_timens_offset;

/*
* A structure to contain pointers to all per-process
@@ -67,7 +68,8 @@ extern struct nsproxy init_nsproxy;
*
*/

-int copy_namespaces(unsigned long flags, struct task_struct *tsk);
+int copy_namespaces(unsigned long flags, struct task_struct *tsk,
+ struct set_timens_offset *offsets, int noffsets);
void exit_task_namespaces(struct task_struct *tsk);
void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
void free_nsproxy(struct nsproxy *ns);
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index f1879884238e..45c0e83d52ec 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -9,6 +9,7 @@

#include <linux/sched.h>
#include <linux/uaccess.h>
+#include <linux/time.h>

struct task_struct;
struct rusage;
@@ -29,6 +30,8 @@ struct kernel_clone_args {
pid_t *set_tid;
/* Number of elements in *set_tid */
size_t set_tid_size;
+ struct set_timens_offset *timens_offset;
+ size_t timens_offset_size;
};

/*
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index fb4ca4402a2a..9261ca1070c9 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -53,7 +53,8 @@ struct time_namespace *copy_time_ns(unsigned long flags,
struct user_namespace *user_ns,
struct time_namespace *old_ns);
void free_time_ns(struct kref *kref);
-int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk);
+int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk,
+ struct set_timens_offset *offsets, int noffsets);
struct vdso_data *arch_get_vdso_data(void *vvar_page);

static inline void put_time_ns(struct time_namespace *ns)
@@ -121,7 +122,9 @@ struct time_namespace *copy_time_ns(unsigned long flags,
}

static inline int timens_on_fork(struct nsproxy *nsproxy,
- struct task_struct *tsk)
+ struct task_struct *tsk,
+ struct set_timens_offset *offsets,
+ int noffsets)
{
return 0;
}
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 2e3bc22c6f20..c9b36ef647cc 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -81,6 +81,22 @@
* @set_tid_size: This defines the size of the array referenced
* in @set_tid. This cannot be larger than the
* kernel's limit of nested PID namespaces.
+ * @timens_offset: Pointer to an array of clock offsets for the
+ * newly created process in a time namespaces.
+ * This requires that a new time namespace has been
+ * requested via CLONE_NEWTIME. It is only possible
+ * to set an offset for CLOCK_MONOTONIC and
+ * CLOCK_BOOTTIME. The array can therefore never
+ * have more than two elements.
+ * clone3() expects the array to contain the
+ * following struct:
+ * struct set_timens_offset {
+ * int clockid;
+ * struct timespec val;
+ * };
+ * @timens_offset_size: This defines the size of the array referenced
+ * in @timens_offset. Currently this is limited
+ * to two elements.
*
* The structure is versioned by size and thus extensible.
* New struct members must go at the end of the struct and
@@ -97,11 +113,14 @@ struct clone_args {
__aligned_u64 tls;
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
+ __aligned_u64 timens_offset;
+ __aligned_u64 timens_offset_size;
};
#endif

#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
+#define CLONE_ARGS_SIZE_VER2 96 /* sizeof third published struct */

/*
* Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index 60a1295f4384..f80aca79d263 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -94,6 +94,7 @@
#include <linux/thread_info.h>
#include <linux/stackleak.h>
#include <linux/kasan.h>
+#include <linux/time_namespace.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -2081,7 +2082,8 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_mm(clone_flags, p);
if (retval)
goto bad_fork_cleanup_signal;
- retval = copy_namespaces(clone_flags, p);
+ retval = copy_namespaces(clone_flags, p, args->timens_offset,
+ args->timens_offset_size);
if (retval)
goto bad_fork_cleanup_mm;
retval = copy_io(clone_flags, p);
@@ -2604,6 +2606,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
int err;
struct clone_args args;
pid_t *kset_tid = kargs->set_tid;
+ struct set_timens_offset *ktimens_offset = kargs->timens_offset;

if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
@@ -2623,6 +2626,23 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
if (unlikely(args.set_tid && args.set_tid_size == 0))
return -EINVAL;

+ /*
+ * Currently the time namespace supports setting two clocks:
+ * CLOCK_MONOTONIC and CLOCK_BOOTTIME
+ */
+ if (unlikely(args.timens_offset_size > 2))
+ return -EINVAL;
+
+ if (unlikely(!args.timens_offset && args.timens_offset_size > 0))
+ return -EINVAL;
+
+ if (unlikely(args.timens_offset && args.timens_offset_size == 0))
+ return -EINVAL;
+
+ if (unlikely((args.timens_offset && args.timens_offset_size)
+ && !(args.flags & CLONE_NEWTIME)))
+ return -EINVAL;
+
/*
* Verify that higher 32bits of exit_signal are unset and that
* it is a valid signal
@@ -2641,6 +2661,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
.stack_size = args.stack_size,
.tls = args.tls,
.set_tid_size = args.set_tid_size,
+ .timens_offset_size = args.timens_offset_size,
};

if (args.set_tid &&
@@ -2648,7 +2669,15 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
(kargs->set_tid_size * sizeof(pid_t))))
return -EFAULT;

+ if (args.timens_offset &&
+ copy_from_user(ktimens_offset,
+ u64_to_user_ptr(args.timens_offset),
+ (kargs->timens_offset_size *
+ sizeof(struct set_timens_offset))))
+ return -EFAULT;
+
kargs->set_tid = kset_tid;
+ kargs->timens_offset = ktimens_offset;

return 0;
}
@@ -2691,7 +2720,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
* - make the CLONE_DETACHED bit reuseable for clone3
* - make the CSIGNAL bits reuseable for clone3
*/
- if (kargs->flags & (CLONE_DETACHED | CSIGNAL))
+ if (kargs->flags & ((CLONE_DETACHED | CSIGNAL) & ~CLONE_NEWTIME))
return false;

if ((kargs->flags & (CLONE_SIGHAND | CLONE_CLEAR_SIGHAND)) ==
@@ -2725,8 +2754,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)

struct kernel_clone_args kargs;
pid_t set_tid[MAX_PID_NS_LEVEL];
+ struct set_timens_offset timens_offset[2];

kargs.set_tid = set_tid;
+ kargs.timens_offset = timens_offset;

err = copy_clone_args_from_user(&kargs, uargs, size);
if (err)
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ed9882108cd2..0f041ae3313a 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -146,7 +146,8 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
* called from clone. This now handles copy for nsproxy and all
* namespaces therein.
*/
-int copy_namespaces(unsigned long flags, struct task_struct *tsk)
+int copy_namespaces(unsigned long flags, struct task_struct *tsk,
+ struct set_timens_offset *offsets, int noffsets)
{
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
@@ -178,7 +179,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (IS_ERR(new_ns))
return PTR_ERR(new_ns);

- ret = timens_on_fork(new_ns, tsk);
+ ret = timens_on_fork(new_ns, tsk, offsets, noffsets);
if (ret) {
free_nsproxy(new_ns);
return ret;
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 839efa7c6886..f4fdf9b0b411 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -370,7 +370,8 @@ static int timens_set_offset(struct task_struct *p, struct time_namespace *ns,
return err;
}

-int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
+int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk,
+ struct set_timens_offset *offsets, int noffsets)
{
struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
struct time_namespace *ns = to_time_ns(nsc);
@@ -380,6 +381,14 @@ int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
if (nsproxy->time_ns == nsproxy->time_ns_for_children)
return 0;

+ if (noffsets) {
+ if (!ns_capable(ns->user_ns, CAP_SYS_TIME))
+ return -EPERM;
+ err = timens_set_offset(tsk, ns, offsets, noffsets);
+ if (err)
+ return err;
+ }
+
timens_set_vvar_page(tsk, ns);

err = vdso_join_timens(tsk, ns);
diff --git a/tools/include/uapi/linux/sched.h b/tools/include/uapi/linux/sched.h
index 2e3bc22c6f20..501b44e6be97 100644
--- a/tools/include/uapi/linux/sched.h
+++ b/tools/include/uapi/linux/sched.h
@@ -102,6 +102,7 @@ struct clone_args {

#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
+#define CLONE_ARGS_SIZE_VER2 96 /* sizeof third published struct */

/*
* Scheduling policies
--
2.24.1

2020-03-17 08:34:54

by Adrian Reber

[permalink] [raw]
Subject: [PATCH 3/4] clone3: align structs and comments

To follow the existing style this commit aligns the comments and
structure settings to have the same look as before. Just with more
spaces.

Signed-off-by: Adrian Reber <[email protected]>
---
include/uapi/linux/sched.h | 72 +++++++++++++++++++-------------------
kernel/fork.c | 18 +++++-----
2 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index c9b36ef647cc..86387052de19 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -45,42 +45,42 @@
#ifndef __ASSEMBLY__
/**
* struct clone_args - arguments for the clone3 syscall
- * @flags: Flags for the new process as listed above.
- * All flags are valid except for CSIGNAL and
- * CLONE_DETACHED.
- * @pidfd: If CLONE_PIDFD is set, a pidfd will be
- * returned in this argument.
- * @child_tid: If CLONE_CHILD_SETTID is set, the TID of the
- * child process will be returned in the child's
- * memory.
- * @parent_tid: If CLONE_PARENT_SETTID is set, the TID of
- * the child process will be returned in the
- * parent's memory.
- * @exit_signal: The exit_signal the parent process will be
- * sent when the child exits.
- * @stack: Specify the location of the stack for the
- * child process.
- * Note, @stack is expected to point to the
- * lowest address. The stack direction will be
- * determined by the kernel and set up
- * appropriately based on @stack_size.
- * @stack_size: The size of the stack for the child process.
- * @tls: If CLONE_SETTLS is set, the tls descriptor
- * is set to tls.
- * @set_tid: Pointer to an array of type *pid_t. The size
- * of the array is defined using @set_tid_size.
- * This array is used to select PIDs/TIDs for
- * newly created processes. The first element in
- * this defines the PID in the most nested PID
- * namespace. Each additional element in the array
- * defines the PID in the parent PID namespace of
- * the original PID namespace. If the array has
- * less entries than the number of currently
- * nested PID namespaces only the PIDs in the
- * corresponding namespaces are set.
- * @set_tid_size: This defines the size of the array referenced
- * in @set_tid. This cannot be larger than the
- * kernel's limit of nested PID namespaces.
+ * @flags: Flags for the new process as listed above.
+ * All flags are valid except for CSIGNAL and
+ * CLONE_DETACHED.
+ * @pidfd: If CLONE_PIDFD is set, a pidfd will be
+ * returned in this argument.
+ * @child_tid: If CLONE_CHILD_SETTID is set, the TID of the
+ * child process will be returned in the child's
+ * memory.
+ * @parent_tid: If CLONE_PARENT_SETTID is set, the TID of
+ * the child process will be returned in the
+ * parent's memory.
+ * @exit_signal: The exit_signal the parent process will be
+ * sent when the child exits.
+ * @stack: Specify the location of the stack for the
+ * child process.
+ * Note, @stack is expected to point to the
+ * lowest address. The stack direction will be
+ * determined by the kernel and set up
+ * appropriately based on @stack_size.
+ * @stack_size: The size of the stack for the child process.
+ * @tls: If CLONE_SETTLS is set, the tls descriptor
+ * is set to tls.
+ * @set_tid: Pointer to an array of type *pid_t. The size
+ * of the array is defined using @set_tid_size.
+ * This array is used to select PIDs/TIDs for
+ * newly created processes. The first element in
+ * this defines the PID in the most nested PID
+ * namespace. Each additional element in the array
+ * defines the PID in the parent PID namespace of
+ * the original PID namespace. If the array has
+ * less entries than the number of currently
+ * nested PID namespaces only the PIDs in the
+ * corresponding namespaces are set.
+ * @set_tid_size: This defines the size of the array referenced
+ * in @set_tid. This cannot be larger than the
+ * kernel's limit of nested PID namespaces.
* @timens_offset: Pointer to an array of clock offsets for the
* newly created process in a time namespaces.
* This requires that a new time namespace has been
diff --git a/kernel/fork.c b/kernel/fork.c
index f80aca79d263..6e9acac1d347 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2652,15 +2652,15 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
return -EINVAL;

*kargs = (struct kernel_clone_args){
- .flags = args.flags,
- .pidfd = u64_to_user_ptr(args.pidfd),
- .child_tid = u64_to_user_ptr(args.child_tid),
- .parent_tid = u64_to_user_ptr(args.parent_tid),
- .exit_signal = args.exit_signal,
- .stack = args.stack,
- .stack_size = args.stack_size,
- .tls = args.tls,
- .set_tid_size = args.set_tid_size,
+ .flags = args.flags,
+ .pidfd = u64_to_user_ptr(args.pidfd),
+ .child_tid = u64_to_user_ptr(args.child_tid),
+ .parent_tid = u64_to_user_ptr(args.parent_tid),
+ .exit_signal = args.exit_signal,
+ .stack = args.stack,
+ .stack_size = args.stack_size,
+ .tls = args.tls,
+ .set_tid_size = args.set_tid_size,
.timens_offset_size = args.timens_offset_size,
};

--
2.24.1

2020-03-17 08:35:48

by Adrian Reber

[permalink] [raw]
Subject: [PATCH 1/4] ns: prepare time namespace for clone3()

To enable clone3()ing a process directly into a new time namespace with
a clock offset, this changes the time namespace code by renaming the
variable proc_timens_offset (which was used to set a timens offset via
/proc) to set_timens_offset to avoid confusion why it will be used in
clone3() without being related to /proc.

This also moves out the code of actually setting the clock offsets to
its own function to be later used via clone3().

Signed-off-by: Adrian Reber <[email protected]>
---
fs/proc/base.c | 4 +-
include/linux/time_namespace.h | 16 +++--
kernel/time/namespace.c | 126 +++++++++++++++++----------------
3 files changed, 78 insertions(+), 68 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c7c64272b0fa..2ca68f11ff0e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1554,7 +1554,7 @@ static ssize_t timens_offsets_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
struct inode *inode = file_inode(file);
- struct proc_timens_offset offsets[2];
+ struct set_timens_offset offsets[2];
char *kbuf = NULL, *pos, *next_line;
struct task_struct *p;
int ret, noffsets;
@@ -1572,7 +1572,7 @@ static ssize_t timens_offsets_write(struct file *file, const char __user *buf,
ret = -EINVAL;
noffsets = 0;
for (pos = kbuf; pos; pos = next_line) {
- struct proc_timens_offset *off = &offsets[noffsets];
+ struct set_timens_offset *off = &offsets[noffsets];
int err;

/* Find the end of line and ensure we don't look past it */
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 824d54e057eb..fb4ca4402a2a 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -30,6 +30,15 @@ struct time_namespace {

extern struct time_namespace init_time_ns;

+/*
+ * This structure is used to set the time namespace offset
+ * via /proc as well as via clone3().
+ */
+struct set_timens_offset {
+ int clockid;
+ struct timespec64 val;
+};
+
#ifdef CONFIG_TIME_NS
extern int vdso_join_timens(struct task_struct *task,
struct time_namespace *ns);
@@ -54,13 +63,8 @@ static inline void put_time_ns(struct time_namespace *ns)

void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m);

-struct proc_timens_offset {
- int clockid;
- struct timespec64 val;
-};
-
int proc_timens_set_offset(struct file *file, struct task_struct *p,
- struct proc_timens_offset *offsets, int n);
+ struct set_timens_offset *offsets, int n);

static inline void timens_add_monotonic(struct timespec64 *ts)
{
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12858507d75a..839efa7c6886 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -307,6 +307,69 @@ static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
return 0;
}

+static int timens_set_offset(struct task_struct *p, struct time_namespace *ns,
+ struct set_timens_offset *offsets, int noffsets)
+{
+ struct timespec64 tp;
+ int i, err;
+
+ for (i = 0; i < noffsets; i++) {
+ struct set_timens_offset *off = &offsets[i];
+
+ switch (off->clockid) {
+ case CLOCK_MONOTONIC:
+ ktime_get_ts64(&tp);
+ break;
+ case CLOCK_BOOTTIME:
+ ktime_get_boottime_ts64(&tp);
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (off->val.tv_sec > KTIME_SEC_MAX ||
+ off->val.tv_sec < -KTIME_SEC_MAX)
+ return -ERANGE;
+
+ tp = timespec64_add(tp, off->val);
+ /*
+ * KTIME_SEC_MAX is divided by 2 to be sure that KTIME_MAX is
+ * still unreachable.
+ */
+ if (tp.tv_sec < 0 || tp.tv_sec > KTIME_SEC_MAX / 2)
+ return -ERANGE;
+ }
+
+ mutex_lock(&offset_lock);
+ if (ns->frozen_offsets) {
+ err = -EACCES;
+ goto out_unlock;
+ }
+
+ err = 0;
+ /* Don't report errors after this line */
+ for (i = 0; i < noffsets; i++) {
+ struct set_timens_offset *off = &offsets[i];
+ struct timespec64 *offset = NULL;
+
+ switch (off->clockid) {
+ case CLOCK_MONOTONIC:
+ offset = &ns->offsets.monotonic;
+ break;
+ case CLOCK_BOOTTIME:
+ offset = &ns->offsets.boottime;
+ break;
+ }
+
+ *offset = off->val;
+ }
+
+out_unlock:
+ mutex_unlock(&offset_lock);
+
+ return err;
+}
+
int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
{
struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
@@ -356,12 +419,11 @@ void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m)
}

int proc_timens_set_offset(struct file *file, struct task_struct *p,
- struct proc_timens_offset *offsets, int noffsets)
+ struct set_timens_offset *offsets, int noffsets)
{
struct ns_common *ns;
struct time_namespace *time_ns;
- struct timespec64 tp;
- int i, err;
+ int err;

ns = timens_for_children_get(p);
if (!ns)
@@ -373,63 +435,7 @@ int proc_timens_set_offset(struct file *file, struct task_struct *p,
return -EPERM;
}

- for (i = 0; i < noffsets; i++) {
- struct proc_timens_offset *off = &offsets[i];
-
- switch (off->clockid) {
- case CLOCK_MONOTONIC:
- ktime_get_ts64(&tp);
- break;
- case CLOCK_BOOTTIME:
- ktime_get_boottime_ts64(&tp);
- break;
- default:
- err = -EINVAL;
- goto out;
- }
-
- err = -ERANGE;
-
- if (off->val.tv_sec > KTIME_SEC_MAX ||
- off->val.tv_sec < -KTIME_SEC_MAX)
- goto out;
-
- tp = timespec64_add(tp, off->val);
- /*
- * KTIME_SEC_MAX is divided by 2 to be sure that KTIME_MAX is
- * still unreachable.
- */
- if (tp.tv_sec < 0 || tp.tv_sec > KTIME_SEC_MAX / 2)
- goto out;
- }
-
- mutex_lock(&offset_lock);
- if (time_ns->frozen_offsets) {
- err = -EACCES;
- goto out_unlock;
- }
-
- err = 0;
- /* Don't report errors after this line */
- for (i = 0; i < noffsets; i++) {
- struct proc_timens_offset *off = &offsets[i];
- struct timespec64 *offset = NULL;
-
- switch (off->clockid) {
- case CLOCK_MONOTONIC:
- offset = &time_ns->offsets.monotonic;
- break;
- case CLOCK_BOOTTIME:
- offset = &time_ns->offsets.boottime;
- break;
- }
-
- *offset = off->val;
- }
-
-out_unlock:
- mutex_unlock(&offset_lock);
-out:
+ err = timens_set_offset(p, time_ns, offsets, noffsets);
put_time_ns(time_ns);

return err;

base-commit: 8b614cb8f1dcac8ca77cf4dd85f46ef3055f8238
--
2.24.1

2020-03-17 08:43:09

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Tue, Mar 17, 2020 at 09:30:40AM +0100, Adrian Reber wrote:
> This is an attempt to add time namespace support to clone3(). I am not
> really sure which way clone3() should handle time namespaces. The time
> namespace through /proc cannot be used with clone3() because the offsets
> for the time namespace need to be written before a process has been
> created in that time namespace. This means it is necessary to somehow
> tell clone3() the offsets for the clocks.
>
> The time namespace offers the possibility to set offsets for
> CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
> 'struct clone_args` with '__aligned_u64 monotonic_offset' and
> '__aligned_u64 boottime_offset'. The problem with this approach was that
> it was not possible to set nanoseconds for the clocks in the time
> namespace.
>
> One of the motivations for clone3() with CLONE_NEWTIME was to enable
> CRIU to restore a process in a time namespace with the corresponding
> offsets. And although the nanosecond value can probably never be
> restored to the same value it had during checkpointing, because the
> clock keeps on running between CRIU pausing all processes and CRIU
> actually reading the value of the clocks, the nanosecond value is still
> necessary for CRIU to not restore a process where the clock jumps back
> due to CRIU restoring it with a nanonsecond value that is too small.
>
> Requiring nanoseconds as well as seconds for two clocks during clone3()
> means that it would require 4 additional members to 'struct clone_args':
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_u64 boottime_offset_seconds;
> + __aligned_u64 boottime_offset_nanoseconds;
> + __aligned_u64 monotonic_offset_seconds;
> + __aligned_u64 monotonic_offset_nanoseconds;
> };
>
> To avoid four additional members to 'struct clone_args' this patchset
> uses another approach:
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_u64 timens_offset;
> + __aligned_u64 timens_offset_size;

Hm, so for set_tid we did set_tid and set_tid_size which makes sense
because set_tid wasn't actually a struct. But I'm not a fan of
establishing a pattern whereby we always have to grow two member, the
object and it's size; at least when we're adding a struct.
So at a first glance here are two possible ideas:
- Don't add a size argument and assume that struct timens_offset won't
grow. I'm not sure how likely it is it will grow.
- Make the size the first member of struct timens_offset the size of the
struct. (See examples for this pattern in the sched syscalls.)

Christian

2020-03-17 08:45:11

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Tue, Mar 17, 2020 at 09:41:55AM +0100, Christian Brauner wrote:
> On Tue, Mar 17, 2020 at 09:30:40AM +0100, Adrian Reber wrote:
> > This is an attempt to add time namespace support to clone3(). I am not
> > really sure which way clone3() should handle time namespaces. The time
> > namespace through /proc cannot be used with clone3() because the offsets
> > for the time namespace need to be written before a process has been
> > created in that time namespace. This means it is necessary to somehow
> > tell clone3() the offsets for the clocks.
> >
> > The time namespace offers the possibility to set offsets for
> > CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
> > 'struct clone_args` with '__aligned_u64 monotonic_offset' and
> > '__aligned_u64 boottime_offset'. The problem with this approach was that
> > it was not possible to set nanoseconds for the clocks in the time
> > namespace.
> >
> > One of the motivations for clone3() with CLONE_NEWTIME was to enable
> > CRIU to restore a process in a time namespace with the corresponding
> > offsets. And although the nanosecond value can probably never be
> > restored to the same value it had during checkpointing, because the
> > clock keeps on running between CRIU pausing all processes and CRIU
> > actually reading the value of the clocks, the nanosecond value is still
> > necessary for CRIU to not restore a process where the clock jumps back
> > due to CRIU restoring it with a nanonsecond value that is too small.
> >
> > Requiring nanoseconds as well as seconds for two clocks during clone3()
> > means that it would require 4 additional members to 'struct clone_args':
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_u64 boottime_offset_seconds;
> > + __aligned_u64 boottime_offset_nanoseconds;
> > + __aligned_u64 monotonic_offset_seconds;
> > + __aligned_u64 monotonic_offset_nanoseconds;
> > };
> >
> > To avoid four additional members to 'struct clone_args' this patchset
> > uses another approach:
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_u64 timens_offset;
> > + __aligned_u64 timens_offset_size;
>
> Hm, so for set_tid we did set_tid and set_tid_size which makes sense
> because set_tid wasn't actually a struct. But I'm not a fan of
> establishing a pattern whereby we always have to grow two member, the
> object and it's size; at least when we're adding a struct.
> So at a first glance here are two possible ideas:
> - Don't add a size argument and assume that struct timens_offset won't
> grow. I'm not sure how likely it is it will grow.
> - Make the size the first member of struct timens_offset the size of the
> struct. (See examples for this pattern in the sched syscalls.)

Oh, and I should point out right way that I consider this material for
the v5.8 merge window. It's too late in this cycle to land this with any
confidence in v5.7. Just so there's no disappointment. :) The good news
is that this leaves us with ample time to figure this out.

Christian

Subject: Re: clone3: allow creation of time namespace with offset

[CC += linux-api; please CC on future versions]

On Tue, 17 Mar 2020 at 09:32, Adrian Reber <[email protected]> wrote:
>
> This is an attempt to add time namespace support to clone3(). I am not
> really sure which way clone3() should handle time namespaces. The time
> namespace through /proc cannot be used with clone3() because the offsets
> for the time namespace need to be written before a process has been
> created in that time namespace. This means it is necessary to somehow
> tell clone3() the offsets for the clocks.
>
> The time namespace offers the possibility to set offsets for
> CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
> 'struct clone_args` with '__aligned_u64 monotonic_offset' and
> '__aligned_u64 boottime_offset'. The problem with this approach was that
> it was not possible to set nanoseconds for the clocks in the time
> namespace.
>
> One of the motivations for clone3() with CLONE_NEWTIME was to enable
> CRIU to restore a process in a time namespace with the corresponding
> offsets. And although the nanosecond value can probably never be
> restored to the same value it had during checkpointing, because the
> clock keeps on running between CRIU pausing all processes and CRIU
> actually reading the value of the clocks, the nanosecond value is still
> necessary for CRIU to not restore a process where the clock jumps back
> due to CRIU restoring it with a nanonsecond value that is too small.
>
> Requiring nanoseconds as well as seconds for two clocks during clone3()
> means that it would require 4 additional members to 'struct clone_args':
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_u64 boottime_offset_seconds;
> + __aligned_u64 boottime_offset_nanoseconds;
> + __aligned_u64 monotonic_offset_seconds;
> + __aligned_u64 monotonic_offset_nanoseconds;
> };
>
> To avoid four additional members to 'struct clone_args' this patchset
> uses another approach:
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_u64 timens_offset;
> + __aligned_u64 timens_offset_size;
> };
>
> timens_offset is a pointer to an array just as previously done with
> set_tid and timens_offset_size is the size of the array.
>
> The timens_offset array is expected to contain a struct like this:
>
> struct set_timens_offset {
> int clockid;
> struct timespec val;
> };
>
> This way it is possible to pass the information of multiple clocks with
> seconds and nanonseconds to clone3().
>
> To me this seems the better approach, but I am not totally convinced
> that it is the right thing. If there are other ideas how to pass two
> clock offsets with seconds and nanonseconds to clone3() I would be happy
> to hear other ideas.
>
> Adrian
>
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-03-17 16:11:16

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Wed, Mar 18, 2020 at 01:23:50AM +1100, Aleksa Sarai wrote:
> On 2020-03-17, Michael Kerrisk (man-pages) <[email protected]> wrote:
> > [CC += linux-api; please CC on future versions]
> >
> > On Tue, 17 Mar 2020 at 09:32, Adrian Reber <[email protected]> wrote:
> > > Requiring nanoseconds as well as seconds for two clocks during clone3()
> > > means that it would require 4 additional members to 'struct clone_args':
> > >
> > > __aligned_u64 tls;
> > > __aligned_u64 set_tid;
> > > __aligned_u64 set_tid_size;
> > > + __aligned_u64 boottime_offset_seconds;
> > > + __aligned_u64 boottime_offset_nanoseconds;
> > > + __aligned_u64 monotonic_offset_seconds;
> > > + __aligned_u64 monotonic_offset_nanoseconds;
> > > };
> > >
> > > To avoid four additional members to 'struct clone_args' this patchset
> > > uses another approach:
> > >
> > > __aligned_u64 tls;
> > > __aligned_u64 set_tid;
> > > __aligned_u64 set_tid_size;
> > > + __aligned_u64 timens_offset;
> > > + __aligned_u64 timens_offset_size;
> > > };
> > >
> > > timens_offset is a pointer to an array just as previously done with
> > > set_tid and timens_offset_size is the size of the array.
> > >
> > > The timens_offset array is expected to contain a struct like this:
> > >
> > > struct set_timens_offset {
> > > int clockid;
> > > struct timespec val;
> > > };
> > >
> > > This way it is possible to pass the information of multiple clocks with
> > > seconds and nanonseconds to clone3().
> > >
> > > To me this seems the better approach, but I am not totally convinced
> > > that it is the right thing. If there are other ideas how to pass two
> > > clock offsets with seconds and nanonseconds to clone3() I would be happy
> > > to hear other ideas.
>
> While I agree this does make the API cleaner, I am a little worried that
> it risks killing some of the ideas we discussed for seccomp deep
> inspection. In particular, having a pointer to variable-sized data
> inside the struct means that now the cBPF program can't just be given a
> copy of the struct data from userspace to check.

I suggested two alternative approaches in a response to this. The
easiest one would be to simple assume that the struct doesn't change
size.
(But haven't we crossed that bridge with the set_tid array already?)

2020-03-18 10:19:54

by Arnd Bergmann

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Tue, Mar 17, 2020 at 9:32 AM Adrian Reber <[email protected]> wrote:
>
> This is an attempt to add time namespace support to clone3(). I am not
> really sure which way clone3() should handle time namespaces. The time
> namespace through /proc cannot be used with clone3() because the offsets
> for the time namespace need to be written before a process has been
> created in that time namespace. This means it is necessary to somehow
> tell clone3() the offsets for the clocks.
>
> The time namespace offers the possibility to set offsets for
> CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
> 'struct clone_args` with '__aligned_u64 monotonic_offset' and
> '__aligned_u64 boottime_offset'. The problem with this approach was that
> it was not possible to set nanoseconds for the clocks in the time
> namespace.
>
> One of the motivations for clone3() with CLONE_NEWTIME was to enable
> CRIU to restore a process in a time namespace with the corresponding
> offsets. And although the nanosecond value can probably never be
> restored to the same value it had during checkpointing, because the
> clock keeps on running between CRIU pausing all processes and CRIU
> actually reading the value of the clocks, the nanosecond value is still
> necessary for CRIU to not restore a process where the clock jumps back
> due to CRIU restoring it with a nanonsecond value that is too small.
>
> Requiring nanoseconds as well as seconds for two clocks during clone3()
> means that it would require 4 additional members to 'struct clone_args':
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_u64 boottime_offset_seconds;
> + __aligned_u64 boottime_offset_nanoseconds;
> + __aligned_u64 monotonic_offset_seconds;
> + __aligned_u64 monotonic_offset_nanoseconds;
> };

Wouldn't it be sufficient to have the two nanosecond values, rather
than both seconds and nanoseconds? With 64-bit nanoseconds
you can represent several hundred years, and these would
always start at zero during boot.

Regardless of this, I think you need a signed offset, not unsigned.

Arnd

2020-03-18 10:58:59

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 1/4] ns: prepare time namespace for clone3()

On Tue, Mar 17, 2020 at 09:30:41AM +0100, Adrian Reber wrote:
...
> +/*
> + * This structure is used to set the time namespace offset
> + * via /proc as well as via clone3().
> + */
> +struct set_timens_offset {
> + int clockid;
> + struct timespec64 val;
> +};
> +

I'm sorry, I didn't follow this series much so can't comment right now.
Still this structure made me a bit wonder -- the @val seems to be 8 byte
aligned and I guess we have a useless pad between these members. If so
can we swap them? Or it is already part of api?

2020-03-18 11:18:31

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/4] ns: prepare time namespace for clone3()

On Wed, Mar 18, 2020 at 01:57:47PM +0300, Cyrill Gorcunov wrote:
> On Tue, Mar 17, 2020 at 09:30:41AM +0100, Adrian Reber wrote:
> ...
> > +/*
> > + * This structure is used to set the time namespace offset
> > + * via /proc as well as via clone3().
> > + */
> > +struct set_timens_offset {
> > + int clockid;
> > + struct timespec64 val;
> > +};
> > +
>
> I'm sorry, I didn't follow this series much so can't comment right now.
> Still this structure made me a bit wonder -- the @val seems to be 8 byte
> aligned and I guess we have a useless pad between these members. If so
> can we swap them? Or it is already part of api?

It's not part of the api yet. We're still arguing about how and what we
want to pass down.

2020-03-18 11:29:02

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 1/4] ns: prepare time namespace for clone3()

On Wed, Mar 18, 2020 at 12:17:26PM +0100, Christian Brauner wrote:
> On Wed, Mar 18, 2020 at 01:57:47PM +0300, Cyrill Gorcunov wrote:
> > On Tue, Mar 17, 2020 at 09:30:41AM +0100, Adrian Reber wrote:
> > ...
> > > +/*
> > > + * This structure is used to set the time namespace offset
> > > + * via /proc as well as via clone3().
> > > + */
> > > +struct set_timens_offset {
> > > + int clockid;
> > > + struct timespec64 val;
> > > +};
> > > +
> >
> > I'm sorry, I didn't follow this series much so can't comment right now.
> > Still this structure made me a bit wonder -- the @val seems to be 8 byte
> > aligned and I guess we have a useless pad between these members. If so
> > can we swap them? Or it is already part of api?
>
> It's not part of the api yet. We're still arguing about how and what we
> want to pass down.

I see, thanks for info! So I think we could swap the members.

2020-03-18 11:58:45

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/4] ns: prepare time namespace for clone3()

On Wed, Mar 18, 2020 at 02:28:14PM +0300, Cyrill Gorcunov wrote:
> On Wed, Mar 18, 2020 at 12:17:26PM +0100, Christian Brauner wrote:
> > On Wed, Mar 18, 2020 at 01:57:47PM +0300, Cyrill Gorcunov wrote:
> > > On Tue, Mar 17, 2020 at 09:30:41AM +0100, Adrian Reber wrote:
> > > ...
> > > > +/*
> > > > + * This structure is used to set the time namespace offset
> > > > + * via /proc as well as via clone3().
> > > > + */
> > > > +struct set_timens_offset {
> > > > + int clockid;
> > > > + struct timespec64 val;
> > > > +};
> > > > +
> > >
> > > I'm sorry, I didn't follow this series much so can't comment right now.
> > > Still this structure made me a bit wonder -- the @val seems to be 8 byte
> > > aligned and I guess we have a useless pad between these members. If so
> > > can we swap them? Or it is already part of api?
> >
> > It's not part of the api yet. We're still arguing about how and what we
> > want to pass down.
>
> I see, thanks for info! So I think we could swap the members.

Yeah, _if_ we go with a version of this struct we definitely need to
ensure proper padding/alignment!

2020-03-18 11:59:10

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/4] ns: prepare time namespace for clone3()

On Wed, Mar 18, 2020 at 12:57:32PM +0100, Christian Brauner wrote:
> On Wed, Mar 18, 2020 at 02:28:14PM +0300, Cyrill Gorcunov wrote:
> > On Wed, Mar 18, 2020 at 12:17:26PM +0100, Christian Brauner wrote:
> > > On Wed, Mar 18, 2020 at 01:57:47PM +0300, Cyrill Gorcunov wrote:
> > > > On Tue, Mar 17, 2020 at 09:30:41AM +0100, Adrian Reber wrote:
> > > > ...
> > > > > +/*
> > > > > + * This structure is used to set the time namespace offset
> > > > > + * via /proc as well as via clone3().
> > > > > + */
> > > > > +struct set_timens_offset {
> > > > > + int clockid;
> > > > > + struct timespec64 val;
> > > > > +};
> > > > > +
> > > >
> > > > I'm sorry, I didn't follow this series much so can't comment right now.
> > > > Still this structure made me a bit wonder -- the @val seems to be 8 byte
> > > > aligned and I guess we have a useless pad between these members. If so
> > > > can we swap them? Or it is already part of api?
> > >
> > > It's not part of the api yet. We're still arguing about how and what we
> > > want to pass down.
> >
> > I see, thanks for info! So I think we could swap the members.
>
> Yeah, _if_ we go with a version of this struct we definitely need to
> ensure proper padding/alignment!

Input/ideas totally welcome of course!

2020-03-18 12:09:41

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 1/4] ns: prepare time namespace for clone3()

On Wed, Mar 18, 2020 at 12:58:00PM +0100, Christian Brauner wrote:
> > >
> > > I see, thanks for info! So I think we could swap the members.
> >
> > Yeah, _if_ we go with a version of this struct we definitely need to
> > ensure proper padding/alignment!
>
> Input/ideas totally welcome of course!

I'll try to take a look once time permit, but no promises though.

2020-03-18 12:14:02

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/4] clone3: allow creation of time namespace with offset

On Tue, Mar 17, 2020 at 09:30:42AM +0100, Adrian Reber wrote:
> This extends clone3() to support the time namespace via CLONE_NEWTIME.
> In addition to creating a new process in a new time namespace this
> allows setting the clock offset in the newly created time namspace.
>
> The time namespace allows to set an offset for two clocks.
> CLOCK_MONOTONIC and CLOCK_BOOTTIME.
>
> This clone3() extension also offers setting both offsets through the
> newly introduced clone_args members timens_offset and
> timens_offset_size.
>
> timens_offset: Pointer to an array of clock offsets for the
> newly created process in a time namespaces.
> This requires that a new time namespace has been
> requested via CLONE_NEWTIME. It is only possible
> to set an offset for CLOCK_MONOTONIC and
> CLOCK_BOOTTIME. The array can therefore never
> have more than two elements.
> clone3() expects the array to contain the
> following struct:
> struct set_timens_offset {
> int clockid;
> struct timespec val;
> };
>
> timens_offset_size: This defines the size of the array referenced
> in timens_offset. Currently this is limited
> to two elements.
>
> To create a new process using clone3() in a new time namespace with
> clock offsets, something like this can be used:
>
> struct set_timens_offset timens_offset[2];
>
> timens_offset[0].clockid = CLOCK_BOOTTIME;
> timens_offset[0].val.tv_sec = -1000;
> timens_offset[0].val.tv_nsec = 42;
> timens_offset[1].clockid = CLOCK_MONOTONIC;
> timens_offset[1].val.tv_sec = 1000000;
> timens_offset[1].val.tv_nsec = 37;
>
> struct _clone_args args = {
> .flags = CLONE_NEWTIME,
> .timens_offset = ptr_to_u64(timens_offset),
> .timens_offset_size = 2;
> };

In all honesty, this would be a terrible API and I think we need to come
up with something better than this. I don't want to pass down an array
of structs and in general would like to avoid this array + size pattern.
That pattern kinda made sense for the pid array because of pid
namespaces being nested but not for this case, I think. Also, why
require the additional clockid argument here? That makes sense for
clock_settime() and clock_gettime() but here we could just do:

struct timens {
struct timespec clock_bootime;
struct timespec clock_monotonic;
};

no? And since you need to expose that struct in a header somewhere
anyway you can version it by size just like clone_args. So the kernel
can apply the same pattern to be backwards compatible that we have with
struct clone_args and for openat2()'s struct open_how via
copy_struct_from_user. Then you only need one additional pointer in
struct clone_args.

What do we think?

Christian

2020-03-19 08:12:52

by Adrian Reber

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Wed, Mar 18, 2020 at 11:18:53AM +0100, Arnd Bergmann wrote:
> On Tue, Mar 17, 2020 at 9:32 AM Adrian Reber <[email protected]> wrote:
> >
> > This is an attempt to add time namespace support to clone3(). I am not
> > really sure which way clone3() should handle time namespaces. The time
> > namespace through /proc cannot be used with clone3() because the offsets
> > for the time namespace need to be written before a process has been
> > created in that time namespace. This means it is necessary to somehow
> > tell clone3() the offsets for the clocks.
> >
> > The time namespace offers the possibility to set offsets for
> > CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
> > 'struct clone_args` with '__aligned_u64 monotonic_offset' and
> > '__aligned_u64 boottime_offset'. The problem with this approach was that
> > it was not possible to set nanoseconds for the clocks in the time
> > namespace.
> >
> > One of the motivations for clone3() with CLONE_NEWTIME was to enable
> > CRIU to restore a process in a time namespace with the corresponding
> > offsets. And although the nanosecond value can probably never be
> > restored to the same value it had during checkpointing, because the
> > clock keeps on running between CRIU pausing all processes and CRIU
> > actually reading the value of the clocks, the nanosecond value is still
> > necessary for CRIU to not restore a process where the clock jumps back
> > due to CRIU restoring it with a nanonsecond value that is too small.
> >
> > Requiring nanoseconds as well as seconds for two clocks during clone3()
> > means that it would require 4 additional members to 'struct clone_args':
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_u64 boottime_offset_seconds;
> > + __aligned_u64 boottime_offset_nanoseconds;
> > + __aligned_u64 monotonic_offset_seconds;
> > + __aligned_u64 monotonic_offset_nanoseconds;
> > };
>
> Wouldn't it be sufficient to have the two nanosecond values, rather
> than both seconds and nanoseconds? With 64-bit nanoseconds
> you can represent several hundred years, and these would
> always start at zero during boot.

I like this. Just using nanoseconds will make it easier and should
indeed be enough.

> Regardless of this, I think you need a signed offset, not unsigned.

Right, that was just a quick test at some point.

Christian and I have also been discussing this a bit and Christian
prefers a pointer to a struct. Maybe something like this:

__aligned_u64 tls;
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
+ __aligned_u64 timens_offset;
};

With Arnd's idea of only using nanoseconds, timens_offset would then
contain something like this:

struct timens_offset {
__aligned_s64 monotonic_offset_ns;
__aligned_s64 boottime_offset_ns;
};

I kind of prefer adding boottime and monotonic directly to struct clone_args

__aligned_u64 tls;
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
+ __aligned_s64 monotonic_offset_ns;
+ __aligned_s64 boottime_offset_ns;
};

But setting the time namespace offset is probably something which does
not happen very often while using clone3(), so maybe the pointer to a
struct approach is better.

I will resend the patches using the pointer to a struct approach if
there are no other ideas how to do this.

Adrian

2020-03-19 08:12:58

by Aleksa Sarai

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On 2020-03-17, Michael Kerrisk (man-pages) <[email protected]> wrote:
> [CC += linux-api; please CC on future versions]
>
> On Tue, 17 Mar 2020 at 09:32, Adrian Reber <[email protected]> wrote:
> > Requiring nanoseconds as well as seconds for two clocks during clone3()
> > means that it would require 4 additional members to 'struct clone_args':
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_u64 boottime_offset_seconds;
> > + __aligned_u64 boottime_offset_nanoseconds;
> > + __aligned_u64 monotonic_offset_seconds;
> > + __aligned_u64 monotonic_offset_nanoseconds;
> > };
> >
> > To avoid four additional members to 'struct clone_args' this patchset
> > uses another approach:
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_u64 timens_offset;
> > + __aligned_u64 timens_offset_size;
> > };
> >
> > timens_offset is a pointer to an array just as previously done with
> > set_tid and timens_offset_size is the size of the array.
> >
> > The timens_offset array is expected to contain a struct like this:
> >
> > struct set_timens_offset {
> > int clockid;
> > struct timespec val;
> > };
> >
> > This way it is possible to pass the information of multiple clocks with
> > seconds and nanonseconds to clone3().
> >
> > To me this seems the better approach, but I am not totally convinced
> > that it is the right thing. If there are other ideas how to pass two
> > clock offsets with seconds and nanonseconds to clone3() I would be happy
> > to hear other ideas.

While I agree this does make the API cleaner, I am a little worried that
it risks killing some of the ideas we discussed for seccomp deep
inspection. In particular, having a pointer to variable-sized data
inside the struct means that now the cBPF program can't just be given a
copy of the struct data from userspace to check.

I'm sure it's a solveable problem (and it was one we were bound to run
into at some point), it'll just mean we'll need a more complicated way
of filtering such syscalls.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.28 kB)
signature.asc (235.00 B)
Download all attachments

2020-03-19 08:18:32

by Arnd Bergmann

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:

> With Arnd's idea of only using nanoseconds, timens_offset would then
> contain something like this:
>
> struct timens_offset {
> __aligned_s64 monotonic_offset_ns;
> __aligned_s64 boottime_offset_ns;
> };
>
> I kind of prefer adding boottime and monotonic directly to struct clone_args
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_s64 monotonic_offset_ns;
> + __aligned_s64 boottime_offset_ns;
> };

I would also prefer the second approach using two 64-bit integers
instead of a pointer, as it keeps the interface simpler to implement
and simpler to interpret by other tools.

Arnd

2020-03-19 10:32:28

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
> On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
>
> > With Arnd's idea of only using nanoseconds, timens_offset would then
> > contain something like this:
> >
> > struct timens_offset {
> > __aligned_s64 monotonic_offset_ns;
> > __aligned_s64 boottime_offset_ns;
> > };
> >
> > I kind of prefer adding boottime and monotonic directly to struct clone_args
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_s64 monotonic_offset_ns;
> > + __aligned_s64 boottime_offset_ns;
> > };
>
> I would also prefer the second approach using two 64-bit integers
> instead of a pointer, as it keeps the interface simpler to implement
> and simpler to interpret by other tools.

Why I don't like has two reasons. There's the scenario where we have
added new extensions after the new boottime member and then we introduce
another offset. Then you'd be looking at:

__aligned_u64 tls;
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
+ __aligned_s64 monotonic_offset_ns;
+ __aligned_s64 boottime_offset_ns;
__aligned_s64 something_1
__aligned_s64 anything_2
+ __aligned_s64 sometime_offset_ns

which bothers me just by looking at it. That's in addition to adding two
new members to the struct when most people will never set CLONE_NEWTIME.
We'll also likely have more features in the future that will want to
pass down more info than we want to directly expose in struct
clone_args, e.g. for a long time I have been thinking about adding a
struct for CLONE_NEWUSER that allows you to specify the id mappings you
want the new user namespace to get. We surely don't want to force all
new info into the uppermost struct. So I'm not convinced we should here.

Christian

2020-03-20 18:35:03

by Andrei Vagin

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
> On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
> > On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
> >
> > > With Arnd's idea of only using nanoseconds, timens_offset would then
> > > contain something like this:
> > >
> > > struct timens_offset {
> > > __aligned_s64 monotonic_offset_ns;
> > > __aligned_s64 boottime_offset_ns;
> > > };
> > >
> > > I kind of prefer adding boottime and monotonic directly to struct clone_args
> > >
> > > __aligned_u64 tls;
> > > __aligned_u64 set_tid;
> > > __aligned_u64 set_tid_size;
> > > + __aligned_s64 monotonic_offset_ns;
> > > + __aligned_s64 boottime_offset_ns;
> > > };
> >
> > I would also prefer the second approach using two 64-bit integers
> > instead of a pointer, as it keeps the interface simpler to implement
> > and simpler to interpret by other tools.
>
> Why I don't like has two reasons. There's the scenario where we have
> added new extensions after the new boottime member and then we introduce
> another offset. Then you'd be looking at:
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_s64 monotonic_offset_ns;
> + __aligned_s64 boottime_offset_ns;
> __aligned_s64 something_1
> __aligned_s64 anything_2
> + __aligned_s64 sometime_offset_ns
>
> which bothers me just by looking at it. That's in addition to adding two
> new members to the struct when most people will never set CLONE_NEWTIME.
> We'll also likely have more features in the future that will want to
> pass down more info than we want to directly expose in struct
> clone_args, e.g. for a long time I have been thinking about adding a
> struct for CLONE_NEWUSER that allows you to specify the id mappings you
> want the new user namespace to get. We surely don't want to force all
> new info into the uppermost struct. So I'm not convinced we should here.

I think here we can start thinking about a netlink-like interface.

struct clone_args {
....
u64 attrs_offset;
}

struct clone_attr {
u16 cla_len;
u16 cla_type;
}


....

int parse_clone_attributes(struct kernel_clone_args *kargs, struct clone_args *args, size_t args_size)
{
u64 off = args->attrs_offset;

while (off < size) {
struct clone_attr *attr;

if (off + sizeof(struct clone_attr) uargs_size)
return -EINVAL;

attr = (struct clone_attr *) ((void *)args + off);

if (attr->cla_type > CLONE_ATTR_TYPE_MAX)
return -ENOSYS;

kargs->attrs[attr->cla_type] = CLONE_ATTR_DATA(attr);
off += CLONE_ATTR_LEN(attr);
}

return 0;
}

This interface doesn't suffer from problems what you enumerated before:

* clone_args contains only fields which are often used.
* per-feature attributes can be extended in a future without breaking
backward compatibility.
* unused features don't affect clone3 argument size.
* seccomp-friendly (I am not 100% sure about this)

2020-03-24 16:10:56

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
> On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
> > On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
> > > On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
> > >
> > > > With Arnd's idea of only using nanoseconds, timens_offset would then
> > > > contain something like this:
> > > >
> > > > struct timens_offset {
> > > > __aligned_s64 monotonic_offset_ns;
> > > > __aligned_s64 boottime_offset_ns;
> > > > };
> > > >
> > > > I kind of prefer adding boottime and monotonic directly to struct clone_args
> > > >
> > > > __aligned_u64 tls;
> > > > __aligned_u64 set_tid;
> > > > __aligned_u64 set_tid_size;
> > > > + __aligned_s64 monotonic_offset_ns;
> > > > + __aligned_s64 boottime_offset_ns;
> > > > };
> > >
> > > I would also prefer the second approach using two 64-bit integers
> > > instead of a pointer, as it keeps the interface simpler to implement
> > > and simpler to interpret by other tools.
> >
> > Why I don't like has two reasons. There's the scenario where we have
> > added new extensions after the new boottime member and then we introduce
> > another offset. Then you'd be looking at:
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_s64 monotonic_offset_ns;
> > + __aligned_s64 boottime_offset_ns;
> > __aligned_s64 something_1
> > __aligned_s64 anything_2
> > + __aligned_s64 sometime_offset_ns
> >
> > which bothers me just by looking at it. That's in addition to adding two
> > new members to the struct when most people will never set CLONE_NEWTIME.
> > We'll also likely have more features in the future that will want to
> > pass down more info than we want to directly expose in struct
> > clone_args, e.g. for a long time I have been thinking about adding a
> > struct for CLONE_NEWUSER that allows you to specify the id mappings you
> > want the new user namespace to get. We surely don't want to force all
> > new info into the uppermost struct. So I'm not convinced we should here.
>
> I think here we can start thinking about a netlink-like interface.

I think netlink is just not a great model for an API and I would not
want us to go down that route.

I kept thinking about this for a bit and I think that we will end up
growing more namespace-related functionality. So one thing that came to
my mind is the following layout:

struct {
struct {
__s64 monotonic;
__s64 boot;
} time;
} namespaces;

struct _clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
__aligned_u64 namespaces;
__aligned_u64 namespaces_size;
};

Then when we end up adding id mapping support for CLONE_NEWUSER we can
extend this with:

struct {
struct {
__aligned_u64 monotonic;
__aligned_u64 boot;
} time;

struct {
/* id mapping members */
} user;
} namespaces;

Thoughts? Other ideas?

Christian

2020-03-24 16:26:35

by Adrian Reber

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Tue, Mar 24, 2020 at 05:09:45PM +0100, Christian Brauner wrote:
> On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
> > On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
> > > On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
> > > > On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
> > > >
> > > > > With Arnd's idea of only using nanoseconds, timens_offset would then
> > > > > contain something like this:
> > > > >
> > > > > struct timens_offset {
> > > > > __aligned_s64 monotonic_offset_ns;
> > > > > __aligned_s64 boottime_offset_ns;
> > > > > };
> > > > >
> > > > > I kind of prefer adding boottime and monotonic directly to struct clone_args
> > > > >
> > > > > __aligned_u64 tls;
> > > > > __aligned_u64 set_tid;
> > > > > __aligned_u64 set_tid_size;
> > > > > + __aligned_s64 monotonic_offset_ns;
> > > > > + __aligned_s64 boottime_offset_ns;
> > > > > };
> > > >
> > > > I would also prefer the second approach using two 64-bit integers
> > > > instead of a pointer, as it keeps the interface simpler to implement
> > > > and simpler to interpret by other tools.
> > >
> > > Why I don't like has two reasons. There's the scenario where we have
> > > added new extensions after the new boottime member and then we introduce
> > > another offset. Then you'd be looking at:
> > >
> > > __aligned_u64 tls;
> > > __aligned_u64 set_tid;
> > > __aligned_u64 set_tid_size;
> > > + __aligned_s64 monotonic_offset_ns;
> > > + __aligned_s64 boottime_offset_ns;
> > > __aligned_s64 something_1
> > > __aligned_s64 anything_2
> > > + __aligned_s64 sometime_offset_ns
> > >
> > > which bothers me just by looking at it. That's in addition to adding two
> > > new members to the struct when most people will never set CLONE_NEWTIME.
> > > We'll also likely have more features in the future that will want to
> > > pass down more info than we want to directly expose in struct
> > > clone_args, e.g. for a long time I have been thinking about adding a
> > > struct for CLONE_NEWUSER that allows you to specify the id mappings you
> > > want the new user namespace to get. We surely don't want to force all
> > > new info into the uppermost struct. So I'm not convinced we should here.
> >
> > I think here we can start thinking about a netlink-like interface.
>
> I think netlink is just not a great model for an API and I would not
> want us to go down that route.
>
> I kept thinking about this for a bit and I think that we will end up
> growing more namespace-related functionality. So one thing that came to
> my mind is the following layout:
>
> struct {
> struct {
> __s64 monotonic;
> __s64 boot;
> } time;
> } namespaces;
>
> struct _clone_args {
> __aligned_u64 flags;
> __aligned_u64 pidfd;
> __aligned_u64 child_tid;
> __aligned_u64 parent_tid;
> __aligned_u64 exit_signal;
> __aligned_u64 stack;
> __aligned_u64 stack_size;
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> __aligned_u64 namespaces;
> __aligned_u64 namespaces_size;
> };
>
> Then when we end up adding id mapping support for CLONE_NEWUSER we can
> extend this with:
>
> struct {
> struct {
> __aligned_u64 monotonic;
> __aligned_u64 boot;
> } time;
>
> struct {
> /* id mapping members */
> } user;
> } namespaces;
>
> Thoughts? Other ideas?

Works for me.

If we add the user namespace id mappings and then at some point a third
element for the time namespace appears it would also start to be mixed.
Just as you mentioned that a few mails ago.

> > > __aligned_u64 set_tid_size;
> > > + __aligned_s64 monotonic_offset_ns;
> > > + __aligned_s64 boottime_offset_ns;
> > > __aligned_s64 something_1
> > > __aligned_s64 anything_2
> > > + __aligned_s64 sometime_offset_ns

If we can live with something like this in the namespaces struct you
proposed, it works for me.

Adrian

2020-03-24 17:57:53

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Tue, Mar 24, 2020 at 05:25:46PM +0100, Adrian Reber wrote:
> On Tue, Mar 24, 2020 at 05:09:45PM +0100, Christian Brauner wrote:
> > On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
> > > On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
> > > > On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
> > > > > On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
> > > > >
> > > > > > With Arnd's idea of only using nanoseconds, timens_offset would then
> > > > > > contain something like this:
> > > > > >
> > > > > > struct timens_offset {
> > > > > > __aligned_s64 monotonic_offset_ns;
> > > > > > __aligned_s64 boottime_offset_ns;
> > > > > > };
> > > > > >
> > > > > > I kind of prefer adding boottime and monotonic directly to struct clone_args
> > > > > >
> > > > > > __aligned_u64 tls;
> > > > > > __aligned_u64 set_tid;
> > > > > > __aligned_u64 set_tid_size;
> > > > > > + __aligned_s64 monotonic_offset_ns;
> > > > > > + __aligned_s64 boottime_offset_ns;
> > > > > > };
> > > > >
> > > > > I would also prefer the second approach using two 64-bit integers
> > > > > instead of a pointer, as it keeps the interface simpler to implement
> > > > > and simpler to interpret by other tools.
> > > >
> > > > Why I don't like has two reasons. There's the scenario where we have
> > > > added new extensions after the new boottime member and then we introduce
> > > > another offset. Then you'd be looking at:
> > > >
> > > > __aligned_u64 tls;
> > > > __aligned_u64 set_tid;
> > > > __aligned_u64 set_tid_size;
> > > > + __aligned_s64 monotonic_offset_ns;
> > > > + __aligned_s64 boottime_offset_ns;
> > > > __aligned_s64 something_1
> > > > __aligned_s64 anything_2
> > > > + __aligned_s64 sometime_offset_ns
> > > >
> > > > which bothers me just by looking at it. That's in addition to adding two
> > > > new members to the struct when most people will never set CLONE_NEWTIME.
> > > > We'll also likely have more features in the future that will want to
> > > > pass down more info than we want to directly expose in struct
> > > > clone_args, e.g. for a long time I have been thinking about adding a
> > > > struct for CLONE_NEWUSER that allows you to specify the id mappings you
> > > > want the new user namespace to get. We surely don't want to force all
> > > > new info into the uppermost struct. So I'm not convinced we should here.
> > >
> > > I think here we can start thinking about a netlink-like interface.
> >
> > I think netlink is just not a great model for an API and I would not
> > want us to go down that route.
> >
> > I kept thinking about this for a bit and I think that we will end up
> > growing more namespace-related functionality. So one thing that came to
> > my mind is the following layout:
> >
> > struct {
> > struct {
> > __s64 monotonic;
> > __s64 boot;
> > } time;
> > } namespaces;
> >
> > struct _clone_args {
> > __aligned_u64 flags;
> > __aligned_u64 pidfd;
> > __aligned_u64 child_tid;
> > __aligned_u64 parent_tid;
> > __aligned_u64 exit_signal;
> > __aligned_u64 stack;
> > __aligned_u64 stack_size;
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > __aligned_u64 namespaces;
> > __aligned_u64 namespaces_size;
> > };
> >
> > Then when we end up adding id mapping support for CLONE_NEWUSER we can
> > extend this with:
> >
> > struct {
> > struct {
> > __aligned_u64 monotonic;
> > __aligned_u64 boot;

s/__aligned_u64/__s64/g

Sorry, leftover from my first draft.

> > } time;
> >
> > struct {
> > /* id mapping members */
> > } user;
> > } namespaces;
> >
> > Thoughts? Other ideas?
>
> Works for me.
>
> If we add the user namespace id mappings and then at some point a third
> element for the time namespace appears it would also start to be mixed.
> Just as you mentioned that a few mails ago.

I think you misunderstand me or I'm misunderstanding you. That new time
namespace member would go into struct time {} so

struct {
struct {
__s64 monotonic;
__s64 boot;
__s64 someothertime;
} time;

struct {
/* id mapping members */
} user;
} namespaces;

Christian

2020-03-25 08:01:03

by Adrian Reber

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Tue, Mar 24, 2020 at 06:56:49PM +0100, Christian Brauner wrote:
> On Tue, Mar 24, 2020 at 05:25:46PM +0100, Adrian Reber wrote:
> > On Tue, Mar 24, 2020 at 05:09:45PM +0100, Christian Brauner wrote:
> > > On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
> > > > On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
> > > > > On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
> > > > > > On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
> > > > > >
> > > > > > > With Arnd's idea of only using nanoseconds, timens_offset would then
> > > > > > > contain something like this:
> > > > > > >
> > > > > > > struct timens_offset {
> > > > > > > __aligned_s64 monotonic_offset_ns;
> > > > > > > __aligned_s64 boottime_offset_ns;
> > > > > > > };
> > > > > > >
> > > > > > > I kind of prefer adding boottime and monotonic directly to struct clone_args
> > > > > > >
> > > > > > > __aligned_u64 tls;
> > > > > > > __aligned_u64 set_tid;
> > > > > > > __aligned_u64 set_tid_size;
> > > > > > > + __aligned_s64 monotonic_offset_ns;
> > > > > > > + __aligned_s64 boottime_offset_ns;
> > > > > > > };
> > > > > >
> > > > > > I would also prefer the second approach using two 64-bit integers
> > > > > > instead of a pointer, as it keeps the interface simpler to implement
> > > > > > and simpler to interpret by other tools.
> > > > >
> > > > > Why I don't like has two reasons. There's the scenario where we have
> > > > > added new extensions after the new boottime member and then we introduce
> > > > > another offset. Then you'd be looking at:
> > > > >
> > > > > __aligned_u64 tls;
> > > > > __aligned_u64 set_tid;
> > > > > __aligned_u64 set_tid_size;
> > > > > + __aligned_s64 monotonic_offset_ns;
> > > > > + __aligned_s64 boottime_offset_ns;
> > > > > __aligned_s64 something_1
> > > > > __aligned_s64 anything_2
> > > > > + __aligned_s64 sometime_offset_ns
> > > > >
> > > > > which bothers me just by looking at it. That's in addition to adding two
> > > > > new members to the struct when most people will never set CLONE_NEWTIME.
> > > > > We'll also likely have more features in the future that will want to
> > > > > pass down more info than we want to directly expose in struct
> > > > > clone_args, e.g. for a long time I have been thinking about adding a
> > > > > struct for CLONE_NEWUSER that allows you to specify the id mappings you
> > > > > want the new user namespace to get. We surely don't want to force all
> > > > > new info into the uppermost struct. So I'm not convinced we should here.
> > > >
> > > > I think here we can start thinking about a netlink-like interface.
> > >
> > > I think netlink is just not a great model for an API and I would not
> > > want us to go down that route.
> > >
> > > I kept thinking about this for a bit and I think that we will end up
> > > growing more namespace-related functionality. So one thing that came to
> > > my mind is the following layout:
> > >
> > > struct {
> > > struct {
> > > __s64 monotonic;
> > > __s64 boot;
> > > } time;
> > > } namespaces;
> > >
> > > struct _clone_args {
> > > __aligned_u64 flags;
> > > __aligned_u64 pidfd;
> > > __aligned_u64 child_tid;
> > > __aligned_u64 parent_tid;
> > > __aligned_u64 exit_signal;
> > > __aligned_u64 stack;
> > > __aligned_u64 stack_size;
> > > __aligned_u64 tls;
> > > __aligned_u64 set_tid;
> > > __aligned_u64 set_tid_size;
> > > __aligned_u64 namespaces;
> > > __aligned_u64 namespaces_size;
> > > };
> > >
> > > Then when we end up adding id mapping support for CLONE_NEWUSER we can
> > > extend this with:
> > >
> > > struct {
> > > struct {
> > > __aligned_u64 monotonic;
> > > __aligned_u64 boot;
>
> s/__aligned_u64/__s64/g
>
> Sorry, leftover from my first draft.
>
> > > } time;
> > >
> > > struct {
> > > /* id mapping members */
> > > } user;
> > > } namespaces;
> > >
> > > Thoughts? Other ideas?
> >
> > Works for me.
> >
> > If we add the user namespace id mappings and then at some point a third
> > element for the time namespace appears it would also start to be mixed.
> > Just as you mentioned that a few mails ago.
>
> I think you misunderstand me or I'm misunderstanding you. That new time
> namespace member would go into struct time {} so
>
> struct {
> struct {
> __s64 monotonic;
> __s64 boot;
> __s64 someothertime;
> } time;
>
> struct {
> /* id mapping members */
> } user;
> } namespaces;

My question was about how does the kernel know how 'struct namespaces'
is structured. How can an older kernel (which only is aware of two
clocks) deal with a, like in this example, third clock. Will the size
'__aligned_u64 namespaces_size' be used for versioning?

Adrian

2020-03-25 11:28:41

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Wed, Mar 25, 2020 at 08:58:36AM +0100, Adrian Reber wrote:
> On Tue, Mar 24, 2020 at 06:56:49PM +0100, Christian Brauner wrote:
> > On Tue, Mar 24, 2020 at 05:25:46PM +0100, Adrian Reber wrote:
> > > On Tue, Mar 24, 2020 at 05:09:45PM +0100, Christian Brauner wrote:
> > > > On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
> > > > > On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
> > > > > > On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
> > > > > > > On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
> > > > > > >
> > > > > > > > With Arnd's idea of only using nanoseconds, timens_offset would then
> > > > > > > > contain something like this:
> > > > > > > >
> > > > > > > > struct timens_offset {
> > > > > > > > __aligned_s64 monotonic_offset_ns;
> > > > > > > > __aligned_s64 boottime_offset_ns;
> > > > > > > > };
> > > > > > > >
> > > > > > > > I kind of prefer adding boottime and monotonic directly to struct clone_args
> > > > > > > >
> > > > > > > > __aligned_u64 tls;
> > > > > > > > __aligned_u64 set_tid;
> > > > > > > > __aligned_u64 set_tid_size;
> > > > > > > > + __aligned_s64 monotonic_offset_ns;
> > > > > > > > + __aligned_s64 boottime_offset_ns;
> > > > > > > > };
> > > > > > >
> > > > > > > I would also prefer the second approach using two 64-bit integers
> > > > > > > instead of a pointer, as it keeps the interface simpler to implement
> > > > > > > and simpler to interpret by other tools.
> > > > > >
> > > > > > Why I don't like has two reasons. There's the scenario where we have
> > > > > > added new extensions after the new boottime member and then we introduce
> > > > > > another offset. Then you'd be looking at:
> > > > > >
> > > > > > __aligned_u64 tls;
> > > > > > __aligned_u64 set_tid;
> > > > > > __aligned_u64 set_tid_size;
> > > > > > + __aligned_s64 monotonic_offset_ns;
> > > > > > + __aligned_s64 boottime_offset_ns;
> > > > > > __aligned_s64 something_1
> > > > > > __aligned_s64 anything_2
> > > > > > + __aligned_s64 sometime_offset_ns
> > > > > >
> > > > > > which bothers me just by looking at it. That's in addition to adding two
> > > > > > new members to the struct when most people will never set CLONE_NEWTIME.
> > > > > > We'll also likely have more features in the future that will want to
> > > > > > pass down more info than we want to directly expose in struct
> > > > > > clone_args, e.g. for a long time I have been thinking about adding a
> > > > > > struct for CLONE_NEWUSER that allows you to specify the id mappings you
> > > > > > want the new user namespace to get. We surely don't want to force all
> > > > > > new info into the uppermost struct. So I'm not convinced we should here.
> > > > >
> > > > > I think here we can start thinking about a netlink-like interface.
> > > >
> > > > I think netlink is just not a great model for an API and I would not
> > > > want us to go down that route.
> > > >
> > > > I kept thinking about this for a bit and I think that we will end up
> > > > growing more namespace-related functionality. So one thing that came to
> > > > my mind is the following layout:
> > > >
> > > > struct {
> > > > struct {
> > > > __s64 monotonic;
> > > > __s64 boot;
> > > > } time;
> > > > } namespaces;
> > > >
> > > > struct _clone_args {
> > > > __aligned_u64 flags;
> > > > __aligned_u64 pidfd;
> > > > __aligned_u64 child_tid;
> > > > __aligned_u64 parent_tid;
> > > > __aligned_u64 exit_signal;
> > > > __aligned_u64 stack;
> > > > __aligned_u64 stack_size;
> > > > __aligned_u64 tls;
> > > > __aligned_u64 set_tid;
> > > > __aligned_u64 set_tid_size;
> > > > __aligned_u64 namespaces;
> > > > __aligned_u64 namespaces_size;
> > > > };
> > > >
> > > > Then when we end up adding id mapping support for CLONE_NEWUSER we can
> > > > extend this with:
> > > >
> > > > struct {
> > > > struct {
> > > > __aligned_u64 monotonic;
> > > > __aligned_u64 boot;
> >
> > s/__aligned_u64/__s64/g
> >
> > Sorry, leftover from my first draft.
> >
> > > > } time;
> > > >
> > > > struct {
> > > > /* id mapping members */
> > > > } user;
> > > > } namespaces;
> > > >
> > > > Thoughts? Other ideas?
> > >
> > > Works for me.
> > >
> > > If we add the user namespace id mappings and then at some point a third
> > > element for the time namespace appears it would also start to be mixed.
> > > Just as you mentioned that a few mails ago.
> >
> > I think you misunderstand me or I'm misunderstanding you. That new time
> > namespace member would go into struct time {} so
> >
> > struct {
> > struct {
> > __s64 monotonic;
> > __s64 boot;
> > __s64 someothertime;
> > } time;
> >
> > struct {
> > /* id mapping members */
> > } user;
> > } namespaces;
>
> My question was about how does the kernel know how 'struct namespaces'
> is structured. How can an older kernel (which only is aware of two
> clocks) deal with a, like in this example, third clock. Will the size
> '__aligned_u64 namespaces_size' be used for versioning?

Yes, that would be the idea.

I don't want to give the impression that I think this is the best
solution. It's one solution that I think is feasible. But if we have
something better that is also future proof I'm happy to hear ideas.

Christian

Subject: Re: clone3: allow creation of time namespace with offset

On 3/25/20 12:26 PM, Christian Brauner wrote:
> On Wed, Mar 25, 2020 at 08:58:36AM +0100, Adrian Reber wrote:
>> On Tue, Mar 24, 2020 at 06:56:49PM +0100, Christian Brauner wrote:
>>> On Tue, Mar 24, 2020 at 05:25:46PM +0100, Adrian Reber wrote:
>>>> On Tue, Mar 24, 2020 at 05:09:45PM +0100, Christian Brauner wrote:
>>>>> On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
>>>>>> On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
>>>>>>> On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
>>>>>>>> On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> With Arnd's idea of only using nanoseconds, timens_offset would then
>>>>>>>>> contain something like this:
>>>>>>>>>
>>>>>>>>> struct timens_offset {
>>>>>>>>> __aligned_s64 monotonic_offset_ns;
>>>>>>>>> __aligned_s64 boottime_offset_ns;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> I kind of prefer adding boottime and monotonic directly to struct clone_args
>>>>>>>>>
>>>>>>>>> __aligned_u64 tls;
>>>>>>>>> __aligned_u64 set_tid;
>>>>>>>>> __aligned_u64 set_tid_size;
>>>>>>>>> + __aligned_s64 monotonic_offset_ns;
>>>>>>>>> + __aligned_s64 boottime_offset_ns;
>>>>>>>>> };
>>>>>>>>
>>>>>>>> I would also prefer the second approach using two 64-bit integers
>>>>>>>> instead of a pointer, as it keeps the interface simpler to implement
>>>>>>>> and simpler to interpret by other tools.
>>>>>>>
>>>>>>> Why I don't like has two reasons. There's the scenario where we have
>>>>>>> added new extensions after the new boottime member and then we introduce
>>>>>>> another offset. Then you'd be looking at:
>>>>>>>
>>>>>>> __aligned_u64 tls;
>>>>>>> __aligned_u64 set_tid;
>>>>>>> __aligned_u64 set_tid_size;
>>>>>>> + __aligned_s64 monotonic_offset_ns;
>>>>>>> + __aligned_s64 boottime_offset_ns;
>>>>>>> __aligned_s64 something_1
>>>>>>> __aligned_s64 anything_2
>>>>>>> + __aligned_s64 sometime_offset_ns
>>>>>>>
>>>>>>> which bothers me just by looking at it. That's in addition to adding two
>>>>>>> new members to the struct when most people will never set CLONE_NEWTIME.
>>>>>>> We'll also likely have more features in the future that will want to
>>>>>>> pass down more info than we want to directly expose in struct
>>>>>>> clone_args, e.g. for a long time I have been thinking about adding a
>>>>>>> struct for CLONE_NEWUSER that allows you to specify the id mappings you
>>>>>>> want the new user namespace to get. We surely don't want to force all
>>>>>>> new info into the uppermost struct. So I'm not convinced we should here.
>>>>>>
>>>>>> I think here we can start thinking about a netlink-like interface.
>>>>>
>>>>> I think netlink is just not a great model for an API and I would not
>>>>> want us to go down that route.
>>>>>
>>>>> I kept thinking about this for a bit and I think that we will end up
>>>>> growing more namespace-related functionality. So one thing that came to
>>>>> my mind is the following layout:
>>>>>
>>>>> struct {
>>>>> struct {
>>>>> __s64 monotonic;
>>>>> __s64 boot;
>>>>> } time;
>>>>> } namespaces;
>>>>>
>>>>> struct _clone_args {
>>>>> __aligned_u64 flags;
>>>>> __aligned_u64 pidfd;
>>>>> __aligned_u64 child_tid;
>>>>> __aligned_u64 parent_tid;
>>>>> __aligned_u64 exit_signal;
>>>>> __aligned_u64 stack;
>>>>> __aligned_u64 stack_size;
>>>>> __aligned_u64 tls;
>>>>> __aligned_u64 set_tid;
>>>>> __aligned_u64 set_tid_size;
>>>>> __aligned_u64 namespaces;
>>>>> __aligned_u64 namespaces_size;
>>>>> };
>>>>>
>>>>> Then when we end up adding id mapping support for CLONE_NEWUSER we can
>>>>> extend this with:
>>>>>
>>>>> struct {
>>>>> struct {
>>>>> __aligned_u64 monotonic;
>>>>> __aligned_u64 boot;
>>>
>>> s/__aligned_u64/__s64/g
>>>
>>> Sorry, leftover from my first draft.
>>>
>>>>> } time;
>>>>>
>>>>> struct {
>>>>> /* id mapping members */
>>>>> } user;
>>>>> } namespaces;
>>>>>
>>>>> Thoughts? Other ideas?
>>>>
>>>> Works for me.
>>>>
>>>> If we add the user namespace id mappings and then at some point a third
>>>> element for the time namespace appears it would also start to be mixed.
>>>> Just as you mentioned that a few mails ago.
>>>
>>> I think you misunderstand me or I'm misunderstanding you. That new time
>>> namespace member would go into struct time {} so
>>>
>>> struct {
>>> struct {
>>> __s64 monotonic;
>>> __s64 boot;
>>> __s64 someothertime;
>>> } time;
>>>
>>> struct {
>>> /* id mapping members */
>>> } user;
>>> } namespaces;

So far, this seems like the least worst approach to me :-).

I think it's reasonable to assume that there will be another
time NS offset to add one day. I don't think anyone expected
CLOCK_BOOTIME (added in 2011) at the time that CLOCK_MONOTONIC
appeared (as part of the POSIX timers API in Linux 2.6.0 2003);
similarly, we probably can't conceive now what clock might be
added in the future that should also be governed by time
namespaces.


But...

>> My question was about how does the kernel know how 'struct namespaces'
>> is structured. How can an older kernel (which only is aware of two
>> clocks) deal with a, like in this example, third clock. Will the size
>> '__aligned_u64 namespaces_size' be used for versioning?
>
> Yes, that would be the idea.

The idea implied here (if I understand correctly) of "binary chop
on the structure size to see what your kernel supports" is pretty
clunky, IMO. It's worth at least considering alternatives.
For example, seccomp has a number of interesting interfaces to
discover what the running kernel supports [1]. Maybe it is worth
considering something similar for clone3()?

Cheers,

Michael

[1]
/proc/sys/kernel/seccomp/actions_avail (see seccomp(2))
SECCOMP_GET_ACTION_AVAIL (see secommp(2))
cgroups also provides something similar in the form of
/sys/kernel/cgroup/features (see cgroups(7))

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-04-01 11:47:13

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Wed, Apr 01, 2020 at 01:40:25PM +0200, Michael Kerrisk (man-pages) wrote:
> On 3/25/20 12:26 PM, Christian Brauner wrote:
> > On Wed, Mar 25, 2020 at 08:58:36AM +0100, Adrian Reber wrote:
> >> On Tue, Mar 24, 2020 at 06:56:49PM +0100, Christian Brauner wrote:
> >>> On Tue, Mar 24, 2020 at 05:25:46PM +0100, Adrian Reber wrote:
> >>>> On Tue, Mar 24, 2020 at 05:09:45PM +0100, Christian Brauner wrote:
> >>>>> On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
> >>>>>> On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
> >>>>>>> On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
> >>>>>>>> On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> With Arnd's idea of only using nanoseconds, timens_offset would then
> >>>>>>>>> contain something like this:
> >>>>>>>>>
> >>>>>>>>> struct timens_offset {
> >>>>>>>>> __aligned_s64 monotonic_offset_ns;
> >>>>>>>>> __aligned_s64 boottime_offset_ns;
> >>>>>>>>> };
> >>>>>>>>>
> >>>>>>>>> I kind of prefer adding boottime and monotonic directly to struct clone_args
> >>>>>>>>>
> >>>>>>>>> __aligned_u64 tls;
> >>>>>>>>> __aligned_u64 set_tid;
> >>>>>>>>> __aligned_u64 set_tid_size;
> >>>>>>>>> + __aligned_s64 monotonic_offset_ns;
> >>>>>>>>> + __aligned_s64 boottime_offset_ns;
> >>>>>>>>> };
> >>>>>>>>
> >>>>>>>> I would also prefer the second approach using two 64-bit integers
> >>>>>>>> instead of a pointer, as it keeps the interface simpler to implement
> >>>>>>>> and simpler to interpret by other tools.
> >>>>>>>
> >>>>>>> Why I don't like has two reasons. There's the scenario where we have
> >>>>>>> added new extensions after the new boottime member and then we introduce
> >>>>>>> another offset. Then you'd be looking at:
> >>>>>>>
> >>>>>>> __aligned_u64 tls;
> >>>>>>> __aligned_u64 set_tid;
> >>>>>>> __aligned_u64 set_tid_size;
> >>>>>>> + __aligned_s64 monotonic_offset_ns;
> >>>>>>> + __aligned_s64 boottime_offset_ns;
> >>>>>>> __aligned_s64 something_1
> >>>>>>> __aligned_s64 anything_2
> >>>>>>> + __aligned_s64 sometime_offset_ns
> >>>>>>>
> >>>>>>> which bothers me just by looking at it. That's in addition to adding two
> >>>>>>> new members to the struct when most people will never set CLONE_NEWTIME.
> >>>>>>> We'll also likely have more features in the future that will want to
> >>>>>>> pass down more info than we want to directly expose in struct
> >>>>>>> clone_args, e.g. for a long time I have been thinking about adding a
> >>>>>>> struct for CLONE_NEWUSER that allows you to specify the id mappings you
> >>>>>>> want the new user namespace to get. We surely don't want to force all
> >>>>>>> new info into the uppermost struct. So I'm not convinced we should here.
> >>>>>>
> >>>>>> I think here we can start thinking about a netlink-like interface.
> >>>>>
> >>>>> I think netlink is just not a great model for an API and I would not
> >>>>> want us to go down that route.
> >>>>>
> >>>>> I kept thinking about this for a bit and I think that we will end up
> >>>>> growing more namespace-related functionality. So one thing that came to
> >>>>> my mind is the following layout:
> >>>>>
> >>>>> struct {
> >>>>> struct {
> >>>>> __s64 monotonic;
> >>>>> __s64 boot;
> >>>>> } time;
> >>>>> } namespaces;
> >>>>>
> >>>>> struct _clone_args {
> >>>>> __aligned_u64 flags;
> >>>>> __aligned_u64 pidfd;
> >>>>> __aligned_u64 child_tid;
> >>>>> __aligned_u64 parent_tid;
> >>>>> __aligned_u64 exit_signal;
> >>>>> __aligned_u64 stack;
> >>>>> __aligned_u64 stack_size;
> >>>>> __aligned_u64 tls;
> >>>>> __aligned_u64 set_tid;
> >>>>> __aligned_u64 set_tid_size;
> >>>>> __aligned_u64 namespaces;
> >>>>> __aligned_u64 namespaces_size;
> >>>>> };
> >>>>>
> >>>>> Then when we end up adding id mapping support for CLONE_NEWUSER we can
> >>>>> extend this with:
> >>>>>
> >>>>> struct {
> >>>>> struct {
> >>>>> __aligned_u64 monotonic;
> >>>>> __aligned_u64 boot;
> >>>
> >>> s/__aligned_u64/__s64/g
> >>>
> >>> Sorry, leftover from my first draft.
> >>>
> >>>>> } time;
> >>>>>
> >>>>> struct {
> >>>>> /* id mapping members */
> >>>>> } user;
> >>>>> } namespaces;
> >>>>>
> >>>>> Thoughts? Other ideas?
> >>>>
> >>>> Works for me.
> >>>>
> >>>> If we add the user namespace id mappings and then at some point a third
> >>>> element for the time namespace appears it would also start to be mixed.
> >>>> Just as you mentioned that a few mails ago.
> >>>
> >>> I think you misunderstand me or I'm misunderstanding you. That new time
> >>> namespace member would go into struct time {} so
> >>>
> >>> struct {
> >>> struct {
> >>> __s64 monotonic;
> >>> __s64 boot;
> >>> __s64 someothertime;
> >>> } time;
> >>>
> >>> struct {
> >>> /* id mapping members */
> >>> } user;
> >>> } namespaces;
>
> So far, this seems like the least worst approach to me :-).
>
> I think it's reasonable to assume that there will be another
> time NS offset to add one day. I don't think anyone expected
> CLOCK_BOOTIME (added in 2011) at the time that CLOCK_MONOTONIC
> appeared (as part of the POSIX timers API in Linux 2.6.0 2003);
> similarly, we probably can't conceive now what clock might be
> added in the future that should also be governed by time
> namespaces.
>
>
> But...
>
> >> My question was about how does the kernel know how 'struct namespaces'
> >> is structured. How can an older kernel (which only is aware of two
> >> clocks) deal with a, like in this example, third clock. Will the size
> >> '__aligned_u64 namespaces_size' be used for versioning?
> >
> > Yes, that would be the idea.
>
> The idea implied here (if I understand correctly) of "binary chop
> on the structure size to see what your kernel supports" is pretty
> clunky, IMO. It's worth at least considering alternatives.
> For example, seccomp has a number of interesting interfaces to
> discover what the running kernel supports [1]. Maybe it is worth
> considering something similar for clone3()?

Yes, that's definitely something I've been thinking about and I've also
discussed this publicly on the libc-alpha mailing list since Florian
(Weimer) wants this for glibc. _But_ I didn't want to simply go my own
way and so Aleksa and I stuck our heads together and tried to come up
with a generic way that e.g. would work for both openat2() and for
clone3(), in fact we tried to come up with a generic way that could
potentially be used by any syscall that makes use of the new
copy_struct_from_user() helper that Aleksa and I pushed upstream. We
haven't yet gotten around to actually try and implementing our ideas.
I'll try to get around to this during this cycle (/me makes note to "sit
down" with Aleksa). We'll definitely take a look at seccomp and cgroups
too.

Christian

>
> Cheers,
>
> Michael
>
> [1]
> /proc/sys/kernel/seccomp/actions_avail (see seccomp(2))
> SECCOMP_GET_ACTION_AVAIL (see secommp(2))
> cgroups also provides something similar in the form of
> /sys/kernel/cgroup/features (see cgroups(7))
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: clone3: allow creation of time namespace with offset

On 4/1/20 1:46 PM, Christian Brauner wrote:
> On Wed, Apr 01, 2020 at 01:40:25PM +0200, Michael Kerrisk (man-pages) wrote:
>> On 3/25/20 12:26 PM, Christian Brauner wrote:
>>> On Wed, Mar 25, 2020 at 08:58:36AM +0100, Adrian Reber wrote:
>>>> On Tue, Mar 24, 2020 at 06:56:49PM +0100, Christian Brauner wrote:
>>>>> On Tue, Mar 24, 2020 at 05:25:46PM +0100, Adrian Reber wrote:
>>>>>> On Tue, Mar 24, 2020 at 05:09:45PM +0100, Christian Brauner wrote:
>>>>>>> On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
>>>>>>>> On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
>>>>>>>>> On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
>>>>>>>>>> On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> With Arnd's idea of only using nanoseconds, timens_offset would then
>>>>>>>>>>> contain something like this:
>>>>>>>>>>>
>>>>>>>>>>> struct timens_offset {
>>>>>>>>>>> __aligned_s64 monotonic_offset_ns;
>>>>>>>>>>> __aligned_s64 boottime_offset_ns;
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> I kind of prefer adding boottime and monotonic directly to struct clone_args
>>>>>>>>>>>
>>>>>>>>>>> __aligned_u64 tls;
>>>>>>>>>>> __aligned_u64 set_tid;
>>>>>>>>>>> __aligned_u64 set_tid_size;
>>>>>>>>>>> + __aligned_s64 monotonic_offset_ns;
>>>>>>>>>>> + __aligned_s64 boottime_offset_ns;
>>>>>>>>>>> };
>>>>>>>>>>
>>>>>>>>>> I would also prefer the second approach using two 64-bit integers
>>>>>>>>>> instead of a pointer, as it keeps the interface simpler to implement
>>>>>>>>>> and simpler to interpret by other tools.
>>>>>>>>>
>>>>>>>>> Why I don't like has two reasons. There's the scenario where we have
>>>>>>>>> added new extensions after the new boottime member and then we introduce
>>>>>>>>> another offset. Then you'd be looking at:
>>>>>>>>>
>>>>>>>>> __aligned_u64 tls;
>>>>>>>>> __aligned_u64 set_tid;
>>>>>>>>> __aligned_u64 set_tid_size;
>>>>>>>>> + __aligned_s64 monotonic_offset_ns;
>>>>>>>>> + __aligned_s64 boottime_offset_ns;
>>>>>>>>> __aligned_s64 something_1
>>>>>>>>> __aligned_s64 anything_2
>>>>>>>>> + __aligned_s64 sometime_offset_ns
>>>>>>>>>
>>>>>>>>> which bothers me just by looking at it. That's in addition to adding two
>>>>>>>>> new members to the struct when most people will never set CLONE_NEWTIME.
>>>>>>>>> We'll also likely have more features in the future that will want to
>>>>>>>>> pass down more info than we want to directly expose in struct
>>>>>>>>> clone_args, e.g. for a long time I have been thinking about adding a
>>>>>>>>> struct for CLONE_NEWUSER that allows you to specify the id mappings you
>>>>>>>>> want the new user namespace to get. We surely don't want to force all
>>>>>>>>> new info into the uppermost struct. So I'm not convinced we should here.
>>>>>>>>
>>>>>>>> I think here we can start thinking about a netlink-like interface.
>>>>>>>
>>>>>>> I think netlink is just not a great model for an API and I would not
>>>>>>> want us to go down that route.
>>>>>>>
>>>>>>> I kept thinking about this for a bit and I think that we will end up
>>>>>>> growing more namespace-related functionality. So one thing that came to
>>>>>>> my mind is the following layout:
>>>>>>>
>>>>>>> struct {
>>>>>>> struct {
>>>>>>> __s64 monotonic;
>>>>>>> __s64 boot;
>>>>>>> } time;
>>>>>>> } namespaces;
>>>>>>>
>>>>>>> struct _clone_args {
>>>>>>> __aligned_u64 flags;
>>>>>>> __aligned_u64 pidfd;
>>>>>>> __aligned_u64 child_tid;
>>>>>>> __aligned_u64 parent_tid;
>>>>>>> __aligned_u64 exit_signal;
>>>>>>> __aligned_u64 stack;
>>>>>>> __aligned_u64 stack_size;
>>>>>>> __aligned_u64 tls;
>>>>>>> __aligned_u64 set_tid;
>>>>>>> __aligned_u64 set_tid_size;
>>>>>>> __aligned_u64 namespaces;
>>>>>>> __aligned_u64 namespaces_size;
>>>>>>> };
>>>>>>>
>>>>>>> Then when we end up adding id mapping support for CLONE_NEWUSER we can
>>>>>>> extend this with:
>>>>>>>
>>>>>>> struct {
>>>>>>> struct {
>>>>>>> __aligned_u64 monotonic;
>>>>>>> __aligned_u64 boot;
>>>>>
>>>>> s/__aligned_u64/__s64/g
>>>>>
>>>>> Sorry, leftover from my first draft.
>>>>>
>>>>>>> } time;
>>>>>>>
>>>>>>> struct {
>>>>>>> /* id mapping members */
>>>>>>> } user;
>>>>>>> } namespaces;
>>>>>>>
>>>>>>> Thoughts? Other ideas?
>>>>>>
>>>>>> Works for me.
>>>>>>
>>>>>> If we add the user namespace id mappings and then at some point a third
>>>>>> element for the time namespace appears it would also start to be mixed.
>>>>>> Just as you mentioned that a few mails ago.
>>>>>
>>>>> I think you misunderstand me or I'm misunderstanding you. That new time
>>>>> namespace member would go into struct time {} so
>>>>>
>>>>> struct {
>>>>> struct {
>>>>> __s64 monotonic;
>>>>> __s64 boot;
>>>>> __s64 someothertime;
>>>>> } time;
>>>>>
>>>>> struct {
>>>>> /* id mapping members */
>>>>> } user;
>>>>> } namespaces;
>>
>> So far, this seems like the least worst approach to me :-).
>>
>> I think it's reasonable to assume that there will be another
>> time NS offset to add one day. I don't think anyone expected
>> CLOCK_BOOTIME (added in 2011) at the time that CLOCK_MONOTONIC
>> appeared (as part of the POSIX timers API in Linux 2.6.0 2003);
>> similarly, we probably can't conceive now what clock might be
>> added in the future that should also be governed by time
>> namespaces.
>>
>>
>> But...
>>
>>>> My question was about how does the kernel know how 'struct namespaces'
>>>> is structured. How can an older kernel (which only is aware of two
>>>> clocks) deal with a, like in this example, third clock. Will the size
>>>> '__aligned_u64 namespaces_size' be used for versioning?
>>>
>>> Yes, that would be the idea.
>>
>> The idea implied here (if I understand correctly) of "binary chop
>> on the structure size to see what your kernel supports" is pretty
>> clunky, IMO. It's worth at least considering alternatives.
>> For example, seccomp has a number of interesting interfaces to
>> discover what the running kernel supports [1]. Maybe it is worth
>> considering something similar for clone3()?
>
> Yes, that's definitely something I've been thinking about and I've also
> discussed this publicly on the libc-alpha mailing list since Florian
> (Weimer) wants this for glibc. _But_ I didn't want to simply go my own
> way and so Aleksa and I stuck our heads together and tried to come up
> with a generic way that e.g. would work for both openat2() and for
> clone3(), in fact we tried to come up with a generic way that could
> potentially be used by any syscall that makes use of the new
> copy_struct_from_user() helper that Aleksa and I pushed upstream. We
> haven't yet gotten around to actually try and implementing our ideas.
> I'll try to get around to this during this cycle (/me makes note to "sit
> down" with Aleksa). We'll definitely take a look at seccomp and cgroups
> too.

Great! Thanks for thinking about this. Just by the way, on
discussions such as the one you had with Florian, it would
be great if you CCed linux-api@ (and also for the patch series
that implements this idea).

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: clone3: allow creation of time namespace with offset

Hi Adrian,

If there was a revision to this patch, I missed it. Is there still a
plan to bring CLONE_NEWTIME to clone3()?

Thanks,

Michael

On Tue, 17 Mar 2020 at 09:32, Adrian Reber <[email protected]> wrote:
>
> This is an attempt to add time namespace support to clone3(). I am not
> really sure which way clone3() should handle time namespaces. The time
> namespace through /proc cannot be used with clone3() because the offsets
> for the time namespace need to be written before a process has been
> created in that time namespace. This means it is necessary to somehow
> tell clone3() the offsets for the clocks.
>
> The time namespace offers the possibility to set offsets for
> CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
> 'struct clone_args` with '__aligned_u64 monotonic_offset' and
> '__aligned_u64 boottime_offset'. The problem with this approach was that
> it was not possible to set nanoseconds for the clocks in the time
> namespace.
>
> One of the motivations for clone3() with CLONE_NEWTIME was to enable
> CRIU to restore a process in a time namespace with the corresponding
> offsets. And although the nanosecond value can probably never be
> restored to the same value it had during checkpointing, because the
> clock keeps on running between CRIU pausing all processes and CRIU
> actually reading the value of the clocks, the nanosecond value is still
> necessary for CRIU to not restore a process where the clock jumps back
> due to CRIU restoring it with a nanonsecond value that is too small.
>
> Requiring nanoseconds as well as seconds for two clocks during clone3()
> means that it would require 4 additional members to 'struct clone_args':
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_u64 boottime_offset_seconds;
> + __aligned_u64 boottime_offset_nanoseconds;
> + __aligned_u64 monotonic_offset_seconds;
> + __aligned_u64 monotonic_offset_nanoseconds;
> };
>
> To avoid four additional members to 'struct clone_args' this patchset
> uses another approach:
>
> __aligned_u64 tls;
> __aligned_u64 set_tid;
> __aligned_u64 set_tid_size;
> + __aligned_u64 timens_offset;
> + __aligned_u64 timens_offset_size;
> };
>
> timens_offset is a pointer to an array just as previously done with
> set_tid and timens_offset_size is the size of the array.
>
> The timens_offset array is expected to contain a struct like this:
>
> struct set_timens_offset {
> int clockid;
> struct timespec val;
> };
>
> This way it is possible to pass the information of multiple clocks with
> seconds and nanonseconds to clone3().
>
> To me this seems the better approach, but I am not totally convinced
> that it is the right thing. If there are other ideas how to pass two
> clock offsets with seconds and nanonseconds to clone3() I would be happy
> to hear other ideas.
>
> Adrian
>
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-05-29 15:15:20

by Adrian Reber

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Fri, May 29, 2020 at 02:26:13PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Adrian,
>
> If there was a revision to this patch, I missed it. Is there still a
> plan to bring CLONE_NEWTIME to clone3()?

Good that you ask. The discussion ended at the point that we do not have
a way to figure out what a syscall supports from user-space. I talked a
bit with Christian about it and he mentioned that there were some ideas
floating around how to do that. At that point it made more sense to me
to wait for such a solution to appear before continuing the clone3()
time namespace work.

Adrian

> On Tue, 17 Mar 2020 at 09:32, Adrian Reber <[email protected]> wrote:
> >
> > This is an attempt to add time namespace support to clone3(). I am not
> > really sure which way clone3() should handle time namespaces. The time
> > namespace through /proc cannot be used with clone3() because the offsets
> > for the time namespace need to be written before a process has been
> > created in that time namespace. This means it is necessary to somehow
> > tell clone3() the offsets for the clocks.
> >
> > The time namespace offers the possibility to set offsets for
> > CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
> > 'struct clone_args` with '__aligned_u64 monotonic_offset' and
> > '__aligned_u64 boottime_offset'. The problem with this approach was that
> > it was not possible to set nanoseconds for the clocks in the time
> > namespace.
> >
> > One of the motivations for clone3() with CLONE_NEWTIME was to enable
> > CRIU to restore a process in a time namespace with the corresponding
> > offsets. And although the nanosecond value can probably never be
> > restored to the same value it had during checkpointing, because the
> > clock keeps on running between CRIU pausing all processes and CRIU
> > actually reading the value of the clocks, the nanosecond value is still
> > necessary for CRIU to not restore a process where the clock jumps back
> > due to CRIU restoring it with a nanonsecond value that is too small.
> >
> > Requiring nanoseconds as well as seconds for two clocks during clone3()
> > means that it would require 4 additional members to 'struct clone_args':
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_u64 boottime_offset_seconds;
> > + __aligned_u64 boottime_offset_nanoseconds;
> > + __aligned_u64 monotonic_offset_seconds;
> > + __aligned_u64 monotonic_offset_nanoseconds;
> > };
> >
> > To avoid four additional members to 'struct clone_args' this patchset
> > uses another approach:
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > + __aligned_u64 timens_offset;
> > + __aligned_u64 timens_offset_size;
> > };
> >
> > timens_offset is a pointer to an array just as previously done with
> > set_tid and timens_offset_size is the size of the array.
> >
> > The timens_offset array is expected to contain a struct like this:
> >
> > struct set_timens_offset {
> > int clockid;
> > struct timespec val;
> > };
> >
> > This way it is possible to pass the information of multiple clocks with
> > seconds and nanonseconds to clone3().
> >
> > To me this seems the better approach, but I am not totally convinced
> > that it is the right thing. If there are other ideas how to pass two
> > clock offsets with seconds and nanonseconds to clone3() I would be happy
> > to hear other ideas.
> >
> > Adrian
> >
> >
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
>

2020-05-29 15:18:09

by Christian Brauner

[permalink] [raw]
Subject: Re: clone3: allow creation of time namespace with offset

On Fri, May 29, 2020 at 05:10:29PM +0200, Adrian Reber wrote:
> On Fri, May 29, 2020 at 02:26:13PM +0200, Michael Kerrisk (man-pages) wrote:
> > Hi Adrian,
> >
> > If there was a revision to this patch, I missed it. Is there still a
> > plan to bring CLONE_NEWTIME to clone3()?
>
> Good that you ask. The discussion ended at the point that we do not have
> a way to figure out what a syscall supports from user-space. I talked a
> bit with Christian about it and he mentioned that there were some ideas
> floating around how to do that. At that point it made more sense to me
> to wait for such a solution to appear before continuing the clone3()
> time namespace work.

We need to have a little talk at ksummit about some aspects of syscall
design. I'm writing something up for this. But we're not in a rush
anyway, I take it.

Christian