Discussions around time virtualization are there for a long time.
The first attempt to implement time namespace was in 2006 by Jeff Dike.
From that time, the topic appears on and off in various discussions.
There are two main use cases for time namespaces:
1. change date and time inside a container;
2. adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping
migration from being used in production systems, given that not all
containers and connections can be migrated as long as a time dependency
is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
start points for them are not defined and are different for each running
system. When a container is migrated from one node to another, all
clocks have to be restored into consistent states; in other words, they
have to continue running from the same points where they have been
dumped.
The main idea behind this patch set is adding per-namespace offsets for
system clocks. When a process in a non-root time namespace requests
time of a clock, a namespace offset is added to the current value of
this clock on a host and the sum is returned.
All offsets are placed on a separate page, this allows up to map it as
part of vvar into user processes and use offsets from vdso calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
clocks.
Questions to discuss:
* Clone flags exhaustion. Currently there is only one unused clone flag
bit left, and it may be worth to use it to extend arguments of the clone
system call.
* Realtime clock implementation details:
Is having a simple offset enough?
What to do when date and time is changed on the host?
Is there a need to adjust vfs modification and creation times?
Implementation for adjtime() syscall.
Cc: Dmitry Safonov <[email protected]>
Cc: Adrian Reber <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Andrei Vagin (12):
ns: Introduce Time Namespace
timens: Add timens_offsets
timens: Introduce CLOCK_MONOTONIC offsets
timens: Introduce CLOCK_BOOTTIME offset
timerfd/timens: Take into account ns clock offsets
kernel: Take into account timens clock offsets in clock_nanosleep
x86/vdso/timens: Add offsets page in vvar
x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow
posix-timers/timens: Take into account clock offsets
selftest/timens: Add test for timerfd
selftest/timens: Add test for clock_nanosleep
timens/selftest: Add timer offsets test
Dmitry Safonov (8):
timens: Shift /proc/uptime
x86/vdso: Restrict splitting vvar vma
x86/vdso: Purge timens page on setns()/unshare()/clone()
x86/vdso: Look for vvar vma to purge timens page
timens: Add align for timens_offsets
timens: Optimize zero-offsets
selftest: Add Time Namespace test for supported clocks
timens/selftest: Add procfs selftest
arch/Kconfig | 5 +
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/vclock_gettime.c | 52 +++++
arch/x86/entry/vdso/vdso-layout.lds.S | 9 +-
arch/x86/entry/vdso/vdso2c.c | 3 +
arch/x86/entry/vdso/vma.c | 67 +++++++
arch/x86/include/asm/vdso.h | 2 +
fs/proc/namespaces.c | 3 +
fs/proc/uptime.c | 3 +
fs/timerfd.c | 16 +-
include/linux/nsproxy.h | 1 +
include/linux/proc_ns.h | 1 +
include/linux/time_namespace.h | 72 +++++++
include/linux/timens_offsets.h | 25 +++
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 8 +
kernel/Makefile | 1 +
kernel/fork.c | 3 +-
kernel/nsproxy.c | 19 +-
kernel/time/hrtimer.c | 8 +
kernel/time/posix-timers.c | 89 ++++++++-
kernel/time/posix-timers.h | 2 +
kernel/time_namespace.c | 230 +++++++++++++++++++++++
tools/testing/selftests/timens/.gitignore | 5 +
tools/testing/selftests/timens/Makefile | 6 +
tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++
tools/testing/selftests/timens/config | 1 +
tools/testing/selftests/timens/log.h | 21 +++
tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++
tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++
tools/testing/selftests/timens/timer.c | 95 ++++++++++
tools/testing/selftests/timens/timerfd.c | 96 ++++++++++
33 files changed, 1272 insertions(+), 13 deletions(-)
create mode 100644 include/linux/time_namespace.h
create mode 100644 include/linux/timens_offsets.h
create mode 100644 kernel/time_namespace.c
create mode 100644 tools/testing/selftests/timens/.gitignore
create mode 100644 tools/testing/selftests/timens/Makefile
create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
create mode 100644 tools/testing/selftests/timens/config
create mode 100644 tools/testing/selftests/timens/log.h
create mode 100644 tools/testing/selftests/timens/procfs.c
create mode 100644 tools/testing/selftests/timens/timens.c
create mode 100644 tools/testing/selftests/timens/timer.c
create mode 100644 tools/testing/selftests/timens/timerfd.c
--
2.13.6
This test checks that all supported clocks can be changed by
clock_settime.
Cc: [email protected]
Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 5 +
tools/testing/selftests/timens/config | 1 +
tools/testing/selftests/timens/log.h | 21 ++++
tools/testing/selftests/timens/timens.c | 196 ++++++++++++++++++++++++++++++
5 files changed, 224 insertions(+)
create mode 100644 tools/testing/selftests/timens/.gitignore
create mode 100644 tools/testing/selftests/timens/Makefile
create mode 100644 tools/testing/selftests/timens/config
create mode 100644 tools/testing/selftests/timens/log.h
create mode 100644 tools/testing/selftests/timens/timens.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
new file mode 100644
index 000000000000..27a693229ce1
--- /dev/null
+++ b/tools/testing/selftests/timens/.gitignore
@@ -0,0 +1 @@
+timens
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
new file mode 100644
index 000000000000..b877efb78974
--- /dev/null
+++ b/tools/testing/selftests/timens/Makefile
@@ -0,0 +1,5 @@
+TEST_GEN_PROGS := timens
+
+CFLAGS := -Wall -Werror
+
+include ../lib.mk
diff --git a/tools/testing/selftests/timens/config b/tools/testing/selftests/timens/config
new file mode 100644
index 000000000000..4480620f6f49
--- /dev/null
+++ b/tools/testing/selftests/timens/config
@@ -0,0 +1 @@
+CONFIG_TIME_NS=y
diff --git a/tools/testing/selftests/timens/log.h b/tools/testing/selftests/timens/log.h
new file mode 100644
index 000000000000..05fec7f97870
--- /dev/null
+++ b/tools/testing/selftests/timens/log.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __SELFTEST_TIMENS_LOG_H__
+#define __SELFTEST_TIMENS_LOG_H__
+
+#define pr_msg(fmt, lvl, ...) \
+ fprintf(stderr, "[%s] (%s:%d)\t" fmt "\n", \
+ lvl, __FILE__, __LINE__, ##__VA_ARGS__)
+
+#define pr_p(func, fmt, ...) func(fmt ": %m", ##__VA_ARGS__)
+
+#define pr_err(fmt, ...) \
+ ({ \
+ pr_msg(fmt, "ERR", ##__VA_ARGS__) \
+ -1; \
+ })
+#define pr_fail(fmt, ...) pr_msg(fmt, "FAIL", ##__VA_ARGS__)
+
+#define pr_perror(fmt, ...) pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+#endif
diff --git a/tools/testing/selftests/timens/timens.c b/tools/testing/selftests/timens/timens.c
new file mode 100644
index 000000000000..dfa6701214b1
--- /dev/null
+++ b/tools/testing/selftests/timens/timens.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+#include <time.h>
+
+#include "log.h"
+
+#ifndef CLONE_NEWTIME
+# define CLONE_NEWTIME 0x00001000
+#endif
+
+/*
+ * Test shouldn't be run for a day, so add 10 days to child
+ * time and check parent's time to be in the same day.
+ */
+#define DAY_IN_SEC (60*60*24)
+#define TEN_DAYS_IN_SEC (10*DAY_IN_SEC)
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+#define CLOCK_TYPES \
+ ct(CLOCK_BOOTTIME), \
+ ct(CLOCK_MONOTONIC), \
+ ct(CLOCK_MONOTONIC_COARSE), \
+ ct(CLOCK_MONOTONIC_RAW), \
+
+
+#define ct(clock) clock
+static clockid_t clocks[] = {
+ CLOCK_TYPES
+};
+#undef ct
+#define ct(clock) #clock
+static char *clock_names[] = {
+ CLOCK_TYPES
+};
+
+static int child_ns, parent_ns;
+
+static int switch_ns(int fd)
+{
+ if (setns(fd, CLONE_NEWTIME)) {
+ pr_perror("setns()");
+ return -1;
+ }
+
+ return 0;
+}
+
+static int init_namespaces(void)
+{
+ char path[] = "/proc/self/ns/time";
+ struct stat st1, st2;
+
+ parent_ns = open(path, O_RDONLY);
+ if (parent_ns <= 0)
+ return pr_perror("Unable to open %s", path);
+
+ if (fstat(parent_ns, &st1))
+ return pr_perror("Unable to stat the parent timens");
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("Can't unshare() timens");
+
+ child_ns = open(path, O_RDONLY);
+ if (child_ns <= 0)
+ return pr_perror("Unable to open %s", path);
+
+ if (fstat(child_ns, &st2))
+ return pr_perror("Unable to stat the timens");
+
+ if (st1.st_ino == st2.st_ino)
+ return pr_perror("The same child_ns after CLONE_NEWTIME");
+
+ return 0;
+}
+
+static int _gettime(clockid_t clk_id, struct timespec *res, bool raw_syscall)
+{
+ int err;
+
+ if (!raw_syscall) {
+ if (clock_gettime(clk_id, res)) {
+ pr_perror("clock_gettime(%d)", (int)clk_id);
+ return -1;
+ }
+ return 0;
+ }
+
+ err = syscall(SYS_clock_gettime, clk_id, res);
+ if (err)
+ pr_perror("syscall(SYS_clock_gettime(%d))", (int)clk_id);
+
+ return err;
+}
+
+static int _settime(clockid_t clk_id, struct timespec *res, bool raw_syscall)
+{
+ int err;
+
+ if (!raw_syscall) {
+ if (clock_settime(clk_id, res))
+ return pr_perror("clock_settime(%d)", (int)clk_id);
+ return 0;
+ }
+
+ err = syscall(SYS_clock_settime, clk_id, res);
+ if (err)
+ pr_perror("syscall(SYS_clock_settime(%d))", (int)clk_id);
+
+ return err;
+}
+
+static int test_gettime(clockid_t clock_index, bool raw_syscall, time_t offset)
+{
+ struct timespec child_ts_new, parent_ts_old, cur_ts;
+ char *entry = raw_syscall ? "syscall" : "vdso";
+ double precision = 0.0;
+
+ switch (clocks[clock_index]) {
+ case CLOCK_MONOTONIC_COARSE:
+ case CLOCK_MONOTONIC_RAW:
+ precision = -2.0;
+ break;
+ }
+
+ if (switch_ns(parent_ns))
+ return pr_err("switch_ns(%d)", child_ns);
+
+ if (_gettime(clocks[clock_index], &parent_ts_old, raw_syscall))
+ return -1;
+
+ if (switch_ns(child_ns))
+ return pr_err("switch_ns(%d)", child_ns);
+
+ child_ts_new.tv_nsec = parent_ts_old.tv_nsec;
+ child_ts_new.tv_sec = parent_ts_old.tv_sec + offset;
+
+ if (_settime(clocks[clock_index], &child_ts_new, raw_syscall))
+ return -1;
+
+ if (_gettime(clocks[clock_index], &cur_ts, raw_syscall))
+ return -1;
+
+ if (difftime(cur_ts.tv_sec, child_ts_new.tv_sec) < precision) {
+ pr_fail("Child's %s (%s) time has not changed: %lu -> %lu [%lu]",
+ clock_names[clock_index], entry, parent_ts_old.tv_sec,
+ child_ts_new.tv_sec, cur_ts.tv_sec);
+ return -1;
+ }
+
+ if (switch_ns(parent_ns))
+ return pr_err("switch_ns(%d)", parent_ns);
+
+ if (_gettime(clocks[clock_index], &cur_ts, raw_syscall))
+ return -1;
+
+ if (difftime(cur_ts.tv_sec, parent_ts_old.tv_sec) > DAY_IN_SEC) {
+ pr_fail("Parent's %s (%s) time has changed: %lu -> %lu [%lu]",
+ clock_names[clock_index], entry, parent_ts_old.tv_sec,
+ child_ts_new.tv_sec, cur_ts.tv_sec);
+ /* Let's play nice and put it closer to original */
+ clock_settime(clocks[clock_index], &cur_ts);
+ return -1;
+ }
+
+ pr_msg("Passed for %s (%s)", "OK", clock_names[clock_index], entry);
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ unsigned int i;
+ int ret = 0;
+
+ if (init_namespaces())
+ return 1;
+
+ for (i = 0; i < ARRAY_SIZE(clocks); i++) {
+ ret |= test_gettime(i, true, TEN_DAYS_IN_SEC);
+ ret |= test_gettime(i, true, -TEN_DAYS_IN_SEC);
+ ret |= test_gettime(i, false, TEN_DAYS_IN_SEC);
+ ret |= test_gettime(i, false, -TEN_DAYS_IN_SEC);
+ }
+
+ return !!ret;
+}
--
2.13.6
Currently only uptime check, but procfs checks for REALTIME might be
added in future.
Cc: [email protected]
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 2 +-
tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++++++++++++++++++
3 files changed, 147 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/procfs.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 9b6c8ddac2c8..94ffdd9cead7 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,3 +1,4 @@
clock_nanosleep
+procfs
timens
timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index 76a1dc891184..f96f50d1fef8 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd clock_nanosleep
+TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs
CFLAGS := -Wall -Werror
diff --git a/tools/testing/selftests/timens/procfs.c b/tools/testing/selftests/timens/procfs.c
new file mode 100644
index 000000000000..5067cbbddcc5
--- /dev/null
+++ b/tools/testing/selftests/timens/procfs.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <math.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+#include <time.h>
+
+#include "log.h"
+
+#ifndef CLONE_NEWTIME
+# define CLONE_NEWTIME 0x00001000
+#endif
+
+/*
+ * Test shouldn't be run for a day, so add 10 days to child
+ * time and check parent's time to be in the same day.
+ */
+#define MAX_TEST_TIME_SEC (60*5)
+#define DAY_IN_SEC (60*60*24)
+#define TEN_DAYS_IN_SEC (10*DAY_IN_SEC)
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+static int child_ns, parent_ns;
+
+static int switch_ns(int fd)
+{
+ if (setns(fd, CLONE_NEWTIME))
+ return pr_perror("setns()");
+
+ return 0;
+}
+
+static int init_namespaces(void)
+{
+ char path[] = "/proc/self/ns/time";
+ struct stat st1, st2;
+
+ parent_ns = open(path, O_RDONLY);
+ if (parent_ns <= 0)
+ return pr_perror("Unable to open %s", path);
+
+ if (fstat(parent_ns, &st1))
+ return pr_perror("Unable to stat the parent timens");
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("Can't unshare() timens");
+
+ child_ns = open(path, O_RDONLY);
+ if (child_ns <= 0)
+ return pr_perror("Unable to open %s", path);
+
+ if (fstat(child_ns, &st2))
+ return pr_perror("Unable to stat the timens");
+
+ if (st1.st_ino == st2.st_ino)
+ return pr_err("The same child_ns after CLONE_NEWTIME");
+
+ return 0;
+}
+
+static int read_proc_uptime(struct timespec *uptime)
+{
+ unsigned long up_sec, up_nsec;
+ FILE *proc;
+
+ proc = fopen("/proc/uptime", "r");
+ if (proc == NULL) {
+ pr_perror("Unable to open /proc/uptime");
+ return -1;
+ }
+
+ if (fscanf(proc, "%lu.%02lu", &up_sec, &up_nsec) != 2) {
+ if (errno) {
+ pr_perror("fscanf");
+ return -errno;
+ }
+ pr_err("failed to parse /proc/uptime");
+ return -1;
+ }
+ fclose(proc);
+
+ uptime->tv_sec = up_sec;
+ uptime->tv_nsec = up_nsec;
+ return 0;
+}
+
+static int check_uptime(void)
+{
+ struct timespec ts_btime, uptime_new, uptime_old;
+ time_t uptime_expected;
+ double prec = MAX_TEST_TIME_SEC;
+
+ if (switch_ns(parent_ns))
+ return pr_err("switch_ns(%d)", parent_ns);
+
+ if (clock_gettime(CLOCK_BOOTTIME, &ts_btime))
+ return pr_perror("clock_gettime()");
+
+ if (read_proc_uptime(&uptime_old))
+ return 1;
+
+ ts_btime.tv_sec += TEN_DAYS_IN_SEC;
+
+ if (switch_ns(child_ns))
+ return pr_err("switch_ns(%d)", child_ns);
+
+ if (clock_settime(CLOCK_BOOTTIME, &ts_btime))
+ return pr_perror("clock_settime()");
+
+ if (read_proc_uptime(&uptime_new))
+ return 1;
+
+ uptime_expected = uptime_old.tv_sec + TEN_DAYS_IN_SEC;
+ if (fabs(difftime(uptime_new.tv_sec, uptime_expected)) > prec) {
+ pr_fail("uptime in /proc/uptime: old %ld, new %ld [%ld]",
+ uptime_old.tv_sec, uptime_new.tv_sec,
+ uptime_old.tv_sec + TEN_DAYS_IN_SEC);
+ return 1;
+ }
+
+ pr_msg("Passed for /proc/uptime", "OK");
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = 0;
+
+ if (init_namespaces())
+ return 1;
+
+ ret |= check_uptime();
+
+ return ret;
+}
--
2.13.6
From: Andrei Vagin <[email protected]>
Check that clock_nanosleep() takes into account clock offsets.
Cc: [email protected]
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 2 +-
tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++++++++++++++++
3 files changed, 100 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index b609f6ee9fb9..9b6c8ddac2c8 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,2 +1,3 @@
+clock_nanosleep
timens
timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index 66b90cd28e5c..76a1dc891184 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd
+TEST_GEN_PROGS := timens timerfd clock_nanosleep
CFLAGS := -Wall -Werror
diff --git a/tools/testing/selftests/timens/clock_nanosleep.c b/tools/testing/selftests/timens/clock_nanosleep.c
new file mode 100644
index 000000000000..5af780b4cfe0
--- /dev/null
+++ b/tools/testing/selftests/timens/clock_nanosleep.c
@@ -0,0 +1,98 @@
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/timerfd.h>
+#include <sys/syscall.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+
+#include "log.h"
+
+#ifndef CLONE_NEWTIME
+#define CLONE_NEWTIME 0x00001000
+#endif
+
+static long long get_elapsed_time(int clockid, struct timespec *start)
+{
+ struct timespec curr;
+ long long secs, nsecs;
+
+ if (clock_gettime(clockid, &curr) == -1)
+ return pr_perror("clock_gettime");
+
+ secs = curr.tv_sec - start->tv_sec;
+ nsecs = curr.tv_nsec - start->tv_nsec;
+ if (nsecs < 0) {
+ secs--;
+ nsecs += 1000000000;
+ }
+ if (nsecs > 1000000000) {
+ secs++;
+ nsecs -= 1000000000;
+ }
+ return secs * 1000 + nsecs / 1000000;
+}
+
+int run_test(int clockid)
+{
+ long long elapsed;
+ int i;
+
+ for (i = 0; i < 2; i++) {
+ struct timespec now = {};
+ struct timespec start;
+
+ if (clock_gettime(clockid, &start) == -1)
+ return pr_perror("clock_gettime");
+
+
+ if (i == 1) {
+ now.tv_sec = start.tv_sec;
+ now.tv_nsec = start.tv_nsec;
+ }
+
+ printf("clock_nanosleep: %d\n", clockid);
+ now.tv_sec += 2;
+ clock_nanosleep(clockid, i ? TIMER_ABSTIME : 0, &now, NULL);
+
+ elapsed = get_elapsed_time(clockid, &start);
+ if (elapsed < 1900 || elapsed > 2100) {
+ pr_fail("elapsed %lld\n", elapsed);
+ return 1;
+ }
+ }
+
+ printf("PASS\n");
+
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ struct timespec tp;
+ int ret;
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("unshare");;
+
+ if (clock_gettime(CLOCK_MONOTONIC, &tp))
+ return pr_perror("clock_gettime");
+ tp.tv_sec += 7 * 24 * 3600;
+ if (clock_settime(CLOCK_MONOTONIC, &tp))
+ return pr_perror("clock_settime");
+
+ if (clock_gettime(CLOCK_BOOTTIME, &tp))
+ return pr_perror("clock_gettime");
+ tp.tv_sec += 9 * 24 * 3600;
+ tp.tv_nsec = 0;
+ if (clock_settime(CLOCK_BOOTTIME, &tp))
+ return pr_perror("clock_settime");
+
+ ret = 0;
+ ret |= run_test(CLOCK_MONOTONIC);
+ return ret;
+}
+
--
2.13.6
From: Andrei Vagin <[email protected]>
Check that timer_create takes into account clock offsets.
Cc: [email protected]
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 3 +-
tools/testing/selftests/timens/timer.c | 95 +++++++++++++++++++++++++++++++
3 files changed, 98 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/timer.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 94ffdd9cead7..3b7eda8f35ce 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,5 @@
clock_nanosleep
procfs
timens
+timer
timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index f96f50d1fef8..ae1ffd24cc43 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,5 +1,6 @@
-TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs
CFLAGS := -Wall -Werror
+LDFLAGS := -lrt
include ../lib.mk
diff --git a/tools/testing/selftests/timens/timer.c b/tools/testing/selftests/timens/timer.c
new file mode 100644
index 000000000000..e3a0951aadc8
--- /dev/null
+++ b/tools/testing/selftests/timens/timer.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/syscall.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <signal.h>
+#include <time.h>
+
+#include "log.h"
+
+#ifndef CLONE_NEWTIME
+#define CLONE_NEWTIME 0x00001000 /* New time namespace */
+#endif
+
+int run_test(int clockid)
+{
+ struct itimerspec new_value;
+ struct timespec now;
+ long long elapsed;
+ timer_t fd;
+ int i;
+
+ if (clock_gettime(clockid, &now) == -1)
+ return pr_perror("clock_gettime");
+
+ for (i = 0; i < 2; i++) {
+ struct sigevent sevp = {.sigev_notify = SIGEV_NONE};
+ int flags = 0;
+
+ pr_msg("timerfd_settime: %d", "INFO", clockid);
+ new_value.it_value.tv_sec = 3600;
+ new_value.it_value.tv_nsec = 0;
+ new_value.it_interval.tv_sec = 1;
+ new_value.it_interval.tv_nsec = 0;
+
+ if (i == 1) {
+ new_value.it_value.tv_sec += now.tv_sec;
+ new_value.it_value.tv_nsec += now.tv_nsec;
+ }
+
+ if (timer_create(clockid, &sevp, &fd) == -1)
+ return pr_perror("timerfd_create");
+
+ if (i == 1)
+ flags |= TIMER_ABSTIME;
+ if (timer_settime(fd, flags, &new_value, NULL) == -1)
+ return pr_perror("timerfd_settime");
+
+ if (timer_gettime(fd, &new_value) == -1)
+ return pr_perror("timerfd_gettime");
+
+ elapsed = new_value.it_value.tv_sec;
+ if (abs(elapsed - 3600) > 60) {
+ pr_fail("elapsed: %lld\n", elapsed);
+ return 1;
+ }
+ }
+
+ printf("PASS\n");
+
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ struct timespec tp;
+ int ret;
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("unshare");
+
+ if (clock_gettime(CLOCK_MONOTONIC, &tp))
+ return pr_perror("clock_gettime");
+ tp.tv_sec -= 70 * 24 * 3600;
+ if (clock_settime(CLOCK_MONOTONIC, &tp))
+ return pr_perror("clock_settime");
+
+ if (clock_gettime(CLOCK_BOOTTIME, &tp))
+ return pr_perror("clock_gettime");
+ tp.tv_sec -= 9 * 24 * 3600;
+ tp.tv_nsec = 0;
+ if (clock_settime(CLOCK_BOOTTIME, &tp))
+ return pr_perror("clock_settime");
+
+ ret = 0;
+ ret |= run_test(CLOCK_BOOTTIME);
+ ret |= run_test(CLOCK_MONOTONIC);
+ return ret;
+}
+
--
2.13.6
From: Andrei Vagin <[email protected]>
Check that timerfd_create takes into account clock offsets.
Cc: [email protected]
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 2 +-
tools/testing/selftests/timens/timerfd.c | 96 +++++++++++++++++++++++++++++++
3 files changed, 98 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/timerfd.c
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 27a693229ce1..b609f6ee9fb9 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1 +1,2 @@
timens
+timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index b877efb78974..66b90cd28e5c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens
+TEST_GEN_PROGS := timens timerfd
CFLAGS := -Wall -Werror
diff --git a/tools/testing/selftests/timens/timerfd.c b/tools/testing/selftests/timens/timerfd.c
new file mode 100644
index 000000000000..914a4cd9a0df
--- /dev/null
+++ b/tools/testing/selftests/timens/timerfd.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/timerfd.h>
+#include <sys/syscall.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+
+#include "log.h"
+
+#ifndef CLONE_NEWTIME
+# define CLONE_NEWTIME 0x00001000
+#endif
+
+int run_test(int clockid)
+{
+ struct itimerspec new_value;
+ struct timespec now;
+ long long elapsed;
+ int fd, i;
+
+ if (clock_gettime(clockid, &now))
+ return pr_perror("clock_gettime");
+
+ for (i = 0; i < 2; i++) {
+ int flags = 0;
+
+ pr_msg("timerfd_settime: %d", "INFO", clockid);
+ new_value.it_value.tv_sec = 3600;
+ new_value.it_value.tv_nsec = 0;
+ new_value.it_interval.tv_sec = 1;
+ new_value.it_interval.tv_nsec = 0;
+
+ if (i == 1) {
+ new_value.it_value.tv_sec += now.tv_sec;
+ new_value.it_value.tv_nsec += now.tv_nsec;
+ }
+
+ fd = timerfd_create(clockid, 0);
+ if (fd == -1)
+ return pr_perror("timerfd_create");
+
+ if (i == 1)
+ flags |= TFD_TIMER_ABSTIME;
+
+ if (timerfd_settime(fd, flags, &new_value, NULL))
+ return pr_perror("timerfd_settime");
+
+ if (timerfd_gettime(fd, &new_value))
+ return pr_perror("timerfd_gettime");
+
+ elapsed = new_value.it_value.tv_sec;
+ if (abs(elapsed - 3600) > 60) {
+ printf("FAIL\n");
+ return 1;
+ }
+
+ close(fd);
+ }
+
+ printf("PASS\n");
+
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ struct timespec tp;
+ int ret;
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("unshare");
+
+ if (clock_gettime(CLOCK_MONOTONIC, &tp))
+ return pr_perror("clock_gettime");
+ tp.tv_sec = 7 * 24 * 3600;
+ if (clock_settime(CLOCK_MONOTONIC, &tp))
+ return pr_perror("clock_settime");
+
+ if (clock_gettime(CLOCK_BOOTTIME, &tp))
+ return pr_perror("clock_gettime");
+ tp.tv_sec += 9 * 24 * 3600;
+ tp.tv_nsec = 0;
+ if (clock_settime(CLOCK_BOOTTIME, &tp))
+ return pr_perror("clock_settime");
+
+ ret = 0;
+ ret |= run_test(CLOCK_BOOTTIME);
+ ret |= run_test(CLOCK_MONOTONIC);
+ return ret;
+}
+
--
2.13.6
Fall through on host or in ns without time set.
Add TIMENS_FALLBACK_SYSCALL which might be wired up if timens offsets
should be unknown for userspace (will result in fall-back to syscalls).
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 17 +++++++++++++----
include/linux/timens_offsets.h | 12 ++++++++++--
kernel/time/posix-timers.c | 21 ++++++++++++---------
kernel/time_namespace.c | 2 +-
4 files changed, 36 insertions(+), 16 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index a265e2737a9a..458cb1992e2e 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -252,17 +252,25 @@ notrace void set_normalized_timespec(struct timespec *ts, time_t sec, s64 nsec)
ts->tv_nsec = nsec;
}
-notrace static __always_inline void monotonic_to_ns(struct timespec *ts)
+notrace static __always_inline int monotonic_to_ns(struct timespec *ts)
{
#ifdef CONFIG_TIME_NS
struct timens_offsets *timens = (struct timens_offsets *) &timens_page;
struct timespec offset;
+ /* Optimization: time is the same as on host, return right away */
+ if (!(timens->flags & TIMENS_USE_OFFSETS))
+ return 0;
+
+ if (timens->flags & TIMENS_FALLBACK_SYSCALL)
+ return -1;
+
offset = timespec64_to_timespec(timens->monotonic_time_offset);
*ts = timespec_add(*ts, offset);
#endif
+ return 0;
}
notrace static int __always_inline do_monotonic(struct timespec *ts)
@@ -283,8 +291,6 @@ notrace static int __always_inline do_monotonic(struct timespec *ts)
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
- monotonic_to_ns(ts);
-
return mode;
}
@@ -306,7 +312,6 @@ notrace static void do_monotonic_coarse(struct timespec *ts)
ts->tv_sec = gtod->monotonic_time_coarse_sec;
ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
- monotonic_to_ns(ts);
}
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
@@ -319,12 +324,16 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
case CLOCK_MONOTONIC:
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
+ if (monotonic_to_ns(ts))
+ goto fallback;
break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
case CLOCK_MONOTONIC_COARSE:
do_monotonic_coarse(ts);
+ if (monotonic_to_ns(ts))
+ goto fallback;
break;
default:
goto fallback;
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index 92a8ea5601eb..8c43e7c3e632 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -2,6 +2,13 @@
#ifndef _LINUX_TIME_OFFSETS_H
#define _LINUX_TIME_OFFSETS_H
+enum {
+ /* We're in namespace - add offsets from vvar */
+ TIMENS_USE_OFFSETS = 1,
+ /* Don't expose host's offsets, fall back to syscall - slow */
+ TIMENS_FALLBACK_SYSCALL = 2, /* TODO if anyone actually interested */
+};
+
/*
* Time offsets need align as they're placed on vvar page,
* which should have tail paddings on ia32 vdso.
@@ -10,8 +17,9 @@
* to timespec because of a padding occuring between the fields.
*/
struct timens_offsets {
- struct timespec64 monotonic_time_offset __aligned(8);
- struct timespec64 monotonic_boottime_offset __aligned(8);
+ u64 flags;
+ struct timespec64 monotonic_time_offset __aligned(8);
+ struct timespec64 monotonic_boottime_offset __aligned(8);
};
#endif
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 701cb0602b7a..576dbd24c498 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -210,7 +210,7 @@ static void common_timens_adjust(clockid_t which_clock, struct timespec64 *tp)
{
struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
- if (!ns_offsets)
+ if (!ns_offsets || !(ns_offsets->flags & TIMENS_USE_OFFSETS))
return;
switch (which_clock) {
@@ -234,15 +234,16 @@ static int posix_ktime_set_ts(clockid_t which_clock,
struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
struct timespec64 ktp;
+ if (!ns_offsets)
+ return -EINVAL;
+
if (!ns_capable(current->nsproxy->time_ns->user_ns, CAP_SYS_TIME))
return -EPERM;
ktime_get_ts64(&ktp);
- if (ns_offsets)
- ns_offsets->monotonic_time_offset = timespec64_sub(*tp, ktp);
- else
- return -EINVAL;
+ ns_offsets->monotonic_time_offset = timespec64_sub(*tp, ktp);
+ ns_offsets->flags |= TIMENS_USE_OFFSETS;
return 0;
}
@@ -296,15 +297,17 @@ static int posix_set_boottime(clockid_t which_clock, const struct timespec64 *tp
struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
struct timespec64 ktp;
+ if (!ns_offsets)
+ return -EINVAL;
+
if (!ns_capable(current->nsproxy->time_ns->user_ns, CAP_SYS_TIME))
return -EPERM;
ktime_get_boottime_ts64(&ktp);
- if (ns_offsets)
- ns_offsets->monotonic_boottime_offset = timespec64_sub(*tp, ktp);
- else
- return -EINVAL;
+ ns_offsets->monotonic_boottime_offset = timespec64_sub(*tp, ktp);
+ ns_offsets->flags |= TIMENS_USE_OFFSETS;
+
return 0;
}
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index f88ae0e17d92..4052bdcec110 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -171,7 +171,7 @@ static void clock_timens_fixup(int clockid, struct timespec64 *val, bool to_ns)
struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
struct timespec64 *offsets = NULL;
- if (!ns_offsets)
+ if (!ns_offsets || !(ns_offsets->flags & TIMENS_USE_OFFSETS))
return;
if (val->tv_sec == 0 && val->tv_nsec == 0)
--
2.13.6
Align offsets so that Time Namespace will work for ia32 applications on
x86_64 host.
Signed-off-by: Dmitry Safonov <[email protected]>
---
include/linux/timens_offsets.h | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index 777530c46852..92a8ea5601eb 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -2,9 +2,16 @@
#ifndef _LINUX_TIME_OFFSETS_H
#define _LINUX_TIME_OFFSETS_H
+/*
+ * Time offsets need align as they're placed on vvar page,
+ * which should have tail paddings on ia32 vdso.
+ * Otherwise as u64 has align(4), vvar offsets will differ.
+ * On 64-bit big-endian systems vdso should convert to timespec64
+ * to timespec because of a padding occuring between the fields.
+ */
struct timens_offsets {
- struct timespec64 monotonic_time_offset;
- struct timespec64 monotonic_boottime_offset;
+ struct timespec64 monotonic_time_offset __aligned(8);
+ struct timespec64 monotonic_boottime_offset __aligned(8);
};
#endif
--
2.13.6
Find page with timens offsets on vvar and flush mapping for it during
entering/creating another time namespace.
Prevents application to have stale mapping from old namespace.
(as old namespace might be destroyed on the moment of userspace access,
it also prevents leaks from kernel).
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vma.c | 31 +++++++++++++++++++++++++++++++
arch/x86/include/asm/vdso.h | 1 +
kernel/time_namespace.c | 12 ++++++++++++
3 files changed, 44 insertions(+)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 0f92227a4a7e..90eadcfcb7f5 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -25,6 +25,7 @@
#include <asm/cpufeature.h>
#include <asm/mshyperv.h>
#include <asm/page.h>
+#include <asm/tlbflush.h>
#if defined(CONFIG_X86_64)
unsigned int __read_mostly vdso64_enabled = 1;
@@ -158,6 +159,36 @@ static int vvar_fault(const struct vm_special_mapping *sm,
return VM_FAULT_SIGBUS;
}
+static void clear_flush_timens_pte(struct mm_struct *mm, unsigned long addr)
+{
+ spinlock_t *ptl;
+ pte_t *ptep;
+
+ if (follow_pte_pmd(mm, addr, NULL, NULL, &ptep, NULL, &ptl))
+ return; /* no pte found */
+ ptep_get_and_clear(mm, addr, ptep);
+ pte_unmap_unlock(ptep, ptl);
+ flush_tlb_mm_range(mm, addr, addr + PAGE_SIZE, VM_NONE);
+}
+
+int vvar_purge_timens(struct task_struct *task)
+{
+ struct mm_struct *mm = task->mm;
+ const struct vdso_image *image;
+ unsigned long addr;
+
+ if (down_write_killable(&mm->mmap_sem))
+ return -EINTR;
+
+ image = mm->context.vdso_image;
+
+ addr = (unsigned long)mm->context.vdso + image->sym_timens_page;
+ clear_flush_timens_pte(mm, addr);
+
+ up_write(&mm->mmap_sem);
+ return 0;
+}
+
static const struct vm_special_mapping vdso_mapping = {
.name = "[vdso]",
.fault = vdso_fault,
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 619322065b8e..98b02481137c 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -45,6 +45,7 @@ extern const struct vdso_image vdso_image_32;
extern void __init init_vdso_image(const struct vdso_image *image);
extern int map_vdso_once(const struct vdso_image *image, unsigned long addr);
+extern int vvar_purge_timens(struct task_struct *task);
#endif /* __ASSEMBLER__ */
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index f96871cb8124..f88ae0e17d92 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -14,6 +14,7 @@
#include <linux/proc_ns.h>
#include <linux/sched/task.h>
#include <linux/mm.h>
+#include <asm/vdso.h>
static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
{
@@ -91,9 +92,15 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
struct time_namespace *copy_time_ns(unsigned long flags,
struct user_namespace *user_ns, struct time_namespace *old_ns)
{
+ int ret;
+
if (!(flags & CLONE_NEWTIME))
return get_time_ns(old_ns);
+ ret = vvar_purge_timens(current);
+ if (ret)
+ return ERR_PTR(ret);
+
return clone_time_ns(user_ns, old_ns);
}
@@ -138,11 +145,16 @@ static void timens_put(struct ns_common *ns)
static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
{
struct time_namespace *ns = to_time_ns(new);
+ int ret;
if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
return -EPERM;
+ ret = vvar_purge_timens(current);
+ if (ret)
+ return ret;
+
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
nsproxy->time_ns = ns;
--
2.13.6
From: Andrei Vagin <[email protected]>
As modern applications fetch time from vdso without entering the kernel,
it's needed to provide offsets for userspace code.
Allocate a page for timens offsets when constructing time namespace.
As vdso mappings are platform-specific, add Kconfig dependency for arch.
Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/Kconfig | 5 +++++
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/vclock_gettime.c | 26 ++++++++++++++++++++++++++
arch/x86/entry/vdso/vdso-layout.lds.S | 9 ++++++++-
arch/x86/entry/vdso/vdso2c.c | 3 +++
arch/x86/entry/vdso/vma.c | 12 ++++++++++++
arch/x86/include/asm/vdso.h | 1 +
init/Kconfig | 1 +
8 files changed, 57 insertions(+), 1 deletion(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 6801123932a5..411df0227a1d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -681,6 +681,11 @@ config HAVE_ARCH_HASH
config ISA_BUS_API
def_bool ISA
+config ARCH_HAS_VDSO_TIME_NS
+ bool
+ help
+ VDSO can add time-ns offsets without entering kernel.
+
#
# ABI hall of shame
#
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1a0be022f91d..4bcbdd1f1200 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -70,6 +70,7 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
+ select ARCH_HAS_VDSO_TIME_NS
select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d95c60..0594266740b9 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
#include <linux/math64.h>
#include <linux/time.h>
#include <linux/kernel.h>
+#include <linux/timens_offsets.h>
#define gtod (&VVAR(vsyscall_gtod_data))
@@ -38,6 +39,11 @@ extern u8 hvclock_page
__attribute__((visibility("hidden")));
#endif
+#ifdef CONFIG_TIME_NS
+extern u8 timens_page
+ __attribute__((visibility("hidden")));
+#endif
+
#ifndef BUILD_VDSO32
notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -225,6 +231,23 @@ notrace static int __always_inline do_realtime(struct timespec *ts)
return mode;
}
+notrace static __always_inline void monotonic_to_ns(struct timespec *ts)
+{
+#ifdef CONFIG_TIME_NS
+ struct timens_offsets *timens = (struct timens_offsets *) &timens_page;
+
+ ts->tv_sec += timens->monotonic_time_offset.tv_sec;
+ ts->tv_nsec += timens->monotonic_time_offset.tv_nsec;
+ if (ts->tv_nsec > NSEC_PER_SEC) {
+ ts->tv_nsec -= NSEC_PER_SEC;
+ ts->tv_sec++;
+ } else if (ts->tv_nsec < 0) {
+ ts->tv_nsec += NSEC_PER_SEC;
+ ts->tv_sec--;
+ }
+#endif
+}
+
notrace static int __always_inline do_monotonic(struct timespec *ts)
{
unsigned long seq;
@@ -243,6 +266,8 @@ notrace static int __always_inline do_monotonic(struct timespec *ts)
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
+ monotonic_to_ns(ts);
+
return mode;
}
@@ -264,6 +289,7 @@ notrace static void do_monotonic_coarse(struct timespec *ts)
ts->tv_sec = gtod->monotonic_time_coarse_sec;
ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
+ monotonic_to_ns(ts);
}
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index acfd5ba7d943..e5c2e9deca03 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -17,6 +17,12 @@
#define NUM_FAKE_SHDRS 13
+#ifdef CONFIG_TIME_NS
+# define TIMENS_SZ PAGE_SIZE
+#else
+# define TIMENS_SZ 0
+#endif
+
SECTIONS
{
/*
@@ -26,7 +32,7 @@ SECTIONS
* segment.
*/
- vvar_start = . - 3 * PAGE_SIZE;
+ vvar_start = . - (3 * PAGE_SIZE + TIMENS_SZ);
vvar_page = vvar_start;
/* Place all vvars at the offsets in asm/vvar.h. */
@@ -38,6 +44,7 @@ SECTIONS
pvclock_page = vvar_start + PAGE_SIZE;
hvclock_page = vvar_start + 2 * PAGE_SIZE;
+ timens_page = vvar_start + 3 * PAGE_SIZE;
. = SIZEOF_HEADERS;
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 4674f58581a1..6c67cde7fe99 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -76,6 +76,7 @@ enum {
sym_hpet_page,
sym_pvclock_page,
sym_hvclock_page,
+ sym_timens_page,
sym_VDSO_FAKE_SECTION_TABLE_START,
sym_VDSO_FAKE_SECTION_TABLE_END,
};
@@ -85,6 +86,7 @@ const int special_pages[] = {
sym_hpet_page,
sym_pvclock_page,
sym_hvclock_page,
+ sym_timens_page,
};
struct vdso_sym {
@@ -98,6 +100,7 @@ struct vdso_sym required_syms[] = {
[sym_hpet_page] = {"hpet_page", true},
[sym_pvclock_page] = {"pvclock_page", true},
[sym_hvclock_page] = {"hvclock_page", true},
+ [sym_timens_page] = {"timens_page", true},
[sym_VDSO_FAKE_SECTION_TABLE_START] = {
"VDSO_FAKE_SECTION_TABLE_START", false
},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 8cc0395687b0..0f92227a4a7e 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -14,6 +14,7 @@
#include <linux/elf.h>
#include <linux/cpu.h>
#include <linux/ptrace.h>
+#include <linux/time_namespace.h>
#include <asm/pvclock.h>
#include <asm/vgtod.h>
#include <asm/proto.h>
@@ -23,6 +24,7 @@
#include <asm/desc.h>
#include <asm/cpufeature.h>
#include <asm/mshyperv.h>
+#include <asm/page.h>
#if defined(CONFIG_X86_64)
unsigned int __read_mostly vdso64_enabled = 1;
@@ -138,6 +140,16 @@ static int vvar_fault(const struct vm_special_mapping *sm,
if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK))
ret = vm_insert_pfn(vma, vmf->address,
vmalloc_to_pfn(tsc_pg));
+ } else if (sym_offset == image->sym_timens_page) {
+ struct time_namespace *ns = current->nsproxy->time_ns;
+ unsigned long pfn;
+
+ if (!ns->offsets)
+ pfn = page_to_pfn(ZERO_PAGE(0));
+ else
+ pfn = page_to_pfn(virt_to_page(ns->offsets));
+
+ ret = vm_insert_pfn(vma, vmf->address, pfn);
}
if (ret == 0 || ret == -EBUSY)
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 27566e57e87d..619322065b8e 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -22,6 +22,7 @@ struct vdso_image {
long sym_hpet_page;
long sym_pvclock_page;
long sym_hvclock_page;
+ long sym_timens_page;
long sym_VDSO32_NOTE_MASK;
long sym___kernel_sigreturn;
long sym___kernel_rt_sigreturn;
diff --git a/init/Kconfig b/init/Kconfig
index dc2b40f7d73f..c9b250475ddb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -929,6 +929,7 @@ config UTS_NS
config TIME_NS
bool "TIME namespace"
+ depends on ARCH_HAS_VDSO_TIME_NS
default y
help
In this namespace boottime and monotonic clocks can be set.
--
2.13.6
From: Andrei Vagin <[email protected]>
Provide a helper that will convert clocks to time namespace.
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
kernel/time/posix-timers.c | 52 +++++++++++++++++++++++++++++++---------------
kernel/time/posix-timers.h | 2 ++
2 files changed, 37 insertions(+), 17 deletions(-)
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index d38835a21c5d..701cb0602b7a 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -206,12 +206,26 @@ static int posix_clock_realtime_adj(const clockid_t which_clock,
return do_adjtimex(t);
}
-static void timens_adjust_monotonic(struct timespec64 *tp)
+static void common_timens_adjust(clockid_t which_clock, struct timespec64 *tp)
{
struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
- if (ns_offsets)
+ if (!ns_offsets)
+ return;
+
+ switch (which_clock) {
+ case CLOCK_MONOTONIC:
+ case CLOCK_MONOTONIC_RAW:
+ case CLOCK_MONOTONIC_COARSE:
*tp = timespec64_add(*tp, ns_offsets->monotonic_time_offset);
+ break;
+ case CLOCK_BOOTTIME:
+ *tp = timespec64_add(*tp, ns_offsets->monotonic_boottime_offset);
+ break;
+ default:
+ WARN_ONCE(1, "Time Namespace offset for %d is not realized",
+ which_clock);
+ }
}
static int posix_ktime_set_ts(clockid_t which_clock,
@@ -239,7 +253,6 @@ static int posix_ktime_set_ts(clockid_t which_clock,
static int posix_ktime_get_ts(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_ts64(tp);
- timens_adjust_monotonic(tp);
return 0;
}
@@ -249,7 +262,6 @@ static int posix_ktime_get_ts(clockid_t which_clock, struct timespec64 *tp)
static int posix_get_monotonic_raw(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_raw_ts64(tp);
- timens_adjust_monotonic(tp);
return 0;
}
@@ -264,7 +276,6 @@ static int posix_get_monotonic_coarse(clockid_t which_clock,
struct timespec64 *tp)
{
ktime_get_coarse_ts64(tp);
- timens_adjust_monotonic(tp);
return 0;
}
@@ -276,15 +287,7 @@ static int posix_get_coarse_res(const clockid_t which_clock, struct timespec64 *
static int posix_get_boottime(const clockid_t which_clock, struct timespec64 *tp)
{
- struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
-
ktime_get_boottime_ts64(tp);
-
- if (!ns_offsets)
- return 0;
-
- *tp = timespec64_add(*tp, ns_offsets->monotonic_boottime_offset);
-
return 0;
}
@@ -933,10 +936,6 @@ static int do_timer_settime(timer_t timer_id, int flags,
unsigned long flag;
int error = 0;
- if (!timespec64_valid(&new_spec64->it_interval) ||
- !timespec64_valid(&new_spec64->it_value))
- return -EINVAL;
-
if (old_spec64)
memset(old_spec64, 0, sizeof(*old_spec64));
retry:
@@ -944,6 +943,15 @@ static int do_timer_settime(timer_t timer_id, int flags,
if (!timr)
return -EINVAL;
+ if (flags & TIMER_ABSTIME)
+ timens_clock_to_host(timr->it_clock, &new_spec64->it_value);
+
+ if (!timespec64_valid(&new_spec64->it_interval) ||
+ !timespec64_valid(&new_spec64->it_value)) {
+ unlock_timer(timr, flag);
+ return -EINVAL;
+ }
+
kc = timr->kclock;
if (WARN_ON_ONCE(!kc || !kc->timer_set))
error = -EINVAL;
@@ -1121,6 +1129,9 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,
error = kc->clock_get(which_clock, &kernel_tp);
+ if (!error && kc->clock_timens_adjust)
+ kc->clock_timens_adjust(which_clock, &kernel_tp);
+
if (!error && put_timespec64(&kernel_tp, tp))
error = -EFAULT;
@@ -1197,6 +1208,9 @@ COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock,
err = kc->clock_get(which_clock, &ts);
+ if (!err && kc->clock_timens_adjust)
+ kc->clock_timens_adjust(which_clock, &ts);
+
if (!err && compat_put_timespec64(&ts, tp))
err = -EFAULT;
@@ -1340,6 +1354,7 @@ static const struct k_clock clock_monotonic = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_ktime_get_ts,
.clock_set = posix_ktime_set_ts,
+ .clock_timens_adjust = common_timens_adjust,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
@@ -1356,6 +1371,7 @@ static const struct k_clock clock_monotonic_raw = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_get_monotonic_raw,
.clock_set = posix_ktime_set_ts,
+ .clock_timens_adjust = common_timens_adjust,
};
static const struct k_clock clock_realtime_coarse = {
@@ -1367,6 +1383,7 @@ static const struct k_clock clock_monotonic_coarse = {
.clock_getres = posix_get_coarse_res,
.clock_get = posix_get_monotonic_coarse,
.clock_set = posix_ktime_set_ts,
+ .clock_timens_adjust = common_timens_adjust,
};
static const struct k_clock clock_tai = {
@@ -1388,6 +1405,7 @@ static const struct k_clock clock_boottime = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_get_boottime,
.clock_set = posix_set_boottime,
+ .clock_timens_adjust = common_timens_adjust,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
diff --git a/kernel/time/posix-timers.h b/kernel/time/posix-timers.h
index ddb21145211a..308774bea32a 100644
--- a/kernel/time/posix-timers.h
+++ b/kernel/time/posix-timers.h
@@ -8,6 +8,8 @@ struct k_clock {
const struct timespec64 *tp);
int (*clock_get)(const clockid_t which_clock,
struct timespec64 *tp);
+ void (*clock_timens_adjust)(const clockid_t which_clock,
+ struct timespec64 *tp);
int (*clock_adj)(const clockid_t which_clock, struct timex *tx);
int (*timer_create)(struct k_itimer *timer);
int (*nsleep)(const clockid_t which_clock, int flags,
--
2.13.6
From: Andrei Vagin <[email protected]>
Wire up clock_nanosleep to timens offsets.
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
kernel/time/hrtimer.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index e1a549c9e399..4fe80c1325b2 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -51,6 +51,7 @@
#include <linux/timer.h>
#include <linux/freezer.h>
#include <linux/compat.h>
+#include <linux/time_namespace.h>
#include <linux/uaccess.h>
@@ -1730,9 +1731,16 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
{
struct restart_block *restart;
struct hrtimer_sleeper t;
+ struct timespec64 tp;
int ret = 0;
u64 slack;
+ if (!(mode & HRTIMER_MODE_REL)) {
+ tp = *rqtp;
+ rqtp = &tp;
+ timens_clock_to_host(clockid, &tp);
+ }
+
slack = current->timer_slack_ns;
if (dl_task(current) || rt_task(current))
slack = 0;
--
2.13.6
As vvar vma may be moved away from vdso, let's search it, rather than
calculate purge address from vdso position.
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vma.c | 39 +++++++++++++++++++++++++--------------
1 file changed, 25 insertions(+), 14 deletions(-)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 90eadcfcb7f5..d1e2392a4905 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -159,7 +159,18 @@ static int vvar_fault(const struct vm_special_mapping *sm,
return VM_FAULT_SIGBUS;
}
-static void clear_flush_timens_pte(struct mm_struct *mm, unsigned long addr)
+static const struct vm_special_mapping vdso_mapping = {
+ .name = "[vdso]",
+ .fault = vdso_fault,
+ .mremap = vdso_mremap,
+};
+static const struct vm_special_mapping vvar_mapping = {
+ .name = "[vvar]",
+ .fault = vvar_fault,
+ .mremap = vvar_mremap,
+};
+
+static void vvar_flush_timens_pte(struct mm_struct *mm, unsigned long addr)
{
spinlock_t *ptl;
pte_t *ptep;
@@ -175,31 +186,31 @@ int vvar_purge_timens(struct task_struct *task)
{
struct mm_struct *mm = task->mm;
const struct vdso_image *image;
+ struct vm_area_struct *vma;
unsigned long addr;
if (down_write_killable(&mm->mmap_sem))
return -EINTR;
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (vma_is_special_mapping(vma, &vvar_mapping))
+ break;
+ }
+
+ /* vvar is unmapped */
+ if (!vma || !vma_is_special_mapping(vma, &vvar_mapping))
+ goto out;
+
image = mm->context.vdso_image;
- addr = (unsigned long)mm->context.vdso + image->sym_timens_page;
- clear_flush_timens_pte(mm, addr);
+ addr = vma->vm_end + image->sym_timens_page;
+ vvar_flush_timens_pte(mm, addr);
+out:
up_write(&mm->mmap_sem);
return 0;
}
-static const struct vm_special_mapping vdso_mapping = {
- .name = "[vdso]",
- .fault = vdso_fault,
- .mremap = vdso_mremap,
-};
-static const struct vm_special_mapping vvar_mapping = {
- .name = "[vvar]",
- .fault = vvar_fault,
- .mremap = vvar_mremap,
-};
-
/*
* Add vdso and vvar mappings to current process.
* @image - blob to map
--
2.13.6
From: Andrei Vagin <[email protected]>
Make timerfd respect timens offsets.
Provide two helpers timens_clock_to_host() timens_clock_from_host() that
are useful to wire up timens to different kernel subsystems.
Following patches will use timens_clock_from_host(), added here for
completeness.
Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
fs/timerfd.c | 16 +++++++++++-----
include/linux/time_namespace.h | 11 +++++++++++
kernel/time_namespace.c | 39 +++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 5 deletions(-)
diff --git a/fs/timerfd.c b/fs/timerfd.c
index d69ad801eb80..001ab7a0fd8b 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -26,6 +26,7 @@
#include <linux/syscalls.h>
#include <linux/compat.h>
#include <linux/rcupdate.h>
+#include <linux/time_namespace.h>
struct timerfd_ctx {
union {
@@ -433,22 +434,27 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
}
static int do_timerfd_settime(int ufd, int flags,
- const struct itimerspec64 *new,
+ struct itimerspec64 *new,
struct itimerspec64 *old)
{
struct fd f;
struct timerfd_ctx *ctx;
int ret;
- if ((flags & ~TFD_SETTIME_FLAGS) ||
- !itimerspec64_valid(new))
- return -EINVAL;
-
ret = timerfd_fget(ufd, &f);
if (ret)
return ret;
ctx = f.file->private_data;
+ if (flags & TFD_TIMER_ABSTIME)
+ timens_clock_to_host(ctx->clockid, &new->it_value);
+
+ if ((flags & ~TFD_SETTIME_FLAGS) ||
+ !itimerspec64_valid(new)) {
+ fdput(f);
+ return -EINVAL;
+ }
+
if (isalarm(ctx) && !capable(CAP_WAKE_ALARM)) {
fdput(f);
return -EPERM;
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 4960c54f1b33..910711d1c39d 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -38,6 +38,9 @@ static inline void put_time_ns(struct time_namespace *ns)
kref_put(&ns->kref, free_time_ns);
}
+void timens_clock_to_host(int clockid, struct timespec64 *val);
+void timens_clock_from_host(int clockid, struct timespec64 *val);
+
#else
static inline void get_time_ns(struct time_namespace *ns)
{
@@ -56,6 +59,14 @@ static inline struct time_namespace *copy_time_ns(unsigned long flags,
return old_ns;
}
+static inline void timens_clock_to_host(int clockid, struct timespec64 *val)
+{
+}
+
+static inline void timens_clock_from_host(int clockid, struct timespec64 *val)
+{
+}
+
#endif
#endif /* _LINUX_TIMENS_H */
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index a985529754b4..f96871cb8124 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -154,6 +154,45 @@ static struct user_namespace *timens_owner(struct ns_common *ns)
return to_time_ns(ns)->user_ns;
}
+static void clock_timens_fixup(int clockid, struct timespec64 *val, bool to_ns)
+{
+ struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
+ struct timespec64 *offsets = NULL;
+
+ if (!ns_offsets)
+ return;
+
+ if (val->tv_sec == 0 && val->tv_nsec == 0)
+ return;
+
+ switch (clockid) {
+ case CLOCK_MONOTONIC:
+ offsets = &ns_offsets->monotonic_time_offset;
+ break;
+ case CLOCK_BOOTTIME:
+ offsets = &ns_offsets->monotonic_boottime_offset;
+ break;
+ }
+
+ if (!offsets)
+ return;
+
+ if (to_ns)
+ *val = timespec64_add(*val, *offsets);
+ else
+ *val = timespec64_sub(*val, *offsets);
+}
+
+void timens_clock_to_host(int clockid, struct timespec64 *val)
+{
+ clock_timens_fixup(clockid, val, false);
+}
+
+void timens_clock_from_host(int clockid, struct timespec64 *val)
+{
+ clock_timens_fixup(clockid, val, true);
+}
+
const struct proc_ns_operations timens_operations = {
.name = "time",
.type = CLONE_NEWTIME,
--
2.13.6
As offsets differ between time namespaces, we will need to flush vvar
mapping for timens page during setns(), unshare(), clone(NEW_TIME).
Forcing userspace to mremap() either all vvar or nothing and the same
for munmap() will simplify searching for timens page to flush.
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vma.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 5b8b556dbb12..8cc0395687b0 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -84,6 +84,18 @@ static int vdso_mremap(const struct vm_special_mapping *sm,
return 0;
}
+static int vvar_mremap(const struct vm_special_mapping *sm,
+ struct vm_area_struct *new_vma)
+{
+ unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
+ const struct vdso_image *image = current->mm->context.vdso_image;
+
+ if (new_size != -image->sym_vvar_start)
+ return -EINVAL;
+
+ return 0;
+}
+
static int vvar_fault(const struct vm_special_mapping *sm,
struct vm_area_struct *vma, struct vm_fault *vmf)
{
@@ -142,6 +154,7 @@ static const struct vm_special_mapping vdso_mapping = {
static const struct vm_special_mapping vvar_mapping = {
.name = "[vvar]",
.fault = vvar_fault,
+ .mremap = vvar_mremap,
};
/*
--
2.13.6
From: Andrei Vagin <[email protected]>
ts->tv_nsec + offset->tv_nsec
On 32 bit machines that sum can be larger than (1 << 31) and therefor
result in a negative value which screws up the result completely.
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 0594266740b9..a265e2737a9a 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -231,20 +231,37 @@ notrace static int __always_inline do_realtime(struct timespec *ts)
return mode;
}
+notrace void set_normalized_timespec(struct timespec *ts, time_t sec, s64 nsec)
+{
+ while (nsec >= NSEC_PER_SEC) {
+ /*
+ * The following asm() prevents the compiler from
+ * optimising this loop into a modulo operation. See
+ * also __iter_div_u64_rem() in include/linux/time.h
+ */
+ asm("" : "+rm"(nsec));
+ nsec -= NSEC_PER_SEC;
+ ++sec;
+ }
+ while (nsec < 0) {
+ asm("" : "+rm"(nsec));
+ nsec += NSEC_PER_SEC;
+ --sec;
+ }
+ ts->tv_sec = sec;
+ ts->tv_nsec = nsec;
+}
+
notrace static __always_inline void monotonic_to_ns(struct timespec *ts)
{
#ifdef CONFIG_TIME_NS
struct timens_offsets *timens = (struct timens_offsets *) &timens_page;
+ struct timespec offset;
+
+ offset = timespec64_to_timespec(timens->monotonic_time_offset);
+
+ *ts = timespec_add(*ts, offset);
- ts->tv_sec += timens->monotonic_time_offset.tv_sec;
- ts->tv_nsec += timens->monotonic_time_offset.tv_nsec;
- if (ts->tv_nsec > NSEC_PER_SEC) {
- ts->tv_nsec -= NSEC_PER_SEC;
- ts->tv_sec++;
- } else if (ts->tv_nsec < 0) {
- ts->tv_nsec += NSEC_PER_SEC;
- ts->tv_sec--;
- }
#endif
}
--
2.13.6
From: Andrei Vagin <[email protected]>
Adds boottime virtualisation for time namespace.
Provide clock_set() API to set boottime clock inside ns.
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
include/linux/timens_offsets.h | 1 +
kernel/time/posix-timers.c | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index 248b0c0bb92a..777530c46852 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -4,6 +4,7 @@
struct timens_offsets {
struct timespec64 monotonic_time_offset;
+ struct timespec64 monotonic_boottime_offset;
};
#endif
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 3c1f98760dec..d38835a21c5d 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -276,7 +276,32 @@ static int posix_get_coarse_res(const clockid_t which_clock, struct timespec64 *
static int posix_get_boottime(const clockid_t which_clock, struct timespec64 *tp)
{
+ struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
+
ktime_get_boottime_ts64(tp);
+
+ if (!ns_offsets)
+ return 0;
+
+ *tp = timespec64_add(*tp, ns_offsets->monotonic_boottime_offset);
+
+ return 0;
+}
+
+static int posix_set_boottime(clockid_t which_clock, const struct timespec64 *tp)
+{
+ struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
+ struct timespec64 ktp;
+
+ if (!ns_capable(current->nsproxy->time_ns->user_ns, CAP_SYS_TIME))
+ return -EPERM;
+
+ ktime_get_boottime_ts64(&ktp);
+
+ if (ns_offsets)
+ ns_offsets->monotonic_boottime_offset = timespec64_sub(*tp, ktp);
+ else
+ return -EINVAL;
return 0;
}
@@ -1362,6 +1387,7 @@ static const struct k_clock clock_tai = {
static const struct k_clock clock_boottime = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_get_boottime,
+ .clock_set = posix_set_boottime,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
--
2.13.6
From: Andrei Vagin <[email protected]>
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes time in
a container (CLOCK_REALTIME).
But in a context of the checkpoint/restore functionality, monotonic and
bootime clocks become interesting. Both clocks are monotonic with
unspecified staring points. These clocks are widely used to measure time
slices, set timers. After restoring or migrating processes, we have to
guarantee that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that we need to be able to set CLOCK_MONOTONIC
and CLOCK_BOOTTIME clocks, what can be done by adding per-namespace
offsets for clocks.
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
fs/proc/namespaces.c | 3 +
include/linux/nsproxy.h | 1 +
include/linux/proc_ns.h | 1 +
include/linux/time_namespace.h | 59 ++++++++++++++
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 7 ++
kernel/Makefile | 1 +
kernel/fork.c | 3 +-
kernel/nsproxy.c | 19 ++++-
kernel/time_namespace.c | 169 +++++++++++++++++++++++++++++++++++++++++
11 files changed, 262 insertions(+), 3 deletions(-)
create mode 100644 include/linux/time_namespace.h
create mode 100644 kernel/time_namespace.c
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..faee2facb4f3 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -33,6 +33,9 @@ static const struct proc_ns_operations *ns_entries[] = {
#ifdef CONFIG_CGROUPS
&cgroupns_operations,
#endif
+#ifdef CONFIG_TIME_NS
+ &timens_operations,
+#endif
};
static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 2ae1b1a4d84d..5355229c0ce7 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -35,6 +35,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct time_namespace *time_ns;
struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..b97b802ab04d 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -32,6 +32,7 @@ extern const struct proc_ns_operations pidns_for_children_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
extern const struct proc_ns_operations cgroupns_operations;
+extern const struct proc_ns_operations timens_operations;
/*
* We always define these enumerators
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
new file mode 100644
index 000000000000..bf98f35efe07
--- /dev/null
+++ b/include/linux/time_namespace.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIMENS_H
+#define _LINUX_TIMENS_H
+
+
+#include <linux/sched.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/ns_common.h>
+#include <linux/err.h>
+
+struct user_namespace;
+extern struct user_namespace init_user_ns;
+
+struct time_namespace {
+ struct kref kref;
+ struct user_namespace *user_ns;
+ struct ucounts *ucounts;
+ struct ns_common ns;
+} __randomize_layout;
+extern struct time_namespace init_time_ns;
+
+#ifdef CONFIG_TIME_NS
+static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
+{
+ kref_get(&ns->kref);
+ return ns;
+}
+
+extern struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns);
+extern void free_time_ns(struct kref *kref);
+
+static inline void put_time_ns(struct time_namespace *ns)
+{
+ kref_put(&ns->kref, free_time_ns);
+}
+
+#else
+static inline void get_time_ns(struct time_namespace *ns)
+{
+}
+
+static inline void put_time_ns(struct time_namespace *ns)
+{
+}
+
+static inline struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns)
+{
+ if (flags & CLONE_NEWTIME)
+ return ERR_PTR(-EINVAL);
+
+ return old_ns;
+}
+
+#endif
+
+#endif /* _LINUX_TIMENS_H */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d6b74b91096b..bf84f93dc411 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -45,6 +45,7 @@ enum ucount_type {
UCOUNT_NET_NAMESPACES,
UCOUNT_MNT_NAMESPACES,
UCOUNT_CGROUP_NAMESPACES,
+ UCOUNT_TIME_NAMESPACES,
#ifdef CONFIG_INOTIFY_USER
UCOUNT_INOTIFY_INSTANCES,
UCOUNT_INOTIFY_WATCHES,
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..adffac53c76e 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_NEWTIME 0x00001000 /* New time namespace */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
diff --git a/init/Kconfig b/init/Kconfig
index 1e234e2f1cba..dc2b40f7d73f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -927,6 +927,13 @@ config UTS_NS
In this namespace tasks see different info provided with the
uname() system call
+config TIME_NS
+ bool "TIME namespace"
+ default y
+ help
+ In this namespace boottime and monotonic clocks can be set.
+ The time will keep going with the same pace.
+
config IPC_NS
bool "IPC namespace"
depends on (SYSVIPC || POSIX_MQUEUE)
diff --git a/kernel/Makefile b/kernel/Makefile
index 7a63d567fdb5..bc92feb6987d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup/
obj-$(CONFIG_UTS_NS) += utsname.o
+obj-$(CONFIG_TIME_NS) += time_namespace.o
obj-$(CONFIG_USER_NS) += user_namespace.o
obj-$(CONFIG_PID_NS) += pid_namespace.o
obj-$(CONFIG_IKCONFIG) += configs.o
diff --git a/kernel/fork.c b/kernel/fork.c
index f0b58479534f..384f88912b63 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2365,7 +2365,8 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP|
+ CLONE_NEWTIME))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f6c5d330059a..5e482e538365 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -22,6 +22,7 @@
#include <linux/pid_namespace.h>
#include <net/net_namespace.h>
#include <linux/ipc_namespace.h>
+#include <linux/time_namespace.h>
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
@@ -44,6 +45,9 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_CGROUPS
.cgroup_ns = &init_cgroup_ns,
#endif
+#ifdef CONFIG_TIME_NS
+ .time_ns = &init_time_ns,
+#endif
};
static inline struct nsproxy *create_nsproxy(void)
@@ -110,8 +114,16 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_net;
}
+ new_nsp->time_ns = copy_time_ns(flags, user_ns, tsk->nsproxy->time_ns);
+ if (IS_ERR(new_nsp->time_ns)) {
+ err = PTR_ERR(new_nsp->time_ns);
+ goto out_time;
+ }
+
return new_nsp;
+out_time:
+ put_net(new_nsp->net_ns);
out_net:
put_cgroup_ns(new_nsp->cgroup_ns);
out_cgroup:
@@ -143,7 +155,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
- CLONE_NEWCGROUP)))) {
+ CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
get_nsproxy(old_ns);
return 0;
}
@@ -180,6 +192,8 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->time_ns)
+ put_time_ns(ns->time_ns);
put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
@@ -196,7 +210,8 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;
if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP |
+ CLONE_NEWTIME)))
return 0;
user_ns = new_cred ? new_cred->user_ns : current_user_ns();
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
new file mode 100644
index 000000000000..902cd9c22159
--- /dev/null
+++ b/kernel/time_namespace.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Andrei Vagin <[email protected]>
+ * Author: Dmitry Safonov <[email protected]>
+ */
+
+#include <linux/export.h>
+#include <linux/time.h>
+#include <linux/time_namespace.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/cred.h>
+#include <linux/user_namespace.h>
+#include <linux/proc_ns.h>
+#include <linux/sched/task.h>
+
+static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
+{
+ return inc_ucount(ns, current_euid(), UCOUNT_TIME_NAMESPACES);
+}
+
+static void dec_time_namespaces(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_TIME_NAMESPACES);
+}
+
+static struct time_namespace *create_time_ns(void)
+{
+ struct time_namespace *time_ns;
+
+ time_ns = kmalloc(sizeof(struct time_namespace), GFP_KERNEL);
+ if (time_ns)
+ kref_init(&time_ns->kref);
+ return time_ns;
+}
+
+/*
+ * Clone a new ns copying an original timename, setting refcount to 1
+ * @old_ns: namespace to clone
+ * Return ERR_PTR(-ENOMEM) on error (failure to allocate), new ns otherwise
+ */
+static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
+ struct time_namespace *old_ns)
+{
+ struct time_namespace *ns;
+ struct ucounts *ucounts;
+ int err;
+
+ err = -ENOSPC;
+ ucounts = inc_time_namespaces(user_ns);
+ if (!ucounts)
+ goto fail;
+
+ err = -ENOMEM;
+ ns = create_time_ns();
+ if (!ns)
+ goto fail_dec;
+
+ err = ns_alloc_inum(&ns->ns);
+ if (err)
+ goto fail_free;
+
+ ns->ucounts = ucounts;
+ ns->ns.ops = &timens_operations;
+ ns->user_ns = get_user_ns(user_ns);
+ return ns;
+
+fail_free:
+ kfree(ns);
+fail_dec:
+ dec_time_namespaces(ucounts);
+fail:
+ return ERR_PTR(err);
+}
+
+/*
+ * Copy task tsk's time namespace, or clone it if flags
+ * specifies CLONE_NEWTIME. In latter case, changes to the
+ * timename of this process won't be seen by parent, and vice
+ * versa.
+ */
+struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns)
+{
+ if (!(flags & CLONE_NEWTIME))
+ return get_time_ns(old_ns);
+
+ return clone_time_ns(user_ns, old_ns);
+}
+
+void free_time_ns(struct kref *kref)
+{
+ struct time_namespace *ns;
+
+ ns = container_of(kref, struct time_namespace, kref);
+ dec_time_namespaces(ns->ucounts);
+ put_user_ns(ns->user_ns);
+ ns_free_inum(&ns->ns);
+ kfree(ns);
+}
+
+static inline struct time_namespace *to_time_ns(struct ns_common *ns)
+{
+ return container_of(ns, struct time_namespace, ns);
+}
+
+static struct ns_common *timens_get(struct task_struct *task)
+{
+ struct time_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->time_ns;
+ get_time_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns ? &ns->ns : NULL;
+}
+
+static void timens_put(struct ns_common *ns)
+{
+ put_time_ns(to_time_ns(ns));
+}
+
+static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
+{
+ struct time_namespace *ns = to_time_ns(new);
+
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
+ !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ return -EPERM;
+
+ get_time_ns(ns);
+ put_time_ns(nsproxy->time_ns);
+ nsproxy->time_ns = ns;
+ return 0;
+}
+
+static struct user_namespace *timens_owner(struct ns_common *ns)
+{
+ return to_time_ns(ns)->user_ns;
+}
+
+const struct proc_ns_operations timens_operations = {
+ .name = "time",
+ .type = CLONE_NEWTIME,
+ .get = timens_get,
+ .put = timens_put,
+ .install = timens_install,
+ .owner = timens_owner,
+};
+
+struct time_namespace init_time_ns = {
+ .kref = KREF_INIT(2),
+ .user_ns = &init_user_ns,
+ .ns.inum = PROC_UTS_INIT_INO,
+#ifdef CONFIG_UTS_NS
+ .ns.ops = &timens_operations,
+#endif
+};
+
+static int __init time_ns_init(void)
+{
+ return 0;
+}
+subsys_initcall(time_ns_init);
--
2.13.6
Respect boottime inside time namespace for /proc/uptime
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
fs/proc/uptime.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index a4c2791ab70b..4421ec058472 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -5,6 +5,7 @@
#include <linux/sched.h>
#include <linux/seq_file.h>
#include <linux/time.h>
+#include <linux/time_namespace.h>
#include <linux/kernel_stat.h>
static int uptime_proc_show(struct seq_file *m, void *v)
@@ -20,6 +21,8 @@ static int uptime_proc_show(struct seq_file *m, void *v)
nsec += (__force u64) kcpustat_cpu(i).cpustat[CPUTIME_IDLE];
ktime_get_boottime_ts64(&uptime);
+ timens_clock_from_host(CLOCK_BOOTTIME, &uptime);
+
idle.tv_sec = div_u64_rem(nsec, NSEC_PER_SEC, &rem);
idle.tv_nsec = rem;
seq_printf(m, "%lu.%02lu %lu.%02lu\n",
--
2.13.6
From: Andrei Vagin <[email protected]>
Introduce offsets for time namespace. They will contain adjustment
needed to convert clocks to/from host's.
Allocate one page for each time namespace that will be premapped into
userspace with vvar pages.
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
include/linux/time_namespace.h | 2 ++
include/linux/timens_offsets.h | 8 ++++++++
kernel/time_namespace.c | 14 ++++++++++++--
3 files changed, 22 insertions(+), 2 deletions(-)
create mode 100644 include/linux/timens_offsets.h
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index bf98f35efe07..4960c54f1b33 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -8,6 +8,7 @@
#include <linux/nsproxy.h>
#include <linux/ns_common.h>
#include <linux/err.h>
+#include <linux/timens_offsets.h>
struct user_namespace;
extern struct user_namespace init_user_ns;
@@ -17,6 +18,7 @@ struct time_namespace {
struct user_namespace *user_ns;
struct ucounts *ucounts;
struct ns_common ns;
+ struct timens_offsets *offsets;
} __randomize_layout;
extern struct time_namespace init_time_ns;
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
new file mode 100644
index 000000000000..7d7cb68ea778
--- /dev/null
+++ b/include/linux/timens_offsets.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIME_OFFSETS_H
+#define _LINUX_TIME_OFFSETS_H
+
+struct timens_offsets {
+};
+
+#endif
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 902cd9c22159..a985529754b4 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -13,6 +13,7 @@
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
#include <linux/sched/task.h>
+#include <linux/mm.h>
static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
{
@@ -44,6 +45,7 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
{
struct time_namespace *ns;
struct ucounts *ucounts;
+ struct page *page;
int err;
err = -ENOSPC;
@@ -56,15 +58,22 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
if (!ns)
goto fail_dec;
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ goto fail_free;
+ ns->offsets = page_address(page);
+ BUILD_BUG_ON(sizeof(*ns->offsets) > PAGE_SIZE);
+
err = ns_alloc_inum(&ns->ns);
if (err)
- goto fail_free;
+ goto fail_page;
ns->ucounts = ucounts;
ns->ns.ops = &timens_operations;
ns->user_ns = get_user_ns(user_ns);
return ns;
-
+fail_page:
+ free_page((unsigned long)ns->offsets);
fail_free:
kfree(ns);
fail_dec:
@@ -93,6 +102,7 @@ void free_time_ns(struct kref *kref)
struct time_namespace *ns;
ns = container_of(kref, struct time_namespace, kref);
+ free_page((unsigned long)ns->offsets);
dec_time_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
--
2.13.6
From: Andrei Vagin <[email protected]>
Adds monotonic time virtualisation for time namespace.
Provide clock_set() API to set monotonic time inside ns.
Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
include/linux/timens_offsets.h | 1 +
kernel/time/posix-timers.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index 7d7cb68ea778..248b0c0bb92a 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -3,6 +3,7 @@
#define _LINUX_TIME_OFFSETS_H
struct timens_offsets {
+ struct timespec64 monotonic_time_offset;
};
#endif
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 4b9127e95430..3c1f98760dec 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -51,6 +51,7 @@
#include <linux/hashtable.h>
#include <linux/compat.h>
#include <linux/nospec.h>
+#include <linux/time_namespace.h>
#include "timekeeping.h"
#include "posix-timers.h"
@@ -205,12 +206,40 @@ static int posix_clock_realtime_adj(const clockid_t which_clock,
return do_adjtimex(t);
}
+static void timens_adjust_monotonic(struct timespec64 *tp)
+{
+ struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
+
+ if (ns_offsets)
+ *tp = timespec64_add(*tp, ns_offsets->monotonic_time_offset);
+}
+
+static int posix_ktime_set_ts(clockid_t which_clock,
+ const struct timespec64 *tp)
+{
+ struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
+ struct timespec64 ktp;
+
+ if (!ns_capable(current->nsproxy->time_ns->user_ns, CAP_SYS_TIME))
+ return -EPERM;
+
+ ktime_get_ts64(&ktp);
+
+ if (ns_offsets)
+ ns_offsets->monotonic_time_offset = timespec64_sub(*tp, ktp);
+ else
+ return -EINVAL;
+
+ return 0;
+}
+
/*
* Get monotonic time for posix timers
*/
static int posix_ktime_get_ts(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_ts64(tp);
+ timens_adjust_monotonic(tp);
return 0;
}
@@ -220,6 +249,7 @@ static int posix_ktime_get_ts(clockid_t which_clock, struct timespec64 *tp)
static int posix_get_monotonic_raw(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_raw_ts64(tp);
+ timens_adjust_monotonic(tp);
return 0;
}
@@ -234,6 +264,7 @@ static int posix_get_monotonic_coarse(clockid_t which_clock,
struct timespec64 *tp)
{
ktime_get_coarse_ts64(tp);
+ timens_adjust_monotonic(tp);
return 0;
}
@@ -1283,6 +1314,7 @@ static const struct k_clock clock_realtime = {
static const struct k_clock clock_monotonic = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_ktime_get_ts,
+ .clock_set = posix_ktime_set_ts,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
@@ -1298,6 +1330,7 @@ static const struct k_clock clock_monotonic = {
static const struct k_clock clock_monotonic_raw = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_get_monotonic_raw,
+ .clock_set = posix_ktime_set_ts,
};
static const struct k_clock clock_realtime_coarse = {
@@ -1308,6 +1341,7 @@ static const struct k_clock clock_realtime_coarse = {
static const struct k_clock clock_monotonic_coarse = {
.clock_getres = posix_get_coarse_res,
.clock_get = posix_get_monotonic_coarse,
+ .clock_set = posix_ktime_set_ts,
};
static const struct k_clock clock_tai = {
--
2.13.6
On Wed, Sep 19, 2018 at 09:50:19PM +0100, Dmitry Safonov wrote:
> From: Andrei Vagin <[email protected]>
>
> Introduce offsets for time namespace. They will contain adjustment
> needed to convert clocks to/from host's.
>
> Allocate one page for each time namespace that will be premapped into
> userspace with vvar pages.
Is not it too much?! The whole page per each clone(new-time-ns) call.
Moreover everytime it is get explicitly zeroifyed. Don't get me wrong,
maybe I miss something obvious, but additional 4K per process, guys :)
On Thu, Sep 20, 2018 at 09:45:10PM +0300, Cyrill Gorcunov wrote:
> On Wed, Sep 19, 2018 at 09:50:19PM +0100, Dmitry Safonov wrote:
> > From: Andrei Vagin <[email protected]>
> >
> > Introduce offsets for time namespace. They will contain adjustment
> > needed to convert clocks to/from host's.
> >
> > Allocate one page for each time namespace that will be premapped into
> > userspace with vvar pages.
>
> Is not it too much?! The whole page per each clone(new-time-ns) call.
> Moreover everytime it is get explicitly zeroifyed. Don't get me wrong,
> maybe I miss something obvious, but additional 4K per process, guys :)
After being talking to Andrew I think there is no better option though.
If syscalls would be free of course we could use them instead but this
vdso stuff, sigh. I thouhgh about modifying vdso code so it would carry
refs inside (or adding some section into elf loader kernel code), but
all this would simply mess the code. Thus this 4K per namespace seems
to be acceptable trade off.
Dmitry Safonov <[email protected]> writes:
> Discussions around time virtualization are there for a long time.
> The first attempt to implement time namespace was in 2006 by Jeff Dike.
> From that time, the topic appears on and off in various discussions.
>
> There are two main use cases for time namespaces:
> 1. change date and time inside a container;
> 2. adjust clocks for a container restored from a checkpoint.
>
> “It seems like this might be one of the last major obstacles keeping
> migration from being used in production systems, given that not all
> containers and connections can be migrated as long as a time dependency
> is capable of messing it up.” (by github.com/dav-ell)
>
> The kernel provides access to several clocks: CLOCK_REALTIME,
> CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> start points for them are not defined and are different for each running
> system. When a container is migrated from one node to another, all
> clocks have to be restored into consistent states; in other words, they
> have to continue running from the same points where they have been
> dumped.
>
> The main idea behind this patch set is adding per-namespace offsets for
> system clocks. When a process in a non-root time namespace requests
> time of a clock, a namespace offset is added to the current value of
> this clock on a host and the sum is returned.
>
> All offsets are placed on a separate page, this allows up to map it as
> part of vvar into user processes and use offsets from vdso calls.
>
> Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> clocks.
>
> Questions to discuss:
>
> * Clone flags exhaustion. Currently there is only one unused clone flag
> bit left, and it may be worth to use it to extend arguments of the clone
> system call.
>
> * Realtime clock implementation details:
> Is having a simple offset enough?
> What to do when date and time is changed on the host?
> Is there a need to adjust vfs modification and creation times?
> Implementation for adjtime() syscall.
Overall I support this effort. In my quick skim this code looked good.
My feeling is that we need to be able to support running ntpd and
support one namespace doing googles smoothing of leap seconds while
another namespace takes the leap second.
What I was imagining when I was last thinking about this was one
instance of struct timekeeper aka tk_core per time namespace. That
structure already keeps offsets for all of the various clocks from
the kerne internal time sources. What would be needed would be to
pass in an appropriate time namespace pointer.
I could be completely wrong as I have not take the time to completely
trace through the code. Have you looked at pushing the time namespace
down as far as tk_core?
What I think would be the big advantage (besides ntp working) is that
the bulk of the code could be reused. Allowing testing of the kernel's
time code by setting up a new time namespace. So a person in production
could setup a time namespace with the time set ahead a little bit and
be able to verify that the kernel handles the upcoming leap second
properly.
I don't know about the vfs. I think the danger is being able to write
dates in the future or in the past. It appears that utimes(2) and
utimesnat(2) already allow this except for status change. So it is
possible we simply don't care. I seem to remember that what nfs does
is take the time stamp from the host writing to the file.
I think the guide for filesystem timestamps should be to first ensure
we don't introduce security issues, and then do what distributed
filesystems do when dealing with hosts with different clocks.
Given those those two guidlines above I don't think there is a need to
change timestamsp the way the user namespace changes uid when displayed.
As for the hardware like the real time clock we definitely should not
let a root in a time namespace change it. We might even be able to get
away with leaving the real time clock out of the time namespace. If not
we need to be very careful how the real time clock is abstracted. I
would start by leaving the real time clock hardware out of the time
namespace and see if there is any part of userspace that cares.
Eric
> Cc: Dmitry Safonov <[email protected]>
> Cc: Adrian Reber <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Cyrill Gorcunov <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jeff Dike <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Pavel Emelyanov <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
>
> Andrei Vagin (12):
> ns: Introduce Time Namespace
> timens: Add timens_offsets
> timens: Introduce CLOCK_MONOTONIC offsets
> timens: Introduce CLOCK_BOOTTIME offset
> timerfd/timens: Take into account ns clock offsets
> kernel: Take into account timens clock offsets in clock_nanosleep
> x86/vdso/timens: Add offsets page in vvar
> x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow
> posix-timers/timens: Take into account clock offsets
> selftest/timens: Add test for timerfd
> selftest/timens: Add test for clock_nanosleep
> timens/selftest: Add timer offsets test
>
> Dmitry Safonov (8):
> timens: Shift /proc/uptime
> x86/vdso: Restrict splitting vvar vma
> x86/vdso: Purge timens page on setns()/unshare()/clone()
> x86/vdso: Look for vvar vma to purge timens page
> timens: Add align for timens_offsets
> timens: Optimize zero-offsets
> selftest: Add Time Namespace test for supported clocks
> timens/selftest: Add procfs selftest
>
> arch/Kconfig | 5 +
> arch/x86/Kconfig | 1 +
> arch/x86/entry/vdso/vclock_gettime.c | 52 +++++
> arch/x86/entry/vdso/vdso-layout.lds.S | 9 +-
> arch/x86/entry/vdso/vdso2c.c | 3 +
> arch/x86/entry/vdso/vma.c | 67 +++++++
> arch/x86/include/asm/vdso.h | 2 +
> fs/proc/namespaces.c | 3 +
> fs/proc/uptime.c | 3 +
> fs/timerfd.c | 16 +-
> include/linux/nsproxy.h | 1 +
> include/linux/proc_ns.h | 1 +
> include/linux/time_namespace.h | 72 +++++++
> include/linux/timens_offsets.h | 25 +++
> include/linux/user_namespace.h | 1 +
> include/uapi/linux/sched.h | 1 +
> init/Kconfig | 8 +
> kernel/Makefile | 1 +
> kernel/fork.c | 3 +-
> kernel/nsproxy.c | 19 +-
> kernel/time/hrtimer.c | 8 +
> kernel/time/posix-timers.c | 89 ++++++++-
> kernel/time/posix-timers.h | 2 +
> kernel/time_namespace.c | 230 +++++++++++++++++++++++
> tools/testing/selftests/timens/.gitignore | 5 +
> tools/testing/selftests/timens/Makefile | 6 +
> tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++
> tools/testing/selftests/timens/config | 1 +
> tools/testing/selftests/timens/log.h | 21 +++
> tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++
> tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++
> tools/testing/selftests/timens/timer.c | 95 ++++++++++
> tools/testing/selftests/timens/timerfd.c | 96 ++++++++++
> 33 files changed, 1272 insertions(+), 13 deletions(-)
> create mode 100644 include/linux/time_namespace.h
> create mode 100644 include/linux/timens_offsets.h
> create mode 100644 kernel/time_namespace.c
> create mode 100644 tools/testing/selftests/timens/.gitignore
> create mode 100644 tools/testing/selftests/timens/Makefile
> create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
> create mode 100644 tools/testing/selftests/timens/config
> create mode 100644 tools/testing/selftests/timens/log.h
> create mode 100644 tools/testing/selftests/timens/procfs.c
> create mode 100644 tools/testing/selftests/timens/timens.c
> create mode 100644 tools/testing/selftests/timens/timer.c
> create mode 100644 tools/testing/selftests/timens/timerfd.c
On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> Dmitry Safonov <[email protected]> writes:
>
> > Discussions around time virtualization are there for a long time.
> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> > From that time, the topic appears on and off in various discussions.
> >
> > There are two main use cases for time namespaces:
> > 1. change date and time inside a container;
> > 2. adjust clocks for a container restored from a checkpoint.
> >
> > “It seems like this might be one of the last major obstacles keeping
> > migration from being used in production systems, given that not all
> > containers and connections can be migrated as long as a time dependency
> > is capable of messing it up.” (by github.com/dav-ell)
> >
> > The kernel provides access to several clocks: CLOCK_REALTIME,
> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> > start points for them are not defined and are different for each running
> > system. When a container is migrated from one node to another, all
> > clocks have to be restored into consistent states; in other words, they
> > have to continue running from the same points where they have been
> > dumped.
> >
> > The main idea behind this patch set is adding per-namespace offsets for
> > system clocks. When a process in a non-root time namespace requests
> > time of a clock, a namespace offset is added to the current value of
> > this clock on a host and the sum is returned.
> >
> > All offsets are placed on a separate page, this allows up to map it as
> > part of vvar into user processes and use offsets from vdso calls.
> >
> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> > clocks.
> >
> > Questions to discuss:
> >
> > * Clone flags exhaustion. Currently there is only one unused clone flag
> > bit left, and it may be worth to use it to extend arguments of the clone
> > system call.
> >
> > * Realtime clock implementation details:
> > Is having a simple offset enough?
> > What to do when date and time is changed on the host?
> > Is there a need to adjust vfs modification and creation times?
> > Implementation for adjtime() syscall.
>
> Overall I support this effort. In my quick skim this code looked good.
Hi Eric,
Thank you for the feedback.
>
> My feeling is that we need to be able to support running ntpd and
> support one namespace doing googles smoothing of leap seconds while
> another namespace takes the leap second.
>
> What I was imagining when I was last thinking about this was one
> instance of struct timekeeper aka tk_core per time namespace. That
> structure already keeps offsets for all of the various clocks from
> the kerne internal time sources. What would be needed would be to
> pass in an appropriate time namespace pointer.
>
> I could be completely wrong as I have not take the time to completely
> trace through the code. Have you looked at pushing the time namespace
> down as far as tk_core?
>
> What I think would be the big advantage (besides ntp working) is that
> the bulk of the code could be reused. Allowing testing of the kernel's
> time code by setting up a new time namespace. So a person in production
> could setup a time namespace with the time set ahead a little bit and
> be able to verify that the kernel handles the upcoming leap second
> properly.
>
It is an interesting idea, but I have a few questions:
1. Does it mean that timekeeping_update() will be called for each
namespace? This functions is called periodically, it updates times on the
timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
overhead of this?
2. What will we do with vdso? It looks like we will have to have a
separate vsyscall_gtod_data for each ns and update each of them
separately.
>
>
> I don't know about the vfs. I think the danger is being able to write
> dates in the future or in the past. It appears that utimes(2) and
> utimesnat(2) already allow this except for status change. So it is
> possible we simply don't care. I seem to remember that what nfs does
> is take the time stamp from the host writing to the file.
>
> I think the guide for filesystem timestamps should be to first ensure
> we don't introduce security issues, and then do what distributed
> filesystems do when dealing with hosts with different clocks.
>
> Given those those two guidlines above I don't think there is a need to
> change timestamsp the way the user namespace changes uid when displayed.
>
>
>
> As for the hardware like the real time clock we definitely should not
> let a root in a time namespace change it. We might even be able to get
> away with leaving the real time clock out of the time namespace. If not
> we need to be very careful how the real time clock is abstracted. I
> would start by leaving the real time clock hardware out of the time
> namespace and see if there is any part of userspace that cares.
>
> Eric
>
> > Cc: Dmitry Safonov <[email protected]>
> > Cc: Adrian Reber <[email protected]>
> > Cc: Andrei Vagin <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Cc: Cyrill Gorcunov <[email protected]>
> > Cc: "Eric W. Biederman" <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Jeff Dike <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Pavel Emelyanov <[email protected]>
> > Cc: Shuah Khan <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> >
> > Andrei Vagin (12):
> > ns: Introduce Time Namespace
> > timens: Add timens_offsets
> > timens: Introduce CLOCK_MONOTONIC offsets
> > timens: Introduce CLOCK_BOOTTIME offset
> > timerfd/timens: Take into account ns clock offsets
> > kernel: Take into account timens clock offsets in clock_nanosleep
> > x86/vdso/timens: Add offsets page in vvar
> > x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow
> > posix-timers/timens: Take into account clock offsets
> > selftest/timens: Add test for timerfd
> > selftest/timens: Add test for clock_nanosleep
> > timens/selftest: Add timer offsets test
> >
> > Dmitry Safonov (8):
> > timens: Shift /proc/uptime
> > x86/vdso: Restrict splitting vvar vma
> > x86/vdso: Purge timens page on setns()/unshare()/clone()
> > x86/vdso: Look for vvar vma to purge timens page
> > timens: Add align for timens_offsets
> > timens: Optimize zero-offsets
> > selftest: Add Time Namespace test for supported clocks
> > timens/selftest: Add procfs selftest
> >
> > arch/Kconfig | 5 +
> > arch/x86/Kconfig | 1 +
> > arch/x86/entry/vdso/vclock_gettime.c | 52 +++++
> > arch/x86/entry/vdso/vdso-layout.lds.S | 9 +-
> > arch/x86/entry/vdso/vdso2c.c | 3 +
> > arch/x86/entry/vdso/vma.c | 67 +++++++
> > arch/x86/include/asm/vdso.h | 2 +
> > fs/proc/namespaces.c | 3 +
> > fs/proc/uptime.c | 3 +
> > fs/timerfd.c | 16 +-
> > include/linux/nsproxy.h | 1 +
> > include/linux/proc_ns.h | 1 +
> > include/linux/time_namespace.h | 72 +++++++
> > include/linux/timens_offsets.h | 25 +++
> > include/linux/user_namespace.h | 1 +
> > include/uapi/linux/sched.h | 1 +
> > init/Kconfig | 8 +
> > kernel/Makefile | 1 +
> > kernel/fork.c | 3 +-
> > kernel/nsproxy.c | 19 +-
> > kernel/time/hrtimer.c | 8 +
> > kernel/time/posix-timers.c | 89 ++++++++-
> > kernel/time/posix-timers.h | 2 +
> > kernel/time_namespace.c | 230 +++++++++++++++++++++++
> > tools/testing/selftests/timens/.gitignore | 5 +
> > tools/testing/selftests/timens/Makefile | 6 +
> > tools/testing/selftests/timens/clock_nanosleep.c | 98 ++++++++++
> > tools/testing/selftests/timens/config | 1 +
> > tools/testing/selftests/timens/log.h | 21 +++
> > tools/testing/selftests/timens/procfs.c | 145 ++++++++++++++
> > tools/testing/selftests/timens/timens.c | 196 +++++++++++++++++++
> > tools/testing/selftests/timens/timer.c | 95 ++++++++++
> > tools/testing/selftests/timens/timerfd.c | 96 ++++++++++
> > 33 files changed, 1272 insertions(+), 13 deletions(-)
> > create mode 100644 include/linux/time_namespace.h
> > create mode 100644 include/linux/timens_offsets.h
> > create mode 100644 kernel/time_namespace.c
> > create mode 100644 tools/testing/selftests/timens/.gitignore
> > create mode 100644 tools/testing/selftests/timens/Makefile
> > create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
> > create mode 100644 tools/testing/selftests/timens/config
> > create mode 100644 tools/testing/selftests/timens/log.h
> > create mode 100644 tools/testing/selftests/timens/procfs.c
> > create mode 100644 tools/testing/selftests/timens/timens.c
> > create mode 100644 tools/testing/selftests/timens/timer.c
> > create mode 100644 tools/testing/selftests/timens/timerfd.c
Hi Dmitry,
Thanks for adding tests with the kernel changes.
On 09/19/2018 02:50 PM, Dmitry Safonov wrote:
> This test checks that all supported clocks can be changed by
> clock_settime.
It would good to elaborate a bit more on the nature of the tests in the
here. Also a few things to consider.
I noticed that this test isn't added to selftests/Makefile as TARGET. If it is
an oversight, please make that change as well. If not, it is fine.
Please make sure if the test can't be run because of unmet dependencies, the test
will exit with KSFT_SKIP as opposed to an error. Dependencies include configuration,
privilege, any other unsupported conditions.
This is a comment applies to all the test patches in this series.
thanks,
-- Shuah
Andrey Vagin <[email protected]> writes:
> On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> Dmitry Safonov <[email protected]> writes:
>>
>> > Discussions around time virtualization are there for a long time.
>> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> > From that time, the topic appears on and off in various discussions.
>> >
>> > There are two main use cases for time namespaces:
>> > 1. change date and time inside a container;
>> > 2. adjust clocks for a container restored from a checkpoint.
>> >
>> > “It seems like this might be one of the last major obstacles keeping
>> > migration from being used in production systems, given that not all
>> > containers and connections can be migrated as long as a time dependency
>> > is capable of messing it up.” (by github.com/dav-ell)
>> >
>> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> > start points for them are not defined and are different for each running
>> > system. When a container is migrated from one node to another, all
>> > clocks have to be restored into consistent states; in other words, they
>> > have to continue running from the same points where they have been
>> > dumped.
>> >
>> > The main idea behind this patch set is adding per-namespace offsets for
>> > system clocks. When a process in a non-root time namespace requests
>> > time of a clock, a namespace offset is added to the current value of
>> > this clock on a host and the sum is returned.
>> >
>> > All offsets are placed on a separate page, this allows up to map it as
>> > part of vvar into user processes and use offsets from vdso calls.
>> >
>> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> > clocks.
>> >
>> > Questions to discuss:
>> >
>> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> > bit left, and it may be worth to use it to extend arguments of the clone
>> > system call.
>> >
>> > * Realtime clock implementation details:
>> > Is having a simple offset enough?
>> > What to do when date and time is changed on the host?
>> > Is there a need to adjust vfs modification and creation times?
>> > Implementation for adjtime() syscall.
>>
>> Overall I support this effort. In my quick skim this code looked good.
>
> Hi Eric,
>
> Thank you for the feedback.
>
>>
>> My feeling is that we need to be able to support running ntpd and
>> support one namespace doing googles smoothing of leap seconds while
>> another namespace takes the leap second.
>>
>> What I was imagining when I was last thinking about this was one
>> instance of struct timekeeper aka tk_core per time namespace. That
>> structure already keeps offsets for all of the various clocks from
>> the kerne internal time sources. What would be needed would be to
>> pass in an appropriate time namespace pointer.
>>
>> I could be completely wrong as I have not take the time to completely
>> trace through the code. Have you looked at pushing the time namespace
>> down as far as tk_core?
>>
>> What I think would be the big advantage (besides ntp working) is that
>> the bulk of the code could be reused. Allowing testing of the kernel's
>> time code by setting up a new time namespace. So a person in production
>> could setup a time namespace with the time set ahead a little bit and
>> be able to verify that the kernel handles the upcoming leap second
>> properly.
>>
>
> It is an interesting idea, but I have a few questions:
>
> 1. Does it mean that timekeeping_update() will be called for each
> namespace? This functions is called periodically, it updates times on the
> timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> overhead of this?
I don't know if periodically is a proper characterization. There may be
a code path that does that. But from what I can see timekeeping_update
is the guts of settimeofday (and a few related functions).
So it appears to make sense for timekeeping_update to be per namespace.
Hmm. Looking at what is updated in the vsyscall_gtod_data it does
look like you would have to periodically update things, but I don't know
big that period would be. As long as the period is reasonably large,
or the time namespaces were sufficiently deschronized it should not
be a problem. But that is the class of problem that could make
my ideal impractical if there is measuarable overhead.
Where were you seeing timekeeping_update being called periodically?
> 2. What will we do with vdso? It looks like we will have to have a
> separate vsyscall_gtod_data for each ns and update each of them
> separately.
Yes. But you don't have to have introduce another variable just make
certain vsyscall_gtod_data is a page aligned thing per time namespace.
If I read the summary of the existing patchset something very similiar
is already going on.
Each process would only map one. And unshare of the time namespace
would need to act like the pid namespace or be limited to only being
allowed when there is only a single task using the mm.
Eric
On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
> Andrey Vagin <[email protected]> writes:
>
> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> >> Dmitry Safonov <[email protected]> writes:
> >>
> >> > Discussions around time virtualization are there for a long time.
> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> >> > From that time, the topic appears on and off in various discussions.
> >> >
> >> > There are two main use cases for time namespaces:
> >> > 1. change date and time inside a container;
> >> > 2. adjust clocks for a container restored from a checkpoint.
> >> >
> >> > “It seems like this might be one of the last major obstacles keeping
> >> > migration from being used in production systems, given that not all
> >> > containers and connections can be migrated as long as a time dependency
> >> > is capable of messing it up.” (by github.com/dav-ell)
> >> >
> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> >> > start points for them are not defined and are different for each running
> >> > system. When a container is migrated from one node to another, all
> >> > clocks have to be restored into consistent states; in other words, they
> >> > have to continue running from the same points where they have been
> >> > dumped.
> >> >
> >> > The main idea behind this patch set is adding per-namespace offsets for
> >> > system clocks. When a process in a non-root time namespace requests
> >> > time of a clock, a namespace offset is added to the current value of
> >> > this clock on a host and the sum is returned.
> >> >
> >> > All offsets are placed on a separate page, this allows up to map it as
> >> > part of vvar into user processes and use offsets from vdso calls.
> >> >
> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> >> > clocks.
> >> >
> >> > Questions to discuss:
> >> >
> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
> >> > bit left, and it may be worth to use it to extend arguments of the clone
> >> > system call.
> >> >
> >> > * Realtime clock implementation details:
> >> > Is having a simple offset enough?
> >> > What to do when date and time is changed on the host?
> >> > Is there a need to adjust vfs modification and creation times?
> >> > Implementation for adjtime() syscall.
> >>
> >> Overall I support this effort. In my quick skim this code looked good.
> >
> > Hi Eric,
> >
> > Thank you for the feedback.
> >
> >>
> >> My feeling is that we need to be able to support running ntpd and
> >> support one namespace doing googles smoothing of leap seconds while
> >> another namespace takes the leap second.
> >>
> >> What I was imagining when I was last thinking about this was one
> >> instance of struct timekeeper aka tk_core per time namespace. That
> >> structure already keeps offsets for all of the various clocks from
> >> the kerne internal time sources. What would be needed would be to
> >> pass in an appropriate time namespace pointer.
> >>
> >> I could be completely wrong as I have not take the time to completely
> >> trace through the code. Have you looked at pushing the time namespace
> >> down as far as tk_core?
> >>
> >> What I think would be the big advantage (besides ntp working) is that
> >> the bulk of the code could be reused. Allowing testing of the kernel's
> >> time code by setting up a new time namespace. So a person in production
> >> could setup a time namespace with the time set ahead a little bit and
> >> be able to verify that the kernel handles the upcoming leap second
> >> properly.
> >>
> >
> > It is an interesting idea, but I have a few questions:
> >
> > 1. Does it mean that timekeeping_update() will be called for each
> > namespace? This functions is called periodically, it updates times on the
> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> > overhead of this?
>
> I don't know if periodically is a proper characterization. There may be
> a code path that does that. But from what I can see timekeeping_update
> is the guts of settimeofday (and a few related functions).
>
> So it appears to make sense for timekeeping_update to be per namespace.
>
> Hmm. Looking at what is updated in the vsyscall_gtod_data it does
> look like you would have to periodically update things, but I don't know
> big that period would be. As long as the period is reasonably large,
> or the time namespaces were sufficiently deschronized it should not
> be a problem. But that is the class of problem that could make
> my ideal impractical if there is measuarable overhead.
>
> Where were you seeing timekeeping_update being called periodically?
timekeeping_update() is called HZ times per-second:
[ 67.912858] timekeeping_update.cold.26+0x5/0xa
[ 67.913332] timekeeping_advance+0x361/0x5c0
[ 67.913857] ? tick_sched_do_timer+0x55/0x70
[ 67.914409] ? tick_sched_do_timer+0x70/0x70
[ 67.914947] tick_sched_do_timer+0x55/0x70
[ 67.915505] tick_sched_timer+0x27/0x70
[ 67.916042] __hrtimer_run_queues+0x10f/0x440
[ 67.916639] hrtimer_interrupt+0x100/0x220
[ 67.917305] smp_apic_timer_interrupt+0x79/0x220
[ 67.918030] apic_timer_interrupt+0xf/0x20
>
> > 2. What will we do with vdso? It looks like we will have to have a
> > separate vsyscall_gtod_data for each ns and update each of them
> > separately.
>
> Yes. But you don't have to have introduce another variable just make
> certain vsyscall_gtod_data is a page aligned thing per time namespace.
>
> If I read the summary of the existing patchset something very similiar
> is already going on.
I mean vsyscall_gtod_data has some data which are often updated. There
are timestamps for monotonic and wall clocks. clock_gettime() reads a
time stamp from vsyscall_gtod_data and then use tsc to approximate the
current value of a clock.
Actually, this is not the second question, it is a part of the first
question. update_vsyscall() is called from timekeeping_update().
>
> Each process would only map one. And unshare of the time namespace
> would need to act like the pid namespace or be limited to only being
> allowed when there is only a single task using the mm.
>
> Eric
Andrey Vagin <[email protected]> writes:
> On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
>> Andrey Vagin <[email protected]> writes:
>>
>> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
>> >> Dmitry Safonov <[email protected]> writes:
>> >>
>> >> > Discussions around time virtualization are there for a long time.
>> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
>> >> > From that time, the topic appears on and off in various discussions.
>> >> >
>> >> > There are two main use cases for time namespaces:
>> >> > 1. change date and time inside a container;
>> >> > 2. adjust clocks for a container restored from a checkpoint.
>> >> >
>> >> > “It seems like this might be one of the last major obstacles keeping
>> >> > migration from being used in production systems, given that not all
>> >> > containers and connections can be migrated as long as a time dependency
>> >> > is capable of messing it up.” (by github.com/dav-ell)
>> >> >
>> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
>> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
>> >> > start points for them are not defined and are different for each running
>> >> > system. When a container is migrated from one node to another, all
>> >> > clocks have to be restored into consistent states; in other words, they
>> >> > have to continue running from the same points where they have been
>> >> > dumped.
>> >> >
>> >> > The main idea behind this patch set is adding per-namespace offsets for
>> >> > system clocks. When a process in a non-root time namespace requests
>> >> > time of a clock, a namespace offset is added to the current value of
>> >> > this clock on a host and the sum is returned.
>> >> >
>> >> > All offsets are placed on a separate page, this allows up to map it as
>> >> > part of vvar into user processes and use offsets from vdso calls.
>> >> >
>> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
>> >> > clocks.
>> >> >
>> >> > Questions to discuss:
>> >> >
>> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
>> >> > bit left, and it may be worth to use it to extend arguments of the clone
>> >> > system call.
>> >> >
>> >> > * Realtime clock implementation details:
>> >> > Is having a simple offset enough?
>> >> > What to do when date and time is changed on the host?
>> >> > Is there a need to adjust vfs modification and creation times?
>> >> > Implementation for adjtime() syscall.
>> >>
>> >> Overall I support this effort. In my quick skim this code looked good.
>> >
>> > Hi Eric,
>> >
>> > Thank you for the feedback.
>> >
>> >>
>> >> My feeling is that we need to be able to support running ntpd and
>> >> support one namespace doing googles smoothing of leap seconds while
>> >> another namespace takes the leap second.
>> >>
>> >> What I was imagining when I was last thinking about this was one
>> >> instance of struct timekeeper aka tk_core per time namespace. That
>> >> structure already keeps offsets for all of the various clocks from
>> >> the kerne internal time sources. What would be needed would be to
>> >> pass in an appropriate time namespace pointer.
>> >>
>> >> I could be completely wrong as I have not take the time to completely
>> >> trace through the code. Have you looked at pushing the time namespace
>> >> down as far as tk_core?
>> >>
>> >> What I think would be the big advantage (besides ntp working) is that
>> >> the bulk of the code could be reused. Allowing testing of the kernel's
>> >> time code by setting up a new time namespace. So a person in production
>> >> could setup a time namespace with the time set ahead a little bit and
>> >> be able to verify that the kernel handles the upcoming leap second
>> >> properly.
>> >>
>> >
>> > It is an interesting idea, but I have a few questions:
>> >
>> > 1. Does it mean that timekeeping_update() will be called for each
>> > namespace? This functions is called periodically, it updates times on the
>> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
>> > overhead of this?
>>
>> I don't know if periodically is a proper characterization. There may be
>> a code path that does that. But from what I can see timekeeping_update
>> is the guts of settimeofday (and a few related functions).
>>
>> So it appears to make sense for timekeeping_update to be per namespace.
>>
>> Hmm. Looking at what is updated in the vsyscall_gtod_data it does
>> look like you would have to periodically update things, but I don't know
>> big that period would be. As long as the period is reasonably large,
>> or the time namespaces were sufficiently deschronized it should not
>> be a problem. But that is the class of problem that could make
>> my ideal impractical if there is measuarable overhead.
>>
>> Where were you seeing timekeeping_update being called periodically?
>
> timekeeping_update() is called HZ times per-second:
>
> [ 67.912858] timekeeping_update.cold.26+0x5/0xa
> [ 67.913332] timekeeping_advance+0x361/0x5c0
> [ 67.913857] ? tick_sched_do_timer+0x55/0x70
> [ 67.914409] ? tick_sched_do_timer+0x70/0x70
> [ 67.914947] tick_sched_do_timer+0x55/0x70
> [ 67.915505] tick_sched_timer+0x27/0x70
> [ 67.916042] __hrtimer_run_queues+0x10f/0x440
> [ 67.916639] hrtimer_interrupt+0x100/0x220
> [ 67.917305] smp_apic_timer_interrupt+0x79/0x220
> [ 67.918030] apic_timer_interrupt+0xf/0x20
Interesting.
Reading the code the calling sequence there is:
tick_sched_do_timer
tick_do_update_jiffies64
update_wall_time
timekeeping_advance
timekeepging_update
If I read that properly under the right nohz circumstances that update
can be delayed indefinitely.
So I think we could prototype a time namespace that was per
timekeeping_update and just had update_wall_time iterate through
all of the time namespaces.
I don't think the naive version would scale to very many time
namespaces.
At the same time using the techniques from the nohz work and a little
smarts I expect we could get the code to scale.
I think this direction is definitely worth exploring. My experience
with namespaces is that if we don't get the advanced features working
there is little to no interest from the core developers of the code,
and the namespaces don't solve additional problems. Which makes the
namespace a hard sell. Especially when it does not solve problems the
developers of the subsystem have.
The advantage of timekeeping_update per time namespace is that it allows
different lengths of seconds per time namespace. Which allows testing
ntp and the kernel in interesting ways while still having a working
production configuration on the same system.
Eric
2018-09-26 18:36 GMT+01:00 Eric W. Biederman <[email protected]>:
> The advantage of timekeeping_update per time namespace is that it allows
> different lengths of seconds per time namespace. Which allows testing
> ntp and the kernel in interesting ways while still having a working
> production configuration on the same system.
Just a quick note: the different length of second per namespace sounds
very interesting in my POV, I remember I've seen this article:
http://publish.illinois.edu/science-of-security-lablet/files/2014/05/DSSnet-A-Smart-Grid-Modeling-Platform-Combining-Electrical-Power-Distributtion-System-Simulation-and-Software-Defined-Networking-Emulation.pdf
And their realisation with a simulation of time going with different speed
per-pid (with vdso disabled):
https://github.com/littlepretty/VirtualTimeKernel
Thanks,
Dmitry
On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> Reading the code the calling sequence there is:
> tick_sched_do_timer
> tick_do_update_jiffies64
> update_wall_time
> timekeeping_advance
> timekeepging_update
>
> If I read that properly under the right nohz circumstances that update
> can be delayed indefinitely.
>
> So I think we could prototype a time namespace that was per
> timekeeping_update and just had update_wall_time iterate through
> all of the time namespaces.
Please don't go there. timekeeping_update() is already heavy and walking
through a gazillion of namespaces will just make it horrible,
> I don't think the naive version would scale to very many time
> namespaces.
:)
> At the same time using the techniques from the nohz work and a little
> smarts I expect we could get the code to scale.
You'd need to invoke the update when the namespace is switched in and
hasn't been updated since the last tick happened. That might be doable, but
you also need to take the wraparound constraints of the underlying
clocksources into account, which again can cause walking all name spaces
when they are all idle long enough.
From there it becomes hairy, because it's not only timekeeping,
i.e. reading time, this is also affecting all timers which are armed from a
namespace.
That gets really ugly because when you do settimeofday() or adjtimex() for
a particular namespace, then you have to search for all armed timers of
that namespace and adjust them.
The original posix timer code had the same issue because it mapped the
clock realtime timers to the timer wheel so any setting of the clock caused
a full walk of all armed timers, disarming, adjusting and requeing
them. That's horrible not only performance wise, it's also a locking
nightmare of all sorts.
Add time skew via NTP/PTP into the picture and you might have to adjust
timers as well, because you need to guarantee that they are not expiring
early.
I haven't looked through Dimitry's patches yet, but I don't see how this
can work at all without introducing subtle issues all over the place.
Thanks,
tglx
On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
>
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.
And just a quick scan tells me that this is broken. Timers will expire
early or late. The latter is acceptible to some extent, but larger delays
might come with surprise. Expiring early is an absolute nono.
Thanks,
tglx
Thomas Gleixner <[email protected]> writes:
> On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> Reading the code the calling sequence there is:
>> tick_sched_do_timer
>> tick_do_update_jiffies64
>> update_wall_time
>> timekeeping_advance
>> timekeepging_update
>>
>> If I read that properly under the right nohz circumstances that update
>> can be delayed indefinitely.
>>
>> So I think we could prototype a time namespace that was per
>> timekeeping_update and just had update_wall_time iterate through
>> all of the time namespaces.
>
> Please don't go there. timekeeping_update() is already heavy and walking
> through a gazillion of namespaces will just make it horrible,
>
>> I don't think the naive version would scale to very many time
>> namespaces.
>
> :)
>
>> At the same time using the techniques from the nohz work and a little
>> smarts I expect we could get the code to scale.
>
> You'd need to invoke the update when the namespace is switched in and
> hasn't been updated since the last tick happened. That might be doable, but
> you also need to take the wraparound constraints of the underlying
> clocksources into account, which again can cause walking all name spaces
> when they are all idle long enough.
The wrap around constraints being how long before the time sources wrap
around so you have to read them once per wrap around? I have not dug
deeply enough into the code to see that yet.
> From there it becomes hairy, because it's not only timekeeping,
> i.e. reading time, this is also affecting all timers which are armed from a
> namespace.
>
> That gets really ugly because when you do settimeofday() or adjtimex() for
> a particular namespace, then you have to search for all armed timers of
> that namespace and adjust them.
>
> The original posix timer code had the same issue because it mapped the
> clock realtime timers to the timer wheel so any setting of the clock caused
> a full walk of all armed timers, disarming, adjusting and requeing
> them. That's horrible not only performance wise, it's also a locking
> nightmare of all sorts.
>
> Add time skew via NTP/PTP into the picture and you might have to adjust
> timers as well, because you need to guarantee that they are not expiring
> early.
>
> I haven't looked through Dimitry's patches yet, but I don't see how this
> can work at all without introducing subtle issues all over the place.
Then it sounds like this will take some more digging.
Please pardon me for thinking out load.
There are one or more time sources that we use to compute the time
and for each time source we have a conversion from ticks of the
time source to nanoseconds.
Each time source needs to be sampled at least once per wrap-around
and something incremented so that we don't loose time when looking
at that time source.
There are several clocks presented to userspace and they all share the
same length of second and are all fundamentally offsets from
CLOCK_MONOTONIC.
I see two fundamental driving cases for a time namespace.
1) Migration from one node to another node in a cluster in almost
real time.
The problem is that CLOCK_MONOTONIC between nodes in the cluster
has not relation ship to each other (except a synchronized length of
the second). So applications that migrate can see CLOCK_MONOTONIC
and CLOCK_BOOTTIME go backwards.
This is the truly pressing problem and adding some kind of offset
sounds like it would be the solution. Possibly by allowing a boot
time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
2) Dealing with two separate time management domains. Say a machine
that needes to deal with both something inside of google where they
slew time to avoid leap time seconds and something in the outside
world proper UTC time is kept as an offset from TAI with the
occasional leap seconds.
In the later case it would fundamentally require having seconds of
different length.
A pure 64bit nanoseond counter is good for 500 years. So 64bit
variables can be used to hold time, and everything can be converted from
there.
This suggests we can for ticks have two values.
- The number of ticks from the time source.
- The number of times the ticks would have rolled over.
That sounds like it may be a little simplistic as it would require being
very diligent about firing a timer exactly at rollover and not losing
that, but for a handwaving argument is probably enough to generate
a 64bit tick counter.
If the focus is on a 64bit tick counter then what update_wall_time
has to do is very limited. Just deal the accounting needed to cope with
tick rollover.
Getting the actual time looks like it would be as simple as now, with
perhaps an extra addition to account for the number of times the tick
counter has rolled over. With limited precision arithmetic and various
optimizations I don't think it is that simple to implement but it feels
like it should be very little extra work.
For timers my inclination would be to assume no adjustments to the
current time parameters and set the timer to go off then. If the time
on the appropriate clock has been changed since the timer was set and
the timer is going off early reschedule so the timer fires at the
appropriate time.
With the above I think it is theoretically possible to build a time
namespace that supports multiple lengths of second, and does not have
much overhead.
Not that I think a final implementation would necessary look like what I
have described. I just think it is possible with extreme care to evolve
the current code base into something that can efficiently handle
multiple time domains with slightly different lenghts of second.
Thomas does it sound like I am completely out of touch with reality?
It does though sound like it is going to take some serious digging
through the code to understand how what everything does and how and why
everthing works the way it does. Not something grafted on top with just
a cursory understanding of how the code works.
Eric
On 19/09/2018 22:50, Dmitry Safonov wrote:
> From: Andrei Vagin <[email protected]>
>
> Time Namespace isolates clock values.
>
> The kernel provides access to several clocks CLOCK_REALTIME,
> CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
>
> CLOCK_REALTIME
> System-wide clock that measures real (i.e., wall-clock) time.
>
> CLOCK_MONOTONIC
> Clock that cannot be set and represents monotonic time since
> some unspecified starting point.
>
> CLOCK_BOOTTIME
> Identical to CLOCK_MONOTONIC, except it also includes any time
> that the system is suspended.
>
> For many users, the time namespace means the ability to changes time in
> a container (CLOCK_REALTIME).
>
> But in a context of the checkpoint/restore functionality, monotonic and
> bootime clocks become interesting. Both clocks are monotonic with
> unspecified staring points. These clocks are widely used to measure time
> slices, set timers. After restoring or migrating processes, we have to
> guarantee that they never go backward. In an ideal case, the behavior of
> these clocks should be the same as for a case when a whole system is
> suspended. All this means that we need to be able to set CLOCK_MONOTONIC
> and CLOCK_BOOTTIME clocks, what can be done by adding per-namespace
> offsets for clocks.
>
> Link: https://criu.org/Time_namespace
> Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
> Signed-off-by: Andrei Vagin <[email protected]>
> Co-developed-by: Dmitry Safonov <[email protected]>
> Signed-off-by: Dmitry Safonov <[email protected]>
> ---
> fs/proc/namespaces.c | 3 +
> include/linux/nsproxy.h | 1 +
> include/linux/proc_ns.h | 1 +
> include/linux/time_namespace.h | 59 ++++++++++++++
> include/linux/user_namespace.h | 1 +
> include/uapi/linux/sched.h | 1 +
> init/Kconfig | 7 ++
> kernel/Makefile | 1 +
> kernel/fork.c | 3 +-
> kernel/nsproxy.c | 19 ++++-
> kernel/time_namespace.c | 169 +++++++++++++++++++++++++++++++++++++++++
> 11 files changed, 262 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/time_namespace.h
> create mode 100644 kernel/time_namespace.c
>
...
> diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
> new file mode 100644
> index 000000000000..902cd9c22159
> --- /dev/null
> +++ b/kernel/time_namespace.c
...
> +
> +struct time_namespace init_time_ns = {
> + .kref = KREF_INIT(2),
> + .user_ns = &init_user_ns,
> + .ns.inum = PROC_UTS_INIT_INO,
Do you mean PROC_TIME_INIT_INO?
> +#ifdef CONFIG_UTS_NS
> + .ns.ops = &timens_operations,
> +#endif
Do you mean CONFIG_TIME_NS?
Thanks,
Laurent
Eric,
On Fri, 28 Sep 2018, Eric W. Biederman wrote:
> Thomas Gleixner <[email protected]> writes:
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
>
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around? I have not dug
> deeply enough into the code to see that yet.
It's done by limiting the NOHZ idle time when all CPUs are going into deep
sleep for a long time, i.e. we make sure that at least one CPU comes back
sufficiently _before_ the wraparound happens and invokes the update
function.
It's not so much a problem for TSC, but not every clocksource the kernel
supports has wraparound times in the range of hundreds of years.
But yes, your idea of keeping track of wraparounds might work. Tricky, but
looks feasible on first sight, but we should be aware of the dragons.
> Please pardon me for thinking out load.
>
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
>
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
>
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.
Yes. That's the readout side. This one is doable. But now look at timers.
If you arm the timer from a name space, then it needs to be converted to
host time in order to sort it into the hrtimer queue and at some point arm
the clockevent device for it. This works as long as host and name space
time have a constant offset and the same skew.
Once the name space time has a different skew this falls apart because the
armed timer will either expire late or early.
Late might be acceptable, early violates the spec. You could do an extra
check for rescheduling it, if it's early, but that requires to store the
name space time accessor in the hrtimer itself because not every timer
expiry happens so that it can be checked in the name space context (think
signal based timers). We need to add this extra magic right into
__hrtimer_run_queues() which is called from the hard and soft interrupt. We
really don't want to touch all relevant callbacks or syscalls. The latter
is not sufficient anyway for signal based timer delivery.
That's going to be interesting in terms of synchronization and might also
cause substantial overhead at least for the timers which belong to name
spaces.
But that also means that anything which is early can and probably will
cause rearming of the timer hardware possibly for a very short delta. We
need to think about whether this can be abused to create interrupt storms.
Now if you accept a bit late, which I'm not really happy about, then you
surely won't accept very late, i.e. hours, days. But that can happen when
settimeofday() comes into play. Right now with a single time domain, this
is easy. When settimeofday() or adjtimex() makes time jump, we just go and
reprogramm the hardware timers accordingly, which might also result in
immediate expiry of timers.
But this does not help for time jumps in name spaces because the timer is
enqueued on the host time base.
And no, we should not think about creating per name space hrtimer queues
and then have to walk through all of them for finding the first expiring
timer in order to arm the hardware. That cannot scale.
Walking all hrtimer bases on all CPUs and check all queued timers whether
they belong to the affected name space does not scale either.
So we'd need to keep track of queued timers belonging to a name space and
then just handle them. Interesting locking problem and also a scalability
issue because this might need to be done on all online CPUs. Haven't
thought it through, but it makes me shudder.
> I see two fundamental driving cases for a time namespace.
<SNIP>
I completely understand the problem you are trying to solve and yes, the
read out of time should be a solvable problem.
> For timers my inclination would be to assume no adjustments to the
> current time parameters and set the timer to go off then. If the time
> on the appropriate clock has been changed since the timer was set and
> the timer is going off early reschedule so the timer fires at the
> appropriate time.
See above.
> Not that I think a final implementation would necessary look like what I
> have described. I just think it is possible with extreme care to evolve
> the current code base into something that can efficiently handle
> multiple time domains with slightly different lenghts of second.
Yes, it really needs some serious thoughts and timekeeping is a really
complex place especially with NTP/PTP in play. We had quite some quality
time to make it work correctly and reliably, now you come along and want to
transform it into a multidimensional puzzle. :)
> Thomas does it sound like I am completely out of touch with reality?
Which reality are you talking about? :)
> It does though sound like it is going to take some serious digging
> through the code to understand how what everything does and how and why
> everthing works the way it does. Not something grafted on top with just
> a cursory understanding of how the code works.
I fully agree and I'm happy to help with explanations and ideas and being
the one who shoots holes into yours.
Thanks,
tglx
FYI, we noticed the following commit (built with gcc-4.9):
commit: 25217c6e39560eeadb338e0140ee215410200b67 ("[RFC 13/20] posix-timers/timens: Take into account clock offsets")
url: https://github.com/0day-ci/linux/commits/Dmitry-Safonov/ns-Introduce-Time-Namespace/20180920-194322
in testcase: boot
on test machine: qemu-system-x86_64 -enable-kvm -cpu qemu64,+ssse3 -smp 4 -m 8G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+---------------------------------------------------------+------------+------------+
| | fb1111e1a5 | 25217c6e39 |
+---------------------------------------------------------+------------+------------+
| boot_successes | 0 | 0 |
| boot_failures | 27 | 16 |
| BUG:KASAN:null-ptr-deref_in_p | 21 | |
| BUG:unable_to_handle_kernel | 21 | 8 |
| Oops:#[##] | 21 | 8 |
| RIP:posix_get_boottime | 21 | |
| Kernel_panic-not_syncing:Fatal_exception | 21 | 8 |
| invoked_oom-killer:gfp_mask=0x | 6 | 6 |
| Mem-Info | 6 | 6 |
| Out_of_memory_and_no_killable_processes | 6 | 6 |
| Kernel_panic-not_syncing:System_is_deadlocked_on_memory | 6 | 6 |
| BUG:KASAN:null-ptr-deref_in_c | 0 | 8 |
| RIP:common_timens_adjust | 0 | 8 |
| BUG:kernel_hang_in_boot_stage | 0 | 2 |
+---------------------------------------------------------+------------+------------+
[ 546.918732] BUG: KASAN: null-ptr-deref in common_timens_adjust+0x4e/0x270
[ 546.919884] Read of size 8 at addr 0000000000000030 by task systemd/1
[ 546.920963]
[ 546.921249] CPU: 1 PID: 1 Comm: systemd Not tainted 4.19.0-rc4-00108-g25217c6 #1
[ 546.922492] Call Trace:
[ 546.922944] dump_stack+0x138/0x1d8
[ 546.923554] ? common_timens_adjust+0x4e/0x270
[ 546.924310] kasan_report+0x26e/0x390
[ 546.924959] __asan_load8+0x54/0x90
[ 546.925569] common_timens_adjust+0x4e/0x270
[ 546.926311] __x64_sys_clock_gettime+0x10b/0x140
[ 546.927114] do_syscall_64+0x1c3/0x280
[ 546.927779] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 546.928648] RIP: 0033:0x7ffc593a1a28
[ 546.929269] Code: 2d 00 ca 9a 3b 83 c2 01 48 3d ff c9 9a 3b 77 ef 48 01 16 45 85 c0 48 89 46 08 0f 85 4b ff ff ff 48 63 ff b8 e4 00 00 00 0f 05 <5b> 5d c3 85 ff 75 ef 44 8b 0d 4a c6 ff ff 41 f6 c1 01 0f 85 e6 01
[ 546.932344] RSP: 002b:00007ffc5935d878 EFLAGS: 00000202 ORIG_RAX: 00000000000000e4
[ 546.933619] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007ffc593a1a28
[ 546.934818] RDX: ffffffffffffffff RSI: 00007ffc5935d8b0 RDI: 0000000000000007
[ 546.936012] RBP: 00007ffc5935d880 R08: 0000000000000002 R09: 000000000003b1e6
[ 546.937205] R10: 0014e3686b800000 R11: 0000000000000202 R12: 00007ffc5935d8f0
[ 546.938401] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[ 546.939622] ==================================================================
[ 546.940817] Disabling lock debugging due to kernel taint
[ 546.942018] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 546.943328] PGD 0 P4D 0
[ 546.943791] Oops: 0000 [#1] SMP KASAN PTI
[ 546.944486] CPU: 1 PID: 1 Comm: systemd Tainted: G B 4.19.0-rc4-00108-g25217c6 #1
[ 546.945962] RIP: 0010:common_timens_adjust+0x4e/0x270
[ 546.946819] Code: 00 06 00 00 48 83 ec 18 e8 ef 48 20 00 48 8b 9b 00 06 00 00 48 8d 7b 30 e8 df 48 20 00 48 8b 5b 30 48 8d 7b 30 e8 d2 48 20 00 <4c> 8b 6b 30 be 08 00 00 00 4d 85 ed 41 0f 94 c6 4c 89 f3 83 e3 01
[ 546.949841] RSP: 0018:ffff8801f5987e90 EFLAGS: 00010286
[ 546.950722] RAX: ffff8801f597e100 RBX: 0000000000000000 RCX: ffffffff812f2e5a
[ 546.951906] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000246
[ 546.953094] RBP: ffff8801f5987ed0 R08: fffffbfff066a22a R09: fffffbfff066a22a
[ 546.954275] R10: 0000000000000001 R11: fffffbfff066a229 R12: ffff8801f5987ee0
[ 546.955460] R13: 0000000000000007 R14: 00007ffc5935d8b0 R15: 0000000000000007
[ 546.956653] FS: 00007f1603e4d940(0000) GS:ffff8801f7000000(0000) knlGS:0000000000000000
[ 546.957994] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 546.958955] CR2: 0000000000000030 CR3: 00000001ddcfa000 CR4: 00000000000006a0
[ 546.960133] Call Trace:
[ 546.960577] __x64_sys_clock_gettime+0x10b/0x140
[ 546.961363] do_syscall_64+0x1c3/0x280
[ 546.962015] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 546.962862] RIP: 0033:0x7ffc593a1a28
[ 546.963472] Code: 2d 00 ca 9a 3b 83 c2 01 48 3d ff c9 9a 3b 77 ef 48 01 16 45 85 c0 48 89 46 08 0f 85 4b ff ff ff 48 63 ff b8 e4 00 00 00 0f 05 <5b> 5d c3 85 ff 75 ef 44 8b 0d 4a c6 ff ff 41 f6 c1 01 0f 85 e6 01
[ 546.966532] RSP: 002b:00007ffc5935d878 EFLAGS: 00000202 ORIG_RAX: 00000000000000e4
[ 546.967796] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007ffc593a1a28
[ 546.968990] RDX: ffffffffffffffff RSI: 00007ffc5935d8b0 RDI: 0000000000000007
[ 546.970168] RBP: 00007ffc5935d880 R08: 0000000000000002 R09: 000000000003b1e6
[ 546.971337] R10: 0014e3686b800000 R11: 0000000000000202 R12: 00007ffc5935d8f0
[ 546.972516] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[ 546.973708] Modules linked in: autofs4
[ 546.974354] CR2: 0000000000000030
[ 546.974960] ---[ end trace f820e59e021274ff ]---
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks,
Rong Chen
FYI, we noticed the following commit (built with gcc-6):
commit: 3cc8de9dcbe53955edcc65122f169666b4f6cbd9 ("[RFC 04/20] timens: Introduce CLOCK_BOOTTIME offset")
url: https://github.com/0day-ci/linux/commits/Dmitry-Safonov/ns-Introduce-Time-Namespace/20180920-194322
in testcase: boot
on test machine: qemu-system-x86_64 -enable-kvm -cpu IvyBridge -smp 4 -m 2G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+---------------------------------------------------------+------------+------------+
| | 05dd6588ce | 3cc8de9dcb |
+---------------------------------------------------------+------------+------------+
| boot_successes | 2 | 0 |
| boot_failures | 13 | 10 |
| invoked_oom-killer:gfp_mask=0x | 2 | 6 |
| Mem-Info | 2 | 6 |
| Out_of_memory_and_no_killable_processes | 2 | 6 |
| Kernel_panic-not_syncing:System_is_deadlocked_on_memory | 2 | 6 |
| BUG:unable_to_handle_kernel | 10 | 4 |
| Oops:#[##] | 11 | 4 |
| RIP:timens_adjust_monotonic | 11 | |
| Kernel_panic-not_syncing:Fatal_exception | 11 | 4 |
| RIP:posix_get_boottime | 0 | 4 |
+---------------------------------------------------------+------------+------------+
[ 9.781291] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 9.782843] PGD 0 P4D 0
[ 9.783186] Oops: 0000 [#1] PTI
[ 9.783604] CPU: 0 PID: 1 Comm: systemd Not tainted 4.19.0-rc4-00099-g3cc8de9d #1
[ 9.784565] RIP: 0010:posix_get_boottime+0x3c/0xad
[ 9.785183] Code: 8b 04 25 28 00 00 00 48 89 45 e0 31 c0 e8 20 67 03 00 48 8b 04 25 40 50 04 82 48 8b 80 c0 04 00 00 bf 01 00 00 00 48 8b 40 30 <4c> 8b 60 30 e8 10 88 ff ff 48 89 c7 e8 d9 12 ff ff 4d 85 e4 48 89
[ 9.787540] RSP: 0018:ffffc9000000be88 EFLAGS: 00010293
[ 9.788211] RAX: 0000000000000000 RBX: ffffc9000000bed0 RCX: ffff880076898040
[ 9.789112] RDX: 0000000000000000 RSI: ffffffff8113136f RDI: 0000000000000001
[ 9.790028] RBP: ffffc9000000beb8 R08: 0000000000000000 R09: 0000000000000000
[ 9.790936] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81c111a0
[ 9.791840] R13: 0000000000000007 R14: 00007ffd481e35e0 R15: 0000000000000000
[ 9.792747] FS: 00007fd2fc3c9940(0000) GS:ffffffff82041000(0000) knlGS:0000000000000000
[ 9.793769] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.794510] CR2: 0000000000000030 CR3: 000000007fb2e000 CR4: 00000000001406b0
[ 9.795413] Call Trace:
[ 9.795738] ? entry_SYSCALL_64_after_hwframe+0x59/0xbe
[ 9.796422] __se_sys_clock_gettime+0x51/0xa7
[ 9.796970] ? lockdep_hardirqs_on+0x144/0x19e
[ 9.797540] __x64_sys_clock_gettime+0x1a/0x1d
[ 9.798098] do_syscall_64+0x73/0x1d2
[ 9.798569] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9.799251] RIP: 0033:0x7ffd481eabbc
[ 9.799709] Code: f8 48 8b 15 36 c5 ff ff 48 89 16 48 8b 15 34 c5 ff ff 48 89 56 08 3b 03 0f 84 0a ff ff ff eb db 49 63 fb b8 e4 00 00 00 0f 05 <48> 83 c4 10 5b 41 5c 41 5d 5d c3 66 0f 1f 84 00 00 00 00 00 55 48
[ 9.802405] RSP: 002b:00007ffd481e3588 EFLAGS: 00000202 ORIG_RAX: 00000000000000e4
[ 9.803735] RAX: ffffffffffffffda RBX: 00007ffd481e7080 RCX: 00007ffd481eabbc
[ 9.804844] RDX: ffffffffffffffff RSI: 00007ffd481e35e0 RDI: 0000000000000007
[ 9.806155] RBP: 00007ffd481e35b0 R08: 0000000000000004 R09: 000000032ddb0949
[ 9.807568] R10: 000dc97587800000 R11: 0000000000000202 R12: 00007ffd481e3620
[ 9.808965] R13: 00007ffd481e3594 R14: 0000000000000001 R15: 0000000000000000
[ 9.810438] Modules linked in: ip_tables x_tables
[ 9.811034] CR2: 0000000000000030
[ 9.811629] ---[ end trace a935ae5a1b8f0750 ]---
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks,
Rong Chen
Thomas Gleixner <[email protected]> writes:
> Eric,
>
> On Fri, 28 Sep 2018, Eric W. Biederman wrote:
>> Thomas Gleixner <[email protected]> writes:
>> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
>> >> At the same time using the techniques from the nohz work and a little
>> >> smarts I expect we could get the code to scale.
>> >
>> > You'd need to invoke the update when the namespace is switched in and
>> > hasn't been updated since the last tick happened. That might be doable, but
>> > you also need to take the wraparound constraints of the underlying
>> > clocksources into account, which again can cause walking all name spaces
>> > when they are all idle long enough.
>>
>> The wrap around constraints being how long before the time sources wrap
>> around so you have to read them once per wrap around? I have not dug
>> deeply enough into the code to see that yet.
>
> It's done by limiting the NOHZ idle time when all CPUs are going into deep
> sleep for a long time, i.e. we make sure that at least one CPU comes back
> sufficiently _before_ the wraparound happens and invokes the update
> function.
>
> It's not so much a problem for TSC, but not every clocksource the kernel
> supports has wraparound times in the range of hundreds of years.
>
> But yes, your idea of keeping track of wraparounds might work. Tricky, but
> looks feasible on first sight, but we should be aware of the dragons.
Oh. Yes. Definitely. A key enabler of any namespace implementation is
figuring out how to tame the dragons.
>> Please pardon me for thinking out load.
>>
>> There are one or more time sources that we use to compute the time
>> and for each time source we have a conversion from ticks of the
>> time source to nanoseconds.
>>
>> Each time source needs to be sampled at least once per wrap-around
>> and something incremented so that we don't loose time when looking
>> at that time source.
>>
>> There are several clocks presented to userspace and they all share the
>> same length of second and are all fundamentally offsets from
>> CLOCK_MONOTONIC.
>
> Yes. That's the readout side. This one is doable. But now look at timers.
>
> If you arm the timer from a name space, then it needs to be converted to
> host time in order to sort it into the hrtimer queue and at some point arm
> the clockevent device for it. This works as long as host and name space
> time have a constant offset and the same skew.
>
> Once the name space time has a different skew this falls apart because the
> armed timer will either expire late or early.
>
> Late might be acceptable, early violates the spec. You could do an extra
> check for rescheduling it, if it's early, but that requires to store the
> name space time accessor in the hrtimer itself because not every timer
> expiry happens so that it can be checked in the name space context (think
> signal based timers). We need to add this extra magic right into
> __hrtimer_run_queues() which is called from the hard and soft interrupt. We
> really don't want to touch all relevant callbacks or syscalls. The latter
> is not sufficient anyway for signal based timer delivery.
>
> That's going to be interesting in terms of synchronization and might also
> cause substantial overhead at least for the timers which belong to name
> spaces.
>
> But that also means that anything which is early can and probably will
> cause rearming of the timer hardware possibly for a very short delta. We
> need to think about whether this can be abused to create interrupt storms.
>
> Now if you accept a bit late, which I'm not really happy about, then you
> surely won't accept very late, i.e. hours, days. But that can happen when
> settimeofday() comes into play. Right now with a single time domain, this
> is easy. When settimeofday() or adjtimex() makes time jump, we just go and
> reprogramm the hardware timers accordingly, which might also result in
> immediate expiry of timers.
>
> But this does not help for time jumps in name spaces because the timer is
> enqueued on the host time base.
>
> And no, we should not think about creating per name space hrtimer queues
> and then have to walk through all of them for finding the first expiring
> timer in order to arm the hardware. That cannot scale.
>
> Walking all hrtimer bases on all CPUs and check all queued timers whether
> they belong to the affected name space does not scale either.
>
> So we'd need to keep track of queued timers belonging to a name space and
> then just handle them. Interesting locking problem and also a scalability
> issue because this might need to be done on all online CPUs. Haven't
> thought it through, but it makes me shudder.
Yes. I can see how this is a dragon that we need to figure out how to
tame. It already exist somewhat for CLOCK_MONOTONIC vs CLOCK_REALTIME
but still.
>> I see two fundamental driving cases for a time namespace.
>
> <SNIP>
>
> I completely understand the problem you are trying to solve and yes, the
> read out of time should be a solvable problem.
There is simplified subproblem that I want to ask about but I will reply
separately for that.
>> Not that I think a final implementation would necessary look like what I
>> have described. I just think it is possible with extreme care to evolve
>> the current code base into something that can efficiently handle
>> multiple time domains with slightly different lenghts of second.
>
> Yes, it really needs some serious thoughts and timekeeping is a really
> complex place especially with NTP/PTP in play. We had quite some quality
> time to make it work correctly and reliably, now you come along and want to
> transform it into a multidimensional puzzle. :)
I thought it was Einstein who pointed out what a puzzle timekeeping is,
with the rest of us just playing catch up. ;-)
>> It does though sound like it is going to take some serious digging
>> through the code to understand how what everything does and how and why
>> everthing works the way it does. Not something grafted on top with just
>> a cursory understanding of how the code works.
>
> I fully agree and I'm happy to help with explanations and ideas and being
> the one who shoots holes into yours.
Sounds good.
Eric
In the context of process migration there is a simpler subproblem that I
think it is worth exploring if we can do something about.
For a cluster of machines all running with synchronized
clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
machines. Not having a matching CLOCK_MONOTONIC prevents successful
process migration between nodes in that cluster.
Would it be possible to allow setting CLOCK_MONOTONIC at the very
beginning of time? So that all of the nodes in a cluster can be in
sync?
No change in skew just in offset for CLOCK_MONOTONIC.
There are also dragons involved in coordinating things so that
CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't
know if allowing CLOCK_MONOTONIC to be set would be practical but it
seems work exploring all on it's own.
Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
your problem that is you are looking at a time namespace to solve?
Eric
On Mon, 1 Oct 2018, Eric W. Biederman wrote:
> In the context of process migration there is a simpler subproblem that I
> think it is worth exploring if we can do something about.
>
> For a cluster of machines all running with synchronized
> clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> machines. Not having a matching CLOCK_MONOTONIC prevents successful
> process migration between nodes in that cluster.
>
> Would it be possible to allow setting CLOCK_MONOTONIC at the very
> beginning of time? So that all of the nodes in a cluster can be in
> sync?
>
> No change in skew just in offset for CLOCK_MONOTONIC.
>
> There are also dragons involved in coordinating things so that
> CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't
> know if allowing CLOCK_MONOTONIC to be set would be practical but it
> seems work exploring all on it's own.
It's used very early on in the kernel, so that would be a major surprise
for many things including user space which has expectations on clock
monotonic.
It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
the way you described and then in name spaces make it possible to magically
map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
It still wouldn't allow to have different NTP/PTP time domains, but might
be a good start to address the main migration headaches.
Thanks,
tglx
On Mon, Oct 01, 2018 at 11:15:32AM +0200, Eric W. Biederman wrote:
>
> In the context of process migration there is a simpler subproblem that I
> think it is worth exploring if we can do something about.
>
> For a cluster of machines all running with synchronized
> clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> machines. Not having a matching CLOCK_MONOTONIC prevents successful
> process migration between nodes in that cluster.
>
> Would it be possible to allow setting CLOCK_MONOTONIC at the very
> beginning of time? So that all of the nodes in a cluster can be in
> sync?
Here is a question about how to synchronize clocks between nodes. It
looks like we will need to have a working network for this, but a
network configuration may be non-trivial and it can require to run a few
processes which can use CLOCK_MONOTNIC...
>
> No change in skew just in offset for CLOCK_MONOTONIC.
>
> There are also dragons involved in coordinating things so that
> CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't
> know if allowing CLOCK_MONOTONIC to be set would be practical but it
> seems work exploring all on it's own.
>
> Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
> your problem that is you are looking at a time namespace to solve?
Process migration is only one of use-cases. Another use-case is
restoring from snapshots. It may be even more popular than process
migration. We can't guarantee that all snapshots will be done in one
cluster. For example, a user meets a bug, does a container snapshot and
attaches it to a bug report.
>
> Eric
On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> >
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
>
> And just a quick scan tells me that this is broken. Timers will expire
> early or late. The latter is acceptible to some extent, but larger delays
> might come with surprise. Expiring early is an absolute nono.
Do you mean that we have to adjust all timers after changing offset for
CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
monotonic and boot times will be set immediately after creating a time
namespace before using any timers.
It is interesting to think what a use-case for changing these offsets
after creating timers. It may be useful for testing needs. A user sets a
timer in an hour and then change a clock offset forward and check that a
test application handles the timer properly.
>
> Thanks,
>
> tglx
>
On Mon, 1 Oct 2018, Andrey Vagin wrote:
> On Mon, Oct 01, 2018 at 11:15:32AM +0200, Eric W. Biederman wrote:
> >
> > In the context of process migration there is a simpler subproblem that I
> > think it is worth exploring if we can do something about.
> >
> > For a cluster of machines all running with synchronized
> > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> > machines. Not having a matching CLOCK_MONOTONIC prevents successful
> > process migration between nodes in that cluster.
> >
> > Would it be possible to allow setting CLOCK_MONOTONIC at the very
> > beginning of time? So that all of the nodes in a cluster can be in
> > sync?
>
> Here is a question about how to synchronize clocks between nodes. It
> looks like we will need to have a working network for this, but a
> network configuration may be non-trivial and it can require to run a few
> processes which can use CLOCK_MONOTNIC...
>
> >
> > No change in skew just in offset for CLOCK_MONOTONIC.
> >
> > There are also dragons involved in coordinating things so that
> > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't
> > know if allowing CLOCK_MONOTONIC to be set would be practical but it
> > seems work exploring all on it's own.
> >
> > Dmitry would setting CLOCK_MONOTONIC exactly once at boot time solve
> > your problem that is you are looking at a time namespace to solve?
>
> Process migration is only one of use-cases. Another use-case is
> restoring from snapshots. It may be even more popular than process
> migration. We can't guarantee that all snapshots will be done in one
> cluster. For example, a user meets a bug, does a container snapshot and
> attaches it to a bug report.
Sure, but see my reply to Eric. That could be solved with that extra clock
id, which then gets mapped to monotonic for name spaces.
Thanks,
tglx
On Mon, 1 Oct 2018, Andrey Vagin wrote:
> On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> > On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> >
> > And just a quick scan tells me that this is broken. Timers will expire
> > early or late. The latter is acceptible to some extent, but larger delays
> > might come with surprise. Expiring early is an absolute nono.
>
> Do you mean that we have to adjust all timers after changing offset for
> CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
> monotonic and boot times will be set immediately after creating a time
> namespace before using any timers.
I explained that in detail in this thread, but it's not about the initial
setting of clock mono/boot before any timers have been armed.
It's about setting the offset or clock realtime (via settimeofday) when
timers are already armed. Also having a entirely different time domain,
e.g. separate NTP adjustments, makes that necessary.
Thanks,
tglx
On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner <[email protected]> wrote:
>
> On Mon, 1 Oct 2018, Eric W. Biederman wrote:
> > In the context of process migration there is a simpler subproblem that I
> > think it is worth exploring if we can do something about.
> >
> > For a cluster of machines all running with synchronized
> > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> > machines. Not having a matching CLOCK_MONOTONIC prevents successful
> > process migration between nodes in that cluster.
> >
> > Would it be possible to allow setting CLOCK_MONOTONIC at the very
> > beginning of time? So that all of the nodes in a cluster can be in
> > sync?
> >
> > No change in skew just in offset for CLOCK_MONOTONIC.
> >
> > There are also dragons involved in coordinating things so that
> > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't
> > know if allowing CLOCK_MONOTONIC to be set would be practical but it
> > seems work exploring all on it's own.
>
> It's used very early on in the kernel, so that would be a major surprise
> for many things including user space which has expectations on clock
> monotonic.
>
> It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
> the way you described and then in name spaces make it possible to magically
> map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
>
> It still wouldn't allow to have different NTP/PTP time domains, but might
> be a good start to address the main migration headaches.
If we make CLOCK_MONOTONIC settable this way in a namespace,
do you think that should include device drivers that report timestamps
in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
interfaces?
Examples for drivers that can report timestamps are input, sound, v4l,
and drm. I think most of these can report stamps in either monotonic
or realtime base, while socket timestamps notably are always in
realtime.
We can probably get away with not setting the timebase for those
device drivers as long as the checkpoint/restart and migration features
are not expected to restore the state of an open character device
in that way. I don't know if that is a reasonable assumption to make
for the examples I listed.
Arnd
On Tue, 2 Oct 2018, Arnd Bergmann wrote:
> On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner <[email protected]> wrote:
> >
> > On Mon, 1 Oct 2018, Eric W. Biederman wrote:
> > > In the context of process migration there is a simpler subproblem that I
> > > think it is worth exploring if we can do something about.
> > >
> > > For a cluster of machines all running with synchronized
> > > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
> > > machines. Not having a matching CLOCK_MONOTONIC prevents successful
> > > process migration between nodes in that cluster.
> > >
> > > Would it be possible to allow setting CLOCK_MONOTONIC at the very
> > > beginning of time? So that all of the nodes in a cluster can be in
> > > sync?
> > >
> > > No change in skew just in offset for CLOCK_MONOTONIC.
> > >
> > > There are also dragons involved in coordinating things so that
> > > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't
> > > know if allowing CLOCK_MONOTONIC to be set would be practical but it
> > > seems work exploring all on it's own.
> >
> > It's used very early on in the kernel, so that would be a major surprise
> > for many things including user space which has expectations on clock
> > monotonic.
> >
> > It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
> > the way you described and then in name spaces make it possible to magically
> > map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
> >
> > It still wouldn't allow to have different NTP/PTP time domains, but might
> > be a good start to address the main migration headaches.
>
> If we make CLOCK_MONOTONIC settable this way in a namespace,
> do you think that should include device drivers that report timestamps
> in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
> interfaces?
Uurgh. That gets messy very fast.
> Examples for drivers that can report timestamps are input, sound, v4l,
> and drm. I think most of these can report stamps in either monotonic
> or realtime base, while socket timestamps notably are always in
> realtime.
>
> We can probably get away with not setting the timebase for those
> device drivers as long as the checkpoint/restart and migration features
> are not expected to restore the state of an open character device
> in that way. I don't know if that is a reasonable assumption to make
> for the examples I listed.
No idea. I'm not a container migration wizard.
Thanks,
tglx
Hi Thomas, Andrei, Eric,
On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner <[email protected]> wrote:
>
> On Mon, 1 Oct 2018, Andrey Vagin wrote:
>
> > On Thu, Sep 27, 2018 at 11:41:49PM +0200, Thomas Gleixner wrote:
> > > On Thu, 27 Sep 2018, Thomas Gleixner wrote:
> > > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > > timers as well, because you need to guarantee that they are not expiring
> > > > early.
> > > >
> > > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > > can work at all without introducing subtle issues all over the place.
> > >
> > > And just a quick scan tells me that this is broken. Timers will expire
> > > early or late. The latter is acceptible to some extent, but larger delays
> > > might come with surprise. Expiring early is an absolute nono.
> >
> > Do you mean that we have to adjust all timers after changing offset for
> > CLOCK_MONOTONIC or CLOCK_BOOTTIME? Our idea is that offsets for
> > monotonic and boot times will be set immediately after creating a time
> > namespace before using any timers.
>
> I explained that in detail in this thread, but it's not about the initial
> setting of clock mono/boot before any timers have been armed.
>
> It's about setting the offset or clock realtime (via settimeofday) when
> timers are already armed. Also having a entirely different time domain,
> e.g. separate NTP adjustments, makes that necessary.
It looks like, there is a bit of misunderstanding each other:
Andrei was talking about the current RFC version, where we haven't
introduced offsets for clock realtime. While Thomas IIUC, is looking
how-to expand time namespace over realtime.
As CLOCK_REALTIME virtualization raises so many complex questions
like a different length of the second or list of realtime timers in ns we
haven't added any realization for it.
It seems like an initial introduction for timens can be expanded after to cover
realtime clocks too. While it may seem incomplete, it solves issues for
restoring/migration of real-world applications like nodejs, Oracle DB server
which fails after being restored if there is a leap in monotonic time.
While solving the mentioned issues, it doesn't bring overhead.
(well, Andy noted that cmp for zero-offsets on vdso can be optimized too,
which will be done in v1).
Thomas, thanks much for your input - now we know that we'll need to
introduce list for timers in namespace when we'll add realtime clocks.
Do you believe that CLOCK_MONOTONIC_SYNC would be an easier
concept than offsets per-namespace?
Thanks,
Dmitry
Dmitry,
On Tue, 2 Oct 2018, Dmitry Safonov wrote:
> On Tue, 2 Oct 2018 at 07:15, Thomas Gleixner <[email protected]> wrote:
> > I explained that in detail in this thread, but it's not about the initial
> > setting of clock mono/boot before any timers have been armed.
> >
> > It's about setting the offset or clock realtime (via settimeofday) when
> > timers are already armed. Also having a entirely different time domain,
> > e.g. separate NTP adjustments, makes that necessary.
>
> It looks like, there is a bit of misunderstanding each other:
> Andrei was talking about the current RFC version, where we haven't
> introduced offsets for clock realtime. While Thomas IIUC, is looking
> how-to expand time namespace over realtime.
>
> As CLOCK_REALTIME virtualization raises so many complex questions
> like a different length of the second or list of realtime timers in ns we
> haven't added any realization for it.
>
> It seems like an initial introduction for timens can be expanded after to cover
> realtime clocks too. While it may seem incomplete, it solves issues for
> restoring/migration of real-world applications like nodejs, Oracle DB server
> which fails after being restored if there is a leap in monotonic time.
Well, yes. But you really have to think about the full picture. Just adding
part of the overall solution right now, just because it can be glued into
the code easily, is not the best approach IMO as it might result in
substantial rework of the whole thing sooner than later. I really don't
want to end up with something which is not extensible and has to be
supported forever.
Just for the record, the current approach with name space offsets for
monotonic is also prone to malfunction vs. timers, unless you can prevent
changing the offset _after_ the namespace has been set up and timers have
been armed. I admit, that I did not look close enough to verify that.
> While solving the mentioned issues, it doesn't bring overhead.
> (well, Andy noted that cmp for zero-offsets on vdso can be optimized too,
> which will be done in v1).
>
> Thomas, thanks much for your input - now we know that we'll need to
> introduce list for timers in namespace when we'll add realtime clocks.
> Do you believe that CLOCK_MONOTONIC_SYNC would be an easier
> concept than offsets per-namespace?
Haven't thought it through. This was just an idea in reaction to Eric's
question whether setting clock monotonic might be feasible. But yes, it
might be worth to think about it.
I think you should really define the long term requirements for time
namespaces and perhaps set some limitations in functionality upfront.
Thanks,
tglx
Thomas Gleixner <[email protected]> writes:
> On Tue, 2 Oct 2018, Arnd Bergmann wrote:
>> On Mon, Oct 1, 2018 at 8:53 PM Thomas Gleixner <[email protected]> wrote:
>> >
>> > On Mon, 1 Oct 2018, Eric W. Biederman wrote:
>> > > In the context of process migration there is a simpler subproblem that I
>> > > think it is worth exploring if we can do something about.
>> > >
>> > > For a cluster of machines all running with synchronized
>> > > clocks. CLOCK_REALTIME matches. CLOCK_MONOTNIC does not match between
>> > > machines. Not having a matching CLOCK_MONOTONIC prevents successful
>> > > process migration between nodes in that cluster.
>> > >
>> > > Would it be possible to allow setting CLOCK_MONOTONIC at the very
>> > > beginning of time? So that all of the nodes in a cluster can be in
>> > > sync?
>> > >
>> > > No change in skew just in offset for CLOCK_MONOTONIC.
>> > >
>> > > There are also dragons involved in coordinating things so that
>> > > CLOCK_MONOTONIC gets set before CLOCK_MONOTONIC gets used. So I don't
>> > > know if allowing CLOCK_MONOTONIC to be set would be practical but it
>> > > seems work exploring all on it's own.
>> >
>> > It's used very early on in the kernel, so that would be a major surprise
>> > for many things including user space which has expectations on clock
>> > monotonic.
>> >
>> > It would be reasonably easy to add CLOCK_MONONOTIC_SYNC which can be set in
>> > the way you described and then in name spaces make it possible to magically
>> > map CLOCK_MONOTONIC to CLOCK_MONOTONIC_SYNC.
>> >
>> > It still wouldn't allow to have different NTP/PTP time domains, but might
>> > be a good start to address the main migration headaches.
>>
>> If we make CLOCK_MONOTONIC settable this way in a namespace,
>> do you think that should include device drivers that report timestamps
>> in CLOCK_MONOTONIC base, or only the timekeeping clock and timer
>> interfaces?
>
> Uurgh. That gets messy very fast.
>
>> Examples for drivers that can report timestamps are input, sound, v4l,
>> and drm. I think most of these can report stamps in either monotonic
>> or realtime base, while socket timestamps notably are always in
>> realtime.
>>
>> We can probably get away with not setting the timebase for those
>> device drivers as long as the checkpoint/restart and migration features
>> are not expected to restore the state of an open character device
>> in that way. I don't know if that is a reasonable assumption to make
>> for the examples I listed.
>
> No idea. I'm not a container migration wizard.
Direct access to hardware/drivers and not through an abstraction like
the vfs (an abstraction over block devices) can legitimately be handled
by hotplug events. I unplug one keyboard I plug in another.
I don't know if the input layer is more of a general abstraction
or more of a hardware device. I have not dug into it but my guess
is abstraction from what I have heard.
The scary difficulty here is if after restart input is reporting times
in CLOCK_MONOTONIC and the applications in the namespace are talking
about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even
with a fixed offset the times don't match up.
So a time namespace absolutely needs to do is figure out how to deal
with all of the kernel interfaces reporting times and figure out how to
report them in the current time namespace.
Eric
On Wed, 3 Oct 2018, Eric W. Biederman wrote:
> Direct access to hardware/drivers and not through an abstraction like
> the vfs (an abstraction over block devices) can legitimately be handled
> by hotplug events. I unplug one keyboard I plug in another.
>
> I don't know if the input layer is more of a general abstraction
> or more of a hardware device. I have not dug into it but my guess
> is abstraction from what I have heard.
>
> The scary difficulty here is if after restart input is reporting times
> in CLOCK_MONOTONIC and the applications in the namespace are talking
> about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even
> with a fixed offset the times don't match up.
>
> So a time namespace absolutely needs to do is figure out how to deal
> with all of the kernel interfaces reporting times and figure out how to
> report them in the current time namespace.
So you want to talk to Arnd who is leading the y2038 effort. He knowns how
many and which interfaces are involved aside of the obvious core timer
ones. It's quite an amount and the problem is that you really need to do
that at the interface level, because many of those time stamps are taken in
contexts which are completely oblivious of name spaces. Ditto for timeouts
and similar things which are handed in through these interfaces.
Thanks,
tglx
Thomas Gleixner <[email protected]> writes:
> On Wed, 3 Oct 2018, Eric W. Biederman wrote:
>> Direct access to hardware/drivers and not through an abstraction like
>> the vfs (an abstraction over block devices) can legitimately be handled
>> by hotplug events. I unplug one keyboard I plug in another.
>>
>> I don't know if the input layer is more of a general abstraction
>> or more of a hardware device. I have not dug into it but my guess
>> is abstraction from what I have heard.
>>
>> The scary difficulty here is if after restart input is reporting times
>> in CLOCK_MONOTONIC and the applications in the namespace are talking
>> about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even
>> with a fixed offset the times don't match up.
>>
>> So a time namespace absolutely needs to do is figure out how to deal
>> with all of the kernel interfaces reporting times and figure out how to
>> report them in the current time namespace.
>
> So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> many and which interfaces are involved aside of the obvious core timer
> ones. It's quite an amount and the problem is that you really need to do
> that at the interface level, because many of those time stamps are taken in
> contexts which are completely oblivious of name spaces. Ditto for timeouts
> and similar things which are handed in through these interfaces.
Yep. That sounds right.
Eric
On Wed, 3 Oct 2018, Thomas Gleixner wrote:
> On Wed, 3 Oct 2018, Eric W. Biederman wrote:
> > Direct access to hardware/drivers and not through an abstraction like
> > the vfs (an abstraction over block devices) can legitimately be handled
> > by hotplug events. I unplug one keyboard I plug in another.
> >
> > I don't know if the input layer is more of a general abstraction
> > or more of a hardware device. I have not dug into it but my guess
> > is abstraction from what I have heard.
> >
> > The scary difficulty here is if after restart input is reporting times
> > in CLOCK_MONOTONIC and the applications in the namespace are talking
> > about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even
> > with a fixed offset the times don't match up.
> >
> > So a time namespace absolutely needs to do is figure out how to deal
> > with all of the kernel interfaces reporting times and figure out how to
> > report them in the current time namespace.
>
> So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> many and which interfaces are involved aside of the obvious core timer
> ones. It's quite an amount and the problem is that you really need to do
> that at the interface level, because many of those time stamps are taken in
> contexts which are completely oblivious of name spaces. Ditto for timeouts
> and similar things which are handed in through these interfaces.
Plus you have to make sure, that any new interface will have that
treatment. For y2038 that's easy as we just require to use timespec64 for
new ones. For your problem that's not so trivial.
Thanks,
tglx
On Wed, Oct 3, 2018 at 8:14 AM Eric W. Biederman <[email protected]> wrote:
>
> Thomas Gleixner <[email protected]> writes:
>
> > On Wed, 3 Oct 2018, Eric W. Biederman wrote:
> >> Direct access to hardware/drivers and not through an abstraction like
> >> the vfs (an abstraction over block devices) can legitimately be handled
> >> by hotplug events. I unplug one keyboard I plug in another.
> >>
> >> I don't know if the input layer is more of a general abstraction
> >> or more of a hardware device. I have not dug into it but my guess
> >> is abstraction from what I have heard.
> >>
> >> The scary difficulty here is if after restart input is reporting times
> >> in CLOCK_MONOTONIC and the applications in the namespace are talking
> >> about times in CLOCK_MONOTONIC_SYNC. Then there is an issue. As even
> >> with a fixed offset the times don't match up.
> >>
> >> So a time namespace absolutely needs to do is figure out how to deal
> >> with all of the kernel interfaces reporting times and figure out how to
> >> report them in the current time namespace.
> >
> > So you want to talk to Arnd who is leading the y2038 effort. He knowns how
> > many and which interfaces are involved aside of the obvious core timer
> > ones. It's quite an amount and the problem is that you really need to do
> > that at the interface level, because many of those time stamps are taken in
> > contexts which are completely oblivious of name spaces. Ditto for timeouts
> > and similar things which are handed in through these interfaces.
>
> Yep. That sounds right.
Let's stay with the input event example for the moment: Here, we have a
character device, and a user calls read() to retrieve one or more records
of type 'struct input_event' using the evdev_read() function. The original
timestamp gets put there using this logic:
ktime_t time;
struct timespec64 ts;
time = client->clk_type == EV_CLK_REAL ?
ktime_get_real() :
client->clk_type == EV_CLK_MONO ?
ktime_get() :
ktime_get_boottime();
ts = ktime_to_timespec64(time);
ev.input_event_sec = ts.tv_sec;
ev.input_event_usec = ts.tv_nsec / NSEC_PER_USEC;
clk_type can get set using an ioctl() to real, monotonic or
boottime. We have to stop using EV_CLK_REAL in the
future because that breaks in y2038, but I guess EV_CLK_MONO
and EV_CLK_BOOK should stay.
If we want this to work correctly in a namespace that has a
user defined CLOCK_MONOTONIC timebase, one way to
do it might be to always call ktime_get() when we record
the timestamp in the kernel-internal CLOCK_MONOTONIC
base, but then convert it to the correct base when copying to
user space.
Note that AFAIU practically all users of evdev do /not/ actually
care about the time base, they only care about the elapsed
time between intervals, e.g. to track how fast a pointer should
move based on input from a trackpad. I don't see any reason
why one would compare this timestamp to a clock_gettime()
value, but of course at the moment this has well-defined
behavior that would break if we change clock_gettime(), and
we have a process in the namespace that opens
/dev/input/eventX and relies on meaningful timestamps
relative to a particular base.
Arnd
On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> Thomas Gleixner <[email protected]> writes:
>
> > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> >> Reading the code the calling sequence there is:
> >> tick_sched_do_timer
> >> tick_do_update_jiffies64
> >> update_wall_time
> >> timekeeping_advance
> >> timekeepging_update
> >>
> >> If I read that properly under the right nohz circumstances that update
> >> can be delayed indefinitely.
> >>
> >> So I think we could prototype a time namespace that was per
> >> timekeeping_update and just had update_wall_time iterate through
> >> all of the time namespaces.
> >
> > Please don't go there. timekeeping_update() is already heavy and walking
> > through a gazillion of namespaces will just make it horrible,
> >
> >> I don't think the naive version would scale to very many time
> >> namespaces.
> >
> > :)
> >
> >> At the same time using the techniques from the nohz work and a little
> >> smarts I expect we could get the code to scale.
> >
> > You'd need to invoke the update when the namespace is switched in and
> > hasn't been updated since the last tick happened. That might be doable, but
> > you also need to take the wraparound constraints of the underlying
> > clocksources into account, which again can cause walking all name spaces
> > when they are all idle long enough.
>
> The wrap around constraints being how long before the time sources wrap
> around so you have to read them once per wrap around? I have not dug
> deeply enough into the code to see that yet.
>
> > From there it becomes hairy, because it's not only timekeeping,
> > i.e. reading time, this is also affecting all timers which are armed from a
> > namespace.
> >
> > That gets really ugly because when you do settimeofday() or adjtimex() for
> > a particular namespace, then you have to search for all armed timers of
> > that namespace and adjust them.
> >
> > The original posix timer code had the same issue because it mapped the
> > clock realtime timers to the timer wheel so any setting of the clock caused
> > a full walk of all armed timers, disarming, adjusting and requeing
> > them. That's horrible not only performance wise, it's also a locking
> > nightmare of all sorts.
> >
> > Add time skew via NTP/PTP into the picture and you might have to adjust
> > timers as well, because you need to guarantee that they are not expiring
> > early.
> >
> > I haven't looked through Dimitry's patches yet, but I don't see how this
> > can work at all without introducing subtle issues all over the place.
>
> Then it sounds like this will take some more digging.
>
> Please pardon me for thinking out load.
>
> There are one or more time sources that we use to compute the time
> and for each time source we have a conversion from ticks of the
> time source to nanoseconds.
>
> Each time source needs to be sampled at least once per wrap-around
> and something incremented so that we don't loose time when looking
> at that time source.
>
> There are several clocks presented to userspace and they all share the
> same length of second and are all fundamentally offsets from
> CLOCK_MONOTONIC.
>
> I see two fundamental driving cases for a time namespace.
> 1) Migration from one node to another node in a cluster in almost
> real time.
>
> The problem is that CLOCK_MONOTONIC between nodes in the cluster
> has not relation ship to each other (except a synchronized length of
> the second). So applications that migrate can see CLOCK_MONOTONIC
> and CLOCK_BOOTTIME go backwards.
>
> This is the truly pressing problem and adding some kind of offset
> sounds like it would be the solution. Possibly by allowing a boot
> time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
>
> 2) Dealing with two separate time management domains. Say a machine
> that needes to deal with both something inside of google where they
> slew time to avoid leap time seconds and something in the outside
> world proper UTC time is kept as an offset from TAI with the
> occasional leap seconds.
>
> In the later case it would fundamentally require having seconds of
> different length.
>
I want to add that the second case should be optional.
When a container is migrated to another host, we have to restore its
monotonic and boottime clocks, but we still expect that the container
will continue using the host real-time clock.
Before stating this series, I was thinking about this, I decided that
these cases can be solved independently. Probably, the full isolation of
the time sub-system will have much higher overhead than just offsets for
a few clocks. And the idea that isolation of the real-time clock should
be optional gives us another hint that offsets for monotonic and
boot-time clocks can be implemented independently.
Eric and Tomas, what do you think about this? If you agree that these
two cases can be implemented separately, what should we do with this
series to make it ready to be merged?
I know that we need to:
* look at device drivers that report timestamps in CLOCK_MONOTONIC base.
* forbid changing offsets after creating timers
Anything else?
Thanks,
Andrei
>
> A pure 64bit nanoseond counter is good for 500 years. So 64bit
> variables can be used to hold time, and everything can be converted from
> there.
>
> This suggests we can for ticks have two values.
> - The number of ticks from the time source.
> - The number of times the ticks would have rolled over.
>
> That sounds like it may be a little simplistic as it would require being
> very diligent about firing a timer exactly at rollover and not losing
> that, but for a handwaving argument is probably enough to generate
> a 64bit tick counter.
>
> If the focus is on a 64bit tick counter then what update_wall_time
> has to do is very limited. Just deal the accounting needed to cope with
> tick rollover.
>
> Getting the actual time looks like it would be as simple as now, with
> perhaps an extra addition to account for the number of times the tick
> counter has rolled over. With limited precision arithmetic and various
> optimizations I don't think it is that simple to implement but it feels
> like it should be very little extra work.
>
> For timers my inclination would be to assume no adjustments to the
> current time parameters and set the timer to go off then. If the time
> on the appropriate clock has been changed since the timer was set and
> the timer is going off early reschedule so the timer fires at the
> appropriate time.
>
> With the above I think it is theoretically possible to build a time
> namespace that supports multiple lengths of second, and does not have
> much overhead.
>
> Not that I think a final implementation would necessary look like what I
> have described. I just think it is possible with extreme care to evolve
> the current code base into something that can efficiently handle
> multiple time domains with slightly different lenghts of second.
>
> Thomas does it sound like I am completely out of touch with reality?
>
> It does though sound like it is going to take some serious digging
> through the code to understand how what everything does and how and why
> everthing works the way it does. Not something grafted on top with just
> a cursory understanding of how the code works.
>
> Eric
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <[email protected]> writes:
> >
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >> tick_do_update_jiffies64
> > >> update_wall_time
> > >> timekeeping_advance
> > >> timekeepging_update
> > >>
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >>
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> >
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around? I have not dug
> > deeply enough into the code to see that yet.
> >
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> >
> > Then it sounds like this will take some more digging.
> >
> > Please pardon me for thinking out load.
> >
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> >
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> >
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> >
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> > real time.
> >
> > The problem is that CLOCK_MONOTONIC between nodes in the cluster
> > has not relation ship to each other (except a synchronized length of
> > the second). So applications that migrate can see CLOCK_MONOTONIC
> > and CLOCK_BOOTTIME go backwards.
> >
> > This is the truly pressing problem and adding some kind of offset
> > sounds like it would be the solution. Possibly by allowing a boot
> > time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> >
> > 2) Dealing with two separate time management domains. Say a machine
> > that needes to deal with both something inside of google where they
> > slew time to avoid leap time seconds and something in the outside
> > world proper UTC time is kept as an offset from TAI with the
> > occasional leap seconds.
> >
> > In the later case it would fundamentally require having seconds of
> > different length.
> >
>
> I want to add that the second case should be optional.
>
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
>
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
>
> Eric and Tomas, what do you think about this? If you agree that these
Sorry Thomas, I mistyped your name.
> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
>
> I know that we need to:
>
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
>
> Anything else?
>
> Thanks,
> Andrei
>
> >
> > A pure 64bit nanoseond counter is good for 500 years. So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> >
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> >
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> >
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited. Just deal the accounting needed to cope with
> > tick rollover.
> >
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over. With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> >
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then. If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> >
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> >
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> >
> > Thomas does it sound like I am completely out of touch with reality?
> >
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does. Not something grafted on top with just
> > a cursory understanding of how the code works.
> >
> > Eric
> > _______________________________________________
> > Containers mailing list
> > [email protected]
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
Andrei,
On Sat, 20 Oct 2018, Andrei Vagin wrote:
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
>
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
>
> Eric and Tomas, what do you think about this? If you agree that these
> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
>
> I know that we need to:
>
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
and CLOCK_BOOTTIME and that's quite a few.
> * forbid changing offsets after creating timers
There are more things to think about. What about interfaces which expose
boot time or monotonic time in /proc?
Aside of that (I finally came around to look at the series in more detail)
I'm really unhappy about the unconditional overhead once the Time namespace
config switch is enabled. This applies especially to the VDSO. We spent
quite some time recently to squeeze a few cycles out of those functions and
it would be a pity to pointlessly waste cycles for the !namespace case.
I can see the urge for this, but please let us think it through properly
before rushing anything in which we are going to regret once we want to do
more sophisticated time domain management, e.g. support for isolated clock
real time. I'm worried, that without a clear plan about the overall
picture, we end up with duct tape which is hard to distangle after the
fact.
There have been a few other things brought up versus time management in
general, like the TSN folks utilizing grand clock masters which expose
random time instead of proper TAI. Plus some requirements for exposing some
sort of 'monotonic' clocks which are derived from external synchronization
mechanisms, but should not affect the regular time keeping clocks.
While different issues, these all fall into the category of separate time
domains, so taking a step back to the drawing board is probably the best
thing what we can do now.
There are certainly a few things which can be looked at independently,
e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
kernel with these name space functions applying offsets left and right. I
rather have dedicated core functionality which replaces/amends existing
timer functions to become time namespace aware.
I'll try to find some time in the next weeks to look deeper into that, but
I can't promise anything before returning from LPC. Btw, LPC would be a
great opportunity to discuss that. Are you and the other name space wizards
there by any chance?
Thanks,
tglx
Thomas Gleixner <[email protected]> writes:
> Andrei,
>
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
>> When a container is migrated to another host, we have to restore its
>> monotonic and boottime clocks, but we still expect that the container
>> will continue using the host real-time clock.
>>
>> Before stating this series, I was thinking about this, I decided that
>> these cases can be solved independently. Probably, the full isolation of
>> the time sub-system will have much higher overhead than just offsets for
>> a few clocks. And the idea that isolation of the real-time clock should
>> be optional gives us another hint that offsets for monotonic and
>> boot-time clocks can be implemented independently.
>>
>> Eric and Tomas, what do you think about this? If you agree that these
>> two cases can be implemented separately, what should we do with this
>> series to make it ready to be merged?
>>
>> I know that we need to:
>>
>> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
>
> and CLOCK_BOOTTIME and that's quite a few.
>
>> * forbid changing offsets after creating timers
>
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?
>
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.
>
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.
>
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
>
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
>
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
>
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?
I will be and there are going to be both container and CRIU
mini-conferences. So there should at least some of us around.
Eric
Eric,
On Mon, 29 Oct 2018, Eric W. Biederman wrote:
> Thomas Gleixner <[email protected]> writes:
> >
> > I'll try to find some time in the next weeks to look deeper into that, but
> > I can't promise anything before returning from LPC. Btw, LPC would be a
> > great opportunity to discuss that. Are you and the other name space wizards
> > there by any chance?
>
> I will be and there are going to be both container and CRIU
> mini-conferences. So there should at least some of us around.
So let's try to find a slot for a BOF or similar (there might be still
slots for the kernel summit available, i'll ask).
Thanks,
tglx
On Mon, Oct 29, 2018 at 09:33:14PM +0100, Thomas Gleixner wrote:
> Andrei,
>
> On Sat, 20 Oct 2018, Andrei Vagin wrote:
> > When a container is migrated to another host, we have to restore its
> > monotonic and boottime clocks, but we still expect that the container
> > will continue using the host real-time clock.
> >
> > Before stating this series, I was thinking about this, I decided that
> > these cases can be solved independently. Probably, the full isolation of
> > the time sub-system will have much higher overhead than just offsets for
> > a few clocks. And the idea that isolation of the real-time clock should
> > be optional gives us another hint that offsets for monotonic and
> > boot-time clocks can be implemented independently.
> >
> > Eric and Tomas, what do you think about this? If you agree that these
> > two cases can be implemented separately, what should we do with this
> > series to make it ready to be merged?
> >
> > I know that we need to:
> >
> > * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
>
> and CLOCK_BOOTTIME and that's quite a few.
>
> > * forbid changing offsets after creating timers
>
> There are more things to think about. What about interfaces which expose
> boot time or monotonic time in /proc?
We didn't find any proc files where boot or monotonic time is reported,
but we will double check this.
>
> Aside of that (I finally came around to look at the series in more detail)
> I'm really unhappy about the unconditional overhead once the Time namespace
> config switch is enabled. This applies especially to the VDSO. We spent
> quite some time recently to squeeze a few cycles out of those functions and
> it would be a pity to pointlessly waste cycles for the !namespace case.
It is a good point. We will work on it.
>
> I can see the urge for this, but please let us think it through properly
> before rushing anything in which we are going to regret once we want to do
> more sophisticated time domain management, e.g. support for isolated clock
> real time. I'm worried, that without a clear plan about the overall
> picture, we end up with duct tape which is hard to distangle after the
> fact.
Thomas, there is no rush at all. This functionality is critical for
CRUI, but we have enough time to solve it properly.
The only thing what I want is that this functionality continues moving
forward and will not be put in the back burner.
>
> There have been a few other things brought up versus time management in
> general, like the TSN folks utilizing grand clock masters which expose
> random time instead of proper TAI. Plus some requirements for exposing some
> sort of 'monotonic' clocks which are derived from external synchronization
> mechanisms, but should not affect the regular time keeping clocks.
>
> While different issues, these all fall into the category of separate time
> domains, so taking a step back to the drawing board is probably the best
> thing what we can do now.
>
> There are certainly a few things which can be looked at independently,
> e.g. the VDSO mechanics or general mechanisms to avoid plastering the whole
> kernel with these name space functions applying offsets left and right. I
> rather have dedicated core functionality which replaces/amends existing
> timer functions to become time namespace aware.
>
> I'll try to find some time in the next weeks to look deeper into that, but
> I can't promise anything before returning from LPC. Btw, LPC would be a
> great opportunity to discuss that. Are you and the other name space wizards
> there by any chance?
Dmitry and I are going to be there.
Thanks!
Andrei
>
> Thanks,
>
> tglx
>
>