2019-02-06 00:15:19

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 00/32] kernel: Introduce Time Namespace

Discussions around time namespace are there for a long time. The first
attempt to implement it was in 2006 by Jeff Dike. From that time, the
topic appears on and off in various discussions.

There are two main use cases for time namespaces:
1. change date and time inside a container;
2. adjust clocks for a container restored from a checkpoint.

“It seems like this might be one of the last major obstacles keeping
migration from being used in production systems, given that not all
containers and connections can be migrated as long as a time dependency
is capable of messing it up.” (by github.com/dav-ell)

The kernel provides access to several clocks: CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
start points for them are not defined and are different for each
system. When a container is migrated from one node to another, all
clocks have to be restored into consistent states; in other words, they
have to continue running from the same points where they have been
dumped.

The main idea of this patch set is adding per-namespace offsets for
system clocks. When a process in a non-root time namespace requests
time of a clock, a namespace offset is added to the current value of
this clock and the sum is returned.

All offsets are placed on a separate page, this allows us to map it as
part of VVAR into user processes and use offsets from VDSO calls.

Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
clocks.

v2: There are two major changes from the previous version:

* Two versions of the VDSO library to avoid a performance penalty for
host tasks outside time namespace (as suggested by Andy and Thomas).

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
VDSO code.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.


* Allow to set clock offsets for a namespace only before any processes
appear in it.

Now a time namespace looks similar to a pid namespace in a way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of
the process will be born in the new time namespace, or a process can
use the setns() system call to join a namespace.

This scheme allows to create a new time namespaces, set clock offsets
and then populate the namespace with processes.

Our performance measurements show that the price of VDSO's clock_gettime()
in a child time namespace is about 8% with a hot CPU cache and about 90%
with a cold CPU cache. There is no performance regression for host
processes outside time namespace on those tests.

We wrote two small benchmarks. The first one gettime_perf.c calls
clock_gettime() in a loop for 3 seconds. It shows us performance with
a hot CPU cache (more clock_gettime() cycles - the better):

| before | CONFIG_TIME_NS=n | host | inside timens
--------|------------|------------------|-------------|-------------
cycles | 139887013 | 139453003 | 139899785 | 128792458
diff (%)| 100 | 99.7 | 100 | 92

The second one gettime_perf_cold.c calls rdtsc, clock_gettime(), rdtsc
and shows a difference between second and first rdtsc. The binary is
called in a loop 1000 times, then calculate MODE for 1000 values.
It should show us performance with a cold CPU cache
(lesser tsc per cycle - the better):

| before | CONFIG_TIME_NS=n | host | inside timens
--------|------------|------------------|-------------|-------------
tsc | 6748 | 6718 | 6862 | 12682
diff (%)| 100 | 99.6 | 101.7 | 188

The numbers gathered on Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz.

Cc: Adrian Reber <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Tucker <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Andrei Vagin (15):
ns: Introduce Time Namespace
timens: Add timens_offsets
timens: Introduce CLOCK_MONOTONIC offsets
timens: Introduce CLOCK_BOOTTIME offset
timerfd/timens: Take into account ns clock offsets
posix-timers/timens: Take into account clock offsets
timens/kernel: Take into account timens clock offsets in
clock_nanosleep
x86/vdso/timens: Add offsets page in vvar
timens/fs/proc: Introduce /proc/pid/timens_offsets
selftest/timens: Add a test for timerfd
selftest/timens: Add a test for clock_nanosleep()
selftest/timens: Add timer offsets test
selftests: Add a simple perf test for clock_gettime()
selftest/timens: Check that a right vdso is mapped after fork and exec
x86/vdso: Align VDSO functions by CPU L1 cache line

Dmitry Safonov (17):
timens: Shift /proc/uptime
x86/vdso2c: Correct err messages on file opening
x86/vdso2c: Convert iterator to unsigned
x86/vdso/Makefile: Add vobjs32
x86/vdso: Build timens .so(s)
x86/VDSO: Build VDSO with -ffunction-sections
x86/vdso2c: Optionally produce linker script for vdso entries
x86/vdso: Generate vdso{,32}-timens.lds
x86/vdso2c: Sort vdso entries by addresses for linker script
x86/vdso.lds: Align !timens (host's) vdso.so entries
x86/vdso2c: Align LOCAL symbols between vdso{-timens,}.so
x86/vdso: Initialize timens 64-bit vdso
x86/vdso: Switch image on setns()/unshare()/clone()
timens: Add align for timens_offsets
selftest/timens: Add Time Namespace test for supported clocks
selftest/timens: Add procfs selftest
x86/vdso: Restrict splitting VVAR VMA

MAINTAINERS | 3 +
arch/Kconfig | 5 +
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/.gitignore | 2 +
arch/x86/entry/vdso/Makefile | 61 ++-
arch/x86/entry/vdso/vclock_gettime-timens.c | 6 +
arch/x86/entry/vdso/vclock_gettime.c | 42 +++
arch/x86/entry/vdso/vdso-layout.lds.S | 21 +-
arch/x86/entry/vdso/vdso-timens.lds.S | 7 +
arch/x86/entry/vdso/vdso2c.c | 46 ++-
arch/x86/entry/vdso/vdso2c.h | 52 ++-
arch/x86/entry/vdso/vdso32/.gitignore | 1 +
arch/x86/entry/vdso/vdso32/sigreturn.S | 2 +
arch/x86/entry/vdso/vdso32/system_call.S | 2 +-
.../entry/vdso/vdso32/vclock_gettime-timens.c | 6 +
.../x86/entry/vdso/vdso32/vdso32-timens.lds.S | 8 +
arch/x86/entry/vdso/vma.c | 110 ++++++
arch/x86/include/asm/vdso.h | 8 +
fs/proc/base.c | 101 +++++
fs/proc/namespaces.c | 4 +
fs/proc/uptime.c | 3 +
fs/timerfd.c | 16 +-
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 2 +
include/linux/time_namespace.h | 91 +++++
include/linux/timens_offsets.h | 18 +
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 8 +
kernel/Makefile | 1 +
kernel/fork.c | 3 +-
kernel/nsproxy.c | 41 ++-
kernel/time/hrtimer.c | 8 +
kernel/time/posix-timers.c | 24 +-
kernel/time/posix-timers.h | 1 +
kernel/time_namespace.c | 348 ++++++++++++++++++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/timens/.gitignore | 7 +
tools/testing/selftests/timens/Makefile | 12 +
.../selftests/timens/clock_nanosleep.c | 99 +++++
tools/testing/selftests/timens/config | 1 +
tools/testing/selftests/timens/exec.c | 91 +++++
tools/testing/selftests/timens/gettime_perf.c | 74 ++++
.../selftests/timens/gettime_perf_cold.c | 63 ++++
tools/testing/selftests/timens/log.h | 26 ++
tools/testing/selftests/timens/procfs.c | 142 +++++++
tools/testing/selftests/timens/timens.c | 191 ++++++++++
tools/testing/selftests/timens/timens.h | 63 ++++
tools/testing/selftests/timens/timer.c | 115 ++++++
tools/testing/selftests/timens/timerfd.c | 119 ++++++
50 files changed, 2008 insertions(+), 52 deletions(-)
create mode 100644 arch/x86/entry/vdso/vclock_gettime-timens.c
create mode 100644 arch/x86/entry/vdso/vdso-timens.lds.S
create mode 100644 arch/x86/entry/vdso/vdso32/vclock_gettime-timens.c
create mode 100644 arch/x86/entry/vdso/vdso32/vdso32-timens.lds.S
create mode 100644 include/linux/time_namespace.h
create mode 100644 include/linux/timens_offsets.h
create mode 100644 kernel/time_namespace.c
create mode 100644 tools/testing/selftests/timens/.gitignore
create mode 100644 tools/testing/selftests/timens/Makefile
create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
create mode 100644 tools/testing/selftests/timens/config
create mode 100644 tools/testing/selftests/timens/exec.c
create mode 100644 tools/testing/selftests/timens/gettime_perf.c
create mode 100644 tools/testing/selftests/timens/gettime_perf_cold.c
create mode 100644 tools/testing/selftests/timens/log.h
create mode 100644 tools/testing/selftests/timens/procfs.c
create mode 100644 tools/testing/selftests/timens/timens.c
create mode 100644 tools/testing/selftests/timens/timens.h
create mode 100644 tools/testing/selftests/timens/timer.c
create mode 100644 tools/testing/selftests/timens/timerfd.c

--
2.20.1



2019-02-06 00:11:54

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 09/32] x86/vdso2c: Correct err messages on file opening

err() message in main() is misleading: it should print `outfilename`,
which is argv[3], not argv[2].

Correct error messages to be more precise about what failed and for
which file.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vdso2c.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 8e470b018512..26d7177c119e 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -187,7 +187,7 @@ static void map_input(const char *name, void **addr, size_t *len, int prot)

int fd = open(name, O_RDONLY);
if (fd == -1)
- err(1, "%s", name);
+ err(1, "open(%s)", name);

tmp_len = lseek(fd, 0, SEEK_END);
if (tmp_len == (off_t)-1)
@@ -240,7 +240,7 @@ int main(int argc, char **argv)
outfilename = argv[3];
outfile = fopen(outfilename, "w");
if (!outfile)
- err(1, "%s", argv[2]);
+ err(1, "fopen(%s)", outfilename);

go(raw_addr, raw_len, stripped_addr, stripped_len, outfile, name);

--
2.20.1


2019-02-06 00:12:06

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 03/32] timens: Introduce CLOCK_MONOTONIC offsets

From: Andrei Vagin <[email protected]>

Add monotonic time virtualisation for time namespace.
Introduce timespec for monotionic clock into timens offsets and wire
clock_gettime() syscall.

Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
include/linux/time_namespace.h | 11 ++++++++++
include/linux/timens_offsets.h | 1 +
kernel/time/posix-timers.c | 10 +++++++++
kernel/time/posix-timers.h | 1 +
kernel/time_namespace.c | 38 ++++++++++++++++++++++++++++++++++
5 files changed, 61 insertions(+)

diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index b6985aa87479..f1807d7f524d 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -41,6 +41,9 @@ static inline void put_time_ns(struct time_namespace *ns)
}


+extern void timens_clock_to_host(int clockid, struct timespec64 *val);
+extern void timens_clock_from_host(int clockid, struct timespec64 *val);
+
#else
static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
{
@@ -65,6 +68,14 @@ static inline int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *ts
return 0;
}

+static inline void timens_clock_to_host(int clockid, struct timespec64 *val)
+{
+}
+
+static inline void timens_clock_from_host(int clockid, struct timespec64 *val)
+{
+}
+
#endif

#endif /* _LINUX_TIMENS_H */
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index 7d7cb68ea778..248b0c0bb92a 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -3,6 +3,7 @@
#define _LINUX_TIME_OFFSETS_H

struct timens_offsets {
+ struct timespec64 monotonic_time_offset;
};

#endif
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 0e84bb72a3da..b6d5145858a3 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -30,6 +30,7 @@
#include <linux/hashtable.h>
#include <linux/compat.h>
#include <linux/nospec.h>
+#include <linux/time_namespace.h>

#include "timekeeping.h"
#include "posix-timers.h"
@@ -1041,6 +1042,9 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,

error = kc->clock_get(which_clock, &kernel_tp);

+ if (!error && kc->clock_timens_adjust)
+ timens_clock_from_host(which_clock, &kernel_tp);
+
if (!error && put_timespec64(&kernel_tp, tp))
error = -EFAULT;

@@ -1117,6 +1121,9 @@ COMPAT_SYSCALL_DEFINE2(clock_gettime, clockid_t, which_clock,

err = kc->clock_get(which_clock, &ts);

+ if (!err && kc->clock_timens_adjust)
+ timens_clock_from_host(which_clock, &ts);
+
if (!err && put_old_timespec32(&ts, tp))
err = -EFAULT;

@@ -1259,6 +1266,7 @@ static const struct k_clock clock_realtime = {
static const struct k_clock clock_monotonic = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_ktime_get_ts,
+ .clock_timens_adjust = true,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
@@ -1274,6 +1282,7 @@ static const struct k_clock clock_monotonic = {
static const struct k_clock clock_monotonic_raw = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_get_monotonic_raw,
+ .clock_timens_adjust = true,
};

static const struct k_clock clock_realtime_coarse = {
@@ -1284,6 +1293,7 @@ static const struct k_clock clock_realtime_coarse = {
static const struct k_clock clock_monotonic_coarse = {
.clock_getres = posix_get_coarse_res,
.clock_get = posix_get_monotonic_coarse,
+ .clock_timens_adjust = true,
};

static const struct k_clock clock_tai = {
diff --git a/kernel/time/posix-timers.h b/kernel/time/posix-timers.h
index ddb21145211a..1cf306bde639 100644
--- a/kernel/time/posix-timers.h
+++ b/kernel/time/posix-timers.h
@@ -24,6 +24,7 @@ struct k_clock {
int (*timer_try_to_cancel)(struct k_itimer *timr);
void (*timer_arm)(struct k_itimer *timr, ktime_t expires,
bool absolute, bool sigev_none);
+ bool clock_timens_adjust;
};

extern const struct k_clock clock_posix_cpu;
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 4828447721ec..57694be9e9db 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -191,6 +191,44 @@ static struct user_namespace *timens_owner(struct ns_common *ns)
return to_time_ns(ns)->user_ns;
}

+static void clock_timens_fixup(int clockid, struct timespec64 *val, bool to_ns)
+{
+ struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
+ struct timespec64 *offsets = NULL;
+
+ if (!ns_offsets)
+ return;
+
+ if (val->tv_sec == 0 && val->tv_nsec == 0)
+ return;
+
+ switch (clockid) {
+ case CLOCK_MONOTONIC:
+ case CLOCK_MONOTONIC_RAW:
+ case CLOCK_MONOTONIC_COARSE:
+ offsets = &ns_offsets->monotonic_time_offset;
+ break;
+ }
+
+ if (!offsets)
+ return;
+
+ if (to_ns)
+ *val = timespec64_add(*val, *offsets);
+ else
+ *val = timespec64_sub(*val, *offsets);
+}
+
+void timens_clock_to_host(int clockid, struct timespec64 *val)
+{
+ clock_timens_fixup(clockid, val, false);
+}
+
+void timens_clock_from_host(int clockid, struct timespec64 *val)
+{
+ clock_timens_fixup(clockid, val, true);
+}
+
const struct proc_ns_operations timens_operations = {
.name = "time",
.type = CLONE_NEWTIME,
--
2.20.1


2019-02-06 00:12:29

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 24/32] selftest/timens: Add Time Namespace test for supported clocks

A test to check that all supported clocks work on host and inside
a new time namespace. Use both ways to get time: through VDSO and
by entering the kernel with implicit syscall.

Introduce a new timens directory in selftests framework for
the next timens tests.

Co-developed-by: Andrei Vagin <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 5 +
tools/testing/selftests/timens/config | 1 +
tools/testing/selftests/timens/log.h | 26 +++
tools/testing/selftests/timens/timens.c | 191 ++++++++++++++++++++++
tools/testing/selftests/timens/timens.h | 63 +++++++
7 files changed, 288 insertions(+)
create mode 100644 tools/testing/selftests/timens/.gitignore
create mode 100644 tools/testing/selftests/timens/Makefile
create mode 100644 tools/testing/selftests/timens/config
create mode 100644 tools/testing/selftests/timens/log.h
create mode 100644 tools/testing/selftests/timens/timens.c
create mode 100644 tools/testing/selftests/timens/timens.h

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1a2bd15c5b6e..cccbe89983fa 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -47,6 +47,7 @@ TARGETS += sysctl
ifneq (1, $(quicktest))
TARGETS += timers
endif
+TARGETS += timens
TARGETS += user
TARGETS += vm
TARGETS += x86
diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
new file mode 100644
index 000000000000..27a693229ce1
--- /dev/null
+++ b/tools/testing/selftests/timens/.gitignore
@@ -0,0 +1 @@
+timens
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
new file mode 100644
index 000000000000..b877efb78974
--- /dev/null
+++ b/tools/testing/selftests/timens/Makefile
@@ -0,0 +1,5 @@
+TEST_GEN_PROGS := timens
+
+CFLAGS := -Wall -Werror
+
+include ../lib.mk
diff --git a/tools/testing/selftests/timens/config b/tools/testing/selftests/timens/config
new file mode 100644
index 000000000000..4480620f6f49
--- /dev/null
+++ b/tools/testing/selftests/timens/config
@@ -0,0 +1 @@
+CONFIG_TIME_NS=y
diff --git a/tools/testing/selftests/timens/log.h b/tools/testing/selftests/timens/log.h
new file mode 100644
index 000000000000..85b54bfa50c5
--- /dev/null
+++ b/tools/testing/selftests/timens/log.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __SELFTEST_TIMENS_LOG_H__
+#define __SELFTEST_TIMENS_LOG_H__
+
+#define pr_msg(fmt, lvl, ...) \
+ ksft_print_msg("[%s] (%s:%d)\t" fmt "\n", \
+ lvl, __FILE__, __LINE__, ##__VA_ARGS__)
+
+#define pr_p(func, fmt, ...) func(fmt ": %m", ##__VA_ARGS__)
+
+#define pr_err(fmt, ...) \
+ ({ \
+ ksft_test_result_error(fmt, ##__VA_ARGS__); \
+ -1; \
+ })
+
+#define pr_fail(fmt, ...) \
+ ({ \
+ ksft_test_result_fail(fmt, ##__VA_ARGS__); \
+ -1; \
+ })
+
+#define pr_perror(fmt, ...) pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+#endif
diff --git a/tools/testing/selftests/timens/timens.c b/tools/testing/selftests/timens/timens.c
new file mode 100644
index 000000000000..334bdefe01a3
--- /dev/null
+++ b/tools/testing/selftests/timens/timens.c
@@ -0,0 +1,191 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+#include <time.h>
+#include <string.h>
+
+#include "log.h"
+#include "timens.h"
+
+/*
+ * Test shouldn't be run for a day, so add 10 days to child
+ * time and check parent's time to be in the same day.
+ */
+#define DAY_IN_SEC (60*60*24)
+#define TEN_DAYS_IN_SEC (10*DAY_IN_SEC)
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+#define CLOCK_TYPES \
+ ct(CLOCK_BOOTTIME, -1), \
+ ct(CLOCK_MONOTONIC, -1), \
+ ct(CLOCK_MONOTONIC_COARSE, 1), \
+ ct(CLOCK_MONOTONIC_RAW, 1), \
+
+
+struct test_clock {
+ clockid_t id;
+ char *name;
+ /*
+ * off_id is -1 if a clock has own offset, or it contains an index
+ * which contains a right offset of this clock.
+ */
+ int off_id;
+ time_t offset;
+};
+
+#define ct(clock, off_id) { clock, #clock, off_id }
+static struct test_clock clocks[] = {
+ CLOCK_TYPES
+};
+#undef ct
+
+static int child_ns, parent_ns = -1;
+
+static int switch_ns(int fd)
+{
+ if (setns(fd, CLONE_NEWTIME)) {
+ pr_perror("setns()");
+ return -1;
+ }
+
+ return 0;
+}
+
+static int init_namespaces(void)
+{
+ char path[] = "/proc/self/ns/time_for_children";
+ struct stat st1, st2;
+
+ if (parent_ns == -1) {
+ parent_ns = open(path, O_RDONLY);
+ if (parent_ns <= 0)
+ return pr_perror("Unable to open %s", path);
+ }
+
+ if (fstat(parent_ns, &st1))
+ return pr_perror("Unable to stat the parent timens");
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("Can't unshare() timens");
+
+ child_ns = open(path, O_RDONLY);
+ if (child_ns <= 0)
+ return pr_perror("Unable to open %s", path);
+
+ if (fstat(child_ns, &st2))
+ return pr_perror("Unable to stat the timens");
+
+ if (st1.st_ino == st2.st_ino)
+ return pr_perror("The same child_ns after CLONE_NEWTIME");
+
+ return 0;
+}
+
+static int test_gettime(clockid_t clock_index, bool raw_syscall, time_t offset)
+{
+ struct timespec child_ts_new, parent_ts_old, cur_ts;
+ char *entry = raw_syscall ? "syscall" : "vdso";
+ double precision = 0.0;
+
+ switch (clocks[clock_index].id) {
+ case CLOCK_MONOTONIC_COARSE:
+ case CLOCK_MONOTONIC_RAW:
+ precision = -2.0;
+ break;
+ }
+
+ if (switch_ns(parent_ns))
+ return pr_err("switch_ns(%d)", child_ns);
+
+ if (_gettime(clocks[clock_index].id, &parent_ts_old, raw_syscall))
+ return -1;
+
+ child_ts_new.tv_nsec = parent_ts_old.tv_nsec;
+ child_ts_new.tv_sec = parent_ts_old.tv_sec + offset;
+
+ if (switch_ns(child_ns))
+ return pr_err("switch_ns(%d)", child_ns);
+
+ if (_gettime(clocks[clock_index].id, &cur_ts, raw_syscall))
+ return -1;
+
+ if (difftime(cur_ts.tv_sec, child_ts_new.tv_sec) < precision) {
+ ksft_test_result_fail(
+ "Child's %s (%s) time has not changed: %lu -> %lu [%lu]\n",
+ clocks[clock_index].name, entry, parent_ts_old.tv_sec,
+ child_ts_new.tv_sec, cur_ts.tv_sec);
+ return -1;
+ }
+
+ if (switch_ns(parent_ns))
+ return pr_err("switch_ns(%d)", parent_ns);
+
+ if (_gettime(clocks[clock_index].id, &cur_ts, raw_syscall))
+ return -1;
+
+ if (difftime(cur_ts.tv_sec, parent_ts_old.tv_sec) > DAY_IN_SEC) {
+ ksft_test_result_fail(
+ "Parent's %s (%s) time has changed: %lu -> %lu [%lu]\n",
+ clocks[clock_index].name, entry, parent_ts_old.tv_sec,
+ child_ts_new.tv_sec, cur_ts.tv_sec);
+ /* Let's play nice and put it closer to original */
+ clock_settime(clocks[clock_index].id, &cur_ts);
+ return -1;
+ }
+
+ ksft_test_result_pass("Passed for %s (%s)\n",
+ clocks[clock_index].name, entry);
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ unsigned int i, j;
+ int ret = 0;
+
+ nscheck();
+
+ for (j = 0; j < 2; j++) {
+ time_t offset;
+
+ if (init_namespaces())
+ return 1;
+
+ /* Offsets have to be set before tasks enter the namespace. */
+ for (i = 0; i < ARRAY_SIZE(clocks); i++) {
+ if (clocks[i].off_id != -1)
+ continue;
+ offset = TEN_DAYS_IN_SEC + i * 1000;
+ if (j > 0)
+ offset = -offset;
+ clocks[i].offset = offset;
+ if (_settime(clocks[i].id, offset))
+ return 1;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(clocks); i++) {
+ if (clocks[i].off_id != -1)
+ offset = clocks[clocks[i].off_id].offset;
+ else
+ offset = clocks[i].offset;
+ ret |= test_gettime(i, true, offset);
+ ret |= test_gettime(i, false, offset);
+ }
+ }
+
+ if (ret)
+ ksft_exit_fail();
+
+ ksft_exit_pass();
+ return !!ret;
+}
diff --git a/tools/testing/selftests/timens/timens.h b/tools/testing/selftests/timens/timens.h
new file mode 100644
index 000000000000..71a0ad78c634
--- /dev/null
+++ b/tools/testing/selftests/timens/timens.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TIMENS_H__
+#define __TIMENS_H__
+
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+
+#include "../kselftest.h"
+
+#ifndef CLONE_NEWTIME
+# define CLONE_NEWTIME 0x00001000
+#endif
+
+static inline int _settime(clockid_t clk_id, time_t offset)
+{
+ int fd, len;
+ char buf[4096];
+
+ if (clk_id == CLOCK_MONOTONIC_COARSE || clk_id == CLOCK_MONOTONIC_RAW)
+ clk_id = CLOCK_MONOTONIC;
+
+ len = snprintf(buf, sizeof(buf), "%d %ld 0", clk_id, offset);
+
+ fd = open("/proc/self/timens_offsets", O_WRONLY);
+ if (fd < 0)
+ return pr_perror("/proc/self/timens_offsets");
+
+ if (write(fd, buf, len) != len)
+ return pr_perror("/proc/self/timens_offsets");
+
+ close(fd);
+
+ return 0;
+}
+
+static inline int _gettime(clockid_t clk_id, struct timespec *res, bool raw_syscall)
+{
+ int err;
+
+ if (!raw_syscall) {
+ if (clock_gettime(clk_id, res)) {
+ pr_perror("clock_gettime(%d)", (int)clk_id);
+ return -1;
+ }
+ return 0;
+ }
+
+ err = syscall(SYS_clock_gettime, clk_id, res);
+ if (err)
+ pr_perror("syscall(SYS_clock_gettime(%d))", (int)clk_id);
+
+ return err;
+}
+
+static inline void nscheck(void)
+{
+ if (access("/proc/self/ns/time", F_OK) < 0)
+ ksft_exit_skip("Time namespaces are not supported\n");
+}
+
+#endif
--
2.20.1


2019-02-06 00:12:34

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 26/32] selftest/timens: Add a test for clock_nanosleep()

From: Andrei Vagin <[email protected]>

Check that clock_nanosleep() takes into account clock offsets.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 2 +-
.../selftests/timens/clock_nanosleep.c | 99 +++++++++++++++++++
3 files changed, 101 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index b609f6ee9fb9..9b6c8ddac2c8 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,2 +1,3 @@
+clock_nanosleep
timens
timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index 66b90cd28e5c..76a1dc891184 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd
+TEST_GEN_PROGS := timens timerfd clock_nanosleep

CFLAGS := -Wall -Werror

diff --git a/tools/testing/selftests/timens/clock_nanosleep.c b/tools/testing/selftests/timens/clock_nanosleep.c
new file mode 100644
index 000000000000..9a1689e5a0e1
--- /dev/null
+++ b/tools/testing/selftests/timens/clock_nanosleep.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/timerfd.h>
+#include <sys/syscall.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+
+#include "log.h"
+#include "timens.h"
+
+static long long get_elapsed_time(int clockid, struct timespec *start)
+{
+ struct timespec curr;
+ long long secs, nsecs;
+
+ if (clock_gettime(clockid, &curr) == -1)
+ return pr_perror("clock_gettime");
+
+ secs = curr.tv_sec - start->tv_sec;
+ nsecs = curr.tv_nsec - start->tv_nsec;
+ if (nsecs < 0) {
+ secs--;
+ nsecs += 1000000000;
+ }
+ if (nsecs > 1000000000) {
+ secs++;
+ nsecs -= 1000000000;
+ }
+ return secs * 1000 + nsecs / 1000000;
+}
+
+int run_test(int clockid)
+{
+ long long elapsed;
+ int i;
+
+ for (i = 0; i < 2; i++) {
+ struct timespec now = {};
+ struct timespec start;
+
+ if (clock_gettime(clockid, &start) == -1)
+ return pr_perror("clock_gettime");
+
+
+ if (i == 1) {
+ now.tv_sec = start.tv_sec;
+ now.tv_nsec = start.tv_nsec;
+ }
+
+ now.tv_sec += 2;
+ clock_nanosleep(clockid, i ? TIMER_ABSTIME : 0, &now, NULL);
+
+ elapsed = get_elapsed_time(clockid, &start);
+ if (elapsed < 1900 || elapsed > 2100) {
+ pr_fail("clockid: %d abs: %d elapsed: %lld\n",
+ clockid, i, elapsed);
+ return 1;
+ }
+ ksft_test_result_pass("clockid: %d abs:%d\n", clockid, i);
+ }
+
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret, nsfd;
+
+ nscheck();
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("unshare");
+
+ if (_settime(CLOCK_MONOTONIC, 7 * 24 * 3600))
+ return 1;
+ if (_settime(CLOCK_BOOTTIME, 9 * 24 * 3600))
+ return 1;
+
+ nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+ if (nsfd < 0)
+ return pr_perror("Unable to open timens_for_children");
+
+ if (setns(nsfd, CLONE_NEWTIME))
+ return pr_perror("Unable to set timens");
+
+ ret = 0;
+ ret |= run_test(CLOCK_MONOTONIC);
+
+ if (ret)
+ ksft_exit_fail();
+ ksft_exit_pass();
+ return ret;
+}
+
--
2.20.1


2019-02-06 00:12:38

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 30/32] selftest/timens: Check that a right vdso is mapped after fork and exec

From: Andrei Vagin <[email protected]>

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 2 +-
tools/testing/selftests/timens/exec.c | 91 +++++++++++++++++++++++
3 files changed, 93 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/exec.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 6bb90fdb4519..b08b4066f5ca 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,5 @@
clock_nanosleep
+exec
gettime_perf
procfs
timens
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index ef65bf96b55c..9e0edf354906 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf exec

uname_M := $(shell uname -m 2>/dev/null || echo not)
ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
diff --git a/tools/testing/selftests/timens/exec.c b/tools/testing/selftests/timens/exec.c
new file mode 100644
index 000000000000..b3a05c41e202
--- /dev/null
+++ b/tools/testing/selftests/timens/exec.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+#include <time.h>
+#include <string.h>
+
+#include "log.h"
+#include "timens.h"
+
+#define OFFSET (36000)
+
+int main(int argc, char *argv[])
+{
+ struct timespec now, tst;
+ int status, i;
+ pid_t pid;
+
+ if (argc > 1) {
+ if (sscanf(argv[1], "%ld", &now.tv_sec) != 1)
+ return pr_perror("sscanf");
+
+ for (i = 0; i < 2; i++) {
+ _gettime(CLOCK_MONOTONIC, &tst, i);
+ if (abs(tst.tv_sec - now.tv_sec) > 5)
+ return pr_fail("%ld %ld\n", now.tv_sec, tst.tv_sec);
+ }
+ }
+
+ nscheck();
+
+ clock_gettime(CLOCK_MONOTONIC, &now);
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("Can't unshare() timens");
+
+ if (_settime(CLOCK_MONOTONIC, OFFSET))
+ return 1;
+
+ for (i = 0; i < 2; i++) {
+ _gettime(CLOCK_MONOTONIC, &tst, i);
+ if (abs(tst.tv_sec - now.tv_sec) > 5)
+ return pr_fail("%ld %ld\n",
+ now.tv_sec, tst.tv_sec);
+ }
+
+ if (argc > 1)
+ return 0;
+
+ pid = fork();
+ if (pid < 0)
+ return pr_perror("fork");
+
+ if (pid == 0) {
+ char now_str[64];
+ char *cargv[] = {"exec", now_str, NULL};
+ char *cenv[] = {NULL};
+
+ /* Check that a child process is in the new timens. */
+ for (i = 0; i < 2; i++) {
+ _gettime(CLOCK_MONOTONIC, &tst, i);
+ if (abs(tst.tv_sec - now.tv_sec - OFFSET) > 5)
+ return pr_fail("%ld %ld\n",
+ now.tv_sec + OFFSET, tst.tv_sec);
+ }
+
+ /* Check that a proper vdso will be mapped after execve. */
+ snprintf(now_str, sizeof(now_str), "%ld", now.tv_sec + OFFSET);
+ execve("/proc/self/exe", cargv, cenv);
+ return pr_perror("execve");
+ }
+
+ if (waitpid(pid, &status, 0) != pid)
+ return pr_perror("waitpid");
+
+ if (status)
+ ksft_exit_fail();
+
+ ksft_test_result_pass("exec\n");
+ ksft_exit_pass();
+ return 0;
+}
--
2.20.1


2019-02-06 00:12:44

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 29/32] selftests: Add a simple perf test for clock_gettime()

From: Andrei Vagin <[email protected]>

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 8 +-
tools/testing/selftests/timens/gettime_perf.c | 74 +++++++++++++++++++
.../selftests/timens/gettime_perf_cold.c | 63 ++++++++++++++++
4 files changed, 145 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/gettime_perf.c
create mode 100644 tools/testing/selftests/timens/gettime_perf_cold.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 3b7eda8f35ce..6bb90fdb4519 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,5 @@
clock_nanosleep
+gettime_perf
procfs
timens
timer
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index ae1ffd24cc43..ef65bf96b55c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,10 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),x86_64)
+TEST_GEN_PROGS += gettime_perf_cold
+endif

CFLAGS := -Wall -Werror
LDFLAGS := -lrt
diff --git a/tools/testing/selftests/timens/gettime_perf.c b/tools/testing/selftests/timens/gettime_perf.c
new file mode 100644
index 000000000000..510d77a941d9
--- /dev/null
+++ b/tools/testing/selftests/timens/gettime_perf.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <time.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+#include "log.h"
+#include "timens.h"
+
+//#define TEST_SYSCALL
+
+static void test(clock_t clockid, char *clockstr, bool in_ns)
+{
+ struct timespec tp, start;
+ long i = 0;
+ const int timeout = 3;
+
+#ifndef TEST_SYSCALL
+ clock_gettime(clockid, &start);
+#else
+ syscall(__NR_clock_gettime, clockid, &start);
+#endif
+ tp = start;
+ for (tp = start; start.tv_sec + timeout > tp.tv_sec ||
+ (start.tv_sec + timeout == tp.tv_sec &&
+ start.tv_nsec > tp.tv_nsec); i++) {
+#ifndef TEST_SYSCALL
+ clock_gettime(clockid, &tp);
+#else
+ syscall(__NR_clock_gettime, clockid, &tp);
+#endif
+ }
+
+ ksft_test_result_pass("%s:\tclock: %10s\tcycles:\t%10ld\n",
+ in_ns ? "ns" : "host", clockstr, i);
+}
+
+int main(int argc, char *argv[])
+{
+ time_t offset = 10;
+ int nsfd;
+
+ test(CLOCK_MONOTONIC, "monotonic", false);
+ test(CLOCK_BOOTTIME, "boottime", false);
+
+ nscheck();
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("Can't unshare() timens");
+
+ nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+ if (nsfd < 0)
+ return pr_perror("Can't open a time namespace");
+
+ if (_settime(CLOCK_MONOTONIC, offset))
+ return 1;
+ if (_settime(CLOCK_BOOTTIME, offset))
+ return 1;
+
+ if (setns(nsfd, CLONE_NEWTIME))
+ return pr_perror("setns");
+
+ test(CLOCK_MONOTONIC, "monotonic", true);
+ test(CLOCK_BOOTTIME, "boottime", true);
+
+ ksft_exit_pass();
+ return 0;
+}
diff --git a/tools/testing/selftests/timens/gettime_perf_cold.c b/tools/testing/selftests/timens/gettime_perf_cold.c
new file mode 100644
index 000000000000..f72db8a4c903
--- /dev/null
+++ b/tools/testing/selftests/timens/gettime_perf_cold.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <time.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <string.h>
+
+#include "log.h"
+#include "timens.h"
+
+static __inline__ unsigned long long rdtsc(void)
+{
+ unsigned hi, lo;
+
+ __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
+ return ((unsigned long long) lo) | (((unsigned long long)hi) << 32);
+}
+
+static void test(clock_t clockid, char *clockstr)
+{
+ struct timespec tp;
+ long long s, e;
+
+ s = rdtsc();
+ clock_gettime(clockid, &tp);
+ e = rdtsc();
+ printf("%lld\n", e - s);
+ return;
+}
+
+int main(int argc, char **argv)
+{
+ time_t offset = 10;
+ int nsfd;
+
+ if (argc == 1) {
+ test(CLOCK_MONOTONIC, "monotonic");
+ return 0;
+ }
+ nscheck();
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("Can't unshare() timens");
+
+ nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+ if (nsfd < 0)
+ return pr_perror("Can't open a time namespace");
+
+ if (_settime(CLOCK_MONOTONIC, offset))
+ return 1;
+
+ if (setns(nsfd, CLONE_NEWTIME))
+ return pr_perror("setns");
+
+ test(CLOCK_MONOTONIC, "monotonic");
+ return 0;
+}
--
2.20.1


2019-02-06 00:12:47

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 32/32] x86/vdso: Restrict splitting VVAR VMA

Although, time namespace can work with VVAR VMA split, it seems worth
to forbid splitting VVAR resulting in stricter ABI and reducing amount
of corner-cases to consider while working further on VDSO.

I don't think there is any use-case for partial mremap() of vvar,
but it there is - this patch can be easily dropped.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vma.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 52c1e4c24455..dc1fca4d935a 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -87,6 +87,18 @@ static int vdso_mremap(const struct vm_special_mapping *sm,
return 0;
}

+static int vvar_mremap(const struct vm_special_mapping *sm,
+ struct vm_area_struct *new_vma)
+{
+ unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
+ const struct vdso_image *image = current->mm->context.vdso_image;
+
+ if (new_size != -image->sym_vvar_start)
+ return -EINVAL;
+
+ return 0;
+}
+
static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
struct vm_area_struct *vma, struct vm_fault *vmf)
{
@@ -149,6 +161,7 @@ static const struct vm_special_mapping vdso_mapping = {
static const struct vm_special_mapping vvar_mapping = {
.name = "[vvar]",
.fault = vvar_fault,
+ .mremap = vvar_mremap,
};

#ifdef CONFIG_TIME_NS
--
2.20.1


2019-02-06 00:12:58

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 31/32] x86/vdso: Align VDSO functions by CPU L1 cache line

From: Andrei Vagin <[email protected]>

After performance testing VDSO patches a noticeable 20% regression was
found on gettime_perf selftest with a cold cache.
As it turns to be, before time namespaces introduction, VDSO functions
were quite aligned to cache lines, but adding a new code to adjust
timens offset inside namespace created a small shift and vdso functions
become unaligned on cache lines.

Add align to vdso functions with gcc option to fix performance drop.

Coping the resulting numbers from cover letter:

Hot CPU cache (more gettime_perf.c cycles - the better):
| before | CONFIG_TIME_NS=n | host | inside timens
--------|------------|------------------|-------------|-------------
cycles | 139887013 | 139453003 | 139899785 | 128792458
diff (%)| 100 | 99.7 | 100 | 92

Cold cache (lesser tsc per gettime_perf_cold.c cycle - the better):
| before | CONFIG_TIME_NS=n | host | inside timens
--------|------------|------------------|-------------|-------------
tsc | 6748 | 6718 | 6862 | 12682
diff (%)| 100 | 99.6 | 101.7 | 188

Measured on Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz

Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/Makefile | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 4e1659619e7e..2cac4660db05 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -4,6 +4,7 @@
#

KBUILD_CFLAGS += $(DISABLE_LTO) -ffunction-sections
+KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_L1_CACHE_SHIFT)
KASAN_SANITIZE := n
UBSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y
--
2.20.1


2019-02-06 00:13:08

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 28/32] selftest/timens: Add timer offsets test

From: Andrei Vagin <[email protected]>

Check that timer_create() takes into account clock offsets.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 3 +-
tools/testing/selftests/timens/timer.c | 115 ++++++++++++++++++++++
3 files changed, 118 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/timer.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 94ffdd9cead7..3b7eda8f35ce 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,5 @@
clock_nanosleep
procfs
timens
+timer
timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index f96f50d1fef8..ae1ffd24cc43 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,5 +1,6 @@
-TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs

CFLAGS := -Wall -Werror
+LDFLAGS := -lrt

include ../lib.mk
diff --git a/tools/testing/selftests/timens/timer.c b/tools/testing/selftests/timens/timer.c
new file mode 100644
index 000000000000..aeb5623d25e4
--- /dev/null
+++ b/tools/testing/selftests/timens/timer.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <signal.h>
+#include <time.h>
+
+#include "log.h"
+#include "timens.h"
+
+int run_test(int clockid, struct timespec now)
+{
+ struct itimerspec new_value;
+ long long elapsed;
+ timer_t fd;
+ int i;
+
+ for (i = 0; i < 2; i++) {
+ struct sigevent sevp = {.sigev_notify = SIGEV_NONE};
+ int flags = 0;
+
+ new_value.it_value.tv_sec = 3600;
+ new_value.it_value.tv_nsec = 0;
+ new_value.it_interval.tv_sec = 1;
+ new_value.it_interval.tv_nsec = 0;
+
+ if (i == 1) {
+ new_value.it_value.tv_sec += now.tv_sec;
+ new_value.it_value.tv_nsec += now.tv_nsec;
+ }
+
+ if (timer_create(clockid, &sevp, &fd) == -1)
+ return pr_perror("timerfd_create");
+
+ if (i == 1)
+ flags |= TIMER_ABSTIME;
+ if (timer_settime(fd, flags, &new_value, NULL) == -1)
+ return pr_perror("timerfd_settime");
+
+ if (timer_gettime(fd, &new_value) == -1)
+ return pr_perror("timerfd_gettime");
+
+ elapsed = new_value.it_value.tv_sec;
+ if (abs(elapsed - 3600) > 60) {
+ ksft_test_result_fail("clockid: %d elapsed: %lld\n",
+ clockid, elapsed);
+ return 1;
+ }
+ }
+
+ ksft_test_result_pass("clockid=%d\n", clockid);
+
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret, status, len, fd;
+ char buf[4096];
+ pid_t pid;
+ struct timespec btime_now, mtime_now;
+
+ nscheck();
+
+ clock_gettime(CLOCK_MONOTONIC, &mtime_now);
+ clock_gettime(CLOCK_BOOTTIME, &btime_now);
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("unshare");
+
+ len = snprintf(buf, sizeof(buf), "%d %d 0\n%d %d 0",
+ CLOCK_MONOTONIC, 70 * 24 * 3600,
+ CLOCK_BOOTTIME, 9 * 24 * 3600);
+ fd = open("/proc/self/timens_offsets", O_WRONLY);
+ if (fd < 0)
+ return pr_perror("/proc/self/timens_offsets");
+
+ if (write(fd, buf, len) != len)
+ return pr_perror("/proc/self/timens_offsets");
+
+ close(fd);
+ mtime_now.tv_sec += 70 * 24 * 3600;
+ btime_now.tv_sec += 9 * 24 * 3600;
+
+ pid = fork();
+ if (pid < 0)
+ return pr_perror("Unable to fork");
+ if (pid == 0) {
+ ret = 0;
+ ret |= run_test(CLOCK_BOOTTIME, btime_now);
+ ret |= run_test(CLOCK_MONOTONIC, mtime_now);
+
+ if (ret)
+ ksft_exit_fail();
+ ksft_exit_pass();
+ return ret;
+ }
+
+ if (waitpid(pid, &status, 0) != pid)
+ return pr_perror("Unable to wait the child process");
+
+ if (WIFEXITED(status))
+ return WEXITSTATUS(status);
+
+ return 1;
+}
+
--
2.20.1


2019-02-06 00:13:18

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 27/32] selftest/timens: Add procfs selftest

Check that /proc/uptime is correct inside a new time namespace.

Co-developed-by: Andrei Vagin <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 2 +-
tools/testing/selftests/timens/procfs.c | 142 ++++++++++++++++++++++
3 files changed, 144 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/procfs.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 9b6c8ddac2c8..94ffdd9cead7 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,3 +1,4 @@
clock_nanosleep
+procfs
timens
timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index 76a1dc891184..f96f50d1fef8 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd clock_nanosleep
+TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs

CFLAGS := -Wall -Werror

diff --git a/tools/testing/selftests/timens/procfs.c b/tools/testing/selftests/timens/procfs.c
new file mode 100644
index 000000000000..af839688ecc6
--- /dev/null
+++ b/tools/testing/selftests/timens/procfs.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <math.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+#include <time.h>
+
+#include "log.h"
+#include "timens.h"
+
+/*
+ * Test shouldn't be run for a day, so add 10 days to child
+ * time and check parent's time to be in the same day.
+ */
+#define MAX_TEST_TIME_SEC (60*5)
+#define DAY_IN_SEC (60*60*24)
+#define TEN_DAYS_IN_SEC (10*DAY_IN_SEC)
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+static int child_ns, parent_ns;
+
+static int switch_ns(int fd)
+{
+ if (setns(fd, CLONE_NEWTIME))
+ return pr_perror("setns()");
+
+ return 0;
+}
+
+static int init_namespaces(void)
+{
+ char path[] = "/proc/self/ns/time_for_children";
+ struct stat st1, st2;
+
+ parent_ns = open(path, O_RDONLY);
+ if (parent_ns <= 0)
+ return pr_perror("Unable to open %s", path);
+
+ if (fstat(parent_ns, &st1))
+ return pr_perror("Unable to stat the parent timens");
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("Can't unshare() timens");
+
+ child_ns = open(path, O_RDONLY);
+ if (child_ns <= 0)
+ return pr_perror("Unable to open %s", path);
+
+ if (fstat(child_ns, &st2))
+ return pr_perror("Unable to stat the timens");
+
+ if (st1.st_ino == st2.st_ino)
+ return pr_err("The same child_ns after CLONE_NEWTIME");
+
+ if (_settime(CLOCK_BOOTTIME, TEN_DAYS_IN_SEC))
+ return -1;
+
+ return 0;
+}
+
+static int read_proc_uptime(struct timespec *uptime)
+{
+ unsigned long up_sec, up_nsec;
+ FILE *proc;
+
+ proc = fopen("/proc/uptime", "r");
+ if (proc == NULL) {
+ pr_perror("Unable to open /proc/uptime");
+ return -1;
+ }
+
+ if (fscanf(proc, "%lu.%02lu", &up_sec, &up_nsec) != 2) {
+ if (errno) {
+ pr_perror("fscanf");
+ return -errno;
+ }
+ pr_err("failed to parse /proc/uptime");
+ return -1;
+ }
+ fclose(proc);
+
+ uptime->tv_sec = up_sec;
+ uptime->tv_nsec = up_nsec;
+ return 0;
+}
+
+static int check_uptime(void)
+{
+ struct timespec uptime_new, uptime_old;
+ time_t uptime_expected;
+ double prec = MAX_TEST_TIME_SEC;
+
+ if (switch_ns(parent_ns))
+ return pr_err("switch_ns(%d)", parent_ns);
+
+ if (read_proc_uptime(&uptime_old))
+ return 1;
+
+ if (switch_ns(child_ns))
+ return pr_err("switch_ns(%d)", child_ns);
+
+ if (read_proc_uptime(&uptime_new))
+ return 1;
+
+ uptime_expected = uptime_old.tv_sec + TEN_DAYS_IN_SEC;
+ if (fabs(difftime(uptime_new.tv_sec, uptime_expected)) > prec) {
+ pr_fail("uptime in /proc/uptime: old %ld, new %ld [%ld]",
+ uptime_old.tv_sec, uptime_new.tv_sec,
+ uptime_old.tv_sec + TEN_DAYS_IN_SEC);
+ return 1;
+ }
+
+ ksft_test_result_pass("Passed for /proc/uptime");
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = 0;
+
+ nscheck();
+
+ if (init_namespaces())
+ return 1;
+
+ ret |= check_uptime();
+
+ if (ret)
+ ksft_exit_fail();
+ ksft_exit_pass();
+ return ret;
+}
--
2.20.1


2019-02-06 00:13:28

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 16/32] x86/vdso: Generate vdso{,32}-timens.lds

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

Unfortunately, to allow changing VDSO VMA on a running process,
the entry points to VDSO should have the same offsets (addresses).
That's needed as i.e. application that calls setns() may have already
resolved VDSO symbols in GOT/PLT.

Provide two linker scripts:
- *-timens.lds for building VDSO for processes inside time namespace
(it has bigger functions and needs to build firstly)
- *.lds for host processes VDSO
(it has smaller functions and entry addresses should be adjusted
with the linker script magic to fit with entries from timens)

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/.gitignore | 1 +
arch/x86/entry/vdso/Makefile | 18 ++++++++++++++----
arch/x86/entry/vdso/vdso-timens.lds.S | 7 +++++++
arch/x86/entry/vdso/vdso32/.gitignore | 1 +
arch/x86/entry/vdso/vdso32/vdso32-timens.lds.S | 8 ++++++++
5 files changed, 31 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/entry/vdso/vdso-timens.lds.S
create mode 100644 arch/x86/entry/vdso/vdso32/vdso32-timens.lds.S

diff --git a/arch/x86/entry/vdso/.gitignore b/arch/x86/entry/vdso/.gitignore
index 9ab4fa4c7e7b..aaddf8f2171c 100644
--- a/arch/x86/entry/vdso/.gitignore
+++ b/arch/x86/entry/vdso/.gitignore
@@ -1,4 +1,5 @@
vdso.lds
+vdso-timens.lds
vdsox32.lds
vdso32-syscall-syms.lds
vdso32-sysenter-syms.lds
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index ccb572831ea1..4e1659619e7e 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -44,11 +44,14 @@ vobjs := $(foreach F,$(vobjs-y),$(obj)/$F)
vobjs32 := $(foreach F,$(vobjs32-y),$(obj)/$F)
vobjs-timens := $(foreach F,$(vobjs-timens-y),$(obj)/$F)
vobjs32-timens := $(foreach F,$(vobjs32-timens-y),$(obj)/$F)
+dep-vdso.lds-$(CONFIG_TIME_NS) += $(obj)/vdso-image-64-timens.c
+dep-vdso32.lds-$(CONFIG_TIME_NS) += $(obj)/vdso-image-32-timens.c

$(obj)/vdso.o: $(obj)/vdso.so

targets += vdso.lds $(vobjs-y) $(vobjs-timens-y) vdso64.entries
targets += vdso32/vdso32.lds $(vobjs32-y) $(vobjs32-timens-y) vdso32.entries
+targets += vdso-timens.lds vdso32/vdso32-timens.lds

# Build the vDSO image C files and link them in.
vdso_img_objs := $(vdso_img-y:%=vdso-image-%.o)
@@ -59,11 +62,13 @@ targets += $(vdso_img_cfiles)
targets += $(vdso_img_sodbg) $(vdso_img-y:%=vdso%.so)

CPPFLAGS_vdso.lds += -P -C
+CPPFLAGS_vdso-timens.lds := $(CPPFLAGS_vdso.lds)

VDSO_LDFLAGS_vdso.lds = -m elf_x86_64 -soname linux-vdso.so.1 --no-undefined \
-z max-page-size=4096
+VDSO_LDFLAGS_vdso-timens.lds := $(VDSO_LDFLAGS_vdso.lds)

-$(obj)/vdso64-timens.so.dbg: $(obj)/vdso.lds $(vobjs-timens) FORCE
+$(obj)/vdso64-timens.so.dbg: $(obj)/vdso-timens.lds $(vobjs-timens) FORCE
$(call if_changed,vdso)

$(obj)/vdso64.so.dbg: $(obj)/vdso.lds $(vobjs) FORCE
@@ -79,6 +84,9 @@ quiet_cmd_vdso2c = VDSO2C $@
$(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
$(call if_changed,vdso2c)

+$(obj)/vdso.lds: $(dep-vdso.lds-y) $(obj)/vdso2c FORCE
+$(obj)/vdso32/vdso32.lds: $(dep-vdso32.lds-y) $(obj)/vdso2c FORCE
+
#
# Don't omit frame pointers for ease of userspace debugging, but do
# optimize sibling calls.
@@ -142,8 +150,10 @@ $(obj)/%.so: $(obj)/%.so.dbg
$(obj)/vdsox32.so.dbg: $(obj)/vdsox32.lds $(vobjx32s) FORCE
$(call if_changed,vdso)

-CPPFLAGS_vdso32.lds = $(CPPFLAGS_vdso.lds)
-VDSO_LDFLAGS_vdso32.lds = -m elf_i386 -soname linux-gate.so.1
+CPPFLAGS_vdso32.lds := $(CPPFLAGS_vdso.lds)
+CPPFLAGS_vdso32-timens.lds := $(CPPFLAGS_vdso32.lds)
+VDSO_LDFLAGS_vdso32.lds := -m elf_i386 -soname linux-gate.so.1
+VDSO_LDFLAGS_vdso32-timens.lds := $(VDSO_LDFLAGS_vdso32.lds)

KBUILD_AFLAGS_32 := $(filter-out -m64,$(KBUILD_AFLAGS)) -DBUILD_VDSO
$(obj)/vdso32.so.dbg: KBUILD_AFLAGS = $(KBUILD_AFLAGS_32)
@@ -172,7 +182,7 @@ endif
$(obj)/vdso32.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
$(obj)/vdso32-timens.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)

-$(obj)/vdso32-timens.so.dbg: $(obj)/vdso32/vdso32.lds $(vobjs32-timens) FORCE
+$(obj)/vdso32-timens.so.dbg: $(obj)/vdso32/vdso32-timens.lds $(vobjs32-timens) FORCE
$(call if_changed,vdso)

$(obj)/vdso32.so.dbg: $(obj)/vdso32/vdso32.lds $(vobjs32) FORCE
diff --git a/arch/x86/entry/vdso/vdso-timens.lds.S b/arch/x86/entry/vdso/vdso-timens.lds.S
new file mode 100644
index 000000000000..687aba3bc5f0
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso-timens.lds.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Linker script for 64-bit timens vDSO.
+ */
+
+#define UNALIGNED_ENTRIES
+#include "vdso.lds.S"
diff --git a/arch/x86/entry/vdso/vdso32/.gitignore b/arch/x86/entry/vdso/vdso32/.gitignore
index e45fba9d0ced..ce4afb6ffb62 100644
--- a/arch/x86/entry/vdso/vdso32/.gitignore
+++ b/arch/x86/entry/vdso/vdso32/.gitignore
@@ -1 +1,2 @@
vdso32.lds
+vdso32-timens.lds
diff --git a/arch/x86/entry/vdso/vdso32/vdso32-timens.lds.S b/arch/x86/entry/vdso/vdso32/vdso32-timens.lds.S
new file mode 100644
index 000000000000..1a3b3b1f0517
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso32/vdso32-timens.lds.S
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Linker script for 32-bit timens vDSO.
+ */
+
+#define UNALIGNED_ENTRIES
+
+#include "vdso32.lds.S"
--
2.20.1


2019-02-06 00:13:35

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 21/32] x86/vdso: Switch image on setns()/unshare()/clone()

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

Whenever a user does setns()/unshare() or clone() with CLONE_TIMENS,
change VDSO image in mm and zap existing VVAR/VDSO page tables.
They will be re-faulted with corresponding image and VVAR offsets.

Co-developed-by: Andrei Vagin <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vma.c | 81 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/vdso.h | 1 +
kernel/time_namespace.c | 11 +++++
3 files changed, 93 insertions(+)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 56a62076a320..52c1e4c24455 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -25,6 +25,7 @@
#include <asm/cpufeature.h>
#include <asm/mshyperv.h>
#include <asm/page.h>
+#include <asm/tlb.h>

#if defined(CONFIG_X86_64)
unsigned int __read_mostly vdso64_enabled = 1;
@@ -150,6 +151,84 @@ static const struct vm_special_mapping vvar_mapping = {
.fault = vvar_fault,
};

+#ifdef CONFIG_TIME_NS
+static const struct vdso_image *timens_vdso(const struct vdso_image *old_img,
+ bool in_ns)
+{
+#ifdef CONFIG_X86_X32_ABI
+ if (old_img == &vdso_image_x32)
+ return NULL;
+#endif
+#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
+ if (old_img == &vdso_image_32 || old_img == &vdso_image_32_timens)
+ return in_ns ? &vdso_image_32_timens : &vdso_image_32;
+#endif
+#ifdef CONFIG_X86_64
+ if (old_img == &vdso_image_64 || old_img == &vdso_image_64_timens)
+ return in_ns ? &vdso_image_64_timens : &vdso_image_64;
+#endif
+ return NULL;
+}
+
+static const struct vdso_image *image_to_timens(const struct vdso_image *img)
+{
+ bool in_ns = (current->nsproxy->time_ns != &init_time_ns);
+ const struct vdso_image *ns;
+
+ ns = timens_vdso(img, in_ns);
+
+ return ns ?: img;
+}
+
+int vdso_join_timens(struct task_struct *task, bool inside_ns)
+{
+ const struct vdso_image *new_image, *old_image;
+ struct mm_struct *mm = task->mm;
+ struct vm_area_struct *vma;
+ int ret = 0;
+
+ if (down_write_killable(&mm->mmap_sem))
+ return -EINTR;
+
+ old_image = mm->context.vdso_image;
+ new_image = timens_vdso(old_image, inside_ns);
+ if (!new_image) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+
+ /* Sanity checks, shouldn't happen */
+ if (unlikely(old_image->size != new_image->size)) {
+ ret = -ENXIO;
+ goto out;
+ }
+
+ mm->context.vdso_image = new_image;
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ unsigned long size = vma->vm_end - vma->vm_start;
+
+ if (vma_is_special_mapping(vma, &vvar_mapping))
+ zap_page_range(vma, vma->vm_start, size);
+ if (vma_is_special_mapping(vma, &vdso_mapping))
+ zap_page_range(vma, vma->vm_start, size);
+ }
+
+out:
+ up_write(&mm->mmap_sem);
+ return ret;
+}
+#else /* CONFIG_TIME_NS */
+static const struct vdso_image *image_to_timens(const struct vdso_image *img)
+{
+ return img;
+}
+int vdso_join_timens(struct task_struct *task, bool inside_ns)
+{
+ return -ENXIO;
+}
+#endif
+
/*
* Add vdso and vvar mappings to current process.
* @image - blob to map
@@ -165,6 +244,8 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
if (down_write_killable(&mm->mmap_sem))
return -EINTR;

+ image = image_to_timens(image);
+
addr = get_unmapped_area(NULL, addr,
image->size - image->sym_vvar_start, 0, 0);
if (IS_ERR_VALUE(addr)) {
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index b6a1a028ac62..c8db853344a0 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -51,6 +51,7 @@ extern const struct vdso_image vdso_image_32_timens;
extern void __init init_vdso_image(const struct vdso_image *image);

extern int map_vdso_once(const struct vdso_image *image, unsigned long addr);
+extern int vdso_join_timens(struct task_struct *task, bool inside_ns);

#endif /* __ASSEMBLER__ */

diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 36b31f234472..1d1d1c023ec1 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -14,6 +14,7 @@
#include <linux/proc_ns.h>
#include <linux/sched/task.h>
#include <linux/mm.h>
+#include <asm/vdso.h>

static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
{
@@ -155,11 +156,16 @@ static void timens_put(struct ns_common *ns)
static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
{
struct time_namespace *ns = to_time_ns(new);
+ int ret;

if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
return -EPERM;

+ ret = vdso_join_timens(current, ns != &init_time_ns);
+ if (ret)
+ return ret;
+
get_time_ns(ns);
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
@@ -174,10 +180,15 @@ int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
{
struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
struct time_namespace *ns = to_time_ns(nsc);
+ int ret;

if (nsproxy->time_ns == nsproxy->time_ns_for_children)
return 0;

+ ret = vdso_join_timens(tsk, ns != &init_time_ns);
+ if (ret)
+ return ret;
+
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
nsproxy->time_ns = ns;
--
2.20.1


2019-02-06 00:13:41

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 20/32] x86/vdso: Initialize timens 64-bit vdso

Initialize both 64-bit VDSO(s): host .so and timens one that has code
for adding timens offsets.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vma.c | 4 ++++
arch/x86/include/asm/vdso.h | 6 ++++++
2 files changed, 10 insertions(+)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index d1031db94093..56a62076a320 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -343,6 +343,10 @@ static int __init init_vdso(void)
{
init_vdso_image(&vdso_image_64);

+#ifdef CONFIG_TIME_NS
+ init_vdso_image(&vdso_image_64_timens);
+#endif
+
#ifdef CONFIG_X86_X32_ABI
init_vdso_image(&vdso_image_x32);
#endif
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 619322065b8e..b6a1a028ac62 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -32,6 +32,9 @@ struct vdso_image {

#ifdef CONFIG_X86_64
extern const struct vdso_image vdso_image_64;
+#ifdef CONFIG_TIME_NS
+extern const struct vdso_image vdso_image_64_timens;
+#endif
#endif

#ifdef CONFIG_X86_X32
@@ -40,6 +43,9 @@ extern const struct vdso_image vdso_image_x32;

#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
extern const struct vdso_image vdso_image_32;
+#ifdef CONFIG_TIME_NS
+extern const struct vdso_image vdso_image_32_timens;
+#endif
#endif

extern void __init init_vdso_image(const struct vdso_image *image);
--
2.20.1


2019-02-06 00:13:43

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 17/32] x86/vdso2c: Sort vdso entries by addresses for linker script

There are two linker scripts for vdso .so(s):
- *-timens.lds for building vdso for processes inside time namespace
(it has bigger functions and needs to build firstly)
- *.lds for host processes vdso
(it has smaller functions and entry addresses should be adjusted
with the linker script magic to fit with entries from timens)

To adjust entries on host vdso, *.lds includes *.entries.
Those are generated by vdso2c while parsing timens vdso.

Linker doesn't allow going back on some addresses, so sort entries
to timens VDSO before writing them to .entries file.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vdso2c.c | 13 +++++++++++++
arch/x86/entry/vdso/vdso2c.h | 20 +++++++++++++++++---
2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 72731c4cfdce..4f91640398b2 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -119,6 +119,19 @@ static void fail(const char *format, ...)
va_end(ap);
}

+struct vdso_entry {
+ unsigned long addr;
+ const char *name;
+};
+
+static int entry_addr_cmp(const void *_a, const void *_b)
+{
+ const struct vdso_entry *a = _a;
+ const struct vdso_entry *b = _b;
+
+ return (a->addr < b->addr) - (a->addr > b->addr);
+}
+
/*
* Evil macros for little-endian reads and writes
*/
diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 065dac6c29c8..50566dd94451 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -21,6 +21,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
ELF(Dyn) *dyn = 0, *dyn_end = 0;
const char *secstrings;
INT_BITS syms[NSYMS] = {};
+ struct vdso_entry *entries, *next_entry;

ELF(Phdr) *pt = (ELF(Phdr) *)(raw_addr + GET_LE(&hdr->e_phoff));

@@ -88,6 +89,10 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
GET_LE(&hdr->e_shentsize) * GET_LE(&symtab_hdr->sh_link);

syms_nr = GET_LE(&symtab_hdr->sh_size) / GET_LE(&symtab_hdr->sh_entsize);
+ entries = calloc(syms_nr, sizeof(*entries));
+ if (!entries)
+ fail("malloc()\n");
+ next_entry = entries;
/* Walk the symbol table */
for (i = 0; i < syms_nr; i++) {
unsigned int k;
@@ -122,11 +127,20 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
if (ELF_FUNC(ST_TYPE, sym->st_info) != STT_FUNC)
continue;

- fprintf(out_entries_lds, "\t\t. = ABSOLUTE(%#lx);\n",
- (unsigned long)GET_LE(&sym->st_value));
- fprintf(out_entries_lds, "\t\t*(.text.%s*)\n", name);
+ next_entry->addr = GET_LE(&sym->st_value);
+ next_entry->name = name;
+ next_entry++;
}

+ qsort(entries, next_entry - entries, sizeof(*entries), entry_addr_cmp);
+
+ while (next_entry != entries && out_entries_lds) {
+ next_entry--;
+ fprintf(out_entries_lds, "\t\t. = ABSOLUTE(%#lx);\n\t\t*(.text.%s*)\n",
+ next_entry->addr, next_entry->name);
+ }
+ free(entries);
+
/* Validate mapping addresses. */
for (i = 0; i < sizeof(special_pages) / sizeof(special_pages[0]); i++) {
INT_BITS symval = syms[special_pages[i]];
--
2.20.1


2019-02-06 00:13:48

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 19/32] x86/vdso2c: Align LOCAL symbols between vdso{-timens,}.so

Align not only VDSO entries as on timens VDSO, but also addresses of
local functions. Otherwise, ld will put them after everything else
into *(.text*). That will result in common VDSO size bigger than
timens VDSO size (sic!).

Unfortunately, filtering by STB_WEAK doesn't work for ia32 VDSO:
by some reason gcc transforms weak symbols into local symbols in .so,
i.e.:
27: 00000000 219 FUNC WEAK DEFAULT 12 clock_gettime
29: 00000000 95 FUNC WEAK DEFAULT 14 gettimeofday
32: 00000000 40 FUNC WEAK DEFAULT 16 time

become:
20: 000006e0 219 FUNC LOCAL DEFAULT 12 clock_gettime
31: 000007c0 95 FUNC LOCAL DEFAULT 12 gettimeofday
33: 00000820 40 FUNC LOCAL DEFAULT 12 time

that results in the same align for two functions in .entries file:
. = ABSOLUTE(0x6e0);
*(.text.__vdso_clock_gettime*)
. = ABSOLUTE(0x6e0);
*(.text.clock_gettime*)

As result, ld becomes a very sad animal and refuses to cooperate:
ld:arch/x86/entry/vdso/vdso32/vdso32.lds:339 cannot move location counter backwards (from 0000000000000762 to 00000000000006e0)

Align local functions on VDSO to timens VDSO and filter weak functions
from .lds script.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vdso2c.h | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 50566dd94451..7096710140fe 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -15,7 +15,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
unsigned long mapping_size;
ELF(Ehdr) *hdr = (ELF(Ehdr) *)raw_addr;
unsigned int i, syms_nr;
- unsigned long j;
+ unsigned long j, last_entry_addr;
ELF(Shdr) *symtab_hdr = NULL, *strtab_hdr, *secstrings_hdr,
*alt_sec = NULL;
ELF(Dyn) *dyn = 0, *dyn_end = 0;
@@ -121,7 +121,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
if (!out_entries_lds)
continue;

- if (ELF_FUNC(ST_BIND, sym->st_info) != STB_GLOBAL)
+ if (ELF_FUNC(ST_BIND, sym->st_info) == STB_WEAK)
continue;

if (ELF_FUNC(ST_TYPE, sym->st_info) != STT_FUNC)
@@ -134,8 +134,19 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,

qsort(entries, next_entry - entries, sizeof(*entries), entry_addr_cmp);

+ last_entry_addr = -1UL;
while (next_entry != entries && out_entries_lds) {
next_entry--;
+
+ /*
+ * Unfortunately, WEAK symbols from objects are resoved
+ * into LOCAL symbols on ia32. Filter them here, as
+ * linker wouldn't like aligning the same symbol twice.
+ */
+ if (last_entry_addr == next_entry->addr)
+ continue;
+ last_entry_addr = next_entry->addr;
+
fprintf(out_entries_lds, "\t\t. = ABSOLUTE(%#lx);\n\t\t*(.text.%s*)\n",
next_entry->addr, next_entry->name);
}
--
2.20.1


2019-02-06 00:13:51

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 25/32] selftest/timens: Add a test for timerfd

From: Andrei Vagin <[email protected]>

Check that timerfd_create() takes into account clock offsets.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
tools/testing/selftests/timens/.gitignore | 1 +
tools/testing/selftests/timens/Makefile | 2 +-
tools/testing/selftests/timens/timerfd.c | 119 ++++++++++++++++++++++
3 files changed, 121 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/timens/timerfd.c

diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore
index 27a693229ce1..b609f6ee9fb9 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1 +1,2 @@
timens
+timerfd
diff --git a/tools/testing/selftests/timens/Makefile b/tools/testing/selftests/timens/Makefile
index b877efb78974..66b90cd28e5c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens
+TEST_GEN_PROGS := timens timerfd

CFLAGS := -Wall -Werror

diff --git a/tools/testing/selftests/timens/timerfd.c b/tools/testing/selftests/timens/timerfd.c
new file mode 100644
index 000000000000..8ec2604d26c9
--- /dev/null
+++ b/tools/testing/selftests/timens/timerfd.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+
+#include <sys/timerfd.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+
+#include "log.h"
+#include "timens.h"
+
+int run_test(int clockid, struct timespec now)
+{
+ struct itimerspec new_value;
+ long long elapsed;
+ int fd, i;
+
+ if (clock_gettime(clockid, &now))
+ return pr_perror("clock_gettime");
+
+ for (i = 0; i < 2; i++) {
+ int flags = 0;
+
+ new_value.it_value.tv_sec = 3600;
+ new_value.it_value.tv_nsec = 0;
+ new_value.it_interval.tv_sec = 1;
+ new_value.it_interval.tv_nsec = 0;
+
+ if (i == 1) {
+ new_value.it_value.tv_sec += now.tv_sec;
+ new_value.it_value.tv_nsec += now.tv_nsec;
+ }
+
+ fd = timerfd_create(clockid, 0);
+ if (fd == -1)
+ return pr_perror("timerfd_create");
+
+ if (i == 1)
+ flags |= TFD_TIMER_ABSTIME;
+
+ if (timerfd_settime(fd, flags, &new_value, NULL))
+ return pr_perror("timerfd_settime");
+
+ if (timerfd_gettime(fd, &new_value))
+ return pr_perror("timerfd_gettime");
+
+ elapsed = new_value.it_value.tv_sec;
+ if (abs(elapsed - 3600) > 60) {
+ ksft_test_result_fail("clockid: %d elapsed: %lld\n",
+ clockid, elapsed);
+ return 1;
+ }
+
+ close(fd);
+ }
+
+ ksft_test_result_pass("clockid=%d\n", clockid);
+
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret, status, len, fd;
+ char buf[4096];
+ pid_t pid;
+ struct timespec btime_now, mtime_now;
+
+ nscheck();
+
+ clock_gettime(CLOCK_MONOTONIC, &mtime_now);
+ clock_gettime(CLOCK_BOOTTIME, &btime_now);
+
+ if (unshare(CLONE_NEWTIME))
+ return pr_perror("unshare");
+
+ len = snprintf(buf, sizeof(buf), "%d %d 0\n%d %d 0",
+ CLOCK_MONOTONIC, 70 * 24 * 3600,
+ CLOCK_BOOTTIME, 9 * 24 * 3600);
+ fd = open("/proc/self/timens_offsets", O_WRONLY);
+ if (fd < 0)
+ return pr_perror("/proc/self/timens_offsets");
+
+ if (write(fd, buf, len) != len)
+ return pr_perror("/proc/self/timens_offsets");
+
+ close(fd);
+ mtime_now.tv_sec += 70 * 24 * 3600;
+ btime_now.tv_sec += 9 * 24 * 3600;
+
+ pid = fork();
+ if (pid < 0)
+ return pr_perror("Unable to fork");
+ if (pid == 0) {
+ ret = 0;
+ ret |= run_test(CLOCK_BOOTTIME, btime_now);
+ ret |= run_test(CLOCK_MONOTONIC, mtime_now);
+
+ if (ret)
+ ksft_exit_fail();
+ ksft_exit_pass();
+ return ret;
+ }
+
+ if (waitpid(pid, &status, 0) != pid)
+ return pr_perror("Unable to wait the child process");
+
+ if (WIFEXITED(status))
+ return WEXITSTATUS(status);
+
+ return 1;
+}
+
--
2.20.1


2019-02-06 00:14:00

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 18/32] x86/vdso.lds: Align !timens (host's) vdso.so entries

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

Unfortunately, to allow changing VDSO VMA on a running process,
the entry points to VDSO should have the same offsets (addresses).
That's needed as i.e. application that calls setns() may have already
resolved VDSO symbols in GOT/PLT.

Align VDSO entries for host with addresses generated from timens VDSO
(which is bigger as it has code for adding offsets).

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vdso-layout.lds.S | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index ba216527e59f..e529ee3ec9e8 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -70,7 +70,17 @@ SECTIONS
* stuff that isn't used at runtime in between.
*/

- .text : { *(.text*) } :text =0x90909090,
+ .text : {
+#if defined(CONFIG_TIME_NS) && !defined(UNALIGNED_ENTRIES)
+#ifdef BUILD_VDSO32
+# include "vdso32.entries"
+#endif
+#ifdef BUILD_VDSO64
+# include "vdso64.entries"
+#endif
+#endif
+ *(.text*)
+ } :text =0x90909090,

.altinstructions : { *(.altinstructions) } :text
.altinstr_replacement : { *(.altinstr_replacement) } :text
--
2.20.1


2019-02-06 00:14:05

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 14/32] x86/VDSO: Build VDSO with -ffunction-sections

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

Unfortunately, to allow changing VDSO VMA on a running process,
the entry points to VDSO should have the same offsets (addresses).
That's needed as i.e. application that calls setns() may have already
resolved VDSO symbols in GOT/PLT.

Compile VDSO images with -ffunction-sections so that VDSO entries can be
aligned on the same addresses with linker script magic.
Put ia32 functions those are written in assembly into corresponding
sections.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/Makefile | 2 +-
arch/x86/entry/vdso/vdso32/sigreturn.S | 2 ++
arch/x86/entry/vdso/vdso32/system_call.S | 2 +-
3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 2433ed9342fd..55ba81d4415c 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -3,7 +3,7 @@
# Building vDSO images for x86.
#

-KBUILD_CFLAGS += $(DISABLE_LTO)
+KBUILD_CFLAGS += $(DISABLE_LTO) -ffunction-sections
KASAN_SANITIZE := n
UBSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y
diff --git a/arch/x86/entry/vdso/vdso32/sigreturn.S b/arch/x86/entry/vdso/vdso32/sigreturn.S
index c3233ee98a6b..b641ccf8d664 100644
--- a/arch/x86/entry/vdso/vdso32/sigreturn.S
+++ b/arch/x86/entry/vdso/vdso32/sigreturn.S
@@ -11,6 +11,7 @@
.globl __kernel_sigreturn
.type __kernel_sigreturn,@function
nop /* this guy is needed for .LSTARTFDEDLSI1 below (watch for HACK) */
+ .section .text.__kernel_sigreturn, "ax"
ALIGN
__kernel_sigreturn:
.LSTART_sigreturn:
@@ -21,6 +22,7 @@ __kernel_sigreturn:
nop
.size __kernel_sigreturn,.-.LSTART_sigreturn

+ .section .text.__kernel_rt_sigreturn, "ax"
.globl __kernel_rt_sigreturn
.type __kernel_rt_sigreturn,@function
ALIGN
diff --git a/arch/x86/entry/vdso/vdso32/system_call.S b/arch/x86/entry/vdso/vdso32/system_call.S
index 263d7433dea8..13ec05287f63 100644
--- a/arch/x86/entry/vdso/vdso32/system_call.S
+++ b/arch/x86/entry/vdso/vdso32/system_call.S
@@ -8,7 +8,7 @@
#include <asm/cpufeatures.h>
#include <asm/alternative-asm.h>

- .text
+ .section .text.__kernel_vsyscall, "ax"
.globl __kernel_vsyscall
.type __kernel_vsyscall,@function
ALIGN
--
2.20.1


2019-02-06 00:14:05

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 23/32] timens/fs/proc: Introduce /proc/pid/timens_offsets

From: Andrei Vagin <[email protected]>

API to set time namespace offsets for children processes, i.e.:
echo "clockid off_ses off_nsec" > /proc/self/timens_offsets

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
fs/proc/base.c | 101 +++++++++++++++++++++++++++++++++
include/linux/time_namespace.h | 10 ++++
kernel/time_namespace.c | 71 +++++++++++++++++++++++
3 files changed, 182 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 633a63462573..1ba31050dcb5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -94,6 +94,7 @@
#include <linux/sched/stat.h>
#include <linux/flex_array.h>
#include <linux/posix-timers.h>
+#include <linux/time_namespace.h>
#include <trace/events/oom.h>
#include "internal.h"
#include "fd.h"
@@ -1520,6 +1521,103 @@ static const struct file_operations proc_pid_sched_autogroup_operations = {

#endif /* CONFIG_SCHED_AUTOGROUP */

+#ifdef CONFIG_TIME_NS
+static int timens_offsets_show(struct seq_file *m, void *v)
+{
+ struct inode *inode = m->private;
+ struct task_struct *p;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+ proc_timens_show_offsets(p, m);
+
+ put_task_struct(p);
+
+ return 0;
+}
+
+static ssize_t
+timens_offsets_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ struct proc_timens_offset offsets[2];
+ char *kbuf = NULL, *pos, *next_line;
+ struct task_struct *p;
+ int ret, noffsets;
+
+ /* Only allow < page size writes at the beginning of the file */
+ if ((*ppos != 0) || (count >= PAGE_SIZE))
+ return -EINVAL;
+
+ /* Slurp in the user data */
+ kbuf = memdup_user_nul(buf, count);
+ if (IS_ERR(kbuf))
+ return PTR_ERR(kbuf);
+
+ /* Parse the user data */
+ ret = -EINVAL;
+ noffsets = 0;
+ pos = kbuf;
+ for (; pos; pos = next_line) {
+ struct proc_timens_offset *off = &offsets[noffsets];
+ int err;
+
+ /* Find the end of line and ensure I don't look past it */
+ next_line = strchr(pos, '\n');
+ if (next_line) {
+ *next_line = '\0';
+ next_line++;
+ if (*next_line == '\0')
+ next_line = NULL;
+ }
+
+ err = sscanf(pos, "%u %lld %lu", &off->clockid,
+ &off->val.tv_sec, &off->val.tv_nsec);
+ if (err != 3 || off->val.tv_nsec >= NSEC_PER_SEC)
+ goto out;
+ if (noffsets++ == ARRAY_SIZE(offsets))
+ break;
+ }
+
+ ret = -ESRCH;
+ p = get_proc_task(inode);
+ if (!p)
+ goto out;
+ ret = proc_timens_set_offset(p, offsets, noffsets);
+ put_task_struct(p);
+ if (ret)
+ goto out;
+
+ ret = count;
+out:
+ kfree(kbuf);
+ return ret;
+}
+
+static int timens_offsets_open(struct inode *inode, struct file *filp)
+{
+ int ret;
+
+ ret = single_open(filp, timens_offsets_show, NULL);
+ if (!ret) {
+ struct seq_file *m = filp->private_data;
+
+ m->private = inode;
+ }
+ return ret;
+}
+
+static const struct file_operations proc_timens_offsets_operations = {
+ .open = timens_offsets_open,
+ .read = seq_read,
+ .write = timens_offsets_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+#endif /* CONFIG_TIME_NS */
+
static ssize_t comm_write(struct file *file, const char __user *buf,
size_t count, loff_t *offset)
{
@@ -2953,6 +3051,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
REG("autogroup", S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
+#endif
+#ifdef CONFIG_TIME_NS
+ REG("timens_offsets", S_IRUGO|S_IWUSR, proc_timens_offsets_operations),
#endif
REG("comm", S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index f1807d7f524d..c9ba7366b3d6 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -41,6 +41,16 @@ static inline void put_time_ns(struct time_namespace *ns)
}


+extern void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m);
+
+struct proc_timens_offset {
+ int clockid;
+ struct timespec64 val;
+};
+
+extern int proc_timens_set_offset(struct task_struct *p,
+ struct proc_timens_offset *offsets, int n);
+
extern void timens_clock_to_host(int clockid, struct timespec64 *val);
extern void timens_clock_from_host(int clockid, struct timespec64 *val);

diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 1d1d1c023ec1..6e2e6629e1ba 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -13,6 +13,7 @@
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
#include <linux/sched/task.h>
+#include <linux/seq_file.h>
#include <linux/mm.h>
#include <asm/vdso.h>

@@ -202,6 +203,76 @@ static struct user_namespace *timens_owner(struct ns_common *ns)
return to_time_ns(ns)->user_ns;
}

+static void show_offset(struct seq_file *m, int clockid, struct timespec64 *ts)
+{
+ seq_printf(m, "%d %lld %ld\n", clockid, ts->tv_sec, ts->tv_nsec);
+}
+
+void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m)
+{
+ struct ns_common *ns;
+ struct time_namespace *time_ns;
+ struct timens_offsets *ns_offsets;
+
+ ns = timens_for_children_get(p);
+ if (!ns)
+ return;
+ time_ns = to_time_ns(ns);
+
+ if (!time_ns->offsets) {
+ put_time_ns(time_ns);
+ return;
+ }
+ ns_offsets = time_ns->offsets;
+
+ show_offset(m, CLOCK_MONOTONIC, &ns_offsets->monotonic_time_offset);
+ show_offset(m, CLOCK_BOOTTIME, &ns_offsets->monotonic_boottime_offset);
+ put_time_ns(time_ns);
+}
+
+int proc_timens_set_offset(struct task_struct *p,
+ struct proc_timens_offset *offsets, int noffsets)
+{
+ struct ns_common *ns;
+ struct time_namespace *time_ns;
+ struct timens_offsets *ns_offsets;
+ int i, err;
+
+ ns = timens_for_children_get(p);
+ if (!ns)
+ return -ESRCH;
+ time_ns = to_time_ns(ns);
+
+ if (!time_ns->offsets || time_ns->initialized ||
+ !ns_capable(time_ns->user_ns, CAP_SYS_TIME)) {
+ put_time_ns(time_ns);
+ return -EPERM;
+ }
+ ns_offsets = time_ns->offsets;
+
+ err = -EINVAL;
+ for (i = 0; i < noffsets; i++) {
+ struct proc_timens_offset *off = &offsets[i];
+
+ switch (off->clockid) {
+ case CLOCK_MONOTONIC:
+ ns_offsets->monotonic_time_offset = off->val;
+ break;
+ case CLOCK_BOOTTIME:
+ ns_offsets->monotonic_boottime_offset = off->val;
+ break;
+ default:
+ goto out;
+ }
+ }
+
+ err = 0;
+out:
+ put_time_ns(time_ns);
+
+ return err;
+}
+
static void clock_timens_fixup(int clockid, struct timespec64 *val, bool to_ns)
{
struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
--
2.20.1


2019-02-06 00:14:17

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 11/32] x86/vdso/Makefile: Add vobjs32

As for 64-bit vdso objects, handle ia32/i386 objects in array.
This is a preparation ground to avoid code duplication on introduction
of timens vdso.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/Makefile | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 5bfe2243a08f..bc0bdbf49397 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -18,6 +18,8 @@ VDSO32-$(CONFIG_IA32_EMULATION) := y

# files to link into the vdso
vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
+vobjs32-y := vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
+vobjs32-y += vdso32/vclock_gettime.o

# files to link into kernel
obj-y += vma.o
@@ -31,10 +33,12 @@ vdso_img-$(VDSO32-y) += 32
obj-$(VDSO32-y) += vdso32-setup.o

vobjs := $(foreach F,$(vobjs-y),$(obj)/$F)
+vobjs32 := $(foreach F,$(vobjs32-y),$(obj)/$F)

$(obj)/vdso.o: $(obj)/vdso.so

targets += vdso.lds $(vobjs-y)
+targets += vdso32/vdso32.lds $(vobjs32-y)

# Build the vDSO image C files and link them in.
vdso_img_objs := $(vdso_img-y:%=vdso-image-%.o)
@@ -125,10 +129,6 @@ $(obj)/vdsox32.so.dbg: $(obj)/vdsox32.lds $(vobjx32s) FORCE
CPPFLAGS_vdso32.lds = $(CPPFLAGS_vdso.lds)
VDSO_LDFLAGS_vdso32.lds = -m elf_i386 -soname linux-gate.so.1

-targets += vdso32/vdso32.lds
-targets += vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
-targets += vdso32/vclock_gettime.o
-
KBUILD_AFLAGS_32 := $(filter-out -m64,$(KBUILD_AFLAGS)) -DBUILD_VDSO
$(obj)/vdso32.so.dbg: KBUILD_AFLAGS = $(KBUILD_AFLAGS_32)
$(obj)/vdso32.so.dbg: asflags-$(CONFIG_X86_64) += -m32
@@ -153,12 +153,7 @@ endif

$(obj)/vdso32.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)

-$(obj)/vdso32.so.dbg: FORCE \
- $(obj)/vdso32/vdso32.lds \
- $(obj)/vdso32/vclock_gettime.o \
- $(obj)/vdso32/note.o \
- $(obj)/vdso32/system_call.o \
- $(obj)/vdso32/sigreturn.o
+$(obj)/vdso32.so.dbg: $(obj)/vdso32/vdso32.lds $(vobjs32) FORCE
$(call if_changed,vdso)

#
--
2.20.1


2019-02-06 00:14:26

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 10/32] x86/vdso2c: Convert iterator to unsigned

i and j are used everywhere with unsigned types.
Cleanup and prettify the code a bit.

Introduce syms_nr for readability and as a preparation for allocating an
array of vDSO entries that will be needed for creating two vdso .so's:
one for host tasks and another for processes inside time namespace.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/vdso2c.h | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index fa847a620f40..61c8bb2e5af8 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -13,7 +13,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
unsigned long load_size = -1; /* Work around bogus warning */
unsigned long mapping_size;
ELF(Ehdr) *hdr = (ELF(Ehdr) *)raw_addr;
- int i;
+ unsigned int i, syms_nr;
unsigned long j;
ELF(Shdr) *symtab_hdr = NULL, *strtab_hdr, *secstrings_hdr,
*alt_sec = NULL;
@@ -86,11 +86,10 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
strtab_hdr = raw_addr + GET_LE(&hdr->e_shoff) +
GET_LE(&hdr->e_shentsize) * GET_LE(&symtab_hdr->sh_link);

+ syms_nr = GET_LE(&symtab_hdr->sh_size) / GET_LE(&symtab_hdr->sh_entsize);
/* Walk the symbol table */
- for (i = 0;
- i < GET_LE(&symtab_hdr->sh_size) / GET_LE(&symtab_hdr->sh_entsize);
- i++) {
- int k;
+ for (i = 0; i < syms_nr; i++) {
+ unsigned int k;
ELF(Sym) *sym = raw_addr + GET_LE(&symtab_hdr->sh_offset) +
GET_LE(&symtab_hdr->sh_entsize) * i;
const char *name = raw_addr + GET_LE(&strtab_hdr->sh_offset) +
--
2.20.1


2019-02-06 00:14:35

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 22/32] timens: Add align for timens_offsets

Align offsets so that time namespace will work for ia32 applications on
x86_64 host.

Signed-off-by: Dmitry Safonov <[email protected]>
---
include/linux/timens_offsets.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index 777530c46852..f2a03d4f7a91 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -2,9 +2,17 @@
#ifndef _LINUX_TIME_OFFSETS_H
#define _LINUX_TIME_OFFSETS_H

+/*
+ * Time offsets need align as they're placed on VVAR page,
+ * which is used by x86_64 and ia32 VDSO code.
+ * On ia32 offset::tv_sec (u64) has align(4), so re-align offsets
+ * to the same positions as 64-bit offsets.
+ * On 64-bit big-endian systems VDSO should convert to timespec64
+ * to timespec because of a padding occurring between the fields.
+ */
struct timens_offsets {
- struct timespec64 monotonic_time_offset;
- struct timespec64 monotonic_boottime_offset;
+ struct timespec64 monotonic_time_offset __aligned(8);
+ struct timespec64 monotonic_boottime_offset __aligned(8);
};

#endif
--
2.20.1


2019-02-06 00:14:46

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 15/32] x86/vdso2c: Optionally produce linker script for vdso entries

Two VDSO images (with/without code for adding offsets inside timens)
should be compatible by VDSO function offsets - this way kernel
can remap VDSO VMA for a task without fixupping GOT/PLT.

Add an optional parameter for vdso2c to generate .entries file from
vdso.so. As timens VDSO by nature is bigger in .text than VDSO for
tasks outside namespace, this parameter will be used to generate
.entries file from timens VDSO and include those aligns into linker
script for !timens VDSO building.

Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/.gitignore | 1 +
arch/x86/entry/vdso/Makefile | 7 ++++---
arch/x86/entry/vdso/vdso2c.c | 26 +++++++++++++++++++-------
arch/x86/entry/vdso/vdso2c.h | 16 +++++++++++++++-
4 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/vdso/.gitignore b/arch/x86/entry/vdso/.gitignore
index aae8ffdd5880..9ab4fa4c7e7b 100644
--- a/arch/x86/entry/vdso/.gitignore
+++ b/arch/x86/entry/vdso/.gitignore
@@ -5,3 +5,4 @@ vdso32-sysenter-syms.lds
vdso32-int80-syms.lds
vdso-image-*.c
vdso2c
+vdso*.entries
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 55ba81d4415c..ccb572831ea1 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -47,8 +47,8 @@ vobjs32-timens := $(foreach F,$(vobjs32-timens-y),$(obj)/$F)

$(obj)/vdso.o: $(obj)/vdso.so

-targets += vdso.lds $(vobjs-y) $(vobjs-timens-y)
-targets += vdso32/vdso32.lds $(vobjs32-y) $(vobjs32-timens-y)
+targets += vdso.lds $(vobjs-y) $(vobjs-timens-y) vdso64.entries
+targets += vdso32/vdso32.lds $(vobjs32-y) $(vobjs32-timens-y) vdso32.entries

# Build the vDSO image C files and link them in.
vdso_img_objs := $(vdso_img-y:%=vdso-image-%.o)
@@ -73,7 +73,8 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi -I$(src
hostprogs-y += vdso2c

quiet_cmd_vdso2c = VDSO2C $@
- cmd_vdso2c = $(obj)/vdso2c $< $(<:%.dbg=%) $@
+ cmd_vdso2c = $(obj)/vdso2c $< $(<:%.dbg=%) $@ \
+ $(filter %.entries,$(<:%-timens.so.dbg=%.entries))

$(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
$(call if_changed,vdso2c)
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index ed66b023d4b9..72731c4cfdce 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -152,6 +152,7 @@ extern void bad_put_le(void);
#define BITSFUNC3(name, bits, suffix) name##bits##suffix
#define BITSFUNC2(name, bits, suffix) BITSFUNC3(name, bits, suffix)
#define BITSFUNC(name) BITSFUNC2(name, ELF_BITS, )
+#define ELF_FUNC(f, x) (BITSFUNC2(ELF, ELF_BITS, _##f)(x))

#define INT_BITS BITSFUNC2(int, ELF_BITS, _t)

@@ -169,16 +170,17 @@ extern void bad_put_le(void);

static void go(void *raw_addr, size_t raw_len,
void *stripped_addr, size_t stripped_len,
- FILE *outfile, const char *name)
+ FILE *outfile, const char *name,
+ FILE *out_entries_lds)
{
Elf64_Ehdr *hdr = (Elf64_Ehdr *)raw_addr;

if (hdr->e_ident[EI_CLASS] == ELFCLASS64) {
go64(raw_addr, raw_len, stripped_addr, stripped_len,
- outfile, name);
+ outfile, name, out_entries_lds);
} else if (hdr->e_ident[EI_CLASS] == ELFCLASS32) {
go32(raw_addr, raw_len, stripped_addr, stripped_len,
- outfile, name);
+ outfile, name, out_entries_lds);
} else {
fail("unknown ELF class\n");
}
@@ -208,12 +210,12 @@ int main(int argc, char **argv)
{
size_t raw_len, stripped_len;
void *raw_addr, *stripped_addr;
- FILE *outfile;
+ FILE *outfile, *entries_lds = NULL;
char *name, *tmp;
int namelen;

- if (argc != 4) {
- printf("Usage: vdso2c RAW_INPUT STRIPPED_INPUT OUTPUT\n");
+ if (argc < 4) {
+ printf("Usage: vdso2c RAW_INPUT STRIPPED_INPUT OUTPUT [OUTPUT_ENTRIES.LDS]\n");
return 1;
}

@@ -245,11 +247,21 @@ int main(int argc, char **argv)
if (!outfile)
err(1, "fopen(%s)", outfilename);

- go(raw_addr, raw_len, stripped_addr, stripped_len, outfile, name);
+ if (argc == 5) {
+ entries_lds = fopen(argv[4], "w");
+ if (!entries_lds) {
+ fclose(outfile);
+ err(1, "fopen(%s)", argv[4]);
+ }
+ }
+
+ go(raw_addr, raw_len, stripped_addr, stripped_len, outfile, name, entries_lds);

munmap(raw_addr, raw_len);
munmap(stripped_addr, stripped_len);
fclose(outfile);
+ if (entries_lds)
+ fclose(entries_lds);

return 0;
}
diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 61c8bb2e5af8..065dac6c29c8 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -7,7 +7,8 @@

static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
void *stripped_addr, size_t stripped_len,
- FILE *outfile, const char *name)
+ FILE *outfile, const char *name,
+ FILE *out_entries_lds)
{
int found_load = 0;
unsigned long load_size = -1; /* Work around bogus warning */
@@ -111,6 +112,19 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
syms[k] = GET_LE(&sym->st_value);
}
}
+
+ if (!out_entries_lds)
+ continue;
+
+ if (ELF_FUNC(ST_BIND, sym->st_info) != STB_GLOBAL)
+ continue;
+
+ if (ELF_FUNC(ST_TYPE, sym->st_info) != STT_FUNC)
+ continue;
+
+ fprintf(out_entries_lds, "\t\t. = ABSOLUTE(%#lx);\n",
+ (unsigned long)GET_LE(&sym->st_value));
+ fprintf(out_entries_lds, "\t\t*(.text.%s*)\n", name);
}

/* Validate mapping addresses. */
--
2.20.1


2019-02-06 00:14:48

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 07/32] timens/kernel: Take into account timens clock offsets in clock_nanosleep

From: Andrei Vagin <[email protected]>

Wire up clock_nanosleep() to timens offsets.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
kernel/time/hrtimer.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index f5cfa1b73d6f..47345aea406d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -42,6 +42,7 @@
#include <linux/timer.h>
#include <linux/freezer.h>
#include <linux/compat.h>
+#include <linux/time_namespace.h>

#include <linux/uaccess.h>

@@ -1721,9 +1722,16 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
{
struct restart_block *restart;
struct hrtimer_sleeper t;
+ struct timespec64 tp;
int ret = 0;
u64 slack;

+ if (!(mode & HRTIMER_MODE_REL)) {
+ tp = *rqtp;
+ rqtp = &tp;
+ timens_clock_to_host(clockid, &tp);
+ }
+
slack = current->timer_slack_ns;
if (dl_task(current) || rt_task(current))
slack = 0;
--
2.20.1


2019-02-06 00:14:51

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 13/32] x86/vdso: Build timens .so(s)

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

Addressing those problems, build two VDSO images instead.

At this moment timens is unsupported for x32 binaries (only x86_64 and
ia32). This may be added on top afterwards.

Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/x86/entry/vdso/Makefile | 28 +++++++++++++++++--
arch/x86/entry/vdso/vclock_gettime-timens.c | 6 ++++
.../entry/vdso/vdso32/vclock_gettime-timens.c | 6 ++++
3 files changed, 37 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/entry/vdso/vclock_gettime-timens.c
create mode 100644 arch/x86/entry/vdso/vdso32/vclock_gettime-timens.c

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index bc0bdbf49397..2433ed9342fd 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -17,8 +17,12 @@ VDSO32-$(CONFIG_X86_32) := y
VDSO32-$(CONFIG_IA32_EMULATION) := y

# files to link into the vdso
-vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
+vobjs-y := vdso-note.o vgetcpu.o
+vobjs-timens-y := $(vobjs-y) vclock_gettime-timens.o
+vobjs-y += vclock_gettime.o
+
vobjs32-y := vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
+vobjs32-timens-y := $(vobjs32-y) vdso32/vclock_gettime-timens.o
vobjs32-y += vdso32/vclock_gettime.o

# files to link into kernel
@@ -30,15 +34,21 @@ vdso_img-$(VDSO64-y) += 64
vdso_img-$(VDSOX32-y) += x32
vdso_img-$(VDSO32-y) += 32

+vdso_timens_img-$(VDSO64-y) += 64-timens
+vdso_timens_img-$(VDSO32-y) += 32-timens
+vdso_img-$(CONFIG_TIME_NS) += $(vdso_timens_img-y)
+
obj-$(VDSO32-y) += vdso32-setup.o

vobjs := $(foreach F,$(vobjs-y),$(obj)/$F)
vobjs32 := $(foreach F,$(vobjs32-y),$(obj)/$F)
+vobjs-timens := $(foreach F,$(vobjs-timens-y),$(obj)/$F)
+vobjs32-timens := $(foreach F,$(vobjs32-timens-y),$(obj)/$F)

$(obj)/vdso.o: $(obj)/vdso.so

-targets += vdso.lds $(vobjs-y)
-targets += vdso32/vdso32.lds $(vobjs32-y)
+targets += vdso.lds $(vobjs-y) $(vobjs-timens-y)
+targets += vdso32/vdso32.lds $(vobjs32-y) $(vobjs32-timens-y)

# Build the vDSO image C files and link them in.
vdso_img_objs := $(vdso_img-y:%=vdso-image-%.o)
@@ -53,6 +63,9 @@ CPPFLAGS_vdso.lds += -P -C
VDSO_LDFLAGS_vdso.lds = -m elf_x86_64 -soname linux-vdso.so.1 --no-undefined \
-z max-page-size=4096

+$(obj)/vdso64-timens.so.dbg: $(obj)/vdso.lds $(vobjs-timens) FORCE
+ $(call if_changed,vdso)
+
$(obj)/vdso64.so.dbg: $(obj)/vdso.lds $(vobjs) FORCE
$(call if_changed,vdso)

@@ -81,12 +94,14 @@ endif
endif

$(vobjs): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
+$(vobjs-timens): KBUILD_CFLAGS := $(filter-out $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)

#
# vDSO code runs in userspace and -pg doesn't help with profiling anyway.
#
CFLAGS_REMOVE_vdso-note.o = -pg
CFLAGS_REMOVE_vclock_gettime.o = -pg
+CFLAGS_REMOVE_vclock_gettime-timens.o = -pg
CFLAGS_REMOVE_vgetcpu.o = -pg
CFLAGS_REMOVE_vvar.o = -pg

@@ -132,6 +147,8 @@ VDSO_LDFLAGS_vdso32.lds = -m elf_i386 -soname linux-gate.so.1
KBUILD_AFLAGS_32 := $(filter-out -m64,$(KBUILD_AFLAGS)) -DBUILD_VDSO
$(obj)/vdso32.so.dbg: KBUILD_AFLAGS = $(KBUILD_AFLAGS_32)
$(obj)/vdso32.so.dbg: asflags-$(CONFIG_X86_64) += -m32
+$(obj)/vdso32-timens.so.dbg: KBUILD_AFLAGS = $(KBUILD_AFLAGS_32)
+$(obj)/vdso32-timens.so.dbg: asflags-$(CONFIG_X86_64) += -m32

KBUILD_CFLAGS_32 := $(filter-out -m64,$(KBUILD_CFLAGS))
KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32))
@@ -152,6 +169,10 @@ endif
endif

$(obj)/vdso32.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
+$(obj)/vdso32-timens.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
+
+$(obj)/vdso32-timens.so.dbg: $(obj)/vdso32/vdso32.lds $(vobjs32-timens) FORCE
+ $(call if_changed,vdso)

$(obj)/vdso32.so.dbg: $(obj)/vdso32/vdso32.lds $(vobjs32) FORCE
$(call if_changed,vdso)
@@ -198,3 +219,4 @@ PHONY += vdso_install $(vdso_img_insttargets)
vdso_install: $(vdso_img_insttargets)

clean-files := vdso32.so vdso32.so.dbg vdso64* vdso-image-*.c vdsox32.so*
+clean-files += vdso32-timens.so vdso32-timens.so.dbg
diff --git a/arch/x86/entry/vdso/vclock_gettime-timens.c b/arch/x86/entry/vdso/vclock_gettime-timens.c
new file mode 100644
index 000000000000..9e92315f83db
--- /dev/null
+++ b/arch/x86/entry/vdso/vclock_gettime-timens.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define BUILD_VDSO_TIME_NS
+
+#include "vclock_gettime.c"
+
diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime-timens.c b/arch/x86/entry/vdso/vdso32/vclock_gettime-timens.c
new file mode 100644
index 000000000000..9e92315f83db
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso32/vclock_gettime-timens.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define BUILD_VDSO_TIME_NS
+
+#include "vclock_gettime.c"
+
--
2.20.1


2019-02-06 00:14:56

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 05/32] timerfd/timens: Take into account ns clock offsets

From: Andrei Vagin <[email protected]>

Make timerfd respect timens offsets.
Provide two helpers timens_clock_to_host() timens_clock_from_host() that
are useful to wire up timens to different kernel subsystems.
Following patches will use timens_clock_from_host(), added here for
completeness.

Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
fs/timerfd.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index 803ca070d42e..c7ae1e371912 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -26,6 +26,7 @@
#include <linux/syscalls.h>
#include <linux/compat.h>
#include <linux/rcupdate.h>
+#include <linux/time_namespace.h>

struct timerfd_ctx {
union {
@@ -433,22 +434,27 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
}

static int do_timerfd_settime(int ufd, int flags,
- const struct itimerspec64 *new,
+ struct itimerspec64 *new,
struct itimerspec64 *old)
{
struct fd f;
struct timerfd_ctx *ctx;
int ret;

- if ((flags & ~TFD_SETTIME_FLAGS) ||
- !itimerspec64_valid(new))
- return -EINVAL;
-
ret = timerfd_fget(ufd, &f);
if (ret)
return ret;
ctx = f.file->private_data;

+ if (flags & TFD_TIMER_ABSTIME)
+ timens_clock_to_host(ctx->clockid, &new->it_value);
+
+ if ((flags & ~TFD_SETTIME_FLAGS) ||
+ !itimerspec64_valid(new)) {
+ fdput(f);
+ return -EINVAL;
+ }
+
if (isalarm(ctx) && !capable(CAP_WAKE_ALARM)) {
fdput(f);
return -EPERM;
--
2.20.1


2019-02-06 00:14:57

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 01/32] ns: Introduce Time Namespace

From: Andrei Vagin <[email protected]>

Time Namespace isolates clock values.

The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.

CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.

CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.

CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.

For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME).

But in a context of the checkpoint/restore functionality, monotonic and
bootime clocks become interesting. Both clocks are monotonic with
unspecified staring points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, we have to
guarantee that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that we need to be able to set CLOCK_MONOTONIC
and CLOCK_BOOTTIME clocks, what can be done by adding per-namespace
offsets for clocks.

A time namespace is similar to a pid namespace in a way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of
the process will be born in the new time namespace, or a process can
use the setns() system call to join a namespace.

This scheme allows setting clock offsets for a namespace, before any
processes appear in it.

Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
MAINTAINERS | 2 +
fs/proc/namespaces.c | 4 +
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 2 +
include/linux/time_namespace.h | 69 +++++++++++
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 1 +
init/Kconfig | 7 ++
kernel/Makefile | 1 +
kernel/fork.c | 3 +-
kernel/nsproxy.c | 41 +++++--
kernel/time_namespace.c | 215 +++++++++++++++++++++++++++++++++
12 files changed, 340 insertions(+), 8 deletions(-)
create mode 100644 include/linux/time_namespace.h
create mode 100644 kernel/time_namespace.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8c68de3cfd80..e03f160012e3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12144,6 +12144,8 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
S: Maintained
F: fs/timerfd.c
F: include/linux/timer*
+F: include/linux/time_namespace.h
+F: kernel/time_namespace.c
F: kernel/time/*timer*

POWER MANAGEMENT CORE
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..8b5c720fe5d7 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -33,6 +33,10 @@ static const struct proc_ns_operations *ns_entries[] = {
#ifdef CONFIG_CGROUPS
&cgroupns_operations,
#endif
+#ifdef CONFIG_TIME_NS
+ &timens_operations,
+ &timens_for_children_operations,
+#endif
};

static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 2ae1b1a4d84d..074f395b9ad2 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -35,6 +35,8 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct time_namespace *time_ns;
+ struct time_namespace *time_ns_for_children;
struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..3e6f332da465 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -32,6 +32,8 @@ extern const struct proc_ns_operations pidns_for_children_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
extern const struct proc_ns_operations cgroupns_operations;
+extern const struct proc_ns_operations timens_operations;
+extern const struct proc_ns_operations timens_for_children_operations;

/*
* We always define these enumerators
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
new file mode 100644
index 000000000000..9507ed7072fe
--- /dev/null
+++ b/include/linux/time_namespace.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIMENS_H
+#define _LINUX_TIMENS_H
+
+
+#include <linux/sched.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/ns_common.h>
+#include <linux/err.h>
+
+struct user_namespace;
+extern struct user_namespace init_user_ns;
+
+struct time_namespace {
+ struct kref kref;
+ struct user_namespace *user_ns;
+ struct ucounts *ucounts;
+ struct ns_common ns;
+ struct timens_offsets *offsets;
+ bool initialized;
+} __randomize_layout;
+extern struct time_namespace init_time_ns;
+
+#ifdef CONFIG_TIME_NS
+static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
+{
+ kref_get(&ns->kref);
+ return ns;
+}
+
+extern struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns);
+extern void free_time_ns(struct kref *kref);
+extern int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk);
+
+static inline void put_time_ns(struct time_namespace *ns)
+{
+ kref_put(&ns->kref, free_time_ns);
+}
+
+
+#else
+static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
+{
+ return NULL;
+}
+
+static inline void put_time_ns(struct time_namespace *ns)
+{
+}
+
+static inline struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns)
+{
+ if (flags & CLONE_NEWTIME)
+ return ERR_PTR(-EINVAL);
+
+ return old_ns;
+}
+
+static inline int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
+{
+ return 0;
+}
+
+#endif
+
+#endif /* _LINUX_TIMENS_H */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d6b74b91096b..bf84f93dc411 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -45,6 +45,7 @@ enum ucount_type {
UCOUNT_NET_NAMESPACES,
UCOUNT_MNT_NAMESPACES,
UCOUNT_CGROUP_NAMESPACES,
+ UCOUNT_TIME_NAMESPACES,
#ifdef CONFIG_INOTIFY_USER
UCOUNT_INOTIFY_INSTANCES,
UCOUNT_INOTIFY_WATCHES,
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..adffac53c76e 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_NEWTIME 0x00001000 /* New time namespace */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
diff --git a/init/Kconfig b/init/Kconfig
index c9386a365eea..03ed7b2694b5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -965,6 +965,13 @@ config UTS_NS
In this namespace tasks see different info provided with the
uname() system call

+config TIME_NS
+ bool "TIME namespace"
+ default y
+ help
+ In this namespace boottime and monotonic clocks can be set.
+ The time will keep going with the same pace.
+
config IPC_NS
bool "IPC namespace"
depends on (SYSVIPC || POSIX_MQUEUE)
diff --git a/kernel/Makefile b/kernel/Makefile
index 6aa7543bcdb2..62c83975937f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -67,6 +67,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup/
obj-$(CONFIG_UTS_NS) += utsname.o
+obj-$(CONFIG_TIME_NS) += time_namespace.o
obj-$(CONFIG_USER_NS) += user_namespace.o
obj-$(CONFIG_PID_NS) += pid_namespace.o
obj-$(CONFIG_IKCONFIG) += configs.o
diff --git a/kernel/fork.c b/kernel/fork.c
index b69248e6f0e0..c653a8a62fec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2426,7 +2426,8 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP|
+ CLONE_NEWTIME))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f6c5d330059a..586c9e2017dc 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -22,6 +22,7 @@
#include <linux/pid_namespace.h>
#include <net/net_namespace.h>
#include <linux/ipc_namespace.h>
+#include <linux/time_namespace.h>
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
@@ -44,6 +45,10 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_CGROUPS
.cgroup_ns = &init_cgroup_ns,
#endif
+#ifdef CONFIG_TIME_NS
+ .time_ns = &init_time_ns,
+ .time_ns_for_children = &init_time_ns,
+#endif
};

static inline struct nsproxy *create_nsproxy(void)
@@ -110,8 +115,18 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_net;
}

+ new_nsp->time_ns_for_children = copy_time_ns(flags, user_ns,
+ tsk->nsproxy->time_ns_for_children);
+ if (IS_ERR(new_nsp->time_ns_for_children)) {
+ err = PTR_ERR(new_nsp->time_ns_for_children);
+ goto out_time;
+ }
+ new_nsp->time_ns = get_time_ns(tsk->nsproxy->time_ns);
+
return new_nsp;

+out_time:
+ put_net(new_nsp->net_ns);
out_net:
put_cgroup_ns(new_nsp->cgroup_ns);
out_cgroup:
@@ -140,15 +155,16 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
+ int ret;

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
- CLONE_NEWCGROUP)))) {
- get_nsproxy(old_ns);
- return 0;
- }
-
- if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
+ if (likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
+ get_nsproxy(old_ns);
+ return 0;
+ }
+ } else if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;

/*
@@ -166,6 +182,12 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (IS_ERR(new_ns))
return PTR_ERR(new_ns);

+ ret = timens_on_fork(new_ns, tsk);
+ if (ret) {
+ free_nsproxy(new_ns);
+ return ret;
+ }
+
tsk->nsproxy = new_ns;
return 0;
}
@@ -180,6 +202,10 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->time_ns)
+ put_time_ns(ns->time_ns);
+ if (ns->time_ns_for_children)
+ put_time_ns(ns->time_ns_for_children);
put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
@@ -196,7 +222,8 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;

if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP |
+ CLONE_NEWTIME)))
return 0;

user_ns = new_cred ? new_cred->user_ns : current_user_ns();
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
new file mode 100644
index 000000000000..8c600df9771d
--- /dev/null
+++ b/kernel/time_namespace.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Andrei Vagin <[email protected]>
+ * Author: Dmitry Safonov <[email protected]>
+ */
+
+#include <linux/export.h>
+#include <linux/time.h>
+#include <linux/time_namespace.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/cred.h>
+#include <linux/user_namespace.h>
+#include <linux/proc_ns.h>
+#include <linux/sched/task.h>
+
+static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
+{
+ return inc_ucount(ns, current_euid(), UCOUNT_TIME_NAMESPACES);
+}
+
+static void dec_time_namespaces(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_TIME_NAMESPACES);
+}
+
+static struct time_namespace *create_time_ns(void)
+{
+ struct time_namespace *time_ns;
+
+ time_ns = kmalloc(sizeof(struct time_namespace), GFP_KERNEL);
+ if (time_ns) {
+ kref_init(&time_ns->kref);
+ time_ns->initialized = false;
+ }
+ return time_ns;
+}
+
+/*
+ * Clone a new ns copying @old_ns, setting refcount to 1
+ * @old_ns: namespace to clone
+ * Return the new ns or ERR_PTR.
+ */
+static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
+ struct time_namespace *old_ns)
+{
+ struct time_namespace *ns;
+ struct ucounts *ucounts;
+ int err;
+
+ err = -ENOSPC;
+ ucounts = inc_time_namespaces(user_ns);
+ if (!ucounts)
+ goto fail;
+
+ err = -ENOMEM;
+ ns = create_time_ns();
+ if (!ns)
+ goto fail_dec;
+
+ err = ns_alloc_inum(&ns->ns);
+ if (err)
+ goto fail_free;
+
+ ns->ucounts = ucounts;
+ ns->ns.ops = &timens_operations;
+ ns->user_ns = get_user_ns(user_ns);
+ return ns;
+
+fail_free:
+ kfree(ns);
+fail_dec:
+ dec_time_namespaces(ucounts);
+fail:
+ return ERR_PTR(err);
+}
+
+/*
+ * Add a reference to old_ns, or clone it if @flags specify CLONE_NEWTIME.
+ * In latter case, changes to the time of this process won't be seen by parent,
+ * and vice versa.
+ */
+struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns)
+{
+ if (!(flags & CLONE_NEWTIME))
+ return get_time_ns(old_ns);
+
+ return clone_time_ns(user_ns, old_ns);
+}
+
+void free_time_ns(struct kref *kref)
+{
+ struct time_namespace *ns;
+
+ ns = container_of(kref, struct time_namespace, kref);
+ dec_time_namespaces(ns->ucounts);
+ put_user_ns(ns->user_ns);
+ ns_free_inum(&ns->ns);
+ kfree(ns);
+}
+
+static struct time_namespace *to_time_ns(struct ns_common *ns)
+{
+ return container_of(ns, struct time_namespace, ns);
+}
+
+static struct ns_common *timens_get(struct task_struct *task)
+{
+ struct time_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->time_ns;
+ get_time_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns ? &ns->ns : NULL;
+}
+
+static struct ns_common *timens_for_children_get(struct task_struct *task)
+{
+ struct time_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->time_ns_for_children;
+ get_time_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns ? &ns->ns : NULL;
+}
+
+static void timens_put(struct ns_common *ns)
+{
+ put_time_ns(to_time_ns(ns));
+}
+
+static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
+{
+ struct time_namespace *ns = to_time_ns(new);
+
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
+ !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ return -EPERM;
+
+ get_time_ns(ns);
+ get_time_ns(ns);
+ put_time_ns(nsproxy->time_ns);
+ put_time_ns(nsproxy->time_ns_for_children);
+ nsproxy->time_ns = ns;
+ nsproxy->time_ns_for_children = ns;
+ ns->initialized = true;
+ return 0;
+}
+
+int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
+{
+ struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
+ struct time_namespace *ns = to_time_ns(nsc);
+
+ if (nsproxy->time_ns == nsproxy->time_ns_for_children)
+ return 0;
+
+ get_time_ns(ns);
+ put_time_ns(nsproxy->time_ns);
+ nsproxy->time_ns = ns;
+ ns->initialized = true;
+
+ return 0;
+}
+
+static struct user_namespace *timens_owner(struct ns_common *ns)
+{
+ return to_time_ns(ns)->user_ns;
+}
+
+const struct proc_ns_operations timens_operations = {
+ .name = "time",
+ .type = CLONE_NEWTIME,
+ .get = timens_get,
+ .put = timens_put,
+ .install = timens_install,
+ .owner = timens_owner,
+};
+
+const struct proc_ns_operations timens_for_children_operations = {
+ .name = "time_for_children",
+ .type = CLONE_NEWTIME,
+ .get = timens_for_children_get,
+ .put = timens_put,
+ .install = timens_install,
+ .owner = timens_owner,
+};
+
+struct time_namespace init_time_ns = {
+ .kref = KREF_INIT(3),
+ .user_ns = &init_user_ns,
+ .ns.inum = PROC_UTS_INIT_INO,
+#ifdef CONFIG_UTS_NS
+ .ns.ops = &timens_operations,
+#endif
+};
+
+static int __init time_ns_init(void)
+{
+ return 0;
+}
+subsys_initcall(time_ns_init);
--
2.20.1


2019-02-06 00:15:14

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 04/32] timens: Introduce CLOCK_BOOTTIME offset

From: Andrei Vagin <[email protected]>

Adds boottime virtualisation for time namespace.
Introduce timespec for boottime clock into timens offsets and wire
clock_gettime() syscall.

Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
include/linux/timens_offsets.h | 1 +
kernel/time/posix-timers.c | 1 +
kernel/time_namespace.c | 3 +++
3 files changed, 5 insertions(+)

diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index 248b0c0bb92a..777530c46852 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -4,6 +4,7 @@

struct timens_offsets {
struct timespec64 monotonic_time_offset;
+ struct timespec64 monotonic_boottime_offset;
};

#endif
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index b6d5145858a3..782708054df2 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -1314,6 +1314,7 @@ static const struct k_clock clock_tai = {
static const struct k_clock clock_boottime = {
.clock_getres = posix_get_hrtimer_res,
.clock_get = posix_get_boottime,
+ .clock_timens_adjust = true,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 57694be9e9db..36b31f234472 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -208,6 +208,9 @@ static void clock_timens_fixup(int clockid, struct timespec64 *val, bool to_ns)
case CLOCK_MONOTONIC_COARSE:
offsets = &ns_offsets->monotonic_time_offset;
break;
+ case CLOCK_BOOTTIME:
+ offsets = &ns_offsets->monotonic_boottime_offset;
+ break;
}

if (!offsets)
--
2.20.1


2019-02-06 00:15:19

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 02/32] timens: Add timens_offsets

From: Andrei Vagin <[email protected]>

Introduce offsets for time namespace. They will contain an adjustment
needed to convert clocks to/from host's.

Allocate one page for each time namespace that will be premapped into
userspace among vvar pages.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
MAINTAINERS | 1 +
include/linux/time_namespace.h | 1 +
include/linux/timens_offsets.h | 8 ++++++++
kernel/time_namespace.c | 14 ++++++++++++--
4 files changed, 22 insertions(+), 2 deletions(-)
create mode 100644 include/linux/timens_offsets.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e03f160012e3..cc9054a74886 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12145,6 +12145,7 @@ S: Maintained
F: fs/timerfd.c
F: include/linux/timer*
F: include/linux/time_namespace.h
+F: include/linux/timens_offsets.h
F: kernel/time_namespace.c
F: kernel/time/*timer*

diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 9507ed7072fe..b6985aa87479 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -8,6 +8,7 @@
#include <linux/nsproxy.h>
#include <linux/ns_common.h>
#include <linux/err.h>
+#include <linux/timens_offsets.h>

struct user_namespace;
extern struct user_namespace init_user_ns;
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
new file mode 100644
index 000000000000..7d7cb68ea778
--- /dev/null
+++ b/include/linux/timens_offsets.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIME_OFFSETS_H
+#define _LINUX_TIME_OFFSETS_H
+
+struct timens_offsets {
+};
+
+#endif
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 8c600df9771d..4828447721ec 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -13,6 +13,7 @@
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
#include <linux/sched/task.h>
+#include <linux/mm.h>

static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
{
@@ -46,6 +47,7 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
{
struct time_namespace *ns;
struct ucounts *ucounts;
+ struct page *page;
int err;

err = -ENOSPC;
@@ -58,15 +60,22 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
if (!ns)
goto fail_dec;

+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ goto fail_free;
+ ns->offsets = page_address(page);
+ BUILD_BUG_ON(sizeof(*ns->offsets) > PAGE_SIZE);
+
err = ns_alloc_inum(&ns->ns);
if (err)
- goto fail_free;
+ goto fail_page;

ns->ucounts = ucounts;
ns->ns.ops = &timens_operations;
ns->user_ns = get_user_ns(user_ns);
return ns;
-
+fail_page:
+ free_page((unsigned long)ns->offsets);
fail_free:
kfree(ns);
fail_dec:
@@ -94,6 +103,7 @@ void free_time_ns(struct kref *kref)
struct time_namespace *ns;

ns = container_of(kref, struct time_namespace, kref);
+ free_page((unsigned long)ns->offsets);
dec_time_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
--
2.20.1


2019-02-06 00:15:35

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 12/32] x86/vdso/timens: Add offsets page in vvar

From: Andrei Vagin <[email protected]>

As modern applications fetch time from VDSO without entering the kernel,
it's needed to provide offsets for userspace code inside time namespace.

A page for timens offsets is allocated on time namespace construction.
Put that page into VVAR for tasks inside timens and zero page for
host processes.

As VDSO code is already optimized as much as possible in terms of speed,
any new if-condition in VDSO code is undesirable; the goal is to provide
two .so(s), as was originally suggested by Andy and Thomas:
- for host tasks with optimized-out clk_to_ns() without any penalty
- for processes inside timens with clk_to_ns()
For this purpose, define clk_to_ns() under BUILD_VDSO_TIME_NS, which
will be enabled in the makefile for timens.so in following patches.

VDSO mappings are platform-specific, add Kconfig dependency for arch.

Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
arch/Kconfig | 5 ++++
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/vclock_gettime.c | 42 +++++++++++++++++++++++++++
arch/x86/entry/vdso/vdso-layout.lds.S | 9 +++++-
arch/x86/entry/vdso/vdso2c.c | 3 ++
arch/x86/entry/vdso/vma.c | 12 ++++++++
arch/x86/include/asm/vdso.h | 1 +
init/Kconfig | 1 +
8 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..fd2f96993db9 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -704,6 +704,11 @@ config HAVE_ARCH_HASH
config ISA_BUS_API
def_bool ISA

+config ARCH_HAS_VDSO_TIME_NS
+ bool
+ help
+ VDSO can add time-ns offsets without entering kernel.
+
#
# ABI hall of shame
#
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68261430fe6e..b415519a293f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -70,6 +70,7 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
+ select ARCH_HAS_VDSO_TIME_NS
select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 007b3fe9d727..cb55bd994497 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
#include <linux/math64.h>
#include <linux/time.h>
#include <linux/kernel.h>
+#include <linux/timens_offsets.h>

#define gtod (&VVAR(vsyscall_gtod_data))

@@ -38,6 +39,11 @@ extern u8 hvclock_page
__attribute__((visibility("hidden")));
#endif

+#ifdef BUILD_VDSO_TIME_NS
+extern u8 timens_page
+ __attribute__((visibility("hidden")));
+#endif
+
#ifndef BUILD_VDSO32

notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -139,6 +145,38 @@ notrace static inline u64 vgetcyc(int mode)
return U64_MAX;
}

+notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec *ts)
+{
+#ifdef BUILD_VDSO_TIME_NS
+ struct timens_offsets *timens = (struct timens_offsets *) &timens_page;
+ struct timespec64 *offset64;
+
+ switch (clk) {
+ case CLOCK_MONOTONIC:
+ case CLOCK_MONOTONIC_COARSE:
+ case CLOCK_MONOTONIC_RAW:
+ offset64 = &timens->monotonic_time_offset;
+ break;
+ case CLOCK_BOOTTIME:
+ offset64 = &timens->monotonic_boottime_offset;
+ default:
+ return;
+ }
+
+ ts->tv_nsec += offset64->tv_nsec;
+ ts->tv_sec += offset64->tv_sec;
+ if (ts->tv_nsec >= NSEC_PER_SEC) {
+ ts->tv_nsec -= NSEC_PER_SEC;
+ ts->tv_sec++;
+ }
+ if (ts->tv_nsec < 0) {
+ ts->tv_nsec += NSEC_PER_SEC;
+ ts->tv_sec--;
+ }
+
+#endif
+}
+
notrace static int do_hres(clockid_t clk, struct timespec *ts)
{
struct vgtod_ts *base = &gtod->basetime[clk];
@@ -165,6 +203,8 @@ notrace static int do_hres(clockid_t clk, struct timespec *ts)
ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;

+ clk_to_ns(clk, ts);
+
return 0;
}

@@ -178,6 +218,8 @@ notrace static void do_coarse(clockid_t clk, struct timespec *ts)
ts->tv_sec = base->sec;
ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
+
+ clk_to_ns(clk, ts);
}

notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index 93c6dc7812d0..ba216527e59f 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -7,6 +7,12 @@
* This script controls its layout.
*/

+#ifdef CONFIG_TIME_NS
+# define TIMENS_SZ PAGE_SIZE
+#else
+# define TIMENS_SZ 0
+#endif
+
SECTIONS
{
/*
@@ -16,7 +22,7 @@ SECTIONS
* segment.
*/

- vvar_start = . - 3 * PAGE_SIZE;
+ vvar_start = . - (3 * PAGE_SIZE + TIMENS_SZ);
vvar_page = vvar_start;

/* Place all vvars at the offsets in asm/vvar.h. */
@@ -28,6 +34,7 @@ SECTIONS

pvclock_page = vvar_start + PAGE_SIZE;
hvclock_page = vvar_start + 2 * PAGE_SIZE;
+ timens_page = vvar_start + 3 * PAGE_SIZE;

. = SIZEOF_HEADERS;

diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 26d7177c119e..ed66b023d4b9 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -76,6 +76,7 @@ enum {
sym_hpet_page,
sym_pvclock_page,
sym_hvclock_page,
+ sym_timens_page,
};

const int special_pages[] = {
@@ -83,6 +84,7 @@ const int special_pages[] = {
sym_hpet_page,
sym_pvclock_page,
sym_hvclock_page,
+ sym_timens_page,
};

struct vdso_sym {
@@ -96,6 +98,7 @@ struct vdso_sym required_syms[] = {
[sym_hpet_page] = {"hpet_page", true},
[sym_pvclock_page] = {"pvclock_page", true},
[sym_hvclock_page] = {"hvclock_page", true},
+ [sym_timens_page] = {"timens_page", true},
{"VDSO32_NOTE_MASK", true},
{"__kernel_vsyscall", true},
{"__kernel_sigreturn", true},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index babc4e7a519c..d1031db94093 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -14,6 +14,7 @@
#include <linux/elf.h>
#include <linux/cpu.h>
#include <linux/ptrace.h>
+#include <linux/time_namespace.h>
#include <asm/pvclock.h>
#include <asm/vgtod.h>
#include <asm/proto.h>
@@ -23,6 +24,7 @@
#include <asm/desc.h>
#include <asm/cpufeature.h>
#include <asm/mshyperv.h>
+#include <asm/page.h>

#if defined(CONFIG_X86_64)
unsigned int __read_mostly vdso64_enabled = 1;
@@ -123,6 +125,16 @@ static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK))
return vmf_insert_pfn(vma, vmf->address,
vmalloc_to_pfn(tsc_pg));
+ } else if (sym_offset == image->sym_timens_page) {
+ struct time_namespace *ns = current->nsproxy->time_ns;
+ unsigned long pfn;
+
+ if (!ns->offsets)
+ pfn = page_to_pfn(ZERO_PAGE(0));
+ else
+ pfn = page_to_pfn(virt_to_page(ns->offsets));
+
+ return vmf_insert_pfn(vma, vmf->address, pfn);
}

return VM_FAULT_SIGBUS;
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 27566e57e87d..619322065b8e 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -22,6 +22,7 @@ struct vdso_image {
long sym_hpet_page;
long sym_pvclock_page;
long sym_hvclock_page;
+ long sym_timens_page;
long sym_VDSO32_NOTE_MASK;
long sym___kernel_sigreturn;
long sym___kernel_rt_sigreturn;
diff --git a/init/Kconfig b/init/Kconfig
index 03ed7b2694b5..14e94a64064a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -967,6 +967,7 @@ config UTS_NS

config TIME_NS
bool "TIME namespace"
+ depends on ARCH_HAS_VDSO_TIME_NS
default y
help
In this namespace boottime and monotonic clocks can be set.
--
2.20.1


2019-02-06 00:15:48

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 08/32] timens: Shift /proc/uptime

Respect boottime inside time namespace for /proc/uptime

Signed-off-by: Dmitry Safonov <[email protected]>
---
fs/proc/uptime.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index a4c2791ab70b..4421ec058472 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -5,6 +5,7 @@
#include <linux/sched.h>
#include <linux/seq_file.h>
#include <linux/time.h>
+#include <linux/time_namespace.h>
#include <linux/kernel_stat.h>

static int uptime_proc_show(struct seq_file *m, void *v)
@@ -20,6 +21,8 @@ static int uptime_proc_show(struct seq_file *m, void *v)
nsec += (__force u64) kcpustat_cpu(i).cpustat[CPUTIME_IDLE];

ktime_get_boottime_ts64(&uptime);
+ timens_clock_from_host(CLOCK_BOOTTIME, &uptime);
+
idle.tv_sec = div_u64_rem(nsec, NSEC_PER_SEC, &rem);
idle.tv_nsec = rem;
seq_printf(m, "%lu.%02lu %lu.%02lu\n",
--
2.20.1


2019-02-06 00:16:17

by Dmitry Safonov

[permalink] [raw]
Subject: [PATCH 06/32] posix-timers/timens: Take into account clock offsets

From: Andrei Vagin <[email protected]>

Wire timer_settime() syscall into time namespace virtualization.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
---
kernel/time/posix-timers.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 782708054df2..d008dfd5b081 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -854,10 +854,6 @@ static int do_timer_settime(timer_t timer_id, int flags,
unsigned long flag;
int error = 0;

- if (!timespec64_valid(&new_spec64->it_interval) ||
- !timespec64_valid(&new_spec64->it_value))
- return -EINVAL;
-
if (old_spec64)
memset(old_spec64, 0, sizeof(*old_spec64));
retry:
@@ -865,6 +861,15 @@ static int do_timer_settime(timer_t timer_id, int flags,
if (!timr)
return -EINVAL;

+ if (flags & TIMER_ABSTIME)
+ timens_clock_to_host(timr->it_clock, &new_spec64->it_value);
+
+ if (!timespec64_valid(&new_spec64->it_interval) ||
+ !timespec64_valid(&new_spec64->it_value)) {
+ unlock_timer(timr, flag);
+ return -EINVAL;
+ }
+
kc = timr->kclock;
if (WARN_ON_ONCE(!kc || !kc->timer_set))
error = -EINVAL;
--
2.20.1


2019-02-06 08:53:49

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 05/32] timerfd/timens: Take into account ns clock offsets

On Wed, Feb 06, 2019 at 12:10:39AM +0000, Dmitry Safonov wrote:
> From: Andrei Vagin <[email protected]>
>
> Make timerfd respect timens offsets.
> Provide two helpers timens_clock_to_host() timens_clock_from_host() that
> are useful to wire up timens to different kernel subsystems.
> Following patches will use timens_clock_from_host(), added here for
> completeness.
>
> Signed-off-by: Andrei Vagin <[email protected]>
> Co-developed-by: Dmitry Safonov <[email protected]>
> Signed-off-by: Dmitry Safonov <[email protected]>
> ---
> fs/timerfd.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/fs/timerfd.c b/fs/timerfd.c
> index 803ca070d42e..c7ae1e371912 100644
> --- a/fs/timerfd.c
> +++ b/fs/timerfd.c
> @@ -26,6 +26,7 @@
> #include <linux/syscalls.h>
> #include <linux/compat.h>
> #include <linux/rcupdate.h>
> +#include <linux/time_namespace.h>
>
> struct timerfd_ctx {
> union {
> @@ -433,22 +434,27 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
> }
>
> static int do_timerfd_settime(int ufd, int flags,
> - const struct itimerspec64 *new,
> + struct itimerspec64 *new,
> struct itimerspec64 *old)
> {
> struct fd f;
> struct timerfd_ctx *ctx;
> int ret;
>
> - if ((flags & ~TFD_SETTIME_FLAGS) ||
> - !itimerspec64_valid(new))
> - return -EINVAL;

Please don't defer this early test of a @flags value. Otherwise
if @flags is invalid you continue fget/put/clock-to-host even
if result will be dropped out then.

2019-02-06 08:56:36

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 05/32] timerfd/timens: Take into account ns clock offsets

On Wed, Feb 06, 2019 at 11:52:03AM +0300, Cyrill Gorcunov wrote:
...
> >
> > - if ((flags & ~TFD_SETTIME_FLAGS) ||
> > - !itimerspec64_valid(new))
> > - return -EINVAL;
>
> Please don't defer this early test of a @flags value. Otherwise
> if @flags is invalid you continue fget/put/clock-to-host even
> if result will be dropped out then.

Just to clarify -- this could be done on top of the series to
not resend the whole bunch.

2019-02-07 06:38:58

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 05/32] timerfd/timens: Take into account ns clock offsets

On Wed, Feb 06, 2019 at 11:52:03AM +0300, Cyrill Gorcunov wrote:
> On Wed, Feb 06, 2019 at 12:10:39AM +0000, Dmitry Safonov wrote:
> > From: Andrei Vagin <[email protected]>
> >
> > Make timerfd respect timens offsets.
> > Provide two helpers timens_clock_to_host() timens_clock_from_host() that
> > are useful to wire up timens to different kernel subsystems.
> > Following patches will use timens_clock_from_host(), added here for
> > completeness.
> >
> > Signed-off-by: Andrei Vagin <[email protected]>
> > Co-developed-by: Dmitry Safonov <[email protected]>
> > Signed-off-by: Dmitry Safonov <[email protected]>
> > ---
> > fs/timerfd.c | 16 +++++++++++-----
> > 1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/timerfd.c b/fs/timerfd.c
> > index 803ca070d42e..c7ae1e371912 100644
> > --- a/fs/timerfd.c
> > +++ b/fs/timerfd.c
> > @@ -26,6 +26,7 @@
> > #include <linux/syscalls.h>
> > #include <linux/compat.h>
> > #include <linux/rcupdate.h>
> > +#include <linux/time_namespace.h>
> >
> > struct timerfd_ctx {
> > union {
> > @@ -433,22 +434,27 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
> > }
> >
> > static int do_timerfd_settime(int ufd, int flags,
> > - const struct itimerspec64 *new,
> > + struct itimerspec64 *new,
> > struct itimerspec64 *old)
> > {
> > struct fd f;
> > struct timerfd_ctx *ctx;
> > int ret;
> >
> > - if ((flags & ~TFD_SETTIME_FLAGS) ||
> > - !itimerspec64_valid(new))
> > - return -EINVAL;
>
> Please don't defer this early test of a @flags value. Otherwise
> if @flags is invalid you continue fget/put/clock-to-host even
> if result will be dropped out then.

Cyrill, you are right. I moved this check together with
itimerspec64_valid(). The idea was to call itimerspec64_valid() after
applying clock offsets but for that, we need to know clockid.

Let's wait a bit for other comments to this patch set and then we will
fix all things what will be found.

Thanks,
Andrei

2019-02-07 08:31:39

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH 16/32] x86/vdso: Generate vdso{,32}-timens.lds

On 06/02/2019 01.10, Dmitry Safonov wrote:
> As it has been discussed on timens RFC, adding a new conditional branch
> `if (inside_time_ns)` on VDSO for all processes is undesirable.
> It will add a penalty for everybody as branch predictor may mispredict
> the jump. Also there are instruction cache lines wasted on cmp/jmp.
>
> Those effects of introducing time namespace are very much unwanted
> having in mind how much work have been spent on micro-optimisation
> vdso code.
>
> Addressing those problems, there are two versions of VDSO's .so:
> for host tasks (without any penalty) and for processes inside of time
> namespace with clk_to_ns() that subtracts offsets from host's time.
>
> Unfortunately, to allow changing VDSO VMA on a running process,
> the entry points to VDSO should have the same offsets (addresses).
> That's needed as i.e. application that calls setns() may have already
> resolved VDSO symbols in GOT/PLT.

These (14-19, if I'm reading them right) seems to add quite a lot of
complexity and fragility to the build, and other architectures would
probably have to add something similar to their vdso builds.

I'm wondering why not make the rule be that a timens takes effect on
next execve?

Rasmus


2019-02-07 16:14:27

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH 16/32] x86/vdso: Generate vdso{,32}-timens.lds

Hi Rasmus,

On 2/7/19 8:31 AM, Rasmus Villemoes wrote:
> These (14-19, if I'm reading them right) seems to add quite a lot of
> complexity and fragility to the build, and other architectures would
> probably have to add something similar to their vdso builds.
>
> I'm wondering why not make the rule be that a timens takes effect on
> next execve?

I believe, it would make setns() syscall much tricker than wanted:
At this moment the only exception is pidns which changes ns of the child
and not the process-callee.
If exec() would be required to join timens - it may be a challenging
problem for container systems: in order to enter it one needs to
exec("/proc/self/exe") and add some new arguments/options.
Furthermore, it seems to me that to enter container with this semantics,
one needs to enter timens before entering mountns.

IOW, I believe, this would move complexity from kernel build time to
userspace ABI. And I guess, it would require much more logic to
re-create possibly nested namespaces hierarchy.

Rather I've considered using some kind of dynamic patching on vdso_init():
o static_branch - it would add some nops to !timens vdso
o something new like static_retpoline which would put RET over call to
clk_to_ns(), shouldn't be a rocket since.

But in my point of view, if something can be done in compile time
instead of patching code dynamically - than it reduces the complexity
(lesser depends on what compiler/toolchain does).

Thanks,
Dmitry

2019-02-07 21:42:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 03/32] timens: Introduce CLOCK_MONOTONIC offsets

On Wed, 6 Feb 2019, Dmitry Safonov wrote:
> #include "timekeeping.h"
> #include "posix-timers.h"
> @@ -1041,6 +1042,9 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,
>
> error = kc->clock_get(which_clock, &kernel_tp);
>
> + if (!error && kc->clock_timens_adjust)
> + timens_clock_from_host(which_clock, &kernel_tp);

Why are you adding this conditional instead of sticking the offset
magic into the affected ->clock_get() implementations?

That spares you the switch() and the !offsets conditional.

> +static void clock_timens_fixup(int clockid, struct timespec64 *val, bool to_ns)
> +{
> + struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
> + struct timespec64 *offsets = NULL;
> +
> + if (!ns_offsets)
> + return;
> +
> + if (val->tv_sec == 0 && val->tv_nsec == 0)
> + return;

I have no idea why 0/0 is special.

> +
> + switch (clockid) {
> + case CLOCK_MONOTONIC:
> + case CLOCK_MONOTONIC_RAW:
> + case CLOCK_MONOTONIC_COARSE:
> + offsets = &ns_offsets->monotonic_time_offset;
> + break;
> + }
> +
> + if (!offsets)
> + return;
> +
> + if (to_ns)
> + *val = timespec64_add(*val, *offsets);
> + else
> + *val = timespec64_sub(*val, *offsets);
> +}
> +
> +void timens_clock_to_host(int clockid, struct timespec64 *val)

Does this really need to be an out of line call? If you stick this into the
clock_get() implementations then it boils down to:

static inline void timens_add_monotonic(struct timespec64 *ts)
{
struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;

if (ns_offsets)
*ts = timespec64_add(*ts, ns_offsets->monotonic_time_offset);
}

and

static int posix_ktime_get_ts(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_ts64(tp);
timens_add_monotonic(tp);
return 0;
}

Hmm?

Thanks,

tglx


2019-02-08 07:57:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 07/32] timens/kernel: Take into account timens clock offsets in clock_nanosleep

On Wed, 6 Feb 2019, Dmitry Safonov wrote:
>
> @@ -1721,9 +1722,16 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
> {
> struct restart_block *restart;
> struct hrtimer_sleeper t;
> + struct timespec64 tp;
> int ret = 0;
> u64 slack;
>
> + if (!(mode & HRTIMER_MODE_REL)) {
> + tp = *rqtp;
> + rqtp = &tp;

So every invocation of hrtimer_nanosleep() gains a copy of the timespec64
even if the namespace muck is disabled.

The only relevant caller of this is common_nsleep(). So it might make sense
to have common_nsleep() separated for CLOCK_MONOTONIC/BOOTTIME and handle
the thing there. That again avoids the switch() to and out of line calls.

Thanks,

tglx



2019-02-08 09:04:21

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 03/32] timens: Introduce CLOCK_MONOTONIC offsets

On Thu, Feb 07, 2019 at 10:40:55PM +0100, Thomas Gleixner wrote:
> On Wed, 6 Feb 2019, Dmitry Safonov wrote:
> > #include "timekeeping.h"
> > #include "posix-timers.h"
> > @@ -1041,6 +1042,9 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,
> >
> > error = kc->clock_get(which_clock, &kernel_tp);
> >
> > + if (!error && kc->clock_timens_adjust)
> > + timens_clock_from_host(which_clock, &kernel_tp);
>
> Why are you adding this conditional instead of sticking the offset
> magic into the affected ->clock_get() implementations?
>
> That spares you the switch() and the !offsets conditional.
>
> > +static void clock_timens_fixup(int clockid, struct timespec64 *val, bool to_ns)
> > +{
> > + struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
> > + struct timespec64 *offsets = NULL;
> > +
> > + if (!ns_offsets)
> > + return;
> > +
> > + if (val->tv_sec == 0 && val->tv_nsec == 0)
> > + return;
>
> I have no idea why 0/0 is special.

Initially this function was introduced to apply timens offsets in
do_timer_settime and there it is special and means that the timer should
be disarmed.

Now this functions is used in many other places and this check defenetly
sould not be here.


>
> > +
> > + switch (clockid) {
> > + case CLOCK_MONOTONIC:
> > + case CLOCK_MONOTONIC_RAW:
> > + case CLOCK_MONOTONIC_COARSE:
> > + offsets = &ns_offsets->monotonic_time_offset;
> > + break;
> > + }
> > +
> > + if (!offsets)
> > + return;
> > +
> > + if (to_ns)
> > + *val = timespec64_add(*val, *offsets);
> > + else
> > + *val = timespec64_sub(*val, *offsets);
> > +}
> > +
> > +void timens_clock_to_host(int clockid, struct timespec64 *val)
>
> Does this really need to be an out of line call?

The idea was to collect all the logic about timens offsets in one place.

clock_timens_fixup() is used in all places where we need apply timens
offsets (clock_gettim, posix timers, clock_nanosleep, timerfd, uptime_proc_show).


> If you stick this into the
> clock_get() implementations then it boils down to:

clock_get() is called from clock_gettime and from common_timer_get(). In
common_timer_get(), we expect to get time in the root time namespace.

but I think we can handle this. For example, we can introduce a new flag
CLOCL_TIMENS and

kc->clock_get(which_clock | CLOCK_TIMENS, &tp) will return time in a
current time namespace.

kc->clock_get(which_clock, &tp) will return time in the root time
namespace.

>
> static inline void timens_add_monotonic(struct timespec64 *ts)
> {
> struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
>
> if (ns_offsets)
> *ts = timespec64_add(*ts, ns_offsets->monotonic_time_offset);
> }
>
> and
>
> static int posix_ktime_get_ts(clockid_t which_clock, struct timespec64 *tp)
> {
> ktime_get_ts64(tp);
> timens_add_monotonic(tp);
> return 0;
> }
>
> Hmm?

Yes, we can do this. I like this idea. This will allow us to remove
timens_clock_to_host(), but I am not sure that we will be able to do
something similar with timens_clock_from_host, which is used to apply
offsets for timers. I need to look at the timer code again.


Thanks,
Andrei

>
> Thanks,
>
> tglx
>

2019-02-08 09:48:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 03/32] timens: Introduce CLOCK_MONOTONIC offsets

On Thu, 7 Feb 2019, Thomas Gleixner wrote:
> Does this really need to be an out of line call? If you stick this into the
> clock_get() implementations then it boils down to:
>
> static inline void timens_add_monotonic(struct timespec64 *ts)
> {
> struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
>
> if (ns_offsets)
> *ts = timespec64_add(*ts, ns_offsets->monotonic_time_offset);

And this needs to be a special variant of
timespec64-add_safe(). timespec64_add_safe() is not sufficient, because it
assumes that both values are positive, which is not the case here..

In timer_set() implementations you move the timespec_valid() check after
the add. That's wrong because you really want to check the input value from
user space.

Assume that the caller supplied value is valid and the adjustment brings it
out of range then how should the caller understand why it it rejected?

So timespec64_add_namespace() must check for under and overflow. But doing
this with timespecs is a pain. I rather suggest to rework the whole thing
so hrtimer_nanosleep() takes a ktime_t expiry value and move the conversion
to the call sites. Then the whole offset magic becomes:

expires = timespec64_to_ktime(rqtp);

if (abstime)
expires = timens_to_host_mono(expires);

and that function can nicely do the underflow and overflow detection and
cap the values to 0 on underflow and KTIME_MAX on overflow.

Hmm?

Thanks,

tglx

2019-02-08 09:58:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 16/32] x86/vdso: Generate vdso{,32}-timens.lds

On Thu, 7 Feb 2019, Rasmus Villemoes wrote:

Cc: + Vincenzo, Will

> On 06/02/2019 01.10, Dmitry Safonov wrote:
> > As it has been discussed on timens RFC, adding a new conditional branch
> > `if (inside_time_ns)` on VDSO for all processes is undesirable.
> > It will add a penalty for everybody as branch predictor may mispredict
> > the jump. Also there are instruction cache lines wasted on cmp/jmp.
> >
> > Those effects of introducing time namespace are very much unwanted
> > having in mind how much work have been spent on micro-optimisation
> > vdso code.
> >
> > Addressing those problems, there are two versions of VDSO's .so:
> > for host tasks (without any penalty) and for processes inside of time
> > namespace with clk_to_ns() that subtracts offsets from host's time.
> >
> > Unfortunately, to allow changing VDSO VMA on a running process,
> > the entry points to VDSO should have the same offsets (addresses).
> > That's needed as i.e. application that calls setns() may have already
> > resolved VDSO symbols in GOT/PLT.
>
> These (14-19, if I'm reading them right) seems to add quite a lot of
> complexity and fragility to the build, and other architectures would
> probably have to add something similar to their vdso builds.

Yes and we really want to avoid that. The VDSO implementations are
pointlessly different accross the architectures and there is effort on the
way to consolidate them:

https://lkml.kernel.org/r/[email protected]

I talked to Vincenzo earlier this week and he's working on a new version of
that. The timens stuff wants to go on top of the consolidation otherwise we
end up with another set of pointlessly different and differently broken
VDSO variants.

Thanks,

tglx

2019-02-08 15:20:11

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH 16/32] x86/vdso: Generate vdso{,32}-timens.lds

On 2/8/19 9:57 AM, Thomas Gleixner wrote:
> On Thu, 7 Feb 2019, Rasmus Villemoes wrote:
>
> Cc: + Vincenzo, Will
>
>> On 06/02/2019 01.10, Dmitry Safonov wrote:
>>> As it has been discussed on timens RFC, adding a new conditional branch
>>> `if (inside_time_ns)` on VDSO for all processes is undesirable.
>>> It will add a penalty for everybody as branch predictor may mispredict
>>> the jump. Also there are instruction cache lines wasted on cmp/jmp.
>>>
>>> Those effects of introducing time namespace are very much unwanted
>>> having in mind how much work have been spent on micro-optimisation
>>> vdso code.
>>>
>>> Addressing those problems, there are two versions of VDSO's .so:
>>> for host tasks (without any penalty) and for processes inside of time
>>> namespace with clk_to_ns() that subtracts offsets from host's time.
>>>
>>> Unfortunately, to allow changing VDSO VMA on a running process,
>>> the entry points to VDSO should have the same offsets (addresses).
>>> That's needed as i.e. application that calls setns() may have already
>>> resolved VDSO symbols in GOT/PLT.
>>
>> These (14-19, if I'm reading them right) seems to add quite a lot of
>> complexity and fragility to the build, and other architectures would
>> probably have to add something similar to their vdso builds.
>
> Yes and we really want to avoid that. The VDSO implementations are
> pointlessly different accross the architectures and there is effort on the
> way to consolidate them:
>
> https://lkml.kernel.org/r/[email protected]
>
> I talked to Vincenzo earlier this week and he's working on a new version of
> that. The timens stuff wants to go on top of the consolidation otherwise we
> end up with another set of pointlessly different and differently broken
> VDSO variants.

That looks awesome!
I've missed the tread about it, will catch the details.

Thanks much,
Dmitry

2019-03-27 18:00:59

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 16/32] x86/vdso: Generate vdso{,32}-timens.lds

While the generic vdso patchset is in development, we decided to think
about what other ways of generating two vdso libraries. In this
patchset, we use a linker script, but it looks too complicated, so we
decided to look at other options. Another obvious approach is the code
patching technique. The main idea was to reduce the amount of
arch-dependent code and Dmitry brought with the idea of three labels.
Let’s look at this pseudo-code:

Int vdso_clock_gettime(clockid_t clk, struct timespec *ts)
{
...
l_call:
clk_to_ns(clk, ts)
l_return:
return 0;
annotate_reachable();
l_out:
nop();
return 0;
}

Here we can see three labels. Without patching this code, the function
will apply vdso offsets. But if we copy the code between the last two
labels to the first label, we will get a version which skips vdso
offsets. The patch which implements this idea will be in replies to this
email. It was tested on x86_64 and with gcc as a compiler, but we
suspect that there might be some issues on other architectures or with
other compilers. So we would like to ask the help of the community to
understand what we have to do to be sure that this code works always
correctly.

The second patch implements static_branch for the vdso code.
Here are only a few lines of arch-dependent code:

+static __always_inline bool timens_static_branch(void)
+{
+ asm_volatile_goto("1:\n\t"
+ ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
+ ".pushsection __retcall_table, \"aw\"\n\t"
+ "2: .word 1b - 2b, %l[l_yes] - 2b\n\t"
+ ".popsection\n\t"
+ : : : : l_yes);
+
+ return false;
+l_yes:
+ return true;
+}

This is a slightly modified version of the arch_static_branch()
function. The timens code in vdso looks like this:

if (timens_static_branch()) {
clk_to_ns(clk, ts);
}

The version of vdso which is compiled from sources will never execute
clk_to_ns(). And then we can patch the 'no-op' in the straight-line
codepath with a 'jump' instruction to the out-of-line true branch and
get the timens version of the vdso library.

Now we can compare these three versions. Our opinion is that the version
with three labels looks cleaner and if it will work with all compilers
on all architectures, we probably have to choose it. Otherwise, we would
prefer the version with static_branches, because it is simpler than the
version with the linker script.

Thanks,
Andrei

On Fri, Feb 08, 2019 at 10:57:57AM +0100, Thomas Gleixner wrote:
> On Thu, 7 Feb 2019, Rasmus Villemoes wrote:
>
> Cc: + Vincenzo, Will
>
> > On 06/02/2019 01.10, Dmitry Safonov wrote:
> > > As it has been discussed on timens RFC, adding a new conditional branch
> > > `if (inside_time_ns)` on VDSO for all processes is undesirable.
> > > It will add a penalty for everybody as branch predictor may mispredict
> > > the jump. Also there are instruction cache lines wasted on cmp/jmp.
> > >
> > > Those effects of introducing time namespace are very much unwanted
> > > having in mind how much work have been spent on micro-optimisation
> > > vdso code.
> > >
> > > Addressing those problems, there are two versions of VDSO's .so:
> > > for host tasks (without any penalty) and for processes inside of time
> > > namespace with clk_to_ns() that subtracts offsets from host's time.
> > >
> > > Unfortunately, to allow changing VDSO VMA on a running process,
> > > the entry points to VDSO should have the same offsets (addresses).
> > > That's needed as i.e. application that calls setns() may have already
> > > resolved VDSO symbols in GOT/PLT.
> >
> > These (14-19, if I'm reading them right) seems to add quite a lot of
> > complexity and fragility to the build, and other architectures would
> > probably have to add something similar to their vdso builds.
>
> Yes and we really want to avoid that. The VDSO implementations are
> pointlessly different accross the architectures and there is effort on the
> way to consolidate them:
>
> https://lkml.kernel.org/r/[email protected]
>
> I talked to Vincenzo earlier this week and he's working on a new version of
> that. The timens stuff wants to go on top of the consolidation otherwise we
> end up with another set of pointlessly different and differently broken
> VDSO variants.
>
> Thanks,
>
> tglx

2019-03-27 19:23:11

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH RFC] x86/asm: Introduce static_retcall(s)

From: Dmitry Safonov <[email protected]>

Provide framework to overwrite tail call in a function with return.

XXX: split vdso/generic part

Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 19 ++++++----
arch/x86/entry/vdso/vdso-layout.lds.S | 1 +
arch/x86/entry/vdso/vdso2c.h | 11 +++++-
arch/x86/entry/vdso/vma.c | 22 +++++++++++
arch/x86/include/asm/static_retcall.h | 54 +++++++++++++++++++++++++++
arch/x86/include/asm/vdso.h | 1 +
6 files changed, 99 insertions(+), 9 deletions(-)
create mode 100644 arch/x86/include/asm/static_retcall.h

diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index cb55bd994497..9416f1ee6b73 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -18,6 +18,7 @@
#include <asm/msr.h>
#include <asm/pvclock.h>
#include <asm/mshyperv.h>
+#include <asm/static_retcall.h>
#include <linux/math64.h>
#include <linux/time.h>
#include <linux/kernel.h>
@@ -39,7 +40,7 @@ extern u8 hvclock_page
__attribute__((visibility("hidden")));
#endif

-#ifdef BUILD_VDSO_TIME_NS
+#ifdef CONFIG_TIME_NS
extern u8 timens_page
__attribute__((visibility("hidden")));
#endif
@@ -145,9 +146,9 @@ notrace static inline u64 vgetcyc(int mode)
return U64_MAX;
}

+#ifdef CONFIG_TIME_NS
notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec *ts)
{
-#ifdef BUILD_VDSO_TIME_NS
struct timens_offsets *timens = (struct timens_offsets *) &timens_page;
struct timespec64 *offset64;

@@ -173,9 +174,13 @@ notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec *ts
ts->tv_nsec += NSEC_PER_SEC;
ts->tv_sec--;
}
-
-#endif
}
+#define _static_retcall static_retcall
+#define _static_retcall_int static_retcall_int
+#else
+#define _static_retcall(...)
+#define _static_retcall_int(...)
+#endif

notrace static int do_hres(clockid_t clk, struct timespec *ts)
{
@@ -203,9 +208,7 @@ notrace static int do_hres(clockid_t clk, struct timespec *ts)
ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;

- clk_to_ns(clk, ts);
-
- return 0;
+ _static_retcall_int(0, clk_to_ns, clk, ts);
}

notrace static void do_coarse(clockid_t clk, struct timespec *ts)
@@ -219,7 +222,7 @@ notrace static void do_coarse(clockid_t clk, struct timespec *ts)
ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));

- clk_to_ns(clk, ts);
+ _static_retcall(clk_to_ns, clk, ts);
}

notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index ba216527e59f..075cae6f33bf 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -45,6 +45,7 @@ SECTIONS
.gnu.version : { *(.gnu.version) }
.gnu.version_d : { *(.gnu.version_d) }
.gnu.version_r : { *(.gnu.version_r) }
+ __retcall_table : { *(__retcall_table) } :text

.dynamic : { *(.dynamic) } :text :dynamic

diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 660f725a02c1..ae91567fd567 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -16,7 +16,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
unsigned int i, syms_nr;
unsigned long j;
ELF(Shdr) *symtab_hdr = NULL, *strtab_hdr, *secstrings_hdr,
- *alt_sec = NULL;
+ *alt_sec = NULL, *retcall_sec = NULL;
ELF(Dyn) *dyn = 0, *dyn_end = 0;
const char *secstrings;
INT_BITS syms[NSYMS] = {};
@@ -78,6 +78,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
if (!strcmp(secstrings + GET_LE(&sh->sh_name),
".altinstructions"))
alt_sec = sh;
+ if (!strcmp(secstrings + GET_LE(&sh->sh_name),
+ "__retcall_table"))
+ retcall_sec = sh;
}

if (!symtab_hdr)
@@ -165,6 +168,12 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
fprintf(outfile, "\t.alt_len = %lu,\n",
(unsigned long)GET_LE(&alt_sec->sh_size));
}
+ if (retcall_sec) {
+ fprintf(outfile, "\t.retcall = %lu,\n",
+ (unsigned long)GET_LE(&retcall_sec->sh_offset));
+ fprintf(outfile, "\t.retcall_len = %lu,\n",
+ (unsigned long)GET_LE(&retcall_sec->sh_size));
+ }
for (i = 0; i < NSYMS; i++) {
if (required_syms[i].export && syms[i])
fprintf(outfile, "\t.sym_%s = %" PRIi64 ",\n",
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 0b8d9f6f0ce3..b4ea7a2ebfed 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -25,6 +25,7 @@
#include <asm/cpufeature.h>
#include <asm/mshyperv.h>
#include <asm/page.h>
+#include <asm/static_retcall.h>
#include <asm/tlb.h>

#if defined(CONFIG_X86_64)
@@ -38,6 +39,25 @@ static __init int vdso_setup(char *s)
__setup("vdso=", vdso_setup);
#endif

+static __init int apply_retcalls(struct retcall_entry *ent, unsigned long nr)
+{
+ while (nr--) {
+ void *call_addr = (void *)ent + ent->call;
+ void *ret_addr = (void *)ent + ent->ret;
+ size_t ret_sz = ent->out - ent->ret;
+
+ if (WARN_ON(ret_sz > PAGE_SIZE))
+ goto next;
+
+ memcpy(call_addr, ret_addr, ret_sz);
+
+next:
+ ent++;
+ }
+
+ return 0;
+}
+
void __init init_vdso_image(struct vdso_image *image)
{
BUG_ON(image->size % PAGE_SIZE != 0);
@@ -51,6 +71,8 @@ void __init init_vdso_image(struct vdso_image *image)
return;

memcpy(image->text_timens, image->text, image->size);
+ apply_retcalls((struct retcall_entry *)(image->text + image->retcall),
+ image->retcall_len / sizeof(struct retcall_entry));
#endif
}

diff --git a/arch/x86/include/asm/static_retcall.h b/arch/x86/include/asm/static_retcall.h
new file mode 100644
index 000000000000..fdb13795b74d
--- /dev/null
+++ b/arch/x86/include/asm/static_retcall.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 Dmitry Safonov, Andrey Vagin
+ */
+
+#ifndef _ASM_X86_STATIC_RETCALL_H
+#define _ASM_X86_STATIC_RETCALL_H
+
+struct retcall_entry {
+ u16 call;
+ u16 ret;
+ u16 out;
+};
+
+#define static_retcall(func, ...) \
+ do { \
+ asm_volatile_goto( \
+ ".pushsection __retcall_table, \"aw\" \n\t" \
+ "2: .word %l[l_call] - 2b\n\t" \
+ ".word %l[l_return] - 2b\n\t" \
+ ".word %l[l_out] - 2b\n\t" \
+ ".popsection" \
+ : : : : l_call, l_return, l_out); \
+l_call: \
+ func(__VA_ARGS__); \
+l_return: \
+ return; \
+ annotate_reachable(); \
+l_out: \
+ nop(); \
+ return; \
+ } while(0)
+
+#define static_retcall_int(ret, func, ...) \
+ do { \
+ asm_volatile_goto( \
+ ".pushsection __retcall_table, \"aw\" \n\t" \
+ _ASM_ALIGN "\n\t" \
+ "2: .word %l[l_call] - 2b\n\t" \
+ ".word %l[l_return] - 2b\n\t" \
+ ".word %l[l_out] - 2b\n\t" \
+ ".popsection" \
+ : : : : l_call, l_return, l_out); \
+l_call: \
+ func(__VA_ARGS__); \
+l_return: \
+ return ret; \
+ annotate_reachable(); \
+l_out: \
+ nop(); \
+ return ret; \
+ } while(0)
+
+#endif
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 583133446874..acdf70bf814b 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -16,6 +16,7 @@ struct vdso_image {
unsigned long size; /* Always a multiple of PAGE_SIZE */

unsigned long alt, alt_len;
+ unsigned long retcall, retcall_len;

long sym_vvar_start; /* Negative offset to the vvar area */

--
2.20.1


2019-03-27 19:23:20

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH RFC] vdso: introduce timens_static_branch

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

This patch introduces timens_static_branch(), which is similar with
static_branch_unlikely.

The timens code in vdso looks like this:

if (timens_static_branch()) {
clk_to_ns(clk, ts);
}

The version of vdso which is compiled from sources will never execute
clk_to_ns(). And then we can patch the 'no-op' in the straight-line
codepath with a 'jump' instruction to the out-of-line true branch and
get the timens version of the vdso library.

Cc: Dmitry Safonov <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 21 ++++++++++++++-------
arch/x86/entry/vdso/vdso-layout.lds.S | 1 +
arch/x86/entry/vdso/vdso2c.h | 11 ++++++++++-
arch/x86/entry/vdso/vma.c | 19 +++++++++++++++++++
arch/x86/include/asm/jump_label.h | 14 ++++++++++++++
arch/x86/include/asm/vdso.h | 1 +
include/linux/jump_label.h | 5 +++++
7 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index cb55bd994497..74de42f1f7d8 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -18,6 +18,7 @@
#include <asm/msr.h>
#include <asm/pvclock.h>
#include <asm/mshyperv.h>
+#include <asm/jump_label.h>
#include <linux/math64.h>
#include <linux/time.h>
#include <linux/kernel.h>
@@ -39,7 +40,7 @@ extern u8 hvclock_page
__attribute__((visibility("hidden")));
#endif

-#ifdef BUILD_VDSO_TIME_NS
+#ifdef CONFIG_TIME_NS
extern u8 timens_page
__attribute__((visibility("hidden")));
#endif
@@ -145,9 +146,9 @@ notrace static inline u64 vgetcyc(int mode)
return U64_MAX;
}

+#ifdef CONFIG_TIME_NS
notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec *ts)
{
-#ifdef BUILD_VDSO_TIME_NS
struct timens_offsets *timens = (struct timens_offsets *) &timens_page;
struct timespec64 *offset64;

@@ -173,9 +174,12 @@ notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec *ts
ts->tv_nsec += NSEC_PER_SEC;
ts->tv_sec--;
}
-
-#endif
}
+#define _timens_static_branch_unlikely timens_static_branch_unlikely
+#else
+notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec *ts) {}
+notrace static __always_inline bool _timens_static_branch_unlikely(void) { return false; }
+#endif

notrace static int do_hres(clockid_t clk, struct timespec *ts)
{
@@ -203,8 +207,9 @@ notrace static int do_hres(clockid_t clk, struct timespec *ts)
ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;

- clk_to_ns(clk, ts);
-
+ if (_timens_static_branch_unlikely()) {
+ clk_to_ns(clk, ts);
+ }
return 0;
}

@@ -219,7 +224,9 @@ notrace static void do_coarse(clockid_t clk, struct timespec *ts)
ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));

- clk_to_ns(clk, ts);
+ if (_timens_static_branch_unlikely()) {
+ clk_to_ns(clk, ts);
+ }
}

notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index ba216527e59f..69dbe4821aa5 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -45,6 +45,7 @@ SECTIONS
.gnu.version : { *(.gnu.version) }
.gnu.version_d : { *(.gnu.version_d) }
.gnu.version_r : { *(.gnu.version_r) }
+ __jump_table : { *(__jump_table) } :text

.dynamic : { *(.dynamic) } :text :dynamic

diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 660f725a02c1..e4eef5e1c6ac 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -16,7 +16,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
unsigned int i, syms_nr;
unsigned long j;
ELF(Shdr) *symtab_hdr = NULL, *strtab_hdr, *secstrings_hdr,
- *alt_sec = NULL;
+ *alt_sec = NULL, *jump_table_sec = NULL;
ELF(Dyn) *dyn = 0, *dyn_end = 0;
const char *secstrings;
INT_BITS syms[NSYMS] = {};
@@ -78,6 +78,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
if (!strcmp(secstrings + GET_LE(&sh->sh_name),
".altinstructions"))
alt_sec = sh;
+ if (!strcmp(secstrings + GET_LE(&sh->sh_name),
+ "__jump_table"))
+ jump_table_sec = sh;
}

if (!symtab_hdr)
@@ -165,6 +168,12 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
fprintf(outfile, "\t.alt_len = %lu,\n",
(unsigned long)GET_LE(&alt_sec->sh_size));
}
+ if (jump_table_sec) {
+ fprintf(outfile, "\t.jump_table = %lu,\n",
+ (unsigned long)GET_LE(&jump_table_sec->sh_offset));
+ fprintf(outfile, "\t.jump_table_len = %lu,\n",
+ (unsigned long)GET_LE(&jump_table_sec->sh_size));
+ }
for (i = 0; i < NSYMS; i++) {
if (required_syms[i].export && syms[i])
fprintf(outfile, "\t.sym_%s = %" PRIi64 ",\n",
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 0b8d9f6f0ce3..5c0e6491aefb 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -15,6 +15,7 @@
#include <linux/cpu.h>
#include <linux/ptrace.h>
#include <linux/time_namespace.h>
+#include <linux/jump_label.h>
#include <asm/pvclock.h>
#include <asm/vgtod.h>
#include <asm/proto.h>
@@ -38,6 +39,22 @@ static __init int vdso_setup(char *s)
__setup("vdso=", vdso_setup);
#endif

+#ifdef CONFIG_TIME_NS
+static __init int apply_jump_tables(struct vdso_jump_entry *ent, unsigned long nr)
+{
+ while (nr--) {
+ void *code_addr = (void *)ent + ent->code;
+ long target_addr = (long) ent->target - (ent->code + JUMP_LABEL_NOP_SIZE);
+ ((char *)code_addr)[0] = 0xe9; /* JMP rel32 */
+ *((long *)(code_addr + 1)) = (long)target_addr;
+
+ ent++;
+ }
+
+ return 0;
+}
+#endif
+
void __init init_vdso_image(struct vdso_image *image)
{
BUG_ON(image->size % PAGE_SIZE != 0);
@@ -51,6 +68,8 @@ void __init init_vdso_image(struct vdso_image *image)
return;

memcpy(image->text_timens, image->text, image->size);
+ apply_jump_tables((struct vdso_jump_entry *)(image->text_timens + image->jump_table),
+ image->jump_table_len / sizeof(struct vdso_jump_entry));
#endif
}

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 65191ce8e1cf..1784aa49cc82 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -51,6 +51,20 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
return true;
}

+static __always_inline bool timens_static_branch_unlikely(void)
+{
+ asm_volatile_goto("1:\n\t"
+ ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
+ ".pushsection __jump_table, \"aw\"\n\t"
+ "2: .word 1b - 2b, %l[l_yes] - 2b\n\t"
+ ".popsection\n\t"
+ : : : : l_yes);
+
+ return false;
+l_yes:
+ return true;
+}
+
#else /* __ASSEMBLY__ */

.macro STATIC_JUMP_IF_TRUE target, key, def
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 583133446874..883151c3a032 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -16,6 +16,7 @@ struct vdso_image {
unsigned long size; /* Always a multiple of PAGE_SIZE */

unsigned long alt, alt_len;
+ unsigned long jump_table, jump_table_len;

long sym_vvar_start; /* Negative offset to the vvar area */

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 3e113a1fa0f1..69854a05d2f2 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -125,6 +125,11 @@ struct jump_entry {
long key; // key may be far away from the core kernel under KASLR
};

+struct vdso_jump_entry {
+ u16 code;
+ u16 target;
+};
+
static inline unsigned long jump_entry_code(const struct jump_entry *entry)
{
return (unsigned long)&entry->code + entry->code;
--
2.20.1