2014-12-08 03:03:54

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 1/3] X86: make VDSO data support multiple pages

Currently vdso data is one page. Next patches will add per-cpu data to
vdso, which requires several pages if CPU number is big. This makes VDSO
data support multiple pages.

Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
arch/x86/include/asm/vvar.h | 6 +++++-
arch/x86/kernel/asm-offsets.c | 5 +++++
arch/x86/kernel/vmlinux.lds.S | 4 +---
arch/x86/vdso/vdso-layout.lds.S | 5 +++--
arch/x86/vdso/vma.c | 3 ++-
5 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index 5d2b9ad..fcbe621 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -47,7 +47,11 @@ extern char __vvar_page;
DECLARE_VVAR(0, volatile unsigned long, jiffies)
DECLARE_VVAR(16, int, vgetcpu_mode)
DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
-
+/*
+ * you must update VVAR_TOTAL_SIZE to reflect all of the variables we're
+ * stuffing into the vvar area. Don't change any of the above without
+ * also changing this math of VVAR_TOTAL_SIZE
+ */
#undef DECLARE_VVAR

#endif
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 9f6b934..0ab31a9 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -16,6 +16,7 @@
#include <asm/sigframe.h>
#include <asm/bootparam.h>
#include <asm/suspend.h>
+#include <asm/vgtod.h>

#ifdef CONFIG_XEN
#include <xen/interface/xen.h>
@@ -71,4 +72,8 @@ void common(void) {

BLANK();
DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
+
+ BLANK();
+ DEFINE(VVAR_TOTAL_SIZE,
+ ALIGN(128 + sizeof(struct vsyscall_gtod_data), PAGE_SIZE));
}
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 49edf2d..8b11307 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -168,11 +168,9 @@ SECTIONS
* Pad the rest of the page with zeros. Otherwise the loader
* can leave garbage here.
*/
- . = __vvar_beginning_hack + PAGE_SIZE;
+ . = __vvar_beginning_hack + VVAR_TOTAL_SIZE;
} :data

- . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
-
/* Init code and data - will be freed after init */
. = ALIGN(PAGE_SIZE);
.init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
diff --git a/arch/x86/vdso/vdso-layout.lds.S b/arch/x86/vdso/vdso-layout.lds.S
index de2c921..acaf8ce 100644
--- a/arch/x86/vdso/vdso-layout.lds.S
+++ b/arch/x86/vdso/vdso-layout.lds.S
@@ -1,4 +1,5 @@
#include <asm/vdso.h>
+#include <asm/asm-offsets.h>

/*
* Linker script for vDSO. This is an ELF shared object prelinked to
@@ -25,7 +26,7 @@ SECTIONS
* segment.
*/

- vvar_start = . - 2 * PAGE_SIZE;
+ vvar_start = . - (VVAR_TOTAL_SIZE + PAGE_SIZE);
vvar_page = vvar_start;

/* Place all vvars at the offsets in asm/vvar.h. */
@@ -35,7 +36,7 @@ SECTIONS
#undef __VVAR_KERNEL_LDS
#undef EMIT_VVAR

- hpet_page = vvar_start + PAGE_SIZE;
+ hpet_page = vvar_start + VVAR_TOTAL_SIZE;

. = SIZEOF_HEADERS;

diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 970463b..fc37067 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -16,6 +16,7 @@
#include <asm/vdso.h>
#include <asm/page.h>
#include <asm/hpet.h>
+#include <asm/asm-offsets.h>

#if defined(CONFIG_X86_64)
unsigned int __read_mostly vdso64_enabled = 1;
@@ -150,7 +151,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
ret = remap_pfn_range(vma,
text_start + image->sym_vvar_page,
__pa_symbol(&__vvar_page) >> PAGE_SHIFT,
- PAGE_SIZE,
+ VVAR_TOTAL_SIZE,
PAGE_READONLY);

if (ret)
--
1.8.1


2014-12-08 03:03:56

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 2/3] X86: add a generic API to let vdso code detect context switch

vdso code can't disable preempt, so it can be preempted at any time.
This makes a challenge to implement specific features. This patch adds a
generic API to let vdso code detect context switch.

With this patch, every cpu maintains a context switch count. The upper
bits of the count is the logical cpu id, so the count can't be identical
for any cpu. The low bits of the count will be increased for each
context switch. For a x86_64 cpu with 4096 cpus, the context switch will
be overflowed for 2^(64 - 12) context switch, which is a long time and can be
ignored. The change of the count in giving time can be used to detect if
context switch occurs.

Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
arch/x86/Kconfig | 4 ++++
arch/x86/include/asm/vdso.h | 34 ++++++++++++++++++++++++++++++++++
arch/x86/include/asm/vvar.h | 6 ++++++
arch/x86/kernel/asm-offsets.c | 6 ++++++
arch/x86/vdso/vclock_gettime.c | 12 ++++++++++++
arch/x86/vdso/vma.c | 6 ++++++
kernel/sched/core.c | 5 +++++
7 files changed, 73 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 41a503c..4978e31 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1920,6 +1920,10 @@ config COMPAT_VDSO
If unsure, say N: if you are compiling your own kernel, you
are unlikely to be using a buggy version of glibc.

+config VDSO_CS_DETECT
+ def_bool y
+ depends on X86_64
+
config CMDLINE_BOOL
bool "Built-in kernel command line"
---help---
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 8021bd2..42d6d2c 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -4,6 +4,7 @@
#include <asm/page_types.h>
#include <linux/linkage.h>
#include <linux/init.h>
+#include <generated/bounds.h>

#ifndef __ASSEMBLER__

@@ -49,6 +50,39 @@ extern const struct vdso_image *selected_vdso32;

extern void __init init_vdso_image(const struct vdso_image *image);

+#ifdef CONFIG_VDSO_CS_DETECT
+struct vdso_percpu_data {
+ /* layout: | cpu ID | context switch count | */
+ unsigned long cpu_cs_count;
+} ____cacheline_aligned;
+
+struct vdso_data {
+ int dummy;
+ struct vdso_percpu_data vpercpu[0];
+};
+extern struct vdso_data vdso_data;
+
+#ifdef CONFIG_SMP
+#define VDSO_CS_COUNT_BITS \
+ (sizeof(unsigned long) * 8 - NR_CPUS_BITS)
+static inline void vdso_inc_cpu_cs_count(int cpu)
+{
+ unsigned long cs = vdso_data.vpercpu[cpu].cpu_cs_count;
+ cs++;
+ cs &= (1L << VDSO_CS_COUNT_BITS) - 1;
+ vdso_data.vpercpu[cpu].cpu_cs_count = cs |
+ (((unsigned long)cpu) << VDSO_CS_COUNT_BITS);
+ smp_wmb();
+}
+#else
+static inline void vdso_inc_cpu_cs_count(int cpu)
+{
+ vdso_data.vpercpu[cpu].cpu_cs_count++;
+ smp_wmb();
+}
+#endif
+#endif
+
#endif /* __ASSEMBLER__ */

#endif /* _ASM_X86_VDSO_H */
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index fcbe621..394ab65 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -47,6 +47,12 @@ extern char __vvar_page;
DECLARE_VVAR(0, volatile unsigned long, jiffies)
DECLARE_VVAR(16, int, vgetcpu_mode)
DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
+#if defined(CONFIG_VDSO_CS_DETECT) && defined(CONFIG_X86_64)
+/*
+ * this one needs to be last because it ends with a per-cpu array.
+ */
+DECLARE_VVAR(320, struct vdso_data, vdso_data)
+#endif
/*
* you must update VVAR_TOTAL_SIZE to reflect all of the variables we're
* stuffing into the vvar area. Don't change any of the above without
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 0ab31a9..7321cdc 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -17,6 +17,7 @@
#include <asm/bootparam.h>
#include <asm/suspend.h>
#include <asm/vgtod.h>
+#include <asm/vdso.h>

#ifdef CONFIG_XEN
#include <xen/interface/xen.h>
@@ -74,6 +75,11 @@ void common(void) {
DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));

BLANK();
+#ifdef CONFIG_VDSO_CS_DETECT
+ DEFINE(VVAR_TOTAL_SIZE, ALIGN(320 + sizeof(struct vdso_data)
+ + sizeof(struct vdso_percpu_data) * CONFIG_NR_CPUS, PAGE_SIZE));
+#else
DEFINE(VVAR_TOTAL_SIZE,
ALIGN(128 + sizeof(struct vsyscall_gtod_data), PAGE_SIZE));
+#endif
}
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 9793322..438b3be 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -17,6 +17,8 @@
#include <asm/vvar.h>
#include <asm/unistd.h>
#include <asm/msr.h>
+#include <asm/vdso.h>
+#include <asm/vsyscall.h>
#include <linux/math64.h>
#include <linux/time.h>

@@ -289,6 +291,16 @@ notrace static void do_monotonic_coarse(struct timespec *ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
}

+#if defined(CONFIG_VDSO_CS_DETECT) && defined(CONFIG_X86_64)
+notrace unsigned long __vdso_get_context_switch_count(void)
+{
+ int cpu = __getcpu() & VGETCPU_CPU_MASK;
+
+ smp_rmb();
+ return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
+}
+#endif
+
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
switch (clock) {
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index fc37067..400e454 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -22,6 +22,8 @@
unsigned int __read_mostly vdso64_enabled = 1;

extern unsigned short vdso_sync_cpuid;
+
+DEFINE_VVAR(struct vdso_data, vdso_data);
#endif

void __init init_vdso_image(const struct vdso_image *image)
@@ -42,6 +44,10 @@ void __init init_vdso_image(const struct vdso_image *image)
#if defined(CONFIG_X86_64)
static int __init init_vdso(void)
{
+ int cpu;
+ for (cpu =0; cpu < CONFIG_NR_CPUS; cpu++)
+ vdso_inc_cpu_cs_count(cpu);
+
init_vdso_image(&vdso_image_64);

#ifdef CONFIG_X86_X32_ABI
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89e7283..5e4100a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2238,6 +2238,11 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
{
struct mm_struct *mm = rq->prev_mm;
long prev_state;
+#ifdef CONFIG_VDSO_CS_DETECT
+ int cpu = smp_processor_id();
+
+ vdso_inc_cpu_cs_count(cpu);
+#endif

rq->prev_mm = NULL;

--
1.8.1

2014-12-08 03:04:28

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
use the following method to compute the thread cpu time:

t0 = process start
t1 = most recent context switch time
t2 = time at which the vsyscall is invoked

thread_cpu_time = sum(time slices between t0 to t1) + (t2 - t1)
= current->se.sum_exec_runtime + now - sched_clock()

At context switch time We stash away

adj_sched_time = sum_exec_runtime - sched_clock()

in a per-cpu struct in the VVAR page and then compute

thread_cpu_time = adj_sched_time + now

All computations are done in nanosecs on systems where TSC is stable. If
TSC is unstable, we fallback to a regular syscall.
Benchmark data:

for (i = 0; i < 100000000; i++) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
sum += ts.tv_sec * NSECS_PER_SEC + ts.tv_nsec;
}

Baseline:
real 1m3.428s
user 0m5.190s
sys 0m58.220s

patched:
real 0m4.912s
user 0m4.910s
sys 0m0.000s

This should speed up profilers that need to query thread cpu time a lot
to do fine-grained timestamps.

No statistically significant regression was detected on x86_64 context
switch code. Most archs that don't support vsyscalls will have this code
disabled via jump labels.

Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Kumar Sundararajan <[email protected]>
Signed-off-by: Arun Sharma <[email protected]>
Signed-off-by: Chris Mason <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
arch/x86/include/asm/vdso.h | 9 +++++++-
arch/x86/kernel/tsc.c | 14 ++++++++++++
arch/x86/vdso/vclock_gettime.c | 49 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/vdso/vma.c | 4 ++++
kernel/sched/core.c | 3 +++
5 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 42d6d2c..cbbab3a 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -54,14 +54,21 @@ extern void __init init_vdso_image(const struct vdso_image *image);
struct vdso_percpu_data {
/* layout: | cpu ID | context switch count | */
unsigned long cpu_cs_count;
+
+ unsigned long long adj_sched_time;
+ unsigned long long cyc2ns_offset;
+ unsigned long cyc2ns;
} ____cacheline_aligned;

struct vdso_data {
- int dummy;
+ unsigned int thread_cputime_disabled;
struct vdso_percpu_data vpercpu[0];
};
extern struct vdso_data vdso_data;

+struct static_key;
+extern struct static_key vcpu_thread_cputime_enabled;
+
#ifdef CONFIG_SMP
#define VDSO_CS_COUNT_BITS \
(sizeof(unsigned long) * 8 - NR_CPUS_BITS)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index b7e50bb..83f8091 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -21,6 +21,7 @@
#include <asm/hypervisor.h>
#include <asm/nmi.h>
#include <asm/x86_init.h>
+#include <asm/vdso.h>

unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
EXPORT_SYMBOL(cpu_khz);
@@ -263,6 +264,11 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
data->cyc2ns_offset = ns_now -
mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);

+#ifdef CONFIG_VDSO_CS_DETECT
+ vdso_data.vpercpu[cpu].cyc2ns = data->cyc2ns_mul;
+ vdso_data.vpercpu[cpu].cyc2ns_offset = data->cyc2ns_offset;
+#endif
+
cyc2ns_write_end(cpu, data);

done:
@@ -989,6 +995,10 @@ void mark_tsc_unstable(char *reason)
tsc_unstable = 1;
clear_sched_clock_stable();
disable_sched_clock_irqtime();
+#ifdef CONFIG_VDSO_CS_DETECT
+ vdso_data.thread_cputime_disabled = 1;
+ static_key_slow_dec(&vcpu_thread_cputime_enabled);
+#endif
pr_info("Marking TSC unstable due to %s\n", reason);
/* Change only the rating, when not registered */
if (clocksource_tsc.mult)
@@ -1202,6 +1212,10 @@ void __init tsc_init(void)

tsc_disabled = 0;
static_key_slow_inc(&__use_tsc);
+#ifdef CONFIG_VDSO_CS_DETECT
+ vdso_data.thread_cputime_disabled = !cpu_has(&boot_cpu_data,
+ X86_FEATURE_RDTSCP);
+#endif

if (!no_sched_irq_time)
enable_sched_clock_irqtime();
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 438b3be..46c03af2 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -299,6 +299,48 @@ notrace unsigned long __vdso_get_context_switch_count(void)
smp_rmb();
return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
}
+
+#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+notrace static inline unsigned long long __cycles_2_ns(unsigned long long cyc,
+ unsigned long long scale,
+ unsigned long long offset)
+{
+ unsigned long long ns = offset;
+ ns += mul_u64_u32_shr(cyc, scale, CYC2NS_SCALE_FACTOR);
+ return ns;
+}
+
+notrace static unsigned long do_thread_cpu_time(void)
+{
+ unsigned int p;
+ u_int64_t tscval;
+ unsigned long long adj_sched_time;
+ unsigned long long scale;
+ unsigned long long offset;
+ const struct vdso_data *vp = &VVAR(vdso_data);
+ int cpu;
+ unsigned long cs;
+
+ do {
+ cs = __vdso_get_context_switch_count();
+
+ rdtscpll(tscval, p);
+ cpu = p & VGETCPU_CPU_MASK;
+ adj_sched_time = vp->vpercpu[cpu].adj_sched_time;
+ scale = vp->vpercpu[cpu].cyc2ns;
+ offset = vp->vpercpu[cpu].cyc2ns_offset;
+
+ } while (unlikely(cs != __vdso_get_context_switch_count()));
+
+ return __cycles_2_ns(tscval, scale, offset) + adj_sched_time;
+}
+
+notrace static inline void ns_to_ts(u64 ns, struct timespec *ts)
+{
+ u32 rem;
+ ts->tv_sec = div_u64_rem(ns, NSEC_PER_SEC, &rem);
+ ts->tv_nsec = rem;
+}
#endif

notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
@@ -318,6 +360,13 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
case CLOCK_MONOTONIC_COARSE:
do_monotonic_coarse(ts);
break;
+#if defined(CONFIG_VDSO_CS_DETECT) && defined(CONFIG_X86_64)
+ case CLOCK_THREAD_CPUTIME_ID:
+ if (VVAR(vdso_data).thread_cputime_disabled)
+ goto fallback;
+ ns_to_ts(do_thread_cpu_time(), ts);
+ break;
+#endif
default:
goto fallback;
}
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 400e454..aefdd93 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -10,6 +10,7 @@
#include <linux/init.h>
#include <linux/random.h>
#include <linux/elf.h>
+#include <linux/jump_label.h>
#include <asm/vsyscall.h>
#include <asm/vgtod.h>
#include <asm/proto.h>
@@ -24,6 +25,7 @@ unsigned int __read_mostly vdso64_enabled = 1;
extern unsigned short vdso_sync_cpuid;

DEFINE_VVAR(struct vdso_data, vdso_data);
+struct static_key vcpu_thread_cputime_enabled;
#endif

void __init init_vdso_image(const struct vdso_image *image)
@@ -54,6 +56,8 @@ static int __init init_vdso(void)
init_vdso_image(&vdso_image_x32);
#endif

+ if (!vdso_data.thread_cputime_disabled)
+ static_key_slow_inc(&vcpu_thread_cputime_enabled);
return 0;
}
subsys_initcall(init_vdso);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e4100a..e0882fb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2242,6 +2242,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
int cpu = smp_processor_id();

vdso_inc_cpu_cs_count(cpu);
+ if (static_key_false(&vcpu_thread_cputime_enabled))
+ vdso_data.vpercpu[cpu].adj_sched_time =
+ current->se.sum_exec_runtime - sched_clock();
#endif

rq->prev_mm = NULL;
--
1.8.1

2014-12-10 18:37:04

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/3] X86: make VDSO data support multiple pages

On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
> Currently vdso data is one page. Next patches will add per-cpu data to
> vdso, which requires several pages if CPU number is big. This makes VDSO
> data support multiple pages.

Can you rename __vvar_page to __vvar_pages?

>
> Cc: Andy Lutomirski <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Signed-off-by: Shaohua Li <[email protected]>
> ---
> arch/x86/include/asm/vvar.h | 6 +++++-
> arch/x86/kernel/asm-offsets.c | 5 +++++
> arch/x86/kernel/vmlinux.lds.S | 4 +---
> arch/x86/vdso/vdso-layout.lds.S | 5 +++--
> arch/x86/vdso/vma.c | 3 ++-
> 5 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
> index 5d2b9ad..fcbe621 100644
> --- a/arch/x86/include/asm/vvar.h
> +++ b/arch/x86/include/asm/vvar.h
> @@ -47,7 +47,11 @@ extern char __vvar_page;
> DECLARE_VVAR(0, volatile unsigned long, jiffies)
> DECLARE_VVAR(16, int, vgetcpu_mode)
> DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
> -
> +/*
> + * you must update VVAR_TOTAL_SIZE to reflect all of the variables we're
> + * stuffing into the vvar area. Don't change any of the above without
> + * also changing this math of VVAR_TOTAL_SIZE
> + */
> #undef DECLARE_VVAR
>
> #endif
> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 9f6b934..0ab31a9 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -16,6 +16,7 @@
> #include <asm/sigframe.h>
> #include <asm/bootparam.h>
> #include <asm/suspend.h>
> +#include <asm/vgtod.h>
>
> #ifdef CONFIG_XEN
> #include <xen/interface/xen.h>
> @@ -71,4 +72,8 @@ void common(void) {
>
> BLANK();
> DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
> +
> + BLANK();
> + DEFINE(VVAR_TOTAL_SIZE,
> + ALIGN(128 + sizeof(struct vsyscall_gtod_data), PAGE_SIZE));

Perhaps add:

BUILD_BUG_ON(VVAR_TOTAL_SIZE % PAGE_SIZE != 0);

or just keep the alignment stuff that you removed.

Although, TBH, this is still rather ugly IMO. Maybe we should just
have struct vvar somewhere and make everything use it. We couldn't do
that before because of jiffies and such, but those are all gone now.

--Andy

--Andy

> }
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index 49edf2d..8b11307 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -168,11 +168,9 @@ SECTIONS
> * Pad the rest of the page with zeros. Otherwise the loader
> * can leave garbage here.
> */
> - . = __vvar_beginning_hack + PAGE_SIZE;
> + . = __vvar_beginning_hack + VVAR_TOTAL_SIZE;
> } :data
>
> - . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
> -
> /* Init code and data - will be freed after init */
> . = ALIGN(PAGE_SIZE);
> .init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
> diff --git a/arch/x86/vdso/vdso-layout.lds.S b/arch/x86/vdso/vdso-layout.lds.S
> index de2c921..acaf8ce 100644
> --- a/arch/x86/vdso/vdso-layout.lds.S
> +++ b/arch/x86/vdso/vdso-layout.lds.S
> @@ -1,4 +1,5 @@
> #include <asm/vdso.h>
> +#include <asm/asm-offsets.h>
>
> /*
> * Linker script for vDSO. This is an ELF shared object prelinked to
> @@ -25,7 +26,7 @@ SECTIONS
> * segment.
> */
>
> - vvar_start = . - 2 * PAGE_SIZE;
> + vvar_start = . - (VVAR_TOTAL_SIZE + PAGE_SIZE);
> vvar_page = vvar_start;
>
> /* Place all vvars at the offsets in asm/vvar.h. */
> @@ -35,7 +36,7 @@ SECTIONS
> #undef __VVAR_KERNEL_LDS
> #undef EMIT_VVAR
>
> - hpet_page = vvar_start + PAGE_SIZE;
> + hpet_page = vvar_start + VVAR_TOTAL_SIZE;
>
> . = SIZEOF_HEADERS;
>
> diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
> index 970463b..fc37067 100644
> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -16,6 +16,7 @@
> #include <asm/vdso.h>
> #include <asm/page.h>
> #include <asm/hpet.h>
> +#include <asm/asm-offsets.h>
>
> #if defined(CONFIG_X86_64)
> unsigned int __read_mostly vdso64_enabled = 1;
> @@ -150,7 +151,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
> ret = remap_pfn_range(vma,
> text_start + image->sym_vvar_page,
> __pa_symbol(&__vvar_page) >> PAGE_SHIFT,
> - PAGE_SIZE,
> + VVAR_TOTAL_SIZE,
> PAGE_READONLY);
>
> if (ret)
> --
> 1.8.1
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-10 18:39:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/3] X86: add a generic API to let vdso code detect context switch

On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
> vdso code can't disable preempt, so it can be preempted at any time.
> This makes a challenge to implement specific features. This patch adds a
> generic API to let vdso code detect context switch.
>
> With this patch, every cpu maintains a context switch count. The upper
> bits of the count is the logical cpu id, so the count can't be identical
> for any cpu. The low bits of the count will be increased for each
> context switch. For a x86_64 cpu with 4096 cpus, the context switch will
> be overflowed for 2^(64 - 12) context switch, which is a long time and can be
> ignored. The change of the count in giving time can be used to detect if
> context switch occurs.

Why do you need those high bits? I don't understand how you could
possibly confuse one cpu's count with another's unless you fail to
make sure that you're reading the same address both times.

That being said, I don't like this patch. I'm not sure I have a much
better idea, though. More thoughts in the 0/0 email to follow.

--Andy

>
> Cc: Andy Lutomirski <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Signed-off-by: Shaohua Li <[email protected]>
> ---
> arch/x86/Kconfig | 4 ++++
> arch/x86/include/asm/vdso.h | 34 ++++++++++++++++++++++++++++++++++
> arch/x86/include/asm/vvar.h | 6 ++++++
> arch/x86/kernel/asm-offsets.c | 6 ++++++
> arch/x86/vdso/vclock_gettime.c | 12 ++++++++++++
> arch/x86/vdso/vma.c | 6 ++++++
> kernel/sched/core.c | 5 +++++
> 7 files changed, 73 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 41a503c..4978e31 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1920,6 +1920,10 @@ config COMPAT_VDSO
> If unsure, say N: if you are compiling your own kernel, you
> are unlikely to be using a buggy version of glibc.
>
> +config VDSO_CS_DETECT
> + def_bool y
> + depends on X86_64
> +
> config CMDLINE_BOOL
> bool "Built-in kernel command line"
> ---help---
> diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
> index 8021bd2..42d6d2c 100644
> --- a/arch/x86/include/asm/vdso.h
> +++ b/arch/x86/include/asm/vdso.h
> @@ -4,6 +4,7 @@
> #include <asm/page_types.h>
> #include <linux/linkage.h>
> #include <linux/init.h>
> +#include <generated/bounds.h>
>
> #ifndef __ASSEMBLER__
>
> @@ -49,6 +50,39 @@ extern const struct vdso_image *selected_vdso32;
>
> extern void __init init_vdso_image(const struct vdso_image *image);
>
> +#ifdef CONFIG_VDSO_CS_DETECT
> +struct vdso_percpu_data {
> + /* layout: | cpu ID | context switch count | */
> + unsigned long cpu_cs_count;
> +} ____cacheline_aligned;
> +
> +struct vdso_data {
> + int dummy;
> + struct vdso_percpu_data vpercpu[0];
> +};
> +extern struct vdso_data vdso_data;
> +
> +#ifdef CONFIG_SMP
> +#define VDSO_CS_COUNT_BITS \
> + (sizeof(unsigned long) * 8 - NR_CPUS_BITS)
> +static inline void vdso_inc_cpu_cs_count(int cpu)
> +{
> + unsigned long cs = vdso_data.vpercpu[cpu].cpu_cs_count;
> + cs++;
> + cs &= (1L << VDSO_CS_COUNT_BITS) - 1;
> + vdso_data.vpercpu[cpu].cpu_cs_count = cs |
> + (((unsigned long)cpu) << VDSO_CS_COUNT_BITS);
> + smp_wmb();
> +}
> +#else
> +static inline void vdso_inc_cpu_cs_count(int cpu)
> +{
> + vdso_data.vpercpu[cpu].cpu_cs_count++;
> + smp_wmb();
> +}
> +#endif
> +#endif
> +
> #endif /* __ASSEMBLER__ */
>
> #endif /* _ASM_X86_VDSO_H */
> diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
> index fcbe621..394ab65 100644
> --- a/arch/x86/include/asm/vvar.h
> +++ b/arch/x86/include/asm/vvar.h
> @@ -47,6 +47,12 @@ extern char __vvar_page;
> DECLARE_VVAR(0, volatile unsigned long, jiffies)
> DECLARE_VVAR(16, int, vgetcpu_mode)
> DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
> +#if defined(CONFIG_VDSO_CS_DETECT) && defined(CONFIG_X86_64)
> +/*
> + * this one needs to be last because it ends with a per-cpu array.
> + */
> +DECLARE_VVAR(320, struct vdso_data, vdso_data)
> +#endif
> /*
> * you must update VVAR_TOTAL_SIZE to reflect all of the variables we're
> * stuffing into the vvar area. Don't change any of the above without
> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 0ab31a9..7321cdc 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -17,6 +17,7 @@
> #include <asm/bootparam.h>
> #include <asm/suspend.h>
> #include <asm/vgtod.h>
> +#include <asm/vdso.h>
>
> #ifdef CONFIG_XEN
> #include <xen/interface/xen.h>
> @@ -74,6 +75,11 @@ void common(void) {
> DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
>
> BLANK();
> +#ifdef CONFIG_VDSO_CS_DETECT
> + DEFINE(VVAR_TOTAL_SIZE, ALIGN(320 + sizeof(struct vdso_data)
> + + sizeof(struct vdso_percpu_data) * CONFIG_NR_CPUS, PAGE_SIZE));
> +#else
> DEFINE(VVAR_TOTAL_SIZE,
> ALIGN(128 + sizeof(struct vsyscall_gtod_data), PAGE_SIZE));
> +#endif
> }
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322..438b3be 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -17,6 +17,8 @@
> #include <asm/vvar.h>
> #include <asm/unistd.h>
> #include <asm/msr.h>
> +#include <asm/vdso.h>
> +#include <asm/vsyscall.h>
> #include <linux/math64.h>
> #include <linux/time.h>
>
> @@ -289,6 +291,16 @@ notrace static void do_monotonic_coarse(struct timespec *ts)
> } while (unlikely(gtod_read_retry(gtod, seq)));
> }
>
> +#if defined(CONFIG_VDSO_CS_DETECT) && defined(CONFIG_X86_64)
> +notrace unsigned long __vdso_get_context_switch_count(void)
> +{
> + int cpu = __getcpu() & VGETCPU_CPU_MASK;
> +
> + smp_rmb();
> + return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
> +}
> +#endif
> +
> notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
> {
> switch (clock) {
> diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
> index fc37067..400e454 100644
> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -22,6 +22,8 @@
> unsigned int __read_mostly vdso64_enabled = 1;
>
> extern unsigned short vdso_sync_cpuid;
> +
> +DEFINE_VVAR(struct vdso_data, vdso_data);
> #endif
>
> void __init init_vdso_image(const struct vdso_image *image)
> @@ -42,6 +44,10 @@ void __init init_vdso_image(const struct vdso_image *image)
> #if defined(CONFIG_X86_64)
> static int __init init_vdso(void)
> {
> + int cpu;
> + for (cpu =0; cpu < CONFIG_NR_CPUS; cpu++)
> + vdso_inc_cpu_cs_count(cpu);
> +
> init_vdso_image(&vdso_image_64);
>
> #ifdef CONFIG_X86_X32_ABI
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 89e7283..5e4100a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2238,6 +2238,11 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> {
> struct mm_struct *mm = rq->prev_mm;
> long prev_state;
> +#ifdef CONFIG_VDSO_CS_DETECT
> + int cpu = smp_processor_id();
> +
> + vdso_inc_cpu_cs_count(cpu);
> +#endif
>
> rq->prev_mm = NULL;
>
> --
> 1.8.1
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-10 18:51:33

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 2/3] X86: add a generic API to let vdso code detect context switch

On Wed, Dec 10, 2014 at 10:38:41AM -0800, Andy Lutomirski wrote:
> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
> > vdso code can't disable preempt, so it can be preempted at any time.
> > This makes a challenge to implement specific features. This patch adds a
> > generic API to let vdso code detect context switch.
> >
> > With this patch, every cpu maintains a context switch count. The upper
> > bits of the count is the logical cpu id, so the count can't be identical
> > for any cpu. The low bits of the count will be increased for each
> > context switch. For a x86_64 cpu with 4096 cpus, the context switch will
> > be overflowed for 2^(64 - 12) context switch, which is a long time and can be
> > ignored. The change of the count in giving time can be used to detect if
> > context switch occurs.
>
> Why do you need those high bits? I don't understand how you could
> possibly confuse one cpu's count with another's unless you fail to
> make sure that you're reading the same address both times.
>
> That being said, I don't like this patch. I'm not sure I have a much
> better idea, though. More thoughts in the 0/0 email to follow.
the vdso code doesn't disable preemption, so it can be migrated between
cpus at any time, the usage is:

get_countext_switch_count (in cpu A)
do_something (in cpu B)
get_countext_switch_count (in cpu C)

The cpu A, B, C could be completely different. We want to make sure
there is no preemption here and we use the context switch count to judge
this. If the high bits is ignored, the context switch count could be
identical even A != C, then our judgement using the switch count is
wrong.

Thanks,
Shaohua

2014-12-10 19:11:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
> This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
> use the following method to compute the thread cpu time:

I like the idea, and I like making this type of profiling fast. I
don't love the implementation because it's an information leak (maybe
we don't care) and it's ugly.

The info leak could be fixed completely by having a per-process array
instead of a global array. That's currently tricky without wasting
memory, but it could be created on demand if we wanted to do that,
once my vvar .fault patches go in (assuming they do -- I need to ping
the linux-mm people).

As for ugliness, it seems to me that there really ought to be a better
way to do this. What we really want is a per-thread vvar-like thing.
Any code that can use TLS can do this trivially, but the vdso,
unfortunately, has no straightforward way to use TLS.

If we could rely on something like TSX (ha!), then this becomes straightforward:

xbegin()
rdtscp
read params
xcommit()

But we can't do that for the next decade or so, obviously.

If we were willing to limit ourselves to systems with rdwrgsbase, then
we could do:

save gs and gsbase
mov $__VDSO_CPU,%gs
mov %gs:(base our array),%whatever
restore gs and gsbase

(We actually could do this if we make arch_prctl disable this feature
the first time a process tries to set the gs base.)

If we were willing to limit ourselves to at most 2^20 threads per
process, then we could assign each thread a 20-bit index within its
process with some LSL trickery. Then we could have the vdso look up
the thread in a per-process array with one element per thread. This
would IMO be the winning approach if we think we'd ever extend the set
of per-thread vvar things. (For the 32-bit VDSO, this is trivial --
just use a far pointer.)

If we has per-cpu fixmap entries, we'd use that. Unfortunately, that
would be a giant mess to implement, and it would be a slowdown for
some workloads (and a speedup for others - hmm).

Grumble. Thoughts?

--Andy

>
> t0 = process start
> t1 = most recent context switch time
> t2 = time at which the vsyscall is invoked
>
> thread_cpu_time = sum(time slices between t0 to t1) + (t2 - t1)
> = current->se.sum_exec_runtime + now - sched_clock()
>
> At context switch time We stash away
>
> adj_sched_time = sum_exec_runtime - sched_clock()
>
> in a per-cpu struct in the VVAR page and then compute
>
> thread_cpu_time = adj_sched_time + now
>
> All computations are done in nanosecs on systems where TSC is stable. If
> TSC is unstable, we fallback to a regular syscall.
> Benchmark data:
>
> for (i = 0; i < 100000000; i++) {
> clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
> sum += ts.tv_sec * NSECS_PER_SEC + ts.tv_nsec;
> }
>
> Baseline:
> real 1m3.428s
> user 0m5.190s
> sys 0m58.220s
>
> patched:
> real 0m4.912s
> user 0m4.910s
> sys 0m0.000s
>
> This should speed up profilers that need to query thread cpu time a lot
> to do fine-grained timestamps.
>
> No statistically significant regression was detected on x86_64 context
> switch code. Most archs that don't support vsyscalls will have this code
> disabled via jump labels.
>
> Cc: Andy Lutomirski <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Signed-off-by: Kumar Sundararajan <[email protected]>
> Signed-off-by: Arun Sharma <[email protected]>
> Signed-off-by: Chris Mason <[email protected]>
> Signed-off-by: Shaohua Li <[email protected]>
> ---
> arch/x86/include/asm/vdso.h | 9 +++++++-
> arch/x86/kernel/tsc.c | 14 ++++++++++++
> arch/x86/vdso/vclock_gettime.c | 49 ++++++++++++++++++++++++++++++++++++++++++
> arch/x86/vdso/vma.c | 4 ++++
> kernel/sched/core.c | 3 +++
> 5 files changed, 78 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
> index 42d6d2c..cbbab3a 100644
> --- a/arch/x86/include/asm/vdso.h
> +++ b/arch/x86/include/asm/vdso.h
> @@ -54,14 +54,21 @@ extern void __init init_vdso_image(const struct vdso_image *image);
> struct vdso_percpu_data {
> /* layout: | cpu ID | context switch count | */
> unsigned long cpu_cs_count;
> +
> + unsigned long long adj_sched_time;
> + unsigned long long cyc2ns_offset;
> + unsigned long cyc2ns;

Having two offsets seems unnecessary.

> } ____cacheline_aligned;
>
> struct vdso_data {
> - int dummy;
> + unsigned int thread_cputime_disabled;
> struct vdso_percpu_data vpercpu[0];
> };
> extern struct vdso_data vdso_data;
>
> +struct static_key;
> +extern struct static_key vcpu_thread_cputime_enabled;
> +
> #ifdef CONFIG_SMP
> #define VDSO_CS_COUNT_BITS \
> (sizeof(unsigned long) * 8 - NR_CPUS_BITS)
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index b7e50bb..83f8091 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -21,6 +21,7 @@
> #include <asm/hypervisor.h>
> #include <asm/nmi.h>
> #include <asm/x86_init.h>
> +#include <asm/vdso.h>
>
> unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
> EXPORT_SYMBOL(cpu_khz);
> @@ -263,6 +264,11 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> data->cyc2ns_offset = ns_now -
> mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
>
> +#ifdef CONFIG_VDSO_CS_DETECT
> + vdso_data.vpercpu[cpu].cyc2ns = data->cyc2ns_mul;
> + vdso_data.vpercpu[cpu].cyc2ns_offset = data->cyc2ns_offset;
> +#endif
> +
> cyc2ns_write_end(cpu, data);
>
> done:
> @@ -989,6 +995,10 @@ void mark_tsc_unstable(char *reason)
> tsc_unstable = 1;
> clear_sched_clock_stable();
> disable_sched_clock_irqtime();
> +#ifdef CONFIG_VDSO_CS_DETECT
> + vdso_data.thread_cputime_disabled = 1;
> + static_key_slow_dec(&vcpu_thread_cputime_enabled);
> +#endif
> pr_info("Marking TSC unstable due to %s\n", reason);
> /* Change only the rating, when not registered */
> if (clocksource_tsc.mult)
> @@ -1202,6 +1212,10 @@ void __init tsc_init(void)
>
> tsc_disabled = 0;
> static_key_slow_inc(&__use_tsc);
> +#ifdef CONFIG_VDSO_CS_DETECT
> + vdso_data.thread_cputime_disabled = !cpu_has(&boot_cpu_data,
> + X86_FEATURE_RDTSCP);
> +#endif

This is backwards IMO. Even if this function never runs, the flag
should be set to disable the feature. Then you can enable it here.

>
> if (!no_sched_irq_time)
> enable_sched_clock_irqtime();
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 438b3be..46c03af2 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -299,6 +299,48 @@ notrace unsigned long __vdso_get_context_switch_count(void)
> smp_rmb();
> return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
> }
> +
> +#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
> +notrace static inline unsigned long long __cycles_2_ns(unsigned long long cyc,
> + unsigned long long scale,
> + unsigned long long offset)
> +{
> + unsigned long long ns = offset;
> + ns += mul_u64_u32_shr(cyc, scale, CYC2NS_SCALE_FACTOR);
> + return ns;
> +}
> +
> +notrace static unsigned long do_thread_cpu_time(void)
> +{
> + unsigned int p;
> + u_int64_t tscval;
> + unsigned long long adj_sched_time;
> + unsigned long long scale;
> + unsigned long long offset;
> + const struct vdso_data *vp = &VVAR(vdso_data);
> + int cpu;
> + unsigned long cs;
> +
> + do {
> + cs = __vdso_get_context_switch_count();
> +
> + rdtscpll(tscval, p);
> + cpu = p & VGETCPU_CPU_MASK;
> + adj_sched_time = vp->vpercpu[cpu].adj_sched_time;
> + scale = vp->vpercpu[cpu].cyc2ns;
> + offset = vp->vpercpu[cpu].cyc2ns_offset;
> +
> + } while (unlikely(cs != __vdso_get_context_switch_count()));
> +
> + return __cycles_2_ns(tscval, scale, offset) + adj_sched_time;

You're literally adding two offsets together here, although it's
obscured by the abstraction of __cycles_2_ns.

Also, if you actually expand this function out, it's not optimized
well. You're doing, roughly:

getcpu
read cs count
rdtscp
read clock params
getcpu
read cs count
repeat if necessary

You should need at most two operations that read the cpu number. You could do:

getcpu
read cs count
rdtscp
read clock params
barrier()
check that both cpu numbers agree and that the cs count hasn't changed

All those smp barriers should be unnecessary, too. There's no SMP here.

Also, if you do this, then the high bits of the cs count can be removed.

It may also make sense to bail and use the fallback after a couple
failures. Otherwise, in some debuggers, you will literally never
finish the loop.

I am curious, though: why are you using sched clock instead of
CLOCK_MONOTONIC? Is it because you can't read CLOCK_MONOTONIC in the
scheduler?

--Andy

> +}
> +
> +notrace static inline void ns_to_ts(u64 ns, struct timespec *ts)
> +{
> + u32 rem;
> + ts->tv_sec = div_u64_rem(ns, NSEC_PER_SEC, &rem);
> + ts->tv_nsec = rem;
> +}
> #endif
>
> notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
> @@ -318,6 +360,13 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
> case CLOCK_MONOTONIC_COARSE:
> do_monotonic_coarse(ts);
> break;
> +#if defined(CONFIG_VDSO_CS_DETECT) && defined(CONFIG_X86_64)
> + case CLOCK_THREAD_CPUTIME_ID:
> + if (VVAR(vdso_data).thread_cputime_disabled)
> + goto fallback;
> + ns_to_ts(do_thread_cpu_time(), ts);
> + break;
> +#endif
> default:
> goto fallback;
> }
> diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
> index 400e454..aefdd93 100644
> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -10,6 +10,7 @@
> #include <linux/init.h>
> #include <linux/random.h>
> #include <linux/elf.h>
> +#include <linux/jump_label.h>
> #include <asm/vsyscall.h>
> #include <asm/vgtod.h>
> #include <asm/proto.h>
> @@ -24,6 +25,7 @@ unsigned int __read_mostly vdso64_enabled = 1;
> extern unsigned short vdso_sync_cpuid;
>
> DEFINE_VVAR(struct vdso_data, vdso_data);
> +struct static_key vcpu_thread_cputime_enabled;
> #endif
>
> void __init init_vdso_image(const struct vdso_image *image)
> @@ -54,6 +56,8 @@ static int __init init_vdso(void)
> init_vdso_image(&vdso_image_x32);
> #endif
>
> + if (!vdso_data.thread_cputime_disabled)
> + static_key_slow_inc(&vcpu_thread_cputime_enabled);
> return 0;
> }
> subsys_initcall(init_vdso);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5e4100a..e0882fb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2242,6 +2242,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> int cpu = smp_processor_id();
>
> vdso_inc_cpu_cs_count(cpu);
> + if (static_key_false(&vcpu_thread_cputime_enabled))
> + vdso_data.vpercpu[cpu].adj_sched_time =
> + current->se.sum_exec_runtime - sched_clock();
> #endif
>
> rq->prev_mm = NULL;
> --
> 1.8.1
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-10 19:11:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/3] X86: add a generic API to let vdso code detect context switch

On Wed, Dec 10, 2014 at 10:51 AM, Shaohua Li <[email protected]> wrote:
> On Wed, Dec 10, 2014 at 10:38:41AM -0800, Andy Lutomirski wrote:
>> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
>> > vdso code can't disable preempt, so it can be preempted at any time.
>> > This makes a challenge to implement specific features. This patch adds a
>> > generic API to let vdso code detect context switch.
>> >
>> > With this patch, every cpu maintains a context switch count. The upper
>> > bits of the count is the logical cpu id, so the count can't be identical
>> > for any cpu. The low bits of the count will be increased for each
>> > context switch. For a x86_64 cpu with 4096 cpus, the context switch will
>> > be overflowed for 2^(64 - 12) context switch, which is a long time and can be
>> > ignored. The change of the count in giving time can be used to detect if
>> > context switch occurs.
>>
>> Why do you need those high bits? I don't understand how you could
>> possibly confuse one cpu's count with another's unless you fail to
>> make sure that you're reading the same address both times.
>>
>> That being said, I don't like this patch. I'm not sure I have a much
>> better idea, though. More thoughts in the 0/0 email to follow.
> the vdso code doesn't disable preemption, so it can be migrated between
> cpus at any time, the usage is:
>
> get_countext_switch_count (in cpu A)
> do_something (in cpu B)
> get_countext_switch_count (in cpu C)
>
> The cpu A, B, C could be completely different. We want to make sure
> there is no preemption here and we use the context switch count to judge
> this. If the high bits is ignored, the context switch count could be
> identical even A != C, then our judgement using the switch count is
> wrong.

Sure, but you could compare the cpu numbers, too.

--Andy

>
> Thanks,
> Shaohua



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-10 19:42:16

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 2/3] X86: add a generic API to let vdso code detect context switch

On Wed, Dec 10, 2014 at 11:11:27AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 10, 2014 at 10:51 AM, Shaohua Li <[email protected]> wrote:
> > On Wed, Dec 10, 2014 at 10:38:41AM -0800, Andy Lutomirski wrote:
> >> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
> >> > vdso code can't disable preempt, so it can be preempted at any time.
> >> > This makes a challenge to implement specific features. This patch adds a
> >> > generic API to let vdso code detect context switch.
> >> >
> >> > With this patch, every cpu maintains a context switch count. The upper
> >> > bits of the count is the logical cpu id, so the count can't be identical
> >> > for any cpu. The low bits of the count will be increased for each
> >> > context switch. For a x86_64 cpu with 4096 cpus, the context switch will
> >> > be overflowed for 2^(64 - 12) context switch, which is a long time and can be
> >> > ignored. The change of the count in giving time can be used to detect if
> >> > context switch occurs.
> >>
> >> Why do you need those high bits? I don't understand how you could
> >> possibly confuse one cpu's count with another's unless you fail to
> >> make sure that you're reading the same address both times.
> >>
> >> That being said, I don't like this patch. I'm not sure I have a much
> >> better idea, though. More thoughts in the 0/0 email to follow.
> > the vdso code doesn't disable preemption, so it can be migrated between
> > cpus at any time, the usage is:
> >
> > get_countext_switch_count (in cpu A)
> > do_something (in cpu B)
> > get_countext_switch_count (in cpu C)
> >
> > The cpu A, B, C could be completely different. We want to make sure
> > there is no preemption here and we use the context switch count to judge
> > this. If the high bits is ignored, the context switch count could be
> > identical even A != C, then our judgement using the switch count is
> > wrong.
>
> Sure, but you could compare the cpu numbers, too.

Aha, makes sense.

Thanks,
Shaohua

2014-12-10 21:57:30

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
> > use the following method to compute the thread cpu time:
>
> I like the idea, and I like making this type of profiling fast. I
> don't love the implementation because it's an information leak (maybe
> we don't care) and it's ugly.
>
> The info leak could be fixed completely by having a per-process array
> instead of a global array. That's currently tricky without wasting
> memory, but it could be created on demand if we wanted to do that,
> once my vvar .fault patches go in (assuming they do -- I need to ping
> the linux-mm people).

those info leak really doesn't matter. But we need the global array
anyway. The context switch detection should be per-cpu data and should
be able to access in remote cpus.

> As for ugliness, it seems to me that there really ought to be a better
> way to do this. What we really want is a per-thread vvar-like thing.
> Any code that can use TLS can do this trivially, but the vdso,
> unfortunately, has no straightforward way to use TLS.
>
> If we could rely on something like TSX (ha!), then this becomes straightforward:
>
> xbegin()
> rdtscp
> read params
> xcommit()
>
> But we can't do that for the next decade or so, obviously.
>
> If we were willing to limit ourselves to systems with rdwrgsbase, then
> we could do:
>
> save gs and gsbase
> mov $__VDSO_CPU,%gs
> mov %gs:(base our array),%whatever
> restore gs and gsbase
>
> (We actually could do this if we make arch_prctl disable this feature
> the first time a process tries to set the gs base.)
>
> If we were willing to limit ourselves to at most 2^20 threads per
> process, then we could assign each thread a 20-bit index within its
> process with some LSL trickery. Then we could have the vdso look up
> the thread in a per-process array with one element per thread. This
> would IMO be the winning approach if we think we'd ever extend the set
> of per-thread vvar things. (For the 32-bit VDSO, this is trivial --
> just use a far pointer.)
>
> If we has per-cpu fixmap entries, we'd use that. Unfortunately, that
> would be a giant mess to implement, and it would be a slowdown for
> some workloads (and a speedup for others - hmm).
>
> Grumble. Thoughts?
>
> --Andy
>
> >
> > t0 = process start
> > t1 = most recent context switch time
> > t2 = time at which the vsyscall is invoked
> >
> > thread_cpu_time = sum(time slices between t0 to t1) + (t2 - t1)
> > = current->se.sum_exec_runtime + now - sched_clock()
> >
> > At context switch time We stash away
> >
> > adj_sched_time = sum_exec_runtime - sched_clock()
> >
> > in a per-cpu struct in the VVAR page and then compute
> >
> > thread_cpu_time = adj_sched_time + now
> >
> > All computations are done in nanosecs on systems where TSC is stable. If
> > TSC is unstable, we fallback to a regular syscall.
> > Benchmark data:
> >
> > for (i = 0; i < 100000000; i++) {
> > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
> > sum += ts.tv_sec * NSECS_PER_SEC + ts.tv_nsec;
> > }
> >
> > Baseline:
> > real 1m3.428s
> > user 0m5.190s
> > sys 0m58.220s
> >
> > patched:
> > real 0m4.912s
> > user 0m4.910s
> > sys 0m0.000s
> >
> > This should speed up profilers that need to query thread cpu time a lot
> > to do fine-grained timestamps.
> >
> > No statistically significant regression was detected on x86_64 context
> > switch code. Most archs that don't support vsyscalls will have this code
> > disabled via jump labels.
> >
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: H. Peter Anvin <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Signed-off-by: Kumar Sundararajan <[email protected]>
> > Signed-off-by: Arun Sharma <[email protected]>
> > Signed-off-by: Chris Mason <[email protected]>
> > Signed-off-by: Shaohua Li <[email protected]>
> > ---
> > arch/x86/include/asm/vdso.h | 9 +++++++-
> > arch/x86/kernel/tsc.c | 14 ++++++++++++
> > arch/x86/vdso/vclock_gettime.c | 49 ++++++++++++++++++++++++++++++++++++++++++
> > arch/x86/vdso/vma.c | 4 ++++
> > kernel/sched/core.c | 3 +++
> > 5 files changed, 78 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
> > index 42d6d2c..cbbab3a 100644
> > --- a/arch/x86/include/asm/vdso.h
> > +++ b/arch/x86/include/asm/vdso.h
> > @@ -54,14 +54,21 @@ extern void __init init_vdso_image(const struct vdso_image *image);
> > struct vdso_percpu_data {
> > /* layout: | cpu ID | context switch count | */
> > unsigned long cpu_cs_count;
> > +
> > + unsigned long long adj_sched_time;
> > + unsigned long long cyc2ns_offset;
> > + unsigned long cyc2ns;
>
> Having two offsets seems unnecessary.
>
> > } ____cacheline_aligned;
> >
> > struct vdso_data {
> > - int dummy;
> > + unsigned int thread_cputime_disabled;
> > struct vdso_percpu_data vpercpu[0];
> > };
> > extern struct vdso_data vdso_data;
> >
> > +struct static_key;
> > +extern struct static_key vcpu_thread_cputime_enabled;
> > +
> > #ifdef CONFIG_SMP
> > #define VDSO_CS_COUNT_BITS \
> > (sizeof(unsigned long) * 8 - NR_CPUS_BITS)
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index b7e50bb..83f8091 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -21,6 +21,7 @@
> > #include <asm/hypervisor.h>
> > #include <asm/nmi.h>
> > #include <asm/x86_init.h>
> > +#include <asm/vdso.h>
> >
> > unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
> > EXPORT_SYMBOL(cpu_khz);
> > @@ -263,6 +264,11 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> > data->cyc2ns_offset = ns_now -
> > mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
> >
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > + vdso_data.vpercpu[cpu].cyc2ns = data->cyc2ns_mul;
> > + vdso_data.vpercpu[cpu].cyc2ns_offset = data->cyc2ns_offset;
> > +#endif
> > +
> > cyc2ns_write_end(cpu, data);
> >
> > done:
> > @@ -989,6 +995,10 @@ void mark_tsc_unstable(char *reason)
> > tsc_unstable = 1;
> > clear_sched_clock_stable();
> > disable_sched_clock_irqtime();
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > + vdso_data.thread_cputime_disabled = 1;
> > + static_key_slow_dec(&vcpu_thread_cputime_enabled);
> > +#endif
> > pr_info("Marking TSC unstable due to %s\n", reason);
> > /* Change only the rating, when not registered */
> > if (clocksource_tsc.mult)
> > @@ -1202,6 +1212,10 @@ void __init tsc_init(void)
> >
> > tsc_disabled = 0;
> > static_key_slow_inc(&__use_tsc);
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > + vdso_data.thread_cputime_disabled = !cpu_has(&boot_cpu_data,
> > + X86_FEATURE_RDTSCP);
> > +#endif
>
> This is backwards IMO. Even if this function never runs, the flag
> should be set to disable the feature. Then you can enable it here.
>
> >
> > if (!no_sched_irq_time)
> > enable_sched_clock_irqtime();
> > diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> > index 438b3be..46c03af2 100644
> > --- a/arch/x86/vdso/vclock_gettime.c
> > +++ b/arch/x86/vdso/vclock_gettime.c
> > @@ -299,6 +299,48 @@ notrace unsigned long __vdso_get_context_switch_count(void)
> > smp_rmb();
> > return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
> > }
> > +
> > +#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
> > +notrace static inline unsigned long long __cycles_2_ns(unsigned long long cyc,
> > + unsigned long long scale,
> > + unsigned long long offset)
> > +{
> > + unsigned long long ns = offset;
> > + ns += mul_u64_u32_shr(cyc, scale, CYC2NS_SCALE_FACTOR);
> > + return ns;
> > +}
> > +
> > +notrace static unsigned long do_thread_cpu_time(void)
> > +{
> > + unsigned int p;
> > + u_int64_t tscval;
> > + unsigned long long adj_sched_time;
> > + unsigned long long scale;
> > + unsigned long long offset;
> > + const struct vdso_data *vp = &VVAR(vdso_data);
> > + int cpu;
> > + unsigned long cs;
> > +
> > + do {
> > + cs = __vdso_get_context_switch_count();
> > +
> > + rdtscpll(tscval, p);
> > + cpu = p & VGETCPU_CPU_MASK;
> > + adj_sched_time = vp->vpercpu[cpu].adj_sched_time;
> > + scale = vp->vpercpu[cpu].cyc2ns;
> > + offset = vp->vpercpu[cpu].cyc2ns_offset;
> > +
> > + } while (unlikely(cs != __vdso_get_context_switch_count()));
> > +
> > + return __cycles_2_ns(tscval, scale, offset) + adj_sched_time;
>
> You're literally adding two offsets together here, although it's
> obscured by the abstraction of __cycles_2_ns.

I don't understand 'two offsets' stuff, could you please give more
details?

> Also, if you actually expand this function out, it's not optimized
> well. You're doing, roughly:
>
> getcpu
> read cs count
> rdtscp
> read clock params
> getcpu
> read cs count
> repeat if necessary
>
> You should need at most two operations that read the cpu number. You could do:
>
> getcpu
> read cs count
> rdtscp
> read clock params
> barrier()
> check that both cpu numbers agree and that the cs count hasn't changed
>
> All those smp barriers should be unnecessary, too. There's no SMP here.
>
> Also, if you do this, then the high bits of the cs count can be removed.
> It may also make sense to bail and use the fallback after a couple
> failures. Otherwise, in some debuggers, you will literally never
> finish the loop.

ok, makes sense.

> I am curious, though: why are you using sched clock instead of
> CLOCK_MONOTONIC? Is it because you can't read CLOCK_MONOTONIC in the
> scheduler?

The sched clock is faster, I guess.

Thanks,
Shaohua

2014-12-10 22:13:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <[email protected]> wrote:
> On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
>> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
>> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
>> > use the following method to compute the thread cpu time:
>>
>> I like the idea, and I like making this type of profiling fast. I
>> don't love the implementation because it's an information leak (maybe
>> we don't care) and it's ugly.
>>
>> The info leak could be fixed completely by having a per-process array
>> instead of a global array. That's currently tricky without wasting
>> memory, but it could be created on demand if we wanted to do that,
>> once my vvar .fault patches go in (assuming they do -- I need to ping
>> the linux-mm people).
>
> those info leak really doesn't matter.

Why not?

> But we need the global array
> anyway. The context switch detection should be per-cpu data and should
> be able to access in remote cpus.

Right, but the whole array could be per process instead of global.

I'm not saying I'm sure that would be better, but I think it's worth
considering.

>
>> As for ugliness, it seems to me that there really ought to be a better
>> way to do this. What we really want is a per-thread vvar-like thing.
>> Any code that can use TLS can do this trivially, but the vdso,
>> unfortunately, has no straightforward way to use TLS.
>>
>> If we could rely on something like TSX (ha!), then this becomes straightforward:
>>
>> xbegin()
>> rdtscp
>> read params
>> xcommit()
>>
>> But we can't do that for the next decade or so, obviously.
>>
>> If we were willing to limit ourselves to systems with rdwrgsbase, then
>> we could do:
>>
>> save gs and gsbase
>> mov $__VDSO_CPU,%gs
>> mov %gs:(base our array),%whatever
>> restore gs and gsbase
>>
>> (We actually could do this if we make arch_prctl disable this feature
>> the first time a process tries to set the gs base.)
>>
>> If we were willing to limit ourselves to at most 2^20 threads per
>> process, then we could assign each thread a 20-bit index within its
>> process with some LSL trickery. Then we could have the vdso look up
>> the thread in a per-process array with one element per thread. This
>> would IMO be the winning approach if we think we'd ever extend the set
>> of per-thread vvar things. (For the 32-bit VDSO, this is trivial --
>> just use a far pointer.)
>>
>> If we has per-cpu fixmap entries, we'd use that. Unfortunately, that
>> would be a giant mess to implement, and it would be a slowdown for
>> some workloads (and a speedup for others - hmm).
>>
>> Grumble. Thoughts?
>>
>> --Andy
>>
>> >
>> > t0 = process start
>> > t1 = most recent context switch time
>> > t2 = time at which the vsyscall is invoked
>> >
>> > thread_cpu_time = sum(time slices between t0 to t1) + (t2 - t1)
>> > = current->se.sum_exec_runtime + now - sched_clock()
>> >
>> > At context switch time We stash away
>> >
>> > adj_sched_time = sum_exec_runtime - sched_clock()
>> >
>> > in a per-cpu struct in the VVAR page and then compute
>> >
>> > thread_cpu_time = adj_sched_time + now
>> >
>> > All computations are done in nanosecs on systems where TSC is stable. If
>> > TSC is unstable, we fallback to a regular syscall.
>> > Benchmark data:
>> >
>> > for (i = 0; i < 100000000; i++) {
>> > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
>> > sum += ts.tv_sec * NSECS_PER_SEC + ts.tv_nsec;
>> > }
>> >
>> > Baseline:
>> > real 1m3.428s
>> > user 0m5.190s
>> > sys 0m58.220s
>> >
>> > patched:
>> > real 0m4.912s
>> > user 0m4.910s
>> > sys 0m0.000s
>> >
>> > This should speed up profilers that need to query thread cpu time a lot
>> > to do fine-grained timestamps.
>> >
>> > No statistically significant regression was detected on x86_64 context
>> > switch code. Most archs that don't support vsyscalls will have this code
>> > disabled via jump labels.
>> >
>> > Cc: Andy Lutomirski <[email protected]>
>> > Cc: H. Peter Anvin <[email protected]>
>> > Cc: Ingo Molnar <[email protected]>
>> > Signed-off-by: Kumar Sundararajan <[email protected]>
>> > Signed-off-by: Arun Sharma <[email protected]>
>> > Signed-off-by: Chris Mason <[email protected]>
>> > Signed-off-by: Shaohua Li <[email protected]>
>> > ---
>> > arch/x86/include/asm/vdso.h | 9 +++++++-
>> > arch/x86/kernel/tsc.c | 14 ++++++++++++
>> > arch/x86/vdso/vclock_gettime.c | 49 ++++++++++++++++++++++++++++++++++++++++++
>> > arch/x86/vdso/vma.c | 4 ++++
>> > kernel/sched/core.c | 3 +++
>> > 5 files changed, 78 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
>> > index 42d6d2c..cbbab3a 100644
>> > --- a/arch/x86/include/asm/vdso.h
>> > +++ b/arch/x86/include/asm/vdso.h
>> > @@ -54,14 +54,21 @@ extern void __init init_vdso_image(const struct vdso_image *image);
>> > struct vdso_percpu_data {
>> > /* layout: | cpu ID | context switch count | */
>> > unsigned long cpu_cs_count;
>> > +
>> > + unsigned long long adj_sched_time;
>> > + unsigned long long cyc2ns_offset;
>> > + unsigned long cyc2ns;
>>
>> Having two offsets seems unnecessary.
>>
>> > } ____cacheline_aligned;
>> >
>> > struct vdso_data {
>> > - int dummy;
>> > + unsigned int thread_cputime_disabled;
>> > struct vdso_percpu_data vpercpu[0];
>> > };
>> > extern struct vdso_data vdso_data;
>> >
>> > +struct static_key;
>> > +extern struct static_key vcpu_thread_cputime_enabled;
>> > +
>> > #ifdef CONFIG_SMP
>> > #define VDSO_CS_COUNT_BITS \
>> > (sizeof(unsigned long) * 8 - NR_CPUS_BITS)
>> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>> > index b7e50bb..83f8091 100644
>> > --- a/arch/x86/kernel/tsc.c
>> > +++ b/arch/x86/kernel/tsc.c
>> > @@ -21,6 +21,7 @@
>> > #include <asm/hypervisor.h>
>> > #include <asm/nmi.h>
>> > #include <asm/x86_init.h>
>> > +#include <asm/vdso.h>
>> >
>> > unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
>> > EXPORT_SYMBOL(cpu_khz);
>> > @@ -263,6 +264,11 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
>> > data->cyc2ns_offset = ns_now -
>> > mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
>> >
>> > +#ifdef CONFIG_VDSO_CS_DETECT
>> > + vdso_data.vpercpu[cpu].cyc2ns = data->cyc2ns_mul;
>> > + vdso_data.vpercpu[cpu].cyc2ns_offset = data->cyc2ns_offset;
>> > +#endif
>> > +
>> > cyc2ns_write_end(cpu, data);
>> >
>> > done:
>> > @@ -989,6 +995,10 @@ void mark_tsc_unstable(char *reason)
>> > tsc_unstable = 1;
>> > clear_sched_clock_stable();
>> > disable_sched_clock_irqtime();
>> > +#ifdef CONFIG_VDSO_CS_DETECT
>> > + vdso_data.thread_cputime_disabled = 1;
>> > + static_key_slow_dec(&vcpu_thread_cputime_enabled);
>> > +#endif
>> > pr_info("Marking TSC unstable due to %s\n", reason);
>> > /* Change only the rating, when not registered */
>> > if (clocksource_tsc.mult)
>> > @@ -1202,6 +1212,10 @@ void __init tsc_init(void)
>> >
>> > tsc_disabled = 0;
>> > static_key_slow_inc(&__use_tsc);
>> > +#ifdef CONFIG_VDSO_CS_DETECT
>> > + vdso_data.thread_cputime_disabled = !cpu_has(&boot_cpu_data,
>> > + X86_FEATURE_RDTSCP);
>> > +#endif
>>
>> This is backwards IMO. Even if this function never runs, the flag
>> should be set to disable the feature. Then you can enable it here.
>>
>> >
>> > if (!no_sched_irq_time)
>> > enable_sched_clock_irqtime();
>> > diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> > index 438b3be..46c03af2 100644
>> > --- a/arch/x86/vdso/vclock_gettime.c
>> > +++ b/arch/x86/vdso/vclock_gettime.c
>> > @@ -299,6 +299,48 @@ notrace unsigned long __vdso_get_context_switch_count(void)
>> > smp_rmb();
>> > return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
>> > }
>> > +
>> > +#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
>> > +notrace static inline unsigned long long __cycles_2_ns(unsigned long long cyc,
>> > + unsigned long long scale,
>> > + unsigned long long offset)
>> > +{
>> > + unsigned long long ns = offset;
>> > + ns += mul_u64_u32_shr(cyc, scale, CYC2NS_SCALE_FACTOR);
>> > + return ns;
>> > +}
>> > +
>> > +notrace static unsigned long do_thread_cpu_time(void)
>> > +{
>> > + unsigned int p;
>> > + u_int64_t tscval;
>> > + unsigned long long adj_sched_time;
>> > + unsigned long long scale;
>> > + unsigned long long offset;
>> > + const struct vdso_data *vp = &VVAR(vdso_data);
>> > + int cpu;
>> > + unsigned long cs;
>> > +
>> > + do {
>> > + cs = __vdso_get_context_switch_count();
>> > +
>> > + rdtscpll(tscval, p);
>> > + cpu = p & VGETCPU_CPU_MASK;
>> > + adj_sched_time = vp->vpercpu[cpu].adj_sched_time;
>> > + scale = vp->vpercpu[cpu].cyc2ns;
>> > + offset = vp->vpercpu[cpu].cyc2ns_offset;
>> > +
>> > + } while (unlikely(cs != __vdso_get_context_switch_count()));
>> > +
>> > + return __cycles_2_ns(tscval, scale, offset) + adj_sched_time;
>>
>> You're literally adding two offsets together here, although it's
>> obscured by the abstraction of __cycles_2_ns.
>
> I don't understand 'two offsets' stuff, could you please give more
> details?

I think you're only really using cyc2ns_offset + adj_sched_time, so it
seems unnecessary to store both.

--Andy

>
>> Also, if you actually expand this function out, it's not optimized
>> well. You're doing, roughly:
>>
>> getcpu
>> read cs count
>> rdtscp
>> read clock params
>> getcpu
>> read cs count
>> repeat if necessary
>>
>> You should need at most two operations that read the cpu number. You could do:
>>
>> getcpu
>> read cs count
>> rdtscp
>> read clock params
>> barrier()
>> check that both cpu numbers agree and that the cs count hasn't changed
>>
>> All those smp barriers should be unnecessary, too. There's no SMP here.
>>
>> Also, if you do this, then the high bits of the cs count can be removed.
>> It may also make sense to bail and use the fallback after a couple
>> failures. Otherwise, in some debuggers, you will literally never
>> finish the loop.
>
> ok, makes sense.
>
>> I am curious, though: why are you using sched clock instead of
>> CLOCK_MONOTONIC? Is it because you can't read CLOCK_MONOTONIC in the
>> scheduler?
>
> The sched clock is faster, I guess.
>
> Thanks,
> Shaohua



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-10 22:56:42

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

On Wed, Dec 10, 2014 at 02:13:23PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <[email protected]> wrote:
> > On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
> >> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
> >> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
> >> > use the following method to compute the thread cpu time:
> >>
> >> I like the idea, and I like making this type of profiling fast. I
> >> don't love the implementation because it's an information leak (maybe
> >> we don't care) and it's ugly.
> >>
> >> The info leak could be fixed completely by having a per-process array
> >> instead of a global array. That's currently tricky without wasting
> >> memory, but it could be created on demand if we wanted to do that,
> >> once my vvar .fault patches go in (assuming they do -- I need to ping
> >> the linux-mm people).
> >
> > those info leak really doesn't matter.
>
> Why not?

Ofcourse I can't make sure completely, but how could this info be used
as attack?

> > But we need the global array
> > anyway. The context switch detection should be per-cpu data and should
> > be able to access in remote cpus.
>
> Right, but the whole array could be per process instead of global.
>
> I'm not saying I'm sure that would be better, but I think it's worth
> considering.

right, it's possible to be per process. As you said, this will waster a
lot of memory. and you can't even do on-demand, as the context switch
path will write the count to the per-process/per-thread vvar. Or you can
maintain the count in kernel and let the .fault copy the count to vvar
page (if the vvar page is absent). But this still wastes memory if
applications use the vdso. I'm wondering how you handle page fault in
context switch too if you don't pin the vdso pages.

Thanks,
Shaohua

2014-12-10 23:06:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

On Wed, Dec 10, 2014 at 2:56 PM, Shaohua Li <[email protected]> wrote:
> On Wed, Dec 10, 2014 at 02:13:23PM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <[email protected]> wrote:
>> > On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
>> >> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
>> >> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
>> >> > use the following method to compute the thread cpu time:
>> >>
>> >> I like the idea, and I like making this type of profiling fast. I
>> >> don't love the implementation because it's an information leak (maybe
>> >> we don't care) and it's ugly.
>> >>
>> >> The info leak could be fixed completely by having a per-process array
>> >> instead of a global array. That's currently tricky without wasting
>> >> memory, but it could be created on demand if we wanted to do that,
>> >> once my vvar .fault patches go in (assuming they do -- I need to ping
>> >> the linux-mm people).
>> >
>> > those info leak really doesn't matter.
>>
>> Why not?
>
> Ofcourse I can't make sure completely, but how could this info be used
> as attack?

It may leak interesting timing info, even from cpus that are outside
your affinity mask / cpuset. I don't know how much anyone actually
cares.

>
>> > But we need the global array
>> > anyway. The context switch detection should be per-cpu data and should
>> > be able to access in remote cpus.
>>
>> Right, but the whole array could be per process instead of global.
>>
>> I'm not saying I'm sure that would be better, but I think it's worth
>> considering.
>
> right, it's possible to be per process. As you said, this will waster a
> lot of memory. and you can't even do on-demand, as the context switch
> path will write the count to the per-process/per-thread vvar. Or you can
> maintain the count in kernel and let the .fault copy the count to vvar
> page (if the vvar page is absent). But this still wastes memory if
> applications use the vdso. I'm wondering how you handle page fault in
> context switch too if you don't pin the vdso pages.
>

You need to pin them, but at least you don't need to create them at
all until they're needed the first time.

The totally per-thread approach has all kinds of nice properties,
including allowing the whole thing to work without a loop, at least on
64-bit machines (if you detect that you had a context switch, just
return the most recent sum_exec_runtime).

Anyway, there's no need to achieve perfection here -- we can always
reimplement this if whatever implementation happens first turns out to
be problematic.

--Andy

> Thanks,
> Shaohua



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-11 06:36:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO


* Andy Lutomirski <[email protected]> wrote:

> On Wed, Dec 10, 2014 at 2:56 PM, Shaohua Li <[email protected]> wrote:
> > On Wed, Dec 10, 2014 at 02:13:23PM -0800, Andy Lutomirski wrote:
> >> On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <[email protected]> wrote:
> >> > On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
> >> >> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
> >> >> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
> >> >> > use the following method to compute the thread cpu time:
> >> >>
> >> >> I like the idea, and I like making this type of profiling fast. I
> >> >> don't love the implementation because it's an information leak (maybe
> >> >> we don't care) and it's ugly.
> >> >>
> >> >> The info leak could be fixed completely by having a per-process array
> >> >> instead of a global array. That's currently tricky without wasting
> >> >> memory, but it could be created on demand if we wanted to do that,
> >> >> once my vvar .fault patches go in (assuming they do -- I need to ping
> >> >> the linux-mm people).
> >> >
> >> > those info leak really doesn't matter.
> >>
> >> Why not?
> >
> > Ofcourse I can't make sure completely, but how could this
> > info be used as attack?
>
> It may leak interesting timing info, even from cpus that are
> outside your affinity mask / cpuset. I don't know how much
> anyone actually cares.

Finegraned timing information has been successfully used to
recover secret keys (and sometimes even coarse timing
information), so it can be a security issue in certain setups.

Thanks,

Ingo

2014-12-15 18:37:13

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO



On Thu, Dec 11, 2014 at 1:36 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> On Wed, Dec 10, 2014 at 2:56 PM, Shaohua Li <[email protected]> wrote:
>> > On Wed, Dec 10, 2014 at 02:13:23PM -0800, Andy Lutomirski wrote:
>> >> On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <[email protected]> wrote:
>> >> > On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski
>> wrote:
>> >> >> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]>
>> wrote:
>> >> >> > This primarily speeds up
>> clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
>> >> >> > use the following method to compute the thread cpu time:
>> >> >>
>> >> >> I like the idea, and I like making this type of profiling
>> fast. I
>> >> >> don't love the implementation because it's an information
>> leak (maybe
>> >> >> we don't care) and it's ugly.
>> >> >>
>> >> >> The info leak could be fixed completely by having a
>> per-process array
>> >> >> instead of a global array. That's currently tricky without
>> wasting
>> >> >> memory, but it could be created on demand if we wanted to do
>> that,
>> >> >> once my vvar .fault patches go in (assuming they do -- I need
>> to ping
>> >> >> the linux-mm people).
>> >> >
>> >> > those info leak really doesn't matter.
>> >>
>> >> Why not?
>> >
>> > Ofcourse I can't make sure completely, but how could this
>> > info be used as attack?
>>
>> It may leak interesting timing info, even from cpus that are
>> outside your affinity mask / cpuset. I don't know how much
>> anyone actually cares.
>
> Finegraned timing information has been successfully used to
> recover secret keys (and sometimes even coarse timing
> information), so it can be a security issue in certain setups.

Trying to nail this down a little more clearly. Are you worried about
the context switch count being exported or the clock_gettime data?

-chris


2014-12-15 18:56:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

On Mon, Dec 15, 2014 at 10:36 AM, Chris Mason <[email protected]> wrote:
>
>
> On Thu, Dec 11, 2014 at 1:36 AM, Ingo Molnar <[email protected]> wrote:
>>
>>
>> * Andy Lutomirski <[email protected]> wrote:
>>
>>> On Wed, Dec 10, 2014 at 2:56 PM, Shaohua Li <[email protected]> wrote:
>>> > On Wed, Dec 10, 2014 at 02:13:23PM -0800, Andy Lutomirski wrote:
>>> >> On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <[email protected]> wrote:
>>> >> > On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
>>> >> >> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
>>> >> >> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID,
>>> ..). We
>>> >> >> > use the following method to compute the thread cpu time:
>>> >> >>
>>> >> >> I like the idea, and I like making this type of profiling fast. I
>>> >> >> don't love the implementation because it's an information leak
>>> (maybe
>>> >> >> we don't care) and it's ugly.
>>> >> >>
>>> >> >> The info leak could be fixed completely by having a per-process
>>> array
>>> >> >> instead of a global array. That's currently tricky without
>>> wasting
>>> >> >> memory, but it could be created on demand if we wanted to do that,
>>> >> >> once my vvar .fault patches go in (assuming they do -- I need to
>>> ping
>>> >> >> the linux-mm people).
>>> >> >
>>> >> > those info leak really doesn't matter.
>>> >>
>>> >> Why not?
>>> >
>>> > Ofcourse I can't make sure completely, but how could this
>>> > info be used as attack?
>>>
>>> It may leak interesting timing info, even from cpus that are
>>> outside your affinity mask / cpuset. I don't know how much
>>> anyone actually cares.
>>
>>
>> Finegraned timing information has been successfully used to
>> recover secret keys (and sometimes even coarse timing
>> information), so it can be a security issue in certain setups.
>
>
> Trying to nail this down a little more clearly. Are you worried about the
> context switch count being exported or the clock_gettime data?
>

The context switch count is unnecessary. Here's an IMO better
algorithm that relies on the context switch code storing the TSC at
last context switch for each CPU in a user-accessible location:

clock_gettime does

cpu, tsc = rdtscp; /* NB: HW orders rdtscp as a load */
barrier();
read scale_factor[cpu], sum_exec_runtime[cpu], etc;
barrier();
if (last_context_switch_tsc[cpu] >= tsc)
repeat (or fall back to a syscall)
return tsc * whatever + whatever_else;

This should be faster, as it avoids two uses of LSL, which take 7-10
cycles each, IIRC.

With that improvement, the context switch count becomes unavailable.
That leaves sum_exec_runtime (which ought to be totally uninteresting
for attackers) and the time of the last context switch.

The time of the last context switch *on the current cpu* is mostly
available to attackers anyway -- just call clock_gettime in a loop and
watch for jumps. Some of those are interrupts, but the long ones are
likely to be context switches.

The time of the last context switch on other CPUs is potentially
important. We could fix that by having a per-process array of per-cpu
values instead of a global array of per-cpu values. We could
alternatively have a per-process array of per-thread values.

If we did that, we'd probably want to create the array on demand,
which will be a mess without my special_mapping fault rework that no
one has commented on. I'll ping people after the merge window.

We could also shove all the timing parameters into magic segment
limits in the GDT. The problem with that is that we only get 20 bits
per descriptor, which isn't enough to make it palatable.

For 32-bit programs, we could do evil things with segmentation to make
this all work cleanly and without leaks, but for 64-bit programs this
is a non-starter.

I *really really* want per-cpu memory mappings for this and other
things. Pretty please, Intel or AMD? (We can have them today by
using per-cpu pgds, giving a speedup for system calls (no more swapgs)
and faults but a possibly unacceptable slowdown for switch_mm. The
slowdown would only be there for programs that either use enormous
amounts of virtual memory or use their address space very sparsely,
though. It could be done for free on an EPT guest, I think, as long
as the hypervisor were willing to play along and give us a per-vcpu
guest physical "page".)

--Andy