2015-02-03 09:21:05

by Pawel Moll

[permalink] [raw]
Subject: Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps

On Tue, 2015-02-03 at 08:30 +0000, ajh mls wrote:
> There is still
>
> http://marc.info/?l=linux-kernel&m=142141223902303

Uh. I have no idea why, but I haven't got this mail at all :-(

Thanks for pointing it out!

Pawel


2015-02-11 16:13:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps


How about something like the below? I _think_ it should mostly work for
x86, where the tsc is a 64bit wide cycle counter.

I suppose we should extend the perf userpage time data with
time_last_cycle and time_mask if/when we want to make this work on
something with a short counter.

Of course, at that time we also need to somehow deal with that counter
wrapping, its hardly practical to go iterate all possible userpg
instances from a timer handler.


---
Documentation/kernel-parameters.txt | 9 +++++++
arch/x86/kernel/cpu/perf_event.c | 44 ++++++++++++++++++++++++---------
include/linux/perf_event.h | 6 +++++
kernel/events/core.c | 49 ++++++++++++++++++++++++++++++++++---
kernel/time/timekeeping.c | 30 +++++++++++++++++++++++
5 files changed, 123 insertions(+), 15 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 176d4fe4f076..52255676b6e2 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -91,6 +91,7 @@ the beginning of each description states the restrictions within which a
NUMA NUMA support is enabled.
NFS Appropriate NFS support is enabled.
OSS OSS sound support is enabled.
+ PERF Performance events and counters support is enabled.
PV_OPS A paravirtualized kernel is enabled.
PARIDE The ParIDE (parallel port IDE) subsystem is enabled.
PARISC The PA-RISC architecture is enabled.
@@ -2796,6 +2797,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
allocator. This parameter is primarily for debugging
and performance comparison.

+ perf_use_local_clock
+ [PERF]
+ Use local_clock() as a source for perf timestamps
+ generation. This was be the default behaviour and
+ this parameter can be used to maintain backward
+ compatibility or on older hardware with expensive
+ monotonic clock source.
+
pf. [PARIDE]
See Documentation/blockdev/paride.txt.

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index b71a7f86d68a..436a66632f76 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1952,6 +1952,35 @@ static struct pmu pmu = {
.flush_branch_stack = x86_pmu_flush_branch_stack,
};

+static void local_clock_user_time(struct perf_event_mmap_page *userpg, u64 now)
+{
+ data = cyc2ns_read_begin();
+
+ userpg->cap_user_time = 1;
+ userpg->time_mult = data->cyc2ns_mul;
+ userpg->time_shift = data->cyc2ns_shift;
+ userpg->time_offset = data->cyc2ns_offset - now;
+
+ userpg->cap_user_time_zero = 1;
+ userpg->time_zero = data->cyc2ns_offset;
+
+ cyc2ns_read_end(data);
+}
+
+extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
+
+static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
+{
+ userpg->cap_user_time = 1;
+ userpg->cap_user_time_zero = 1;
+
+ __ktime_get_mono_fast(&userpg->time_zero,
+ &userpg->time_mult,
+ &userpg->time_shift);
+
+ userpg->offset = userpg->time_zero - now;
+}
+
void arch_perf_update_userpage(struct perf_event *event,
struct perf_event_mmap_page *userpg, u64 now)
{
@@ -1966,17 +1995,10 @@ void arch_perf_update_userpage(struct perf_event *event,
if (!sched_clock_stable())
return;

- data = cyc2ns_read_begin();
-
- userpg->cap_user_time = 1;
- userpg->time_mult = data->cyc2ns_mul;
- userpg->time_shift = data->cyc2ns_shift;
- userpg->time_offset = data->cyc2ns_offset - now;
-
- userpg->cap_user_time_zero = 1;
- userpg->time_zero = data->cyc2ns_offset;
-
- cyc2ns_read_end(data);
+ if (static_key_false(&perf_use_local_clock_key))
+ local_clock_user_time(userpg, now);
+ else
+ ktime_fast_mono_user_time(userpg, now);
}

/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 33262004c310..1d61f968113a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -561,6 +561,12 @@ extern void perf_pmu_enable(struct pmu *pmu);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
extern int perf_event_refresh(struct perf_event *event, int refresh);
+
+extern struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
+extern void __weak
+arch_perf_update_userpage(struct perf_event *event,
+ struct perf_event_mmap_page *userpg, u64 now);
+
extern void perf_event_update_userpage(struct perf_event *event);
extern int perf_event_release_kernel(struct perf_event *event);
extern struct perf_event *
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 13209a90b751..7bad385103ea 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,8 @@
#include <linux/module.h>
#include <linux/mman.h>
#include <linux/compat.h>
+#include <linux/sysctl.h>
+#include <linux/jump_label.h>

#include "internal.h"

@@ -322,9 +324,43 @@ extern __weak const char *perf_pmu_name(void)
return "pmu";
}

+struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
+static bool perf_use_local_clock_param __initdata;
+static int __init perf_use_local_clock_setup(char *__unused)
+{
+ perf_use_local_clock_param = true;
+ return 1;
+}
+__setup("perf_use_local_clock", perf_use_local_clock_setup);
+
+static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
+static struct ctl_table perf_sample_time_kern_table[] = {
+ {
+ .procname = "perf_sample_time_clk_id",
+ .data = &sysctl_perf_sample_time_clk_id,
+ .maxlen = sizeof(int),
+ .mode = 0444,
+ .proc_handler = proc_dointvec,
+ },
+ {}
+};
+
+static struct ctl_table perf_sample_time_root_table[] = {
+ {
+ .procname = "kernel",
+ .mode = 0555,
+ .child = perf_sample_time_kern_table,
+ },
+ {}
+};
+
static inline u64 perf_clock(void)
{
- return local_clock();
+ if (static_key_false(&perf_use_local_clock_key))
+ return local_clock();
+ else
+ return ktime_get_mono_fast_ns();
}

static inline struct perf_cpu_context *
@@ -4101,8 +4137,8 @@ static void perf_event_init_userpage(struct perf_event *event)
rcu_read_unlock();
}

-void __weak arch_perf_update_userpage(
- struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now)
+void __weak arch_perf_update_userpage(struct perf_event *event,
+ struct perf_event_mmap_page *userpg, u64 now)
{
}

@@ -4487,7 +4523,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_flags & VM_WRITE)
flags |= RING_BUFFER_WRITABLE;

- rb = rb_alloc(nr_pages,
+ rb = rb_alloc(nr_pages,
event->attr.watermark ? event->attr.wakeup_watermark : 0,
event->cpu, flags);

@@ -8516,6 +8552,11 @@ void __init perf_event_init(void)
*/
BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
!= 1024);
+
+ if (perf_use_local_clock_param)
+ static_key_slow_inc(&perf_use_local_clock_key);
+ else
+ register_sysctl_table(perf_sample_time_root_table);
}

static int __init perf_event_sysfs_init(void)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index b124af259800..37bed5931a91 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -334,6 +334,36 @@ u64 notrace ktime_get_mono_fast_ns(void)
}
EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);

+void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
+{
+ struct tk_read_base *tkr;
+ unsigned int seq;
+ cycle_t cycle_now, delta;
+ u64 nsecs, now;
+
+ do {
+ seq = raw_read_seqcount(&tk_fast_mono.seq);
+ tkr = tk_fast_mono.base + (seq & 0x01);
+
+ cycle_now = tkr->read(tkr->clock);
+ delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
+
+ nsec = delta * tkr->mult + tkr->xtime_nsec;
+ nsec >>= tkr->shift;
+ nsec += arch_gettimeoffset();
+
+ now = ktime_to_ns(tkr->base_mono) + nsec;
+
+ *mult = tkr->mult;
+ *shift = tkr->shift;
+
+ nsec = mul_u64_u32_shr(cycle_now, tkr->mult, tkr->shift);
+
+ *offset = now - nsec;
+
+ } while (read_seqcount_retry(&tk_fast_mono.seq, seq));
+}
+
#ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD

static inline void update_vsyscall(struct timekeeper *tk)

2015-02-12 10:06:53

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps

On 11/02/15 18:12, Peter Zijlstra wrote:
>
> How about something like the below? I _think_ it should mostly work for
> x86, where the tsc is a 64bit wide cycle counter.

It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC and you
would have to check the clocksource is TSC.

Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
adjustment would skew performance timings?

>
> I suppose we should extend the perf userpage time data with
> time_last_cycle and time_mask if/when we want to make this work on
> something with a short counter.
>
> Of course, at that time we also need to somehow deal with that counter
> wrapping, its hardly practical to go iterate all possible userpg
> instances from a timer handler.
>
>
> ---
> Documentation/kernel-parameters.txt | 9 +++++++
> arch/x86/kernel/cpu/perf_event.c | 44 ++++++++++++++++++++++++---------
> include/linux/perf_event.h | 6 +++++
> kernel/events/core.c | 49 ++++++++++++++++++++++++++++++++++---
> kernel/time/timekeeping.c | 30 +++++++++++++++++++++++
> 5 files changed, 123 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 176d4fe4f076..52255676b6e2 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -91,6 +91,7 @@ the beginning of each description states the restrictions within which a
> NUMA NUMA support is enabled.
> NFS Appropriate NFS support is enabled.
> OSS OSS sound support is enabled.
> + PERF Performance events and counters support is enabled.
> PV_OPS A paravirtualized kernel is enabled.
> PARIDE The ParIDE (parallel port IDE) subsystem is enabled.
> PARISC The PA-RISC architecture is enabled.
> @@ -2796,6 +2797,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> allocator. This parameter is primarily for debugging
> and performance comparison.
>
> + perf_use_local_clock
> + [PERF]
> + Use local_clock() as a source for perf timestamps
> + generation. This was be the default behaviour and
> + this parameter can be used to maintain backward
> + compatibility or on older hardware with expensive
> + monotonic clock source.
> +
> pf. [PARIDE]
> See Documentation/blockdev/paride.txt.
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index b71a7f86d68a..436a66632f76 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1952,6 +1952,35 @@ static struct pmu pmu = {
> .flush_branch_stack = x86_pmu_flush_branch_stack,
> };
>
> +static void local_clock_user_time(struct perf_event_mmap_page *userpg, u64 now)
> +{
> + data = cyc2ns_read_begin();
> +
> + userpg->cap_user_time = 1;
> + userpg->time_mult = data->cyc2ns_mul;
> + userpg->time_shift = data->cyc2ns_shift;
> + userpg->time_offset = data->cyc2ns_offset - now;
> +
> + userpg->cap_user_time_zero = 1;
> + userpg->time_zero = data->cyc2ns_offset;
> +
> + cyc2ns_read_end(data);
> +}
> +
> +extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
> +
> +static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
> +{
> + userpg->cap_user_time = 1;
> + userpg->cap_user_time_zero = 1;
> +
> + __ktime_get_mono_fast(&userpg->time_zero,
> + &userpg->time_mult,
> + &userpg->time_shift);
> +
> + userpg->offset = userpg->time_zero - now;
> +}
> +
> void arch_perf_update_userpage(struct perf_event *event,
> struct perf_event_mmap_page *userpg, u64 now)
> {
> @@ -1966,17 +1995,10 @@ void arch_perf_update_userpage(struct perf_event *event,
> if (!sched_clock_stable())
> return;
>
> - data = cyc2ns_read_begin();
> -
> - userpg->cap_user_time = 1;
> - userpg->time_mult = data->cyc2ns_mul;
> - userpg->time_shift = data->cyc2ns_shift;
> - userpg->time_offset = data->cyc2ns_offset - now;
> -
> - userpg->cap_user_time_zero = 1;
> - userpg->time_zero = data->cyc2ns_offset;
> -
> - cyc2ns_read_end(data);
> + if (static_key_false(&perf_use_local_clock_key))
> + local_clock_user_time(userpg, now);
> + else
> + ktime_fast_mono_user_time(userpg, now);
> }
>
> /*
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 33262004c310..1d61f968113a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -561,6 +561,12 @@ extern void perf_pmu_enable(struct pmu *pmu);
> extern int perf_event_task_disable(void);
> extern int perf_event_task_enable(void);
> extern int perf_event_refresh(struct perf_event *event, int refresh);
> +
> +extern struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
> +extern void __weak
> +arch_perf_update_userpage(struct perf_event *event,
> + struct perf_event_mmap_page *userpg, u64 now);
> +
> extern void perf_event_update_userpage(struct perf_event *event);
> extern int perf_event_release_kernel(struct perf_event *event);
> extern struct perf_event *
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 13209a90b751..7bad385103ea 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -42,6 +42,8 @@
> #include <linux/module.h>
> #include <linux/mman.h>
> #include <linux/compat.h>
> +#include <linux/sysctl.h>
> +#include <linux/jump_label.h>
>
> #include "internal.h"
>
> @@ -322,9 +324,43 @@ extern __weak const char *perf_pmu_name(void)
> return "pmu";
> }
>
> +struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
> +static bool perf_use_local_clock_param __initdata;
> +static int __init perf_use_local_clock_setup(char *__unused)
> +{
> + perf_use_local_clock_param = true;
> + return 1;
> +}
> +__setup("perf_use_local_clock", perf_use_local_clock_setup);
> +
> +static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
> +
> +static struct ctl_table perf_sample_time_kern_table[] = {
> + {
> + .procname = "perf_sample_time_clk_id",
> + .data = &sysctl_perf_sample_time_clk_id,
> + .maxlen = sizeof(int),
> + .mode = 0444,
> + .proc_handler = proc_dointvec,
> + },
> + {}
> +};
> +
> +static struct ctl_table perf_sample_time_root_table[] = {
> + {
> + .procname = "kernel",
> + .mode = 0555,
> + .child = perf_sample_time_kern_table,
> + },
> + {}
> +};
> +
> static inline u64 perf_clock(void)
> {
> - return local_clock();
> + if (static_key_false(&perf_use_local_clock_key))
> + return local_clock();
> + else
> + return ktime_get_mono_fast_ns();
> }
>
> static inline struct perf_cpu_context *
> @@ -4101,8 +4137,8 @@ static void perf_event_init_userpage(struct perf_event *event)
> rcu_read_unlock();
> }
>
> -void __weak arch_perf_update_userpage(
> - struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now)
> +void __weak arch_perf_update_userpage(struct perf_event *event,
> + struct perf_event_mmap_page *userpg, u64 now)
> {
> }
>
> @@ -4487,7 +4523,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> if (vma->vm_flags & VM_WRITE)
> flags |= RING_BUFFER_WRITABLE;
>
> - rb = rb_alloc(nr_pages,
> + rb = rb_alloc(nr_pages,
> event->attr.watermark ? event->attr.wakeup_watermark : 0,
> event->cpu, flags);
>
> @@ -8516,6 +8552,11 @@ void __init perf_event_init(void)
> */
> BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
> != 1024);
> +
> + if (perf_use_local_clock_param)
> + static_key_slow_inc(&perf_use_local_clock_key);
> + else
> + register_sysctl_table(perf_sample_time_root_table);
> }
>
> static int __init perf_event_sysfs_init(void)
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index b124af259800..37bed5931a91 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -334,6 +334,36 @@ u64 notrace ktime_get_mono_fast_ns(void)
> }
> EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
>
> +void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
> +{
> + struct tk_read_base *tkr;
> + unsigned int seq;
> + cycle_t cycle_now, delta;
> + u64 nsecs, now;
> +
> + do {
> + seq = raw_read_seqcount(&tk_fast_mono.seq);
> + tkr = tk_fast_mono.base + (seq & 0x01);
> +
> + cycle_now = tkr->read(tkr->clock);
> + delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
> +
> + nsec = delta * tkr->mult + tkr->xtime_nsec;
> + nsec >>= tkr->shift;
> + nsec += arch_gettimeoffset();
> +
> + now = ktime_to_ns(tkr->base_mono) + nsec;
> +
> + *mult = tkr->mult;
> + *shift = tkr->shift;
> +
> + nsec = mul_u64_u32_shr(cycle_now, tkr->mult, tkr->shift);
> +
> + *offset = now - nsec;
> +
> + } while (read_seqcount_retry(&tk_fast_mono.seq, seq));
> +}
> +
> #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
>
> static inline void update_vsyscall(struct timekeeper *tk)
>
>

2015-02-12 10:28:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps

On Thu, Feb 12, 2015 at 12:04:54PM +0200, Adrian Hunter wrote:
> On 11/02/15 18:12, Peter Zijlstra wrote:
> >
> > How about something like the below? I _think_ it should mostly work for
> > x86, where the tsc is a 64bit wide cycle counter.
>
> It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC

Why?

> and you would have to check the clocksource is TSC.

It implicitly does that; it has that sched_clock_stable() thing, but
yeah I suppose someone could change the clocksource even though the tsc
is stable.

Not using TSC when its available is quite crazy though.. but sure.

> Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
> adjustment would skew performance timings?

Because you can do inter-machine stuff with MONOTONIC and that's
entirely impossible with MONO_RAW.

2015-02-12 15:38:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps

On Thu, Feb 12, 2015 at 11:28:14AM +0100, Peter Zijlstra wrote:
> > and you would have to check the clocksource is TSC.
>
> It implicitly does that; it has that sched_clock_stable() thing, but
> yeah I suppose someone could change the clocksource even though the tsc
> is stable.
>
> Not using TSC when its available is quite crazy though.. but sure.

Something like this on top then.. it might have a few header issues, the
whole asm/tsc.h vs clocksource.h thing looks like pain.

I haven't tried to compile it, maybe we can move cycle_t into types and
fwd declare struct clocksource or whatnot.

Of course, all this is quite horrible on the timekeeping side; it might
be tglx and/or jstutlz are having spasms just reading it :-)

---
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1967,17 +1967,19 @@ static void local_clock_user_time(struct
cyc2ns_read_end(data);
}

-extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
+extern bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *cs),
+ u64 *offset, u32 *mult, u16 *shift);

static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
{
+ if (!__ktime_get_mono_fast(read_tsc, &userpg->time_zero,
+ &userpg->time_mult,
+ &userpg->time_shift))
+ return;
+
userpg->cap_user_time = 1;
userpg->cap_user_time_zero = 1;

- __ktime_get_mono_fast(&userpg->time_zero,
- &userpg->time_mult,
- &userpg->time_shift);
-
userpg->offset = userpg->time_zero - now;
}

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -334,7 +334,8 @@ u64 notrace ktime_get_mono_fast_ns(void)
}
EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);

-void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
+bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *),
+ u64 *offset, u32 *mult, u16 *shift)
{
struct tk_read_base *tkr;
unsigned int seq;
@@ -345,6 +346,9 @@ void notrace __ktime_get_mono_fast(u64 *
seq = raw_read_seqcount(&tk_fast_mono.seq);
tkr = tk_fast_mono.base + (seq & 0x01);

+ if (tkr->read != read)
+ return false;
+
cycle_now = tkr->read(tkr->clock);
delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);

@@ -362,6 +366,8 @@ void notrace __ktime_get_mono_fast(u64 *
*offset = now - nsec;

} while (read_seqcount_retry(&tk_fast_mono.seq, seq));
+
+ return true;
}

#ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94605c0e9cee..68e4039a58ea 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -32,6 +32,8 @@ static inline cycles_t get_cycles(void)
return ret;
}

+extern void cycle_t read_tsc(struct clocksource *);
+
static __always_inline cycles_t vget_cycles(void)
{
/*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 505449700e0c..c580998f0160 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -965,7 +965,7 @@ static struct clocksource clocksource_tsc;
* checking the result of read_tsc() - cycle_last for being negative.
* That works because CLOCKSOURCE_MASK(64) does not mask out any bit.
*/
-static cycle_t read_tsc(struct clocksource *cs)
+cycle_t read_tsc(struct clocksource *cs)
{
return (cycle_t)get_cycles();
}

2015-02-13 00:25:17

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps

On Thu, Feb 12, 2015 at 11:38 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Feb 12, 2015 at 11:28:14AM +0100, Peter Zijlstra wrote:
>> > and you would have to check the clocksource is TSC.
>>
>> It implicitly does that; it has that sched_clock_stable() thing, but
>> yeah I suppose someone could change the clocksource even though the tsc
>> is stable.
>>
>> Not using TSC when its available is quite crazy though.. but sure.
>
> Something like this on top then.. it might have a few header issues, the
> whole asm/tsc.h vs clocksource.h thing looks like pain.
>
> I haven't tried to compile it, maybe we can move cycle_t into types and
> fwd declare struct clocksource or whatnot.
>
> Of course, all this is quite horrible on the timekeeping side; it might
> be tglx and/or jstutlz are having spasms just reading it :-)


Oof.. Yea, this exposes all sorts of timekeeping internals out to the
rest of the kernel that I'd rather not have out there.

> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1967,17 +1967,19 @@ static void local_clock_user_time(struct
> cyc2ns_read_end(data);
> }
>
> -extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
> +extern bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *cs),
> + u64 *offset, u32 *mult, u16 *shift);
>
> static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
> {
> + if (!__ktime_get_mono_fast(read_tsc, &userpg->time_zero,
> + &userpg->time_mult,
> + &userpg->time_shift))

Soo.. instead of hard-coding read_tsc here, can we instead use a
clocksource flag that we can check that the current clocksource is
valid for this sort of use?

thanks
-john

2015-02-13 07:09:33

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps

On 12/02/15 12:28, Peter Zijlstra wrote:
> On Thu, Feb 12, 2015 at 12:04:54PM +0200, Adrian Hunter wrote:
>> On 11/02/15 18:12, Peter Zijlstra wrote:
>>>
>>> How about something like the below? I _think_ it should mostly work for
>>> x86, where the tsc is a 64bit wide cycle counter.
>>
>> It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC
>
> Why?

In the CLOCK_MONOTONIC case, the components of the calculation (mult and
shift etc) are subject to change, so the calculation would be increasingly
inaccurate the greater the time between reading those values the reading TSC
or capturing perf events.

Accuracy is important for matching sideband events against Intel PT. e.g.
did a mmap event happen before or after a given TSC timestamp.

Adding another sample value (Pawel's patch 3) is more accurate and simpler
to understand. It just needs to be extended to allow TSC.

>
>> and you would have to check the clocksource is TSC.
>
> It implicitly does that; it has that sched_clock_stable() thing, but
> yeah I suppose someone could change the clocksource even though the tsc
> is stable.
>
> Not using TSC when its available is quite crazy though.. but sure.
>
>> Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
>> adjustment would skew performance timings?
>
> Because you can do inter-machine stuff with MONOTONIC and that's
> entirely impossible with MONO_RAW.

Ok, the man page does not make it sound as bad as that.