From: Kan Liang <[email protected]>
Compared with V1, this patch series provides a different solution to
address the conversion issue according to the feedback from Thomas and
John.
- Support the monotonic raw clock rather than the monotonic clock.
The monotonic raw clock is not affected by NTP/PTP correction.
The conversion information can be used to calculate the time for
large PEBS and do post-processing in perf tool.
- Support post-processing. Move the conversion to the user space
perf tool.
Link to V1:
https://lore.kernel.org/lkml/[email protected]/
Motivation:
A Processor Event Based Sampling (PEBS) record includes a field that
provide the time stamp counter value when the counter was overflowed
and the PEBS record was generated. The accurate time stamp can be used
to reconcile user samples. However, the current PEBS codes only can
convert the time stamp to sched_clock, which is not available from user
space. A solution to convert a given TSC to user visible monotonic raw
clock is required.
Solution:
Currently, the conversion of any clock id is done in the kernel. The
patch series extends the existing ABI to dump both the raw HW time
and the conversion information into the user space. The conversion will
be done in the perf tool.
The extended ABI is shared among different ARCHs. But the patch series
only implements the post-processing conversion on X86 platforms. For the
other ARCHs, there is nothing changed. The post-processing conversion
can be added later separately.
Only support the post-processing conversion for monotonic raw clock,
since it is not affected by NTP/PTP correction.
With the patch series, on X86, the post-processing conversion is the
default setting of perf tool for monotonic raw clock.
The patch series is on top of Peter's perf/core branch.
Kan Liang (9):
timekeeping: Expose the conversion information of monotonic raw
perf: Extend ABI to support post-processing monotonic raw conversion
perf/x86: Factor out x86_pmu_sample_preload()
perf/x86: Enable post-processing monotonic raw conversion
perf/x86/intel: Enable large PEBS for monotonic raw
tools headers UAPI: Sync linux/perf_event.h with the kernel sources
perf session: Support the monotonic raw clock conversion information
perf evsel, tsc: Support the monotonic raw clock conversion
perf evsel: Enable post-processing monotonic raw conversion by default
arch/x86/events/amd/core.c | 3 +-
arch/x86/events/core.c | 15 +++++++---
arch/x86/events/intel/core.c | 6 ++--
arch/x86/events/intel/ds.c | 15 +++++++---
arch/x86/events/perf_event.h | 20 +++++++++++++
include/linux/timekeeping.h | 18 ++++++++++++
include/uapi/linux/perf_event.h | 21 ++++++++++++--
kernel/events/core.c | 7 +++++
kernel/time/timekeeping.c | 24 ++++++++++++++++
tools/include/uapi/linux/perf_event.h | 21 ++++++++++++--
tools/lib/perf/include/perf/event.h | 8 +++++-
tools/perf/util/evlist.h | 1 +
tools/perf/util/evsel.c | 28 +++++++++++++++++--
tools/perf/util/evsel.h | 8 ++++++
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/session.c | 9 ++++++
tools/perf/util/tsc.c | 34 ++++++++++++++++++++++-
tools/perf/util/tsc.h | 8 ++++++
18 files changed, 223 insertions(+), 24 deletions(-)
--
2.35.1
From: Kan Liang <[email protected]>
The conversion information of monotonic raw is not affected by NTP/PTP
correction. The perf tool can utilize the information to correctly
calculate the monotonic raw via a TSC in each PEBS record in the
post-processing stage.
The current conversion information is hidden in the internal
struct tk_read_base. Add a new external struct ktime_conv to store and
share the conversion information with other subsystems.
Add a new interface ktime_get_fast_mono_raw_conv() to expose the
conversion information of monotonic raw. The function probably be
invoked in a NMI. Use NMI safe tk_fast_raw to retrieve the conversion
information.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/timekeeping.h | 18 ++++++++++++++++++
kernel/time/timekeeping.c | 24 ++++++++++++++++++++++++
2 files changed, 42 insertions(+)
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index fe1e467ba046..94ba02e7eb13 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -253,6 +253,21 @@ struct system_time_snapshot {
u8 cs_was_changed_seq;
};
+/**
+ * struct ktime_conv - Timestamp conversion information
+ * @mult: Multiplier for scaled math conversion
+ * @shift: Shift value for scaled math conversion
+ * @xtime_nsec: Shifted (fractional) nano seconds offset for readout
+ * @base: (nanoseconds) base time for readout
+ */
+struct ktime_conv {
+ u64 cycle_last;
+ u32 mult;
+ u32 shift;
+ u64 xtime_nsec;
+ u64 base;
+};
+
/**
* struct system_device_crosststamp - system/device cross-timestamp
* (synchronized capture)
@@ -297,6 +312,9 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
/* NMI safe mono/boot/realtime timestamps */
extern void ktime_get_fast_timestamps(struct ktime_timestamps *snap);
+/* NMI safe mono raw conv information */
+extern void ktime_get_fast_mono_raw_conv(struct ktime_conv *conv);
+
/*
* Persistent clock related interfaces
*/
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 5579ead449f2..a202b7a0a249 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -505,6 +505,30 @@ u64 notrace ktime_get_raw_fast_ns(void)
}
EXPORT_SYMBOL_GPL(ktime_get_raw_fast_ns);
+/**
+ * ktime_get_fast_mono_raw_conv - NMI safe access to get the conversion
+ * information of clock monotonic raw
+ *
+ * The conversion information is not affected by NTP/PTP correction.
+ */
+void ktime_get_fast_mono_raw_conv(struct ktime_conv *conv)
+{
+ struct tk_fast *tkf = &tk_fast_raw;
+ struct tk_read_base *tkr;
+ unsigned int seq;
+
+ do {
+ seq = raw_read_seqcount_latch(&tkf->seq);
+ tkr = tkf->base + (seq & 0x01);
+ conv->cycle_last = tkr->cycle_last;
+ conv->mult = tkr->mult;
+ conv->shift = tkr->shift;
+ conv->xtime_nsec = tkr->xtime_nsec;
+ conv->base = tkr->base;
+ } while (read_seqcount_latch_retry(&tkf->seq, seq));
+}
+EXPORT_SYMBOL_GPL(ktime_get_fast_mono_raw_conv);
+
/**
* ktime_get_boot_fast_ns - NMI safe and fast access to boot clock.
*
--
2.35.1
From: Kan Liang <[email protected]>
The monotonic raw clock is not affected by NTP/PTP correction. The
calculation of the monotonic raw clock can be done in the
post-processing, which can reduce the kernel overhead.
Add hw_time in the struct perf_event_attr to tell the kernel dump the
raw HW time to user space. The perf tool will calculate the HW time
in post-processing.
Currently, only supports the monotonic raw conversion.
Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
HW time can only be provided in a sample by HW. For other type of
records, the user requested clock should be returned as usual. Nothing
is changed.
Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
conversion information. The cap_user_time_mono_raw also indicates
whether the monotonic raw conversion information is available.
If yes, the clock monotonic raw can be calculated as
mono_raw = base + ((cyc - last) * mult + nsec) >> shift
Signed-off-by: Kan Liang <[email protected]>
---
include/uapi/linux/perf_event.h | 21 ++++++++++++++++++---
kernel/events/core.c | 7 +++++++
2 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ccb7f5dad59b..9d56fe027f6c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -455,7 +455,8 @@ struct perf_event_attr {
inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */
remove_on_exec : 1, /* event is removed from task on exec */
sigtrap : 1, /* send synchronous SIGTRAP on event */
- __reserved_1 : 26;
+ hw_time : 1, /* generate raw HW time for samples */
+ __reserved_1 : 25;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -615,7 +616,8 @@ struct perf_event_mmap_page {
cap_user_time : 1, /* The time_{shift,mult,offset} fields are used */
cap_user_time_zero : 1, /* The time_zero field is used */
cap_user_time_short : 1, /* the time_{cycle,mask} fields are used */
- cap_____res : 58;
+ cap_user_time_mono_raw : 1, /* The time_mono_* fields are used */
+ cap_____res : 57;
};
};
@@ -692,11 +694,24 @@ struct perf_event_mmap_page {
__u64 time_cycles;
__u64 time_mask;
+ /*
+ * If cap_user_time_mono_raw, the monotonic raw clock can be calculated
+ * from the hardware clock (e.g. TSC) 'cyc'.
+ *
+ * mono_raw = base + ((cyc - last) * mult + nsec) >> shift
+ *
+ */
+ __u64 time_mono_last;
+ __u32 time_mono_mult;
+ __u32 time_mono_shift;
+ __u64 time_mono_nsec;
+ __u64 time_mono_base;
+
/*
* Hole for extension of the self monitor capabilities
*/
- __u8 __reserved[116*8]; /* align to 1k. */
+ __u8 __reserved[112*8]; /* align to 1k. */
/*
* Control data for the mmap() data buffer.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 380476a934e8..f062cce2dafc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12135,6 +12135,13 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
if (attr->sigtrap && !attr->remove_on_exec)
return -EINVAL;
+ if (attr->use_clockid) {
+ /*
+ * Only support post-processing for the monotonic raw clock
+ */
+ if (attr->hw_time && (attr->clockid != CLOCK_MONOTONIC_RAW))
+ return -EINVAL;
+ }
out:
return ret;
--
2.35.1
From: Kan Liang <[email protected]>
Some common sample data are preloaded on X86 platforms before the sample
output. For example, the branch stack information.
Factor out a generic x86_pmu_sample_preload().
It will also be used later to preload the common HW time, TSC.
Signed-off-by: Kan Liang <[email protected]>
Cc: Ravi Bangoria <[email protected]>
---
arch/x86/events/amd/core.c | 3 +--
arch/x86/events/core.c | 5 +----
arch/x86/events/intel/core.c | 3 +--
arch/x86/events/intel/ds.c | 3 +--
arch/x86/events/perf_event.h | 8 ++++++++
5 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 8c45b198b62f..af7b3977efa8 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -928,8 +928,7 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
if (!x86_perf_event_set_period(event))
continue;
- if (has_branch_stack(event))
- perf_sample_save_brstack(&data, event, &cpuc->lbr_stack);
+ x86_pmu_sample_preload(&data, event, cpuc);
if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 85a63a41c471..b19ac54ebeea 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1703,10 +1703,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
perf_sample_data_init(&data, 0, event->hw.last_period);
- if (has_branch_stack(event)) {
- data.br_stack = &cpuc->lbr_stack;
- data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
- }
+ x86_pmu_sample_preload(&data, event, cpuc);
if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 14f0a746257d..d9be5701e60a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3036,8 +3036,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
perf_sample_data_init(&data, 0, event->hw.last_period);
- if (has_branch_stack(event))
- perf_sample_save_brstack(&data, event, &cpuc->lbr_stack);
+ x86_pmu_sample_preload(&data, event, cpuc);
if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 7980e92dec64..2f59573ed463 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1742,8 +1742,7 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,
if (x86_pmu.intel_cap.pebs_format >= 3)
setup_pebs_time(event, data, pebs->tsc);
- if (has_branch_stack(event))
- perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
+ x86_pmu_sample_preload(data, event, cpuc);
}
static void adaptive_pebs_save_regs(struct pt_regs *regs,
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index d6de4487348c..ae6ec58fde14 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1185,6 +1185,14 @@ int x86_pmu_handle_irq(struct pt_regs *regs);
void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
u64 intel_ctrl);
+static inline void x86_pmu_sample_preload(struct perf_sample_data *data,
+ struct perf_event *event,
+ struct cpu_hw_events *cpuc)
+{
+ if (has_branch_stack(event))
+ perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
+}
+
extern struct event_constraint emptyconstraint;
extern struct event_constraint unconstrained;
--
2.35.1
From: Kan Liang <[email protected]>
The raw HW time is from TSC on X86. Preload the HW time for each sample,
once the hw_time is set with the monotonic raw clock by the new perf
tool. Also, dump the conversion information into mmap_page.
For the legacy perf tool which doesn't know the hw_time, nothing is
changed.
Move the x86_pmu_sample_preload() before setup_pebs_time() to utilize
the TSC from a PEBS record.
Signed-off-by: Kan Liang <[email protected]>
Cc: Ravi Bangoria <[email protected]>
---
arch/x86/events/core.c | 10 ++++++++++
arch/x86/events/intel/ds.c | 14 +++++++++++---
arch/x86/events/perf_event.h | 12 ++++++++++++
3 files changed, 33 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b19ac54ebeea..7c1dfb8c763d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2740,6 +2740,16 @@ void arch_perf_update_userpage(struct perf_event *event,
if (!event->attr.use_clockid) {
userpg->cap_user_time_zero = 1;
userpg->time_zero = offset;
+ } else if (perf_event_hw_time(event)) {
+ struct ktime_conv mono;
+
+ userpg->cap_user_time_mono_raw = 1;
+ ktime_get_fast_mono_raw_conv(&mono);
+ userpg->time_mono_last = mono.cycle_last;
+ userpg->time_mono_mult = mono.mult;
+ userpg->time_mono_shift = mono.shift;
+ userpg->time_mono_nsec = mono.xtime_nsec;
+ userpg->time_mono_base = mono.base;
}
cyc2ns_read_end();
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 2f59573ed463..10d4b63c891f 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1574,6 +1574,12 @@ static void setup_pebs_time(struct perf_event *event,
struct perf_sample_data *data,
u64 tsc)
{
+ u64 time = tsc;
+
+ /* Perf tool does the conversion. No conversion here. */
+ if (perf_event_hw_time(event))
+ goto done;
+
/* Converting to a user-defined clock is not supported yet. */
if (event->attr.use_clockid != 0)
return;
@@ -1588,7 +1594,9 @@ static void setup_pebs_time(struct perf_event *event,
if (!using_native_sched_clock() || !sched_clock_stable())
return;
- data->time = native_sched_clock_from_tsc(tsc) + __sched_clock_offset;
+ time = native_sched_clock_from_tsc(tsc) + __sched_clock_offset;
+done:
+ data->time = time;
data->sample_flags |= PERF_SAMPLE_TIME;
}
@@ -1733,6 +1741,8 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,
}
}
+ x86_pmu_sample_preload(data, event, cpuc);
+
/*
* v3 supplies an accurate time stamp, so we use that
* for the time stamp.
@@ -1741,8 +1751,6 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,
*/
if (x86_pmu.intel_cap.pebs_format >= 3)
setup_pebs_time(event, data, pebs->tsc);
-
- x86_pmu_sample_preload(data, event, cpuc);
}
static void adaptive_pebs_save_regs(struct pt_regs *regs,
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ae6ec58fde14..0486ee6a7605 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1185,12 +1185,24 @@ int x86_pmu_handle_irq(struct pt_regs *regs);
void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
u64 intel_ctrl);
+static inline bool perf_event_hw_time(struct perf_event *event)
+{
+ return (event->attr.hw_time &&
+ event->attr.use_clockid &&
+ (event->attr.clockid == CLOCK_MONOTONIC_RAW));
+}
+
static inline void x86_pmu_sample_preload(struct perf_sample_data *data,
struct perf_event *event,
struct cpu_hw_events *cpuc)
{
if (has_branch_stack(event))
perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
+
+ if (perf_event_hw_time(event)) {
+ data->time = rdtsc();
+ data->sample_flags |= PERF_SAMPLE_TIME;
+ }
}
extern struct event_constraint emptyconstraint;
--
2.35.1
From: Kan Liang <[email protected]>
The monotonic raw clock is not affected by NTP/PTP correction. The
monotonic raw clock can be calculated from the TSC of each PEBS record
by the same conversion information.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d9be5701e60a..eac389e1f44c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3776,7 +3776,8 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
{
unsigned long flags = x86_pmu.large_pebs_flags;
- if (event->attr.use_clockid)
+ if (event->attr.use_clockid &&
+ !((x86_pmu.intel_cap.pebs_format >= 3) && perf_event_hw_time(event)))
flags &= ~PERF_SAMPLE_TIME;
if (!event->attr.exclude_kernel)
flags &= ~PERF_SAMPLE_REGS_USER;
--
2.35.1
From: Kan Liang <[email protected]>
The kernel ABI has been extended to support tool based monotonic raw
conversion.
This thus partially addresses this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h'
differs from latest version at 'include/uapi/linux/perf_event.h'
diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
Signed-off-by: Kan Liang <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index ccb7f5dad59b..9d56fe027f6c 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -455,7 +455,8 @@ struct perf_event_attr {
inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */
remove_on_exec : 1, /* event is removed from task on exec */
sigtrap : 1, /* send synchronous SIGTRAP on event */
- __reserved_1 : 26;
+ hw_time : 1, /* generate raw HW time for samples */
+ __reserved_1 : 25;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -615,7 +616,8 @@ struct perf_event_mmap_page {
cap_user_time : 1, /* The time_{shift,mult,offset} fields are used */
cap_user_time_zero : 1, /* The time_zero field is used */
cap_user_time_short : 1, /* the time_{cycle,mask} fields are used */
- cap_____res : 58;
+ cap_user_time_mono_raw : 1, /* The time_mono_* fields are used */
+ cap_____res : 57;
};
};
@@ -692,11 +694,24 @@ struct perf_event_mmap_page {
__u64 time_cycles;
__u64 time_mask;
+ /*
+ * If cap_user_time_mono_raw, the monotonic raw clock can be calculated
+ * from the hardware clock (e.g. TSC) 'cyc'.
+ *
+ * mono_raw = base + ((cyc - last) * mult + nsec) >> shift
+ *
+ */
+ __u64 time_mono_last;
+ __u32 time_mono_mult;
+ __u32 time_mono_shift;
+ __u64 time_mono_nsec;
+ __u64 time_mono_base;
+
/*
* Hole for extension of the self monitor capabilities
*/
- __u8 __reserved[116*8]; /* align to 1k. */
+ __u8 __reserved[112*8]; /* align to 1k. */
/*
* Control data for the mmap() data buffer.
--
2.35.1
From: Kan Liang <[email protected]>
The monotonic raw clock conversion information can be retrieved from
the perf_event_mmap_page::cap_user_time_mono_raw ABI. Store the
information in the struct perf_record_time_conv for later usage.
Dump the information in TIME_CONV event as well.
Signed-off-by: Kan Liang <[email protected]>
---
tools/lib/perf/include/perf/event.h | 8 +++++++-
tools/perf/util/session.c | 8 ++++++++
tools/perf/util/tsc.c | 22 +++++++++++++++++++++-
tools/perf/util/tsc.h | 6 ++++++
4 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index ad47d7b31046..20187a3d84c8 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -422,9 +422,15 @@ struct perf_record_time_conv {
__u64 time_zero;
__u64 time_cycles;
__u64 time_mask;
+ __u64 time_mono_last;
+ __u32 time_mono_mult;
+ __u32 time_mono_shift;
+ __u64 time_mono_nsec;
+ __u64 time_mono_base;
__u8 cap_user_time_zero;
__u8 cap_user_time_short;
- __u8 reserved[6]; /* For alignment */
+ __u8 cap_user_time_mono_raw;
+ __u8 reserved[5]; /* For alignment */
};
struct perf_record_header_feature {
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 7c021c6cedb9..189149a7012f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -995,6 +995,14 @@ static void perf_event__time_conv_swap(union perf_event *event,
event->time_conv.time_cycles = bswap_64(event->time_conv.time_cycles);
event->time_conv.time_mask = bswap_64(event->time_conv.time_mask);
}
+
+ if (event_contains(event->time_conv, time_mono_last)) {
+ event->time_conv.time_mono_last = bswap_64(event->time_conv.time_mono_last);
+ event->time_conv.time_mono_mult = bswap_32(event->time_conv.time_mono_mult);
+ event->time_conv.time_mono_shift = bswap_32(event->time_conv.time_mono_shift);
+ event->time_conv.time_mono_nsec = bswap_64(event->time_conv.time_mono_nsec);
+ event->time_conv.time_mono_base = bswap_64(event->time_conv.time_mono_base);
+ }
}
typedef void (*perf_event__swap_op)(union perf_event *event,
diff --git a/tools/perf/util/tsc.c b/tools/perf/util/tsc.c
index f19791d46e99..0b59c0f815f9 100644
--- a/tools/perf/util/tsc.c
+++ b/tools/perf/util/tsc.c
@@ -54,8 +54,14 @@ int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
tc->time_zero = pc->time_zero;
tc->time_cycles = pc->time_cycles;
tc->time_mask = pc->time_mask;
+ tc->time_mono_last = pc->time_mono_last;
+ tc->time_mono_mult = pc->time_mono_mult;
+ tc->time_mono_shift = pc->time_mono_shift;
+ tc->time_mono_nsec = pc->time_mono_nsec;
+ tc->time_mono_base = pc->time_mono_base;
tc->cap_user_time_zero = pc->cap_user_time_zero;
tc->cap_user_time_short = pc->cap_user_time_short;
+ tc->cap_user_time_mono_raw = pc->cap_user_time_mono_raw;
rmb();
if (pc->lock == seq && !(seq & 1))
break;
@@ -65,7 +71,7 @@ int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
}
}
- if (!tc->cap_user_time_zero)
+ if (!tc->cap_user_time_zero && !tc->cap_user_time_mono_raw)
return -EOPNOTSUPP;
return 0;
@@ -102,8 +108,14 @@ int perf_event__synth_time_conv(const struct perf_event_mmap_page *pc,
event.time_conv.time_zero = tc.time_zero;
event.time_conv.time_cycles = tc.time_cycles;
event.time_conv.time_mask = tc.time_mask;
+ event.time_conv.time_mono_last = tc.time_mono_last;
+ event.time_conv.time_mono_mult = tc.time_mono_mult;
+ event.time_conv.time_mono_shift = tc.time_mono_shift;
+ event.time_conv.time_mono_nsec = tc.time_mono_nsec;
+ event.time_conv.time_mono_base = tc.time_mono_base;
event.time_conv.cap_user_time_zero = tc.cap_user_time_zero;
event.time_conv.cap_user_time_short = tc.cap_user_time_short;
+ event.time_conv.cap_user_time_mono_raw = tc.cap_user_time_mono_raw;
return process(tool, &event, NULL, machine);
}
@@ -138,5 +150,13 @@ size_t perf_event__fprintf_time_conv(union perf_event *event, FILE *fp)
tc->cap_user_time_short);
}
+ ret += fprintf(fp, "... Cap Time Monotonic Raw %" PRId32 "\n",
+ tc->cap_user_time_mono_raw);
+ ret += fprintf(fp, "... Time Last %" PRI_lu64 "\n", tc->time_mono_last);
+ ret += fprintf(fp, "... Time Multiplier %" PRId32 "\n", tc->time_mono_mult);
+ ret += fprintf(fp, "... Time Shift %" PRId32 "\n", tc->time_mono_shift);
+ ret += fprintf(fp, "... Time Nsec %" PRI_lu64 "\n", tc->time_mono_nsec);
+ ret += fprintf(fp, "... Time Base %" PRI_lu64 "\n", tc->time_mono_base);
+
return ret;
}
diff --git a/tools/perf/util/tsc.h b/tools/perf/util/tsc.h
index 88fd1c4c1cb8..6bacc450a14d 100644
--- a/tools/perf/util/tsc.h
+++ b/tools/perf/util/tsc.h
@@ -12,9 +12,15 @@ struct perf_tsc_conversion {
u64 time_zero;
u64 time_cycles;
u64 time_mask;
+ u64 time_mono_last;
+ u32 time_mono_mult;
+ u32 time_mono_shift;
+ u64 time_mono_nsec;
+ u64 time_mono_base;
bool cap_user_time_zero;
bool cap_user_time_short;
+ bool cap_user_time_mono_raw;
};
struct perf_event_mmap_page;
--
2.35.1
From: Kan Liang <[email protected]>
The cap_user_time_mono_raw indicates that the kernel relies on the perf
tool to convert the HW time to the monotonic raw clock.
Add tsc_to_monotonic_raw() to do the conversion.
The conversion information is stored in the session, which cannot be
read in evsel parsing. Add a pointor in the evlist to point to the
conversion information.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/evlist.h | 1 +
tools/perf/util/evsel.c | 17 +++++++++++++++--
tools/perf/util/evsel.h | 7 +++++++
tools/perf/util/session.c | 1 +
tools/perf/util/tsc.c | 12 ++++++++++++
tools/perf/util/tsc.h | 2 ++
6 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 01fa9d592c5a..d860dc94009c 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -82,6 +82,7 @@ struct evlist {
int pos; /* index at evlist core object to check signals */
} ctl_fd;
struct event_enable_timer *eet;
+ struct perf_record_time_conv *time_conv;
};
struct evsel_str_handler {
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 999dd1700502..5e27ac2b9f9b 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -50,6 +50,7 @@
#include "off_cpu.h"
#include "../perf-sys.h"
#include "util/parse-branch-options.h"
+#include "tsc.h"
#include <internal/xyarray.h>
#include <internal/lib.h>
#include <internal/threadmap.h>
@@ -2349,6 +2350,18 @@ u64 evsel__bitfield_swap_branch_flags(u64 value)
return new_val;
}
+static u64 perf_evsel_parse_time(struct evsel *evsel, u64 time)
+{
+ /*
+ * The HW time can only be generated by HW events.
+ */
+ if ((evsel->core.attr.clockid == CLOCK_MONOTONIC_RAW) &&
+ evsel->evlist->time_conv && evsel__is_hw_event(evsel))
+ return tsc_to_monotonic_raw(evsel->evlist->time_conv, time);
+
+ return time;
+}
+
int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
struct perf_sample *data)
{
@@ -2411,7 +2424,7 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
}
if (type & PERF_SAMPLE_TIME) {
- data->time = *array;
+ data->time = perf_evsel_parse_time(evsel, *array);
array++;
}
@@ -2734,7 +2747,7 @@ int evsel__parse_sample_timestamp(struct evsel *evsel, union perf_event *event,
array++;
if (type & PERF_SAMPLE_TIME)
- *timestamp = *array;
+ *timestamp = perf_evsel_parse_time(evsel, *array);
return 0;
}
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index d572be41b960..d1ef67852bda 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -269,6 +269,13 @@ static inline bool evsel__is_bpf(struct evsel *evsel)
return evsel->bpf_counter_ops != NULL;
}
+static inline bool evsel__is_hw_event(struct evsel *evsel)
+{
+ return (evsel->core.attr.type == PERF_TYPE_HARDWARE) ||
+ (evsel->core.attr.type == PERF_TYPE_HW_CACHE) ||
+ (evsel->core.attr.type == PERF_TYPE_RAW);
+}
+
#define EVSEL__MAX_ALIASES 8
extern const char *const evsel__hw_cache[PERF_COUNT_HW_CACHE_MAX][EVSEL__MAX_ALIASES];
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 189149a7012f..d80d0c4e46da 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1725,6 +1725,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,
return tool->stat_round(session, event);
case PERF_RECORD_TIME_CONV:
session->time_conv = event->time_conv;
+ session->evlist->time_conv = &session->time_conv;
return tool->time_conv(session, event);
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
diff --git a/tools/perf/util/tsc.c b/tools/perf/util/tsc.c
index 0b59c0f815f9..5264f9d54be4 100644
--- a/tools/perf/util/tsc.c
+++ b/tools/perf/util/tsc.c
@@ -160,3 +160,15 @@ size_t perf_event__fprintf_time_conv(union perf_event *event, FILE *fp)
return ret;
}
+
+u64 tsc_to_monotonic_raw(struct perf_record_time_conv *tc, u64 cyc)
+{
+ u64 delta;
+
+ if (!tc->cap_user_time_mono_raw)
+ return cyc;
+
+ delta = (cyc - tc->time_mono_last) * tc->time_mono_mult + tc->time_mono_nsec;
+ delta >>= tc->time_mono_shift;
+ return tc->time_mono_base + delta;
+}
diff --git a/tools/perf/util/tsc.h b/tools/perf/util/tsc.h
index 6bacc450a14d..2611d3de94b1 100644
--- a/tools/perf/util/tsc.h
+++ b/tools/perf/util/tsc.h
@@ -35,4 +35,6 @@ double arch_get_tsc_freq(void);
size_t perf_event__fprintf_time_conv(union perf_event *event, FILE *fp);
+u64 tsc_to_monotonic_raw(struct perf_record_time_conv *tc, u64 cyc);
+
#endif // __PERF_TSC_H
--
2.35.1
From: Kan Liang <[email protected]>
The HW time is more accurate than the time recorded in the NMI handler.
Set the hw_time by default for the monotonic raw clock, and convert the
HW time to the monotonic raw clock in perf tool.
For the legacy kernel which doesn't support the attr, nothing is
changed. The monotonic raw clock is still from the time recorded in the
NMI handler.
Print the hw_time in perf_event_attr__fprintf
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/evsel.c | 11 ++++++++++-
tools/perf/util/evsel.h | 1 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
3 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 5e27ac2b9f9b..d182c12fd291 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1349,6 +1349,8 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
if (opts->use_clockid) {
attr->use_clockid = 1;
attr->clockid = opts->clockid;
+ if (opts->clockid == CLOCK_MONOTONIC_RAW)
+ attr->hw_time = 1;
}
if (evsel->precise_max)
@@ -1853,6 +1855,8 @@ static int __evsel__prepare_open(struct evsel *evsel, struct perf_cpu_map *cpus,
static void evsel__disable_missing_features(struct evsel *evsel)
{
+ if (perf_missing_features.hw_time)
+ evsel->core.attr.hw_time = 0;
if (perf_missing_features.read_lost)
evsel->core.attr.read_format &= ~PERF_FORMAT_LOST;
if (perf_missing_features.weight_struct) {
@@ -1906,7 +1910,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
* Must probe features in the order they were added to the
* perf_event_attr interface.
*/
- if (!perf_missing_features.read_lost &&
+ if (!perf_missing_features.hw_time &&
+ evsel->core.attr.hw_time) {
+ perf_missing_features.hw_time = true;
+ pr_debug2("switching off hw_time support\n");
+ return true;
+ } else if (!perf_missing_features.read_lost &&
(evsel->core.attr.read_format & PERF_FORMAT_LOST)) {
perf_missing_features.read_lost = true;
pr_debug2("switching off PERF_FORMAT_LOST support\n");
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index d1ef67852bda..c1d6fd40ea39 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -189,6 +189,7 @@ struct perf_missing_features {
bool code_page_size;
bool weight_struct;
bool read_lost;
+ bool hw_time;
};
extern struct perf_missing_features perf_missing_features;
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 7e5e7b30510d..7b3669430b87 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -154,6 +154,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
PRINT_ATTRf(sample_max_stack, p_unsigned);
PRINT_ATTRf(aux_sample_size, p_unsigned);
PRINT_ATTRf(sig_data, p_unsigned);
+ PRINT_ATTRf(hw_time, p_unsigned);
return ret;
}
--
2.35.1
On Mon, Feb 13, 2023 at 11:08 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> The conversion information of monotonic raw is not affected by NTP/PTP
> correction. The perf tool can utilize the information to correctly
> calculate the monotonic raw via a TSC in each PEBS record in the
> post-processing stage.
>
> The current conversion information is hidden in the internal
> struct tk_read_base. Add a new external struct ktime_conv to store and
> share the conversion information with other subsystems.
>
> Add a new interface ktime_get_fast_mono_raw_conv() to expose the
> conversion information of monotonic raw. The function probably be
> invoked in a NMI. Use NMI safe tk_fast_raw to retrieve the conversion
> information.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> include/linux/timekeeping.h | 18 ++++++++++++++++++
> kernel/time/timekeeping.c | 24 ++++++++++++++++++++++++
> 2 files changed, 42 insertions(+)
>
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index fe1e467ba046..94ba02e7eb13 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -253,6 +253,21 @@ struct system_time_snapshot {
> u8 cs_was_changed_seq;
> };
>
> +/**
> + * struct ktime_conv - Timestamp conversion information
> + * @mult: Multiplier for scaled math conversion
> + * @shift: Shift value for scaled math conversion
> + * @xtime_nsec: Shifted (fractional) nano seconds offset for readout
> + * @base: (nanoseconds) base time for readout
> + */
> +struct ktime_conv {
> + u64 cycle_last;
> + u32 mult;
> + u32 shift;
> + u64 xtime_nsec;
> + u64 base;
> +};
> +
> /**
> * struct system_device_crosststamp - system/device cross-timestamp
> * (synchronized capture)
> @@ -297,6 +312,9 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
> /* NMI safe mono/boot/realtime timestamps */
> extern void ktime_get_fast_timestamps(struct ktime_timestamps *snap);
>
> +/* NMI safe mono raw conv information */
> +extern void ktime_get_fast_mono_raw_conv(struct ktime_conv *conv);
> +
> /*
> * Persistent clock related interfaces
> */
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 5579ead449f2..a202b7a0a249 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -505,6 +505,30 @@ u64 notrace ktime_get_raw_fast_ns(void)
> }
> EXPORT_SYMBOL_GPL(ktime_get_raw_fast_ns);
>
> +/**
> + * ktime_get_fast_mono_raw_conv - NMI safe access to get the conversion
> + * information of clock monotonic raw
> + *
> + * The conversion information is not affected by NTP/PTP correction.
> + */
> +void ktime_get_fast_mono_raw_conv(struct ktime_conv *conv)
> +{
> + struct tk_fast *tkf = &tk_fast_raw;
> + struct tk_read_base *tkr;
> + unsigned int seq;
> +
> + do {
> + seq = raw_read_seqcount_latch(&tkf->seq);
> + tkr = tkf->base + (seq & 0x01);
> + conv->cycle_last = tkr->cycle_last;
> + conv->mult = tkr->mult;
> + conv->shift = tkr->shift;
> + conv->xtime_nsec = tkr->xtime_nsec;
> + conv->base = tkr->base;
> + } while (read_seqcount_latch_retry(&tkf->seq, seq));
> +}
> +EXPORT_SYMBOL_GPL(ktime_get_fast_mono_raw_conv);
Thanks for taking another pass at this! Using CLOCK_MONOTONIC_RAW
removes a lot of the issues around time inconsistencies.
Though, I'm not super excited about exporting a lot of timekeeping
state out to drivers to have drivers then duplicate timekeeping logic.
Would it make more sense to have the timekeeping core export an
interface like: ktime_get_mono_raw_from_timestamp(struct clocksource
*cs, cycle_t timestamp)?
The complexity is that the timestamp may be pretty far in the past, so
special handling will be needed to do the mult/shift conversion for a
large negative delta.
Also we need some way of checking that the current clocksource
(because it can change) matches the timestamp source?
Maybe some get_mono_raw_timestamp(&cs) accessor that captures both the
current clocksource and the timestamp?
I've not thought this out fully, but curious if something like that
might work for you and also encapsulate the timekeeping logic better
so we don't have to have that logic leak out to various driver
implementations.
thanks
-john
On Mon, Feb 13, 2023 at 11:08 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> The monotonic raw clock is not affected by NTP/PTP correction. The
> calculation of the monotonic raw clock can be done in the
> post-processing, which can reduce the kernel overhead.
>
> Add hw_time in the struct perf_event_attr to tell the kernel dump the
> raw HW time to user space. The perf tool will calculate the HW time
> in post-processing.
> Currently, only supports the monotonic raw conversion.
> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
> HW time can only be provided in a sample by HW. For other type of
> records, the user requested clock should be returned as usual. Nothing
> is changed.
>
> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
> conversion information. The cap_user_time_mono_raw also indicates
> whether the monotonic raw conversion information is available.
> If yes, the clock monotonic raw can be calculated as
> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
Again, I appreciate you reworking and resending this series out, I
know it took some effort.
But oof, I'd really like to make sure we're not exporting timekeeping
internals to userland.
I think Thomas' suggestion of doing the timestamp conversion in
post-processing was more about interpolating collected system times
with the counter (tsc) values captured.
I get the interpolation can be difficult as the counter value and
system time can't currently atomically collected, so potentially there
may be a need for a way to tie two together (see my previous email's
thought of ktime_get_raw_monotonic_from_timestamp()), but we'd
probably want a clear understanding of the benefit (quantitative
reduction in interpolation error, and what real benefit that brings),
and would also want the driver to generate and share those pairs
rather than having userland have access.
thanks
-john
On 2023-02-13 2:37 p.m., John Stultz wrote:
> On Mon, Feb 13, 2023 at 11:08 AM <[email protected]> wrote:
>>
>> From: Kan Liang <[email protected]>
>>
>> The monotonic raw clock is not affected by NTP/PTP correction. The
>> calculation of the monotonic raw clock can be done in the
>> post-processing, which can reduce the kernel overhead.
>>
>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
>> raw HW time to user space. The perf tool will calculate the HW time
>> in post-processing.
>> Currently, only supports the monotonic raw conversion.
>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
>> HW time can only be provided in a sample by HW. For other type of
>> records, the user requested clock should be returned as usual. Nothing
>> is changed.
>>
>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
>> conversion information. The cap_user_time_mono_raw also indicates
>> whether the monotonic raw conversion information is available.
>> If yes, the clock monotonic raw can be calculated as
>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
>
> Again, I appreciate you reworking and resending this series out, I
> know it took some effort.
>
> But oof, I'd really like to make sure we're not exporting timekeeping
> internals to userland.
>
> I think Thomas' suggestion of doing the timestamp conversion in
> post-processing was more about interpolating collected system times
> with the counter (tsc) values captured.
>
Thomas, could you please clarify your suggestion regarding "the relevant
conversion information" provided by the kernel?
https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
Is it only the interpolation information or the entire conversion
information (Mult, shift etc.)?
If it's only the interpolation information, the user space will be lack
of information to handle all the cases. If I understand John's comments
correctly, it could also bring some interpolation error which can only
be addressed by the mult/shift conversion.
If the suggestion is to dump the entire conversion information into the
user space, we have to expose the timekeeping internals.
Considering the above difficulties, could we use the kernel conversion?
(The current perf already uses the kernel conversion for monotonic raw.
It should not bring extra overhead.)
Thanks,
Kan
> I get the interpolation can be difficult as the counter value and
> system time can't currently atomically collected, so potentially there
> may be a need for a way to tie two together (see my previous email's
> thought of ktime_get_raw_monotonic_from_timestamp()), but we'd
> probably want a clear understanding of the benefit (quantitative
> reduction in interpolation error, and what real benefit that brings),
> and would also want the driver to generate and share those pairs
> rather than having userland have access.
>
> thanks
> -john
On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <[email protected]> wrote:
> On 2023-02-13 2:37 p.m., John Stultz wrote:
> > On Mon, Feb 13, 2023 at 11:08 AM <[email protected]> wrote:
> >>
> >> From: Kan Liang <[email protected]>
> >>
> >> The monotonic raw clock is not affected by NTP/PTP correction. The
> >> calculation of the monotonic raw clock can be done in the
> >> post-processing, which can reduce the kernel overhead.
> >>
> >> Add hw_time in the struct perf_event_attr to tell the kernel dump the
> >> raw HW time to user space. The perf tool will calculate the HW time
> >> in post-processing.
> >> Currently, only supports the monotonic raw conversion.
> >> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
> >> HW time can only be provided in a sample by HW. For other type of
> >> records, the user requested clock should be returned as usual. Nothing
> >> is changed.
> >>
> >> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
> >> conversion information. The cap_user_time_mono_raw also indicates
> >> whether the monotonic raw conversion information is available.
> >> If yes, the clock monotonic raw can be calculated as
> >> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
> >
> > Again, I appreciate you reworking and resending this series out, I
> > know it took some effort.
> >
> > But oof, I'd really like to make sure we're not exporting timekeeping
> > internals to userland.
> >
> > I think Thomas' suggestion of doing the timestamp conversion in
> > post-processing was more about interpolating collected system times
> > with the counter (tsc) values captured.
> >
>
> Thomas, could you please clarify your suggestion regarding "the relevant
> conversion information" provided by the kernel?
> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
>
> Is it only the interpolation information or the entire conversion
> information (Mult, shift etc.)?
>
> If it's only the interpolation information, the user space will be lack
> of information to handle all the cases. If I understand John's comments
> correctly, it could also bring some interpolation error which can only
> be addressed by the mult/shift conversion.
"Only" is maybe too strong a word. I think having the driver use
kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
counter values will minimize the error.
But again, it's not yet established that any interpolation error using
existing interfaces is great enough to be problematic here.
The interpoloation is pretty easy to do:
do {
start= readtsc();
clock_gett(CLOCK_MONOTONIC_RAW, &ts);
end = readtsc();
delta = end-start;
} while (delta > THRESHOLD) // make sure the reads were not preempted
mid = start + (delta +(delta/2))/2; //round-closest
and be able to get you a fairly close matching of TSC to
CLOCK_MONOTONIC_RAW value.
Once you have that mapping you can take a few samples and establish
the linear function.
But that will have some error, so quantifying that error helps
establish why being able to get an atomic mapping of TSC ->
CLOCK_MONOTONIC_RAW would help.
So I really don't think we need to expose the kernel internal values
to userland, but I'm willing to guess the atomic mapping (which the
driver will have access to, not userland) may be helpful for the fine
granularity you want in the trace.
thanks
-john
On 2023-02-13 5:22 p.m., John Stultz wrote:
> On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <[email protected]> wrote:
>> On 2023-02-13 2:37 p.m., John Stultz wrote:
>>> On Mon, Feb 13, 2023 at 11:08 AM <[email protected]> wrote:
>>>>
>>>> From: Kan Liang <[email protected]>
>>>>
>>>> The monotonic raw clock is not affected by NTP/PTP correction. The
>>>> calculation of the monotonic raw clock can be done in the
>>>> post-processing, which can reduce the kernel overhead.
>>>>
>>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
>>>> raw HW time to user space. The perf tool will calculate the HW time
>>>> in post-processing.
>>>> Currently, only supports the monotonic raw conversion.
>>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
>>>> HW time can only be provided in a sample by HW. For other type of
>>>> records, the user requested clock should be returned as usual. Nothing
>>>> is changed.
>>>>
>>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
>>>> conversion information. The cap_user_time_mono_raw also indicates
>>>> whether the monotonic raw conversion information is available.
>>>> If yes, the clock monotonic raw can be calculated as
>>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
>>>
>>> Again, I appreciate you reworking and resending this series out, I
>>> know it took some effort.
>>>
>>> But oof, I'd really like to make sure we're not exporting timekeeping
>>> internals to userland.
>>>
>>> I think Thomas' suggestion of doing the timestamp conversion in
>>> post-processing was more about interpolating collected system times
>>> with the counter (tsc) values captured.
>>>
>>
>> Thomas, could you please clarify your suggestion regarding "the relevant
>> conversion information" provided by the kernel?
>> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
>>
>> Is it only the interpolation information or the entire conversion
>> information (Mult, shift etc.)?
>>
>> If it's only the interpolation information, the user space will be lack
>> of information to handle all the cases. If I understand John's comments
>> correctly, it could also bring some interpolation error which can only
>> be addressed by the mult/shift conversion.
>
Thanks for the details John.
> "Only" is maybe too strong a word. I think having the driver use
> kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
> counter values will minimize the error.
>
The key motivation of using the TSC in the PEBS record is to get an
accurate timestamp of each record. We definitely want the conversion has
minimized error.
> But again, it's not yet established that any interpolation error using
> existing interfaces is great enough to be problematic here.
>
> The interpoloation is pretty easy to do:
>
> do {
> start= readtsc();
> clock_gett(CLOCK_MONOTONIC_RAW, &ts);
> end = readtsc();
> delta = end-start;
> } while (delta > THRESHOLD) // make sure the reads were not preempted
> mid = start + (delta +(delta/2))/2; //round-closest
>
How to choose the THRESHOLD? It seems the THRESHOLD value also impacts
the accuracy.
> and be able to get you a fairly close matching of TSC to
> CLOCK_MONOTONIC_RAW value.
>
> Once you have that mapping you can take a few samples and establish
> the linear function.
>
> But that will have some error, so quantifying that error helps
> establish why being able to get an atomic mapping of TSC ->
> CLOCK_MONOTONIC_RAW would help.
>
> So I really don't think we need to expose the kernel internal values
> to userland, but I'm willing to guess the atomic mapping (which the
> driver will have access to, not userland) may be helpful for the fine
> granularity you want in the trace.
>
If I understand correctly, the idea is to let the user space tool run
the above interpoloation algorithm several times to 'guess' the atomic
mapping. Using the mapping information to covert the TSC from the PEBS
record. Is my understanding correct?
If so, to be honest, I doubt we can get the accuracy we want.
Thanks,
Kan
On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
> The interpoloation is pretty easy to do:
>
> do {
> start= readtsc();
> clock_gett(CLOCK_MONOTONIC_RAW, &ts);
> end = readtsc();
> delta = end-start;
> } while (delta > THRESHOLD) // make sure the reads were not preempted
> mid = start + (delta +(delta/2))/2; //round-closest
>
> and be able to get you a fairly close matching of TSC to
> CLOCK_MONOTONIC_RAW value.
>
> Once you have that mapping you can take a few samples and establish
> the linear function.
Right, this is how we do the TSC calibration in the first place, and if
NTP can achieve high correctness over a network, then surely we can do
better locally.
That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
On 2023-02-14 9:51 a.m., Liang, Kan wrote:
>
>
> On 2023-02-13 5:22 p.m., John Stultz wrote:
>> On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <[email protected]> wrote:
>>> On 2023-02-13 2:37 p.m., John Stultz wrote:
>>>> On Mon, Feb 13, 2023 at 11:08 AM <[email protected]> wrote:
>>>>>
>>>>> From: Kan Liang <[email protected]>
>>>>>
>>>>> The monotonic raw clock is not affected by NTP/PTP correction. The
>>>>> calculation of the monotonic raw clock can be done in the
>>>>> post-processing, which can reduce the kernel overhead.
>>>>>
>>>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
>>>>> raw HW time to user space. The perf tool will calculate the HW time
>>>>> in post-processing.
>>>>> Currently, only supports the monotonic raw conversion.
>>>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
>>>>> HW time can only be provided in a sample by HW. For other type of
>>>>> records, the user requested clock should be returned as usual. Nothing
>>>>> is changed.
>>>>>
>>>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
>>>>> conversion information. The cap_user_time_mono_raw also indicates
>>>>> whether the monotonic raw conversion information is available.
>>>>> If yes, the clock monotonic raw can be calculated as
>>>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
>>>>
>>>> Again, I appreciate you reworking and resending this series out, I
>>>> know it took some effort.
>>>>
>>>> But oof, I'd really like to make sure we're not exporting timekeeping
>>>> internals to userland.
>>>>
>>>> I think Thomas' suggestion of doing the timestamp conversion in
>>>> post-processing was more about interpolating collected system times
>>>> with the counter (tsc) values captured.
>>>>
>>>
>>> Thomas, could you please clarify your suggestion regarding "the relevant
>>> conversion information" provided by the kernel?
>>> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
>>>
>>> Is it only the interpolation information or the entire conversion
>>> information (Mult, shift etc.)?
>>>
>>> If it's only the interpolation information, the user space will be lack
>>> of information to handle all the cases. If I understand John's comments
>>> correctly, it could also bring some interpolation error which can only
>>> be addressed by the mult/shift conversion.
>>
>
>
> Thanks for the details John.
>
>> "Only" is maybe too strong a word. I think having the driver use
>> kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
>> counter values will minimize the error.
>>
>
> The key motivation of using the TSC in the PEBS record is to get an
> accurate timestamp of each record. We definitely want the conversion has
> minimized error.
>
>
>> But again, it's not yet established that any interpolation error using
>> existing interfaces is great enough to be problematic here.
>>
>> The interpoloation is pretty easy to do:
>>
>> do {
>> start= readtsc();
>> clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>> end = readtsc();
>> delta = end-start;
>> } while (delta > THRESHOLD) // make sure the reads were not preempted
>> mid = start + (delta +(delta/2))/2; //round-closest
>>
>
> How to choose the THRESHOLD? It seems the THRESHOLD value also impacts
> the accuracy.
>
>
>> and be able to get you a fairly close matching of TSC to
>> CLOCK_MONOTONIC_RAW value.
>>
>> Once you have that mapping you can take a few samples and establish
>> the linear function.
>>
>> But that will have some error, so quantifying that error helps
>> establish why being able to get an atomic mapping of TSC ->
>> CLOCK_MONOTONIC_RAW would help.
>>
>> So I really don't think we need to expose the kernel internal values
>> to userland, but I'm willing to guess the atomic mapping (which the
>> driver will have access to, not userland) may be helpful for the fine
>> granularity you want in the trace.
>>
>
> If I understand correctly, the idea is to let the user space tool run
> the above interpoloation algorithm several times to 'guess' the atomic
> mapping. Using the mapping information to covert the TSC from the PEBS
> record. Is my understanding correct?
>
> If so, to be honest, I doubt we can get the accuracy we want.
>
I implemented a simple test to evaluate the error.
I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
at the start and end of perf cmd.
MONO_RAW TSC
start 89553516545645 223619715214239
end 89562251233830 223641517000376
Here is what I get via mult/shift conversion from this patch.
MONO_RAW TSC
PEBS 89555942691466 223625770878571
Then I use the time information from start and end to create a linear
function and 'guess' the MONO_RAW of PEBS from the TSC. I get
89555942692721.
There is a 1255 ns difference.
I tried several different PEBS records. The error is ~1000ns.
I think it should be an observable error.
Thanks,
Kan
On 2023-02-14 5:43 a.m., Peter Zijlstra wrote:
> On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
>> The interpoloation is pretty easy to do:
>>
>> do {
>> start= readtsc();
>> clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>> end = readtsc();
>> delta = end-start;
>> } while (delta > THRESHOLD) // make sure the reads were not preempted
>> mid = start + (delta +(delta/2))/2; //round-closest
>>
>> and be able to get you a fairly close matching of TSC to
>> CLOCK_MONOTONIC_RAW value.
>>
>> Once you have that mapping you can take a few samples and establish
>> the linear function.
>
> Right, this is how we do the TSC calibration in the first place, and if
> NTP can achieve high correctness over a network, then surely we can do
> better locally.
>
> That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
If I understand correctly, the TSC calibration is done in the kernel.
The kernel keeps updating the mul/shift. We dump the mul/shift into the
perf mmap page for the user tools.
But for the CLOCKs, the mul/shift is kernel internal values which we
don't want to expose to the user space.
If we only apply the scheme in the user space, it brings some observable
errors based on my test mentioned in the other thread.
Thanks,
Kan
On Tue, Feb 14, 2023 at 7:56 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
> > The interpoloation is pretty easy to do:
> >
> > do {
> > start= readtsc();
> > clock_gett(CLOCK_MONOTONIC_RAW, &ts);
> > end = readtsc();
> > delta = end-start;
> > } while (delta > THRESHOLD) // make sure the reads were not preempted
> > mid = start + (delta +(delta/2))/2; //round-closest
> >
> > and be able to get you a fairly close matching of TSC to
> > CLOCK_MONOTONIC_RAW value.
> >
> > Once you have that mapping you can take a few samples and establish
> > the linear function.
>
> Right, this is how we do the TSC calibration in the first place, and if
> NTP can achieve high correctness over a network, then surely we can do
> better locally.
>
> That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
Well, CLOCK_MONOTONIC_RAW is at least a fixed function, we don't
change its frequency. Whereas other clocks will likely be adjusted
over their lifetime, so deriving the frequency has to be continually
re-calculated, so they aren't ideal for this sort of interpolation.
thanks
-john
On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <[email protected]> wrote:
> On 2023-02-14 5:43 a.m., Peter Zijlstra wrote:
> > On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
> >> The interpoloation is pretty easy to do:
> >>
> >> do {
> >> start= readtsc();
> >> clock_gett(CLOCK_MONOTONIC_RAW, &ts);
> >> end = readtsc();
> >> delta = end-start;
> >> } while (delta > THRESHOLD) // make sure the reads were not preempted
> >> mid = start + (delta +(delta/2))/2; //round-closest
> >>
> >> and be able to get you a fairly close matching of TSC to
> >> CLOCK_MONOTONIC_RAW value.
> >>
> >> Once you have that mapping you can take a few samples and establish
> >> the linear function.
> >
> > Right, this is how we do the TSC calibration in the first place, and if
> > NTP can achieve high correctness over a network, then surely we can do
> > better locally.
> >
> > That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
>
> If I understand correctly, the TSC calibration is done in the kernel.
> The kernel keeps updating the mul/shift. We dump the mul/shift into the
> perf mmap page for the user tools.
Where is that done in the perf mmap? I wasn't aware.
thanks
-john
On Tue, Feb 14, 2023 at 6:51 AM Liang, Kan <[email protected]> wrote:
> On 2023-02-13 5:22 p.m., John Stultz wrote:
> > On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <[email protected]> wrote:
> >> On 2023-02-13 2:37 p.m., John Stultz wrote:
> >>> On Mon, Feb 13, 2023 at 11:08 AM <[email protected]> wrote:
> >>>>
> >>>> From: Kan Liang <[email protected]>
> >>>>
> >>>> The monotonic raw clock is not affected by NTP/PTP correction. The
> >>>> calculation of the monotonic raw clock can be done in the
> >>>> post-processing, which can reduce the kernel overhead.
> >>>>
> >>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
> >>>> raw HW time to user space. The perf tool will calculate the HW time
> >>>> in post-processing.
> >>>> Currently, only supports the monotonic raw conversion.
> >>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
> >>>> HW time can only be provided in a sample by HW. For other type of
> >>>> records, the user requested clock should be returned as usual. Nothing
> >>>> is changed.
> >>>>
> >>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
> >>>> conversion information. The cap_user_time_mono_raw also indicates
> >>>> whether the monotonic raw conversion information is available.
> >>>> If yes, the clock monotonic raw can be calculated as
> >>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
> >>>
> >>> Again, I appreciate you reworking and resending this series out, I
> >>> know it took some effort.
> >>>
> >>> But oof, I'd really like to make sure we're not exporting timekeeping
> >>> internals to userland.
> >>>
> >>> I think Thomas' suggestion of doing the timestamp conversion in
> >>> post-processing was more about interpolating collected system times
> >>> with the counter (tsc) values captured.
> >>>
> >>
> >> Thomas, could you please clarify your suggestion regarding "the relevant
> >> conversion information" provided by the kernel?
> >> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
> >>
> >> Is it only the interpolation information or the entire conversion
> >> information (Mult, shift etc.)?
> >>
> >> If it's only the interpolation information, the user space will be lack
> >> of information to handle all the cases. If I understand John's comments
> >> correctly, it could also bring some interpolation error which can only
> >> be addressed by the mult/shift conversion.
> >
>
>
> Thanks for the details John.
>
> > "Only" is maybe too strong a word. I think having the driver use
> > kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
> > counter values will minimize the error.
> >
>
> The key motivation of using the TSC in the PEBS record is to get an
> accurate timestamp of each record. We definitely want the conversion has
> minimized error.
Yep.
> > But again, it's not yet established that any interpolation error using
> > existing interfaces is great enough to be problematic here.
> >
> > The interpoloation is pretty easy to do:
> >
> > do {
> > start= readtsc();
> > clock_gett(CLOCK_MONOTONIC_RAW, &ts);
> > end = readtsc();
> > delta = end-start;
> > } while (delta > THRESHOLD) // make sure the reads were not preempted
> > mid = start + (delta +(delta/2))/2; //round-closest
> >
>
> How to choose the THRESHOLD? It seems the THRESHOLD value also impacts
> the accuracy.
Maybe by running a number of of these reads and collecting the detlas,
then setting THRESHOLD to a standard deviation of the results?
(I'm sure there's more sound methods, but I'd have to do some digging
to find them)
Alternatively you could always take 10 samples and then only do the
mapping with the smallest delta value.
> > and be able to get you a fairly close matching of TSC to
> > CLOCK_MONOTONIC_RAW value.
> >
> > Once you have that mapping you can take a few samples and establish
> > the linear function.
> >
> > But that will have some error, so quantifying that error helps
> > establish why being able to get an atomic mapping of TSC ->
> > CLOCK_MONOTONIC_RAW would help.
> >
> > So I really don't think we need to expose the kernel internal values
> > to userland, but I'm willing to guess the atomic mapping (which the
> > driver will have access to, not userland) may be helpful for the fine
> > granularity you want in the trace.
> >
>
> If I understand correctly, the idea is to let the user space tool run
> the above interpoloation algorithm several times to 'guess' the atomic
> mapping. Using the mapping information to covert the TSC from the PEBS
> record. Is my understanding correct?
So I think that's what Thomas was suggesting.
The next step would probably be to provide a way for the driver to
provide atomic TSC->CLOCK_MONOTONIC_RAW samples, so userland can
calculate the function itself.
So then the problem becomes if X1 and Y1 are exactly mapped, and X2
and Y2 are exactly mapped, then given X3, find Y3.
And if that doesn't work, then we would have to see about having the
driver do all the conversions.
> If so, to be honest, I doubt we can get the accuracy we want.
Sure. I just want to make sure its quantified that the pure userland
interpolation approach won't work before we go adding in extra
in-kernel logic
(We'd obviously rather do the logic that can be done in userland in userland)
thanks
-john
Kan!
On Mon, Feb 13 2023 at 11:07, kan liang wrote:
> From: Kan Liang <[email protected]>
> + } else if (perf_event_hw_time(event)) {
> + struct ktime_conv mono;
> +
> + userpg->cap_user_time_mono_raw = 1;
> + ktime_get_fast_mono_raw_conv(&mono);
What guarantees that the clocksource used by the timekeeping core is
actually TSC? Nothing at all. You cannot make assumptions here.
Thanks,
tglx
On 2023-02-14 2:37 p.m., John Stultz wrote:
> On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <[email protected]> wrote:
>> On 2023-02-14 5:43 a.m., Peter Zijlstra wrote:
>>> On Mon, Feb 13, 2023 at 02:22:39PM -0800, John Stultz wrote:
>>>> The interpoloation is pretty easy to do:
>>>>
>>>> do {
>>>> start= readtsc();
>>>> clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>>>> end = readtsc();
>>>> delta = end-start;
>>>> } while (delta > THRESHOLD) // make sure the reads were not preempted
>>>> mid = start + (delta +(delta/2))/2; //round-closest
>>>>
>>>> and be able to get you a fairly close matching of TSC to
>>>> CLOCK_MONOTONIC_RAW value.
>>>>
>>>> Once you have that mapping you can take a few samples and establish
>>>> the linear function.
>>>
>>> Right, this is how we do the TSC calibration in the first place, and if
>>> NTP can achieve high correctness over a network, then surely we can do
>>> better locally.
>>>
>>> That is, this scheme should work for all CLOCKs, not only MONOTONIC_RAW.
>>
>> If I understand correctly, the TSC calibration is done in the kernel.
>> The kernel keeps updating the mul/shift. We dump the mul/shift into the
>> perf mmap page for the user tools.
>
> Where is that done in the perf mmap? I wasn't aware.
The updating of the mul/shift for sched_clock should be done in the
set_cyc2ns_scale() in tsc.c
The perf user space tool mmap a page to retrieve the enabling
time/running time from the kernel. On X86 and Arm, the conversion
information from HW time (TSC) to sched_clock/perf_time is also stored
in the page. Please see the arch_perf_update_userpage(). In the perf
mmap, it only retrieve the current mul/shift information and write them
into the page for the user space tool.
This V2 patch series try to do the same thing for the monotonic raw
conversion. So the kernel internal mul/shift information has to be exposed.
Thanks,
Kan
On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <[email protected]> wrote:
> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
> > If I understand correctly, the idea is to let the user space tool run
> > the above interpoloation algorithm several times to 'guess' the atomic
> > mapping. Using the mapping information to covert the TSC from the PEBS
> > record. Is my understanding correct?
> >
> > If so, to be honest, I doubt we can get the accuracy we want.
> >
>
> I implemented a simple test to evaluate the error.
Very cool!
> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
> at the start and end of perf cmd.
> MONO_RAW TSC
> start 89553516545645 223619715214239
> end 89562251233830 223641517000376
>
> Here is what I get via mult/shift conversion from this patch.
> MONO_RAW TSC
> PEBS 89555942691466 223625770878571
>
> Then I use the time information from start and end to create a linear
> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
> 89555942692721.
> There is a 1255 ns difference.
> I tried several different PEBS records. The error is ~1000ns.
> I think it should be an observable error.
Interesting. That's a good bit higher than I'd expect as I'd expect a
clock_gettime() call to take ~ double digit nanoseconds range on
average, so the error should be within that.
Can you share your logic?
thanks
-john
On Tue, Feb 14, 2023 at 12:09 PM Liang, Kan <[email protected]> wrote:
> On 2023-02-14 2:37 p.m., John Stultz wrote:
> > On Tue, Feb 14, 2023 at 9:46 AM Liang, Kan <[email protected]> wrote:
> >> If I understand correctly, the TSC calibration is done in the kernel.
> >> The kernel keeps updating the mul/shift. We dump the mul/shift into the
> >> perf mmap page for the user tools.
> >
> > Where is that done in the perf mmap? I wasn't aware.
>
> The updating of the mul/shift for sched_clock should be done in the
> set_cyc2ns_scale() in tsc.c
Thanks for the pointer!
> The perf user space tool mmap a page to retrieve the enabling
> time/running time from the kernel. On X86 and Arm, the conversion
> information from HW time (TSC) to sched_clock/perf_time is also stored
> in the page. Please see the arch_perf_update_userpage(). In the perf
> mmap, it only retrieve the current mul/shift information and write them
> into the page for the user space tool.
>
> This V2 patch series try to do the same thing for the monotonic raw
> conversion. So the kernel internal mul/shift information has to be exposed.
Ugh. Well, I think perf may have made a bad API choice here, so I'm
still going to push back on exposting timekeeping internals to
userland.
But I do suspect that with ways to provide paired TSC/CLOCK_MONOTONIC
values, you should be able to get the same functionality in userland
as if the underlying data was shared.
thanks
-john
On 2023-02-14 3:02 p.m., Thomas Gleixner wrote:
> Kan!
>
> On Mon, Feb 13 2023 at 11:07, kan liang wrote:
>> From: Kan Liang <[email protected]>
>> + } else if (perf_event_hw_time(event)) {
>> + struct ktime_conv mono;
>> +
>> + userpg->cap_user_time_mono_raw = 1;
>> + ktime_get_fast_mono_raw_conv(&mono);
>
> What guarantees that the clocksource used by the timekeeping core is
> actually TSC? Nothing at all. You cannot make assumptions here.
>
Yes, you are right.
I will add a check to make sure the clocksource is TSC when perf does
the conversion.
Could you please comment on whether the patch is in the right direction?
This V2 patch series expose the kernel internal conversion information
into the user space. Is it OK for you?
Thanks,
Kan
On 2023-02-14 3:11 p.m., John Stultz wrote:
> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <[email protected]> wrote:
>> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
>>> If I understand correctly, the idea is to let the user space tool run
>>> the above interpoloation algorithm several times to 'guess' the atomic
>>> mapping. Using the mapping information to covert the TSC from the PEBS
>>> record. Is my understanding correct?
>>>
>>> If so, to be honest, I doubt we can get the accuracy we want.
>>>
>>
>> I implemented a simple test to evaluate the error.
>
> Very cool!
>
>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
>> at the start and end of perf cmd.
>> MONO_RAW TSC
>> start 89553516545645 223619715214239
>> end 89562251233830 223641517000376
>>
>> Here is what I get via mult/shift conversion from this patch.
>> MONO_RAW TSC
>> PEBS 89555942691466 223625770878571
>>
>> Then I use the time information from start and end to create a linear
>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
>> 89555942692721.
>> There is a 1255 ns difference.
>> I tried several different PEBS records. The error is ~1000ns.
>> I think it should be an observable error.
>
> Interesting. That's a good bit higher than I'd expect as I'd expect a
> clock_gettime() call to take ~ double digit nanoseconds range on
> average, so the error should be within that.
>
> Can you share your logic?
>
I run the algorithm right before and after the perf command as below.
(The source code of time is attached.)
$./time
$perf record -e cycles:upp --clockid monotonic_raw $some_workaround
$./time
The time will dump both MONO_RAW and TSC. That's where "start" and "end"
from.
The perf command print out both TSC and converted MONO_RAW (using the
mul/shift from this patch series). That's where "PEBS" value from.
Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
(end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
The guessed_MONO_RAW is 89555942692721.
The PEBS_MONO_RAW is 89555942691466.
The difference is 1255.
Is the calculation correct?
Thanks,
Kan
On Tue, Feb 14 2023 at 15:21, Kan Liang wrote:
> On 2023-02-14 3:02 p.m., Thomas Gleixner wrote:
>>
>> What guarantees that the clocksource used by the timekeeping core is
>> actually TSC? Nothing at all. You cannot make assumptions here.
>>
>
> Yes, you are right.
> I will add a check to make sure the clocksource is TSC when perf does
> the conversion.
>
> Could you please comment on whether the patch is in the right direction?
> This V2 patch series expose the kernel internal conversion information
> into the user space. Is it OK for you?
Making the conversion info an ABI is suboptimal at best. I'm still
trying to wrap my brain around all of this. Will reply over there once
my confusion subsides.
Thanks,
tglx
On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <[email protected]> wrote:
> On 2023-02-14 3:11 p.m., John Stultz wrote:
> > On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <[email protected]> wrote:
> >> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
> >>> If I understand correctly, the idea is to let the user space tool run
> >>> the above interpoloation algorithm several times to 'guess' the atomic
> >>> mapping. Using the mapping information to covert the TSC from the PEBS
> >>> record. Is my understanding correct?
> >>>
> >>> If so, to be honest, I doubt we can get the accuracy we want.
> >>>
> >>
> >> I implemented a simple test to evaluate the error.
> >
> > Very cool!
> >
> >> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
> >> at the start and end of perf cmd.
> >> MONO_RAW TSC
> >> start 89553516545645 223619715214239
> >> end 89562251233830 223641517000376
> >>
> >> Here is what I get via mult/shift conversion from this patch.
> >> MONO_RAW TSC
> >> PEBS 89555942691466 223625770878571
> >>
> >> Then I use the time information from start and end to create a linear
> >> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
> >> 89555942692721.
> >> There is a 1255 ns difference.
> >> I tried several different PEBS records. The error is ~1000ns.
> >> I think it should be an observable error.
> >
> > Interesting. That's a good bit higher than I'd expect as I'd expect a
> > clock_gettime() call to take ~ double digit nanoseconds range on
> > average, so the error should be within that.
> >
> > Can you share your logic?
> >
>
> I run the algorithm right before and after the perf command as below.
> (The source code of time is attached.)
>
> $./time
> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
> $./time
>
> The time will dump both MONO_RAW and TSC. That's where "start" and "end"
> from.
> The perf command print out both TSC and converted MONO_RAW (using the
> mul/shift from this patch series). That's where "PEBS" value from.
>
> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
>
> The guessed_MONO_RAW is 89555942692721.
> The PEBS_MONO_RAW is 89555942691466.
> The difference is 1255.
>
> Is the calculation correct?
Thanks for sharing it. The equation you have there looks ok at a high
level for the values you captured (there's small tweaks like doing the
mult before the div to make sure you don't hit integer precision
issues, but I didn't see that with your results).
I've got a todo to try to see how the calculation changes if we do
provide atomic TSC/RAW stamps, here but I got a little busy with other
work and haven't gotten to it.
So my apologies, but I'll try to get back to this soon.
thanks
-john
Hi John,
On 2023-02-17 6:11 p.m., John Stultz wrote:
> On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <[email protected]> wrote:
>> On 2023-02-14 3:11 p.m., John Stultz wrote:
>>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <[email protected]> wrote:
>>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
>>>>> If I understand correctly, the idea is to let the user space tool run
>>>>> the above interpoloation algorithm several times to 'guess' the atomic
>>>>> mapping. Using the mapping information to covert the TSC from the PEBS
>>>>> record. Is my understanding correct?
>>>>>
>>>>> If so, to be honest, I doubt we can get the accuracy we want.
>>>>>
>>>>
>>>> I implemented a simple test to evaluate the error.
>>>
>>> Very cool!
>>>
>>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
>>>> at the start and end of perf cmd.
>>>> MONO_RAW TSC
>>>> start 89553516545645 223619715214239
>>>> end 89562251233830 223641517000376
>>>>
>>>> Here is what I get via mult/shift conversion from this patch.
>>>> MONO_RAW TSC
>>>> PEBS 89555942691466 223625770878571
>>>>
>>>> Then I use the time information from start and end to create a linear
>>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
>>>> 89555942692721.
>>>> There is a 1255 ns difference.
>>>> I tried several different PEBS records. The error is ~1000ns.
>>>> I think it should be an observable error.
>>>
>>> Interesting. That's a good bit higher than I'd expect as I'd expect a
>>> clock_gettime() call to take ~ double digit nanoseconds range on
>>> average, so the error should be within that.
>>>
>>> Can you share your logic?
>>>
>>
>> I run the algorithm right before and after the perf command as below.
>> (The source code of time is attached.)
>>
>> $./time
>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>> $./time
>>
>> The time will dump both MONO_RAW and TSC. That's where "start" and "end"
>> from.
>> The perf command print out both TSC and converted MONO_RAW (using the
>> mul/shift from this patch series). That's where "PEBS" value from.
>>
>> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
>> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
>> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
>>
>> The guessed_MONO_RAW is 89555942692721.
>> The PEBS_MONO_RAW is 89555942691466.
>> The difference is 1255.
>>
>> Is the calculation correct?
>
> Thanks for sharing it. The equation you have there looks ok at a high
> level for the values you captured (there's small tweaks like doing the
> mult before the div to make sure you don't hit integer precision
> issues, but I didn't see that with your results).
>
> I've got a todo to try to see how the calculation changes if we do
> provide atomic TSC/RAW stamps, here but I got a little busy with other
> work and haven't gotten to it.
> So my apologies, but I'll try to get back to this soon.
>
Have you got a chance to try the idea?
I just want to check whether the userspace interpolation approach works.
Should I prepare V3 and go back to the kernel solution?
Thanks,
Kan
On Wed, Mar 8, 2023 at 10:44 AM Liang, Kan <[email protected]> wrote:
> On 2023-02-17 6:11 p.m., John Stultz wrote:
> > On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <[email protected]> wrote:
> >> On 2023-02-14 3:11 p.m., John Stultz wrote:
> >>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <[email protected]> wrote:
> >>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
> >>>>> If I understand correctly, the idea is to let the user space tool run
> >>>>> the above interpoloation algorithm several times to 'guess' the atomic
> >>>>> mapping. Using the mapping information to covert the TSC from the PEBS
> >>>>> record. Is my understanding correct?
> >>>>>
> >>>>> If so, to be honest, I doubt we can get the accuracy we want.
> >>>>>
> >>>>
> >>>> I implemented a simple test to evaluate the error.
> >>>
> >>> Very cool!
> >>>
> >>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
> >>>> at the start and end of perf cmd.
> >>>> MONO_RAW TSC
> >>>> start 89553516545645 223619715214239
> >>>> end 89562251233830 223641517000376
> >>>>
> >>>> Here is what I get via mult/shift conversion from this patch.
> >>>> MONO_RAW TSC
> >>>> PEBS 89555942691466 223625770878571
> >>>>
> >>>> Then I use the time information from start and end to create a linear
> >>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
> >>>> 89555942692721.
> >>>> There is a 1255 ns difference.
> >>>> I tried several different PEBS records. The error is ~1000ns.
> >>>> I think it should be an observable error.
> >>>
> >>> Interesting. That's a good bit higher than I'd expect as I'd expect a
> >>> clock_gettime() call to take ~ double digit nanoseconds range on
> >>> average, so the error should be within that.
> >>>
> >>> Can you share your logic?
> >>>
> >>
> >> I run the algorithm right before and after the perf command as below.
> >> (The source code of time is attached.)
> >>
> >> $./time
> >> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
> >> $./time
> >>
> >> The time will dump both MONO_RAW and TSC. That's where "start" and "end"
> >> from.
> >> The perf command print out both TSC and converted MONO_RAW (using the
> >> mul/shift from this patch series). That's where "PEBS" value from.
> >>
> >> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
> >> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
> >> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
> >>
> >> The guessed_MONO_RAW is 89555942692721.
> >> The PEBS_MONO_RAW is 89555942691466.
> >> The difference is 1255.
> >>
> >> Is the calculation correct?
> >
> > Thanks for sharing it. The equation you have there looks ok at a high
> > level for the values you captured (there's small tweaks like doing the
> > mult before the div to make sure you don't hit integer precision
> > issues, but I didn't see that with your results).
> >
> > I've got a todo to try to see how the calculation changes if we do
> > provide atomic TSC/RAW stamps, here but I got a little busy with other
> > work and haven't gotten to it.
> > So my apologies, but I'll try to get back to this soon.
> >
>
> Have you got a chance to try the idea?
>
> I just want to check whether the userspace interpolation approach works.
> Should I prepare V3 and go back to the kernel solution?
Oh, my apologies. I had some other work come up and this fell off my plate.
So I spent a little bit of time today adding some trace_printks to the
timekeeping code so I could record the actual TSC and timestamps being
calculated from CLOCK_MONOTONIC_RAW.
I did catch one error in the test code, which unfortunately I'm to blame for:
mid = start + (delta +(delta/2))/2; //round-closest
That should be
mid = start + (delta +(2/2))/2 //round-closest
or more simply
mid = start + (delta +1)/2; //round-closest
Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
guessing with two as the divisor, my brain mixed it up and typed
"delta". My apologies!
With that fix, I'm seeing closer to ~500ns of error in the
interpolation, just using the userland sampling. Now, I've also
disabled vsyscalls for this (otherwise I wouldn't be able to
trace_printk), so the error likely would be higher than with
vsyscalls.
Now, part of the error is that:
start= rdtsc();
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
end = rdtsc();
Ends up looking like
start= rdtsc();
clock_gettime() {
now = rdtsc();
delta = now - last;
ns = (delta * mult) >> shift
[~midpoint~]
ts->nsec = base_ns + ns;
ts->sec = base_sec;
normalize_ts(ts)
}
end = rdtsc();
And so by taking the mid-point we're always a little skewed from where
the tsc was actually read. Looking at the data for my case the tsc
read seems to be ~12% in, so you could instead try:
delta = end - start;
p12 = start + ((delta * 12) + (100/2))/100;
With that adjustment, I'm seeing error around ~40ns.
Mind giving that a try?
Now, if you had two snapshots of MONOTONIC_RAW + the TSC value used to
calculate it(maybe the driver access this via a special internal
timekeeping interface), in my testing interpolating will give you
sub-ns error. So I think this is workable without exposing quite so
much to userland.
thanks
-john
On 2023-03-08 8:17 p.m., John Stultz wrote:
> On Wed, Mar 8, 2023 at 10:44 AM Liang, Kan <[email protected]> wrote:
>> On 2023-02-17 6:11 p.m., John Stultz wrote:
>>> On Tue, Feb 14, 2023 at 12:38 PM Liang, Kan <[email protected]> wrote:
>>>> On 2023-02-14 3:11 p.m., John Stultz wrote:
>>>>> On Tue, Feb 14, 2023 at 9:00 AM Liang, Kan <[email protected]> wrote:
>>>>>> On 2023-02-14 9:51 a.m., Liang, Kan wrote:
>>>>>>> If I understand correctly, the idea is to let the user space tool run
>>>>>>> the above interpoloation algorithm several times to 'guess' the atomic
>>>>>>> mapping. Using the mapping information to covert the TSC from the PEBS
>>>>>>> record. Is my understanding correct?
>>>>>>>
>>>>>>> If so, to be honest, I doubt we can get the accuracy we want.
>>>>>>>
>>>>>>
>>>>>> I implemented a simple test to evaluate the error.
>>>>>
>>>>> Very cool!
>>>>>
>>>>>> I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
>>>>>> at the start and end of perf cmd.
>>>>>> MONO_RAW TSC
>>>>>> start 89553516545645 223619715214239
>>>>>> end 89562251233830 223641517000376
>>>>>>
>>>>>> Here is what I get via mult/shift conversion from this patch.
>>>>>> MONO_RAW TSC
>>>>>> PEBS 89555942691466 223625770878571
>>>>>>
>>>>>> Then I use the time information from start and end to create a linear
>>>>>> function and 'guess' the MONO_RAW of PEBS from the TSC. I get
>>>>>> 89555942692721.
>>>>>> There is a 1255 ns difference.
>>>>>> I tried several different PEBS records. The error is ~1000ns.
>>>>>> I think it should be an observable error.
>>>>>
>>>>> Interesting. That's a good bit higher than I'd expect as I'd expect a
>>>>> clock_gettime() call to take ~ double digit nanoseconds range on
>>>>> average, so the error should be within that.
>>>>>
>>>>> Can you share your logic?
>>>>>
>>>>
>>>> I run the algorithm right before and after the perf command as below.
>>>> (The source code of time is attached.)
>>>>
>>>> $./time
>>>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>>>> $./time
>>>>
>>>> The time will dump both MONO_RAW and TSC. That's where "start" and "end"
>>>> from.
>>>> The perf command print out both TSC and converted MONO_RAW (using the
>>>> mul/shift from this patch series). That's where "PEBS" value from.
>>>>
>>>> Than I use the below formula to calculate the guessed MONO_RAW of PEBS TSC.
>>>> Guessed_MONO_RAW = (PEBS_TSC - start_TSC) / (end_TSC - start_TSC) *
>>>> (end_MONO_RAW - start_MONO_RAW) + start_MONO_RAW.
>>>>
>>>> The guessed_MONO_RAW is 89555942692721.
>>>> The PEBS_MONO_RAW is 89555942691466.
>>>> The difference is 1255.
>>>>
>>>> Is the calculation correct?
>>>
>>> Thanks for sharing it. The equation you have there looks ok at a high
>>> level for the values you captured (there's small tweaks like doing the
>>> mult before the div to make sure you don't hit integer precision
>>> issues, but I didn't see that with your results).
>>>
>>> I've got a todo to try to see how the calculation changes if we do
>>> provide atomic TSC/RAW stamps, here but I got a little busy with other
>>> work and haven't gotten to it.
>>> So my apologies, but I'll try to get back to this soon.
>>>
>>
>> Have you got a chance to try the idea?
>>
>> I just want to check whether the userspace interpolation approach works.
>> Should I prepare V3 and go back to the kernel solution?
>
> Oh, my apologies. I had some other work come up and this fell off my plate.
>
> So I spent a little bit of time today adding some trace_printks to the
> timekeeping code so I could record the actual TSC and timestamps being
> calculated from CLOCK_MONOTONIC_RAW.
>
> I did catch one error in the test code, which unfortunately I'm to blame for:
> mid = start + (delta +(delta/2))/2; //round-closest
>
> That should be
> mid = start + (delta +(2/2))/2 //round-closest
> or more simply
> mid = start + (delta +1)/2; //round-closest
>
> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
> guessing with two as the divisor, my brain mixed it up and typed
> "delta". My apologies!
>
> With that fix, I'm seeing closer to ~500ns of error in the
> interpolation, just using the userland sampling. Now, I've also
> disabled vsyscalls for this (otherwise I wouldn't be able to
> trace_printk), so the error likely would be higher than with
> vsyscalls.
>
> Now, part of the error is that:
> start= rdtsc();
> clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
> end = rdtsc();
>
> Ends up looking like
> start= rdtsc();
> clock_gettime() {
> now = rdtsc();
> delta = now - last;
> ns = (delta * mult) >> shift
> [~midpoint~]
> ts->nsec = base_ns + ns;
> ts->sec = base_sec;
> normalize_ts(ts)
> }
> end = rdtsc();
>
> And so by taking the mid-point we're always a little skewed from where
> the tsc was actually read. Looking at the data for my case the tsc
> read seems to be ~12% in, so you could instead try:
>
> delta = end - start;
> p12 = start + ((delta * 12) + (100/2))/100;
>
> With that adjustment, I'm seeing error around ~40ns.
>
> Mind giving that a try?
I tried both the new mid and p12. The error becomes even larger.
With new mid (start + (delta +1)/2), the error is now ~3800ns
With p12 adjustment, the error is ~6700ns.
Here is how I run the test.
$./time
$perf record -e cycles:upp --clockid monotonic_raw $some_workaround
$./time
Here are some raw data.
For the first ./time,
start: 961886196018
end: 961886215603
MONO_RAW: 341485848531
For the second ./time,
start: 986870117783
end: 986870136152
MONO_RAW: 351495432044
Here is the time generated from one PEBS record.
TSC: 968210217271
PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
344019506897. The error is 3825ns.
Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
The error is 6759ns
Thanks,
Kan
>
> Now, if you had two snapshots of MONOTONIC_RAW + the TSC value used to
> calculate it(maybe the driver access this via a special internal
> timekeeping interface), in my testing interpolating will give you
> sub-ns error. So I think this is workable without exposing quite so
> much to userland.
>
> thanks
> -john
On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <[email protected]> wrote:
> On 2023-03-08 8:17 p.m., John Stultz wrote:
> > So I spent a little bit of time today adding some trace_printks to the
> > timekeeping code so I could record the actual TSC and timestamps being
> > calculated from CLOCK_MONOTONIC_RAW.
> >
> > I did catch one error in the test code, which unfortunately I'm to blame for:
> > mid = start + (delta +(delta/2))/2; //round-closest
> >
> > That should be
> > mid = start + (delta +(2/2))/2 //round-closest
> > or more simply
> > mid = start + (delta +1)/2; //round-closest
> >
> > Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
> > guessing with two as the divisor, my brain mixed it up and typed
> > "delta". My apologies!
> >
> > With that fix, I'm seeing closer to ~500ns of error in the
> > interpolation, just using the userland sampling. Now, I've also
> > disabled vsyscalls for this (otherwise I wouldn't be able to
> > trace_printk), so the error likely would be higher than with
> > vsyscalls.
> >
> > Now, part of the error is that:
> > start= rdtsc();
> > clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
> > end = rdtsc();
> >
> > Ends up looking like
> > start= rdtsc();
> > clock_gettime() {
> > now = rdtsc();
> > delta = now - last;
> > ns = (delta * mult) >> shift
> > [~midpoint~]
> > ts->nsec = base_ns + ns;
> > ts->sec = base_sec;
> > normalize_ts(ts)
> > }
> > end = rdtsc();
> >
> > And so by taking the mid-point we're always a little skewed from where
> > the tsc was actually read. Looking at the data for my case the tsc
> > read seems to be ~12% in, so you could instead try:
> >
> > delta = end - start;
> > p12 = start + ((delta * 12) + (100/2))/100;
> >
> > With that adjustment, I'm seeing error around ~40ns.
> >
> > Mind giving that a try?
>
> I tried both the new mid and p12. The error becomes even larger.
>
> With new mid (start + (delta +1)/2), the error is now ~3800ns
> With p12 adjustment, the error is ~6700ns.
>
>
> Here is how I run the test.
> $./time
> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
> $./time
>
> Here are some raw data.
>
> For the first ./time,
> start: 961886196018
> end: 961886215603
> MONO_RAW: 341485848531
>
> For the second ./time,
> start: 986870117783
> end: 986870136152
> MONO_RAW: 351495432044
>
> Here is the time generated from one PEBS record.
> TSC: 968210217271
> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
>
> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
> 344019506897. The error is 3825ns.
> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
> The error is 6759ns
Huh. I dunno. That seems wild that the error increased.
Just in case something is going astray with the PEBS_MONO_RAW logic,
can you apply the hack patch I was using to display the MONOTONIC_RAW
values the kernel calculates?
https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6
It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
to get the output.
thanks
-john
ersion. So the kernel internal mul/shift information has to be exposed.
> Ugh. Well, I think perf may have made a bad API choice here, so I'm
> still going to push back on exposting timekeeping internals to
> userland.
It's not about the perf ABI.
The perf mmap mult/offset if for PT, which always has raw TSCs.
Without it the PT decoder couldn't supply wall clock time.
-Andi
On 2023-03-11 12:55 a.m., John Stultz wrote:
> On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <[email protected]> wrote:
>> On 2023-03-08 8:17 p.m., John Stultz wrote:
>>> So I spent a little bit of time today adding some trace_printks to the
>>> timekeeping code so I could record the actual TSC and timestamps being
>>> calculated from CLOCK_MONOTONIC_RAW.
>>>
>>> I did catch one error in the test code, which unfortunately I'm to blame for:
>>> mid = start + (delta +(delta/2))/2; //round-closest
>>>
>>> That should be
>>> mid = start + (delta +(2/2))/2 //round-closest
>>> or more simply
>>> mid = start + (delta +1)/2; //round-closest
>>>
>>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
>>> guessing with two as the divisor, my brain mixed it up and typed
>>> "delta". My apologies!
>>>
>>> With that fix, I'm seeing closer to ~500ns of error in the
>>> interpolation, just using the userland sampling. Now, I've also
>>> disabled vsyscalls for this (otherwise I wouldn't be able to
>>> trace_printk), so the error likely would be higher than with
>>> vsyscalls.
>>>
>>> Now, part of the error is that:
>>> start= rdtsc();
>>> clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
>>> end = rdtsc();
>>>
>>> Ends up looking like
>>> start= rdtsc();
>>> clock_gettime() {
>>> now = rdtsc();
>>> delta = now - last;
>>> ns = (delta * mult) >> shift
>>> [~midpoint~]
>>> ts->nsec = base_ns + ns;
>>> ts->sec = base_sec;
>>> normalize_ts(ts)
>>> }
>>> end = rdtsc();
>>>
>>> And so by taking the mid-point we're always a little skewed from where
>>> the tsc was actually read. Looking at the data for my case the tsc
>>> read seems to be ~12% in, so you could instead try:
>>>
>>> delta = end - start;
>>> p12 = start + ((delta * 12) + (100/2))/100;
>>>
>>> With that adjustment, I'm seeing error around ~40ns.
>>>
>>> Mind giving that a try?
>>
>> I tried both the new mid and p12. The error becomes even larger.
>>
>> With new mid (start + (delta +1)/2), the error is now ~3800ns
>> With p12 adjustment, the error is ~6700ns.
>>
>>
>> Here is how I run the test.
>> $./time
>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>> $./time
>>
>> Here are some raw data.
>>
>> For the first ./time,
>> start: 961886196018
>> end: 961886215603
>> MONO_RAW: 341485848531
>>
>> For the second ./time,
>> start: 986870117783
>> end: 986870136152
>> MONO_RAW: 351495432044
>>
>> Here is the time generated from one PEBS record.
>> TSC: 968210217271
>> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
>>
>> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
>> 344019506897. The error is 3825ns.
>> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
>> The error is 6759ns
>
> Huh. I dunno. That seems wild that the error increased.
>
> Just in case something is going astray with the PEBS_MONO_RAW logic,
> can you apply the hack patch I was using to display the MONOTONIC_RAW
> values the kernel calculates?
> https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6
>
> It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
> to get the output.
>
$ ./time_3
start: 7358368893806 end: 7358368902944 delta: 9138
MONO_RAW: 2899739790738
MID: 7358368898375
P12: 7358368894903
$ sudo cat /sys/kernel/tracing/trace | grep time_3
time_3-1443 [002] ..... 2899.858936: ktime_get_raw_ts64:
JDB: timekeeping_get_delta cycle_now: 7358368897679
time_3-1443 [002] ..... 2899.858937: ktime_get_raw_ts64:
JDB: ktime_get_raw_ts64: 2899739790738
The error between MID and cycle_now is -696ns
The error between P12 and cycle_now is 2776ns
The time_3.c is attached.
Thanks,
Kan
On Mon, Mar 13, 2023 at 2:19 PM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2023-03-11 12:55 a.m., John Stultz wrote:
> > On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <[email protected]> wrote:
> >> On 2023-03-08 8:17 p.m., John Stultz wrote:
> >>> So I spent a little bit of time today adding some trace_printks to the
> >>> timekeeping code so I could record the actual TSC and timestamps being
> >>> calculated from CLOCK_MONOTONIC_RAW.
> >>>
> >>> I did catch one error in the test code, which unfortunately I'm to blame for:
> >>> mid = start + (delta +(delta/2))/2; //round-closest
> >>>
> >>> That should be
> >>> mid = start + (delta +(2/2))/2 //round-closest
> >>> or more simply
> >>> mid = start + (delta +1)/2; //round-closest
> >>>
> >>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
> >>> guessing with two as the divisor, my brain mixed it up and typed
> >>> "delta". My apologies!
> >>>
> >>> With that fix, I'm seeing closer to ~500ns of error in the
> >>> interpolation, just using the userland sampling. Now, I've also
> >>> disabled vsyscalls for this (otherwise I wouldn't be able to
> >>> trace_printk), so the error likely would be higher than with
> >>> vsyscalls.
> >>>
> >>> Now, part of the error is that:
> >>> start= rdtsc();
> >>> clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
> >>> end = rdtsc();
> >>>
> >>> Ends up looking like
> >>> start= rdtsc();
> >>> clock_gettime() {
> >>> now = rdtsc();
> >>> delta = now - last;
> >>> ns = (delta * mult) >> shift
> >>> [~midpoint~]
> >>> ts->nsec = base_ns + ns;
> >>> ts->sec = base_sec;
> >>> normalize_ts(ts)
> >>> }
> >>> end = rdtsc();
> >>>
> >>> And so by taking the mid-point we're always a little skewed from where
> >>> the tsc was actually read. Looking at the data for my case the tsc
> >>> read seems to be ~12% in, so you could instead try:
> >>>
> >>> delta = end - start;
> >>> p12 = start + ((delta * 12) + (100/2))/100;
> >>>
> >>> With that adjustment, I'm seeing error around ~40ns.
> >>>
> >>> Mind giving that a try?
> >>
> >> I tried both the new mid and p12. The error becomes even larger.
> >>
> >> With new mid (start + (delta +1)/2), the error is now ~3800ns
> >> With p12 adjustment, the error is ~6700ns.
> >>
> >>
> >> Here is how I run the test.
> >> $./time
> >> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
> >> $./time
> >>
> >> Here are some raw data.
> >>
> >> For the first ./time,
> >> start: 961886196018
> >> end: 961886215603
> >> MONO_RAW: 341485848531
> >>
> >> For the second ./time,
> >> start: 986870117783
> >> end: 986870136152
> >> MONO_RAW: 351495432044
> >>
> >> Here is the time generated from one PEBS record.
> >> TSC: 968210217271
> >> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
> >>
> >> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
> >> 344019506897. The error is 3825ns.
> >> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
> >> The error is 6759ns
> >
> > Huh. I dunno. That seems wild that the error increased.
> >
> > Just in case something is going astray with the PEBS_MONO_RAW logic,
> > can you apply the hack patch I was using to display the MONOTONIC_RAW
> > values the kernel calculates?
> > https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6
> >
> > It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
> > to get the output.
> >
>
>
> $ ./time_3
> start: 7358368893806 end: 7358368902944 delta: 9138
> MONO_RAW: 2899739790738
> MID: 7358368898375
> P12: 7358368894903
> $ sudo cat /sys/kernel/tracing/trace | grep time_3
> time_3-1443 [002] ..... 2899.858936: ktime_get_raw_ts64:
> JDB: timekeeping_get_delta cycle_now: 7358368897679
> time_3-1443 [002] ..... 2899.858937: ktime_get_raw_ts64:
> JDB: ktime_get_raw_ts64: 2899739790738
>
> The error between MID and cycle_now is -696ns
> The error between P12 and cycle_now is 2776ns
Hey Kan,
So I'm terribly sorry, I'm a bit underwater right now and haven't
had time to look deeper at this. The MID case you have above looks
closer to what I was seeing but I can't explain why the 12% case is
worse.
Since I feel it's not really fair to object to your patch but not have
the time to work through an alternative with you, I'm going to
withdraw my objection (though others may persist!).
I'd still really prefer if we avoided exposing internal timekeeping
state directly to userland, and it would be good to see some further
exploration in other directions, but there is the existing perf mmap
precedence (even if I dislike it). Sorry I can't be of more help to
find a better approach here. :(
thanks
-john
Hi John,
On 2023-03-18 2:02 a.m., John Stultz wrote:
> On Mon, Mar 13, 2023 at 2:19 PM Liang, Kan <[email protected]> wrote:
>>
>>
>>
>> On 2023-03-11 12:55 a.m., John Stultz wrote:
>>> On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <[email protected]> wrote:
>>>> On 2023-03-08 8:17 p.m., John Stultz wrote:
>>>>> So I spent a little bit of time today adding some trace_printks to the
>>>>> timekeeping code so I could record the actual TSC and timestamps being
>>>>> calculated from CLOCK_MONOTONIC_RAW.
>>>>>
>>>>> I did catch one error in the test code, which unfortunately I'm to blame for:
>>>>> mid = start + (delta +(delta/2))/2; //round-closest
>>>>>
>>>>> That should be
>>>>> mid = start + (delta +(2/2))/2 //round-closest
>>>>> or more simply
>>>>> mid = start + (delta +1)/2; //round-closest
>>>>>
>>>>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
>>>>> guessing with two as the divisor, my brain mixed it up and typed
>>>>> "delta". My apologies!
>>>>>
>>>>> With that fix, I'm seeing closer to ~500ns of error in the
>>>>> interpolation, just using the userland sampling. Now, I've also
>>>>> disabled vsyscalls for this (otherwise I wouldn't be able to
>>>>> trace_printk), so the error likely would be higher than with
>>>>> vsyscalls.
>>>>>
>>>>> Now, part of the error is that:
>>>>> start= rdtsc();
>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
>>>>> end = rdtsc();
>>>>>
>>>>> Ends up looking like
>>>>> start= rdtsc();
>>>>> clock_gettime() {
>>>>> now = rdtsc();
>>>>> delta = now - last;
>>>>> ns = (delta * mult) >> shift
>>>>> [~midpoint~]
>>>>> ts->nsec = base_ns + ns;
>>>>> ts->sec = base_sec;
>>>>> normalize_ts(ts)
>>>>> }
>>>>> end = rdtsc();
>>>>>
>>>>> And so by taking the mid-point we're always a little skewed from where
>>>>> the tsc was actually read. Looking at the data for my case the tsc
>>>>> read seems to be ~12% in, so you could instead try:
>>>>>
>>>>> delta = end - start;
>>>>> p12 = start + ((delta * 12) + (100/2))/100;
>>>>>
>>>>> With that adjustment, I'm seeing error around ~40ns.
>>>>>
>>>>> Mind giving that a try?
>>>>
>>>> I tried both the new mid and p12. The error becomes even larger.
>>>>
>>>> With new mid (start + (delta +1)/2), the error is now ~3800ns
>>>> With p12 adjustment, the error is ~6700ns.
>>>>
>>>>
>>>> Here is how I run the test.
>>>> $./time
>>>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>>>> $./time
>>>>
>>>> Here are some raw data.
>>>>
>>>> For the first ./time,
>>>> start: 961886196018
>>>> end: 961886215603
>>>> MONO_RAW: 341485848531
>>>>
>>>> For the second ./time,
>>>> start: 986870117783
>>>> end: 986870136152
>>>> MONO_RAW: 351495432044
>>>>
>>>> Here is the time generated from one PEBS record.
>>>> TSC: 968210217271
>>>> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
>>>>
>>>> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
>>>> 344019506897. The error is 3825ns.
>>>> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
>>>> The error is 6759ns
>>>
>>> Huh. I dunno. That seems wild that the error increased.
>>>
>>> Just in case something is going astray with the PEBS_MONO_RAW logic,
>>> can you apply the hack patch I was using to display the MONOTONIC_RAW
>>> values the kernel calculates?
>>> https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6
>>>
>>> It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
>>> to get the output.
>>>
>>
>>
>> $ ./time_3
>> start: 7358368893806 end: 7358368902944 delta: 9138
>> MONO_RAW: 2899739790738
>> MID: 7358368898375
>> P12: 7358368894903
>> $ sudo cat /sys/kernel/tracing/trace | grep time_3
>> time_3-1443 [002] ..... 2899.858936: ktime_get_raw_ts64:
>> JDB: timekeeping_get_delta cycle_now: 7358368897679
>> time_3-1443 [002] ..... 2899.858937: ktime_get_raw_ts64:
>> JDB: ktime_get_raw_ts64: 2899739790738
>>
>> The error between MID and cycle_now is -696ns
>> The error between P12 and cycle_now is 2776ns
>
> Hey Kan,
> So I'm terribly sorry, I'm a bit underwater right now and haven't
> had time to look deeper at this. The MID case you have above looks
> closer to what I was seeing but I can't explain why the 12% case is
> worse.
>
> Since I feel it's not really fair to object to your patch but not have
> the time to work through an alternative with you, I'm going to
> withdraw my objection (though others may persist!).
> I'd still really prefer if we avoided exposing internal timekeeping
> state directly to userland, and it would be good to see some further
> exploration in other directions, but there is the existing perf mmap
> precedence (even if I dislike it). Sorry I can't be of more help to
> find a better approach here. :(
>
Thank you all the same. I think we learnt that there should be more work
for the pure user space solution. It is not a solution for the monotonic
raw conversion for now.
I have no idea how to do the post-processing conversion without the
internal conversion information.
So, for now, there seems only two candidate solutions.
- Pure kernel solution (Similar to V1).
- Expose the internal conversion information to the user space and does
post-processing conversion. (V2)
I will ping Thomas in the other thread and see if he has any suggestions.
Thanks,
Kan
Hi Thomas,
On 2023-02-14 3:55 p.m., Thomas Gleixner wrote:
> On Tue, Feb 14 2023 at 15:21, Kan Liang wrote:
>> On 2023-02-14 3:02 p.m., Thomas Gleixner wrote:
>>>
>>> What guarantees that the clocksource used by the timekeeping core is
>>> actually TSC? Nothing at all. You cannot make assumptions here.
>>>
>>
>> Yes, you are right.
>> I will add a check to make sure the clocksource is TSC when perf does
>> the conversion.
>>
>> Could you please comment on whether the patch is in the right direction?
>> This V2 patch series expose the kernel internal conversion information
>> into the user space. Is it OK for you?
>
> Making the conversion info an ABI is suboptimal at best. I'm still
> trying to wrap my brain around all of this. Will reply over there once
> my confusion subsides.
>
John and I have tried a pure user-space solution (avoid exposing
internal conversion info) to convert a given TSC to a monotonic raw.
But it doesn't work well.
https://lore.kernel.org/lkml/CANDhNComKRDdZJ8SJECNdoAzQhmR3vu9yKAtp7NKDmECxff=fg@mail.gmail.com/
So, for now, there seems only two solutions.
Solution 1: Do the conversion in the kernel (Similar to V1).
https://lore.kernel.org/lkml/[email protected]/
Solution 2: Expose the internal conversion information to the user space
via perf mmap and does post-processing conversion. (Implemented in this V2)
Personally, I incline the solution 1. Because
- The current monotonic raw is calculated in the kernel as well.
The solution 1 just follow the existing method. It doesn't introduce
extra overhead.
- It avoids exposing the internal timekeeping state directly to userspace.
What do you think? Are there other directions I can explore?
Thanks,
Kan