2023-01-23 18:29:22

by Liang, Kan

[permalink] [raw]
Subject: [PATCH 0/3] Convert TSC to monotonic clock for PEBS

From: Kan Liang <[email protected]>

A Processor Event Based Sampling (PEBS) record includes a field that
provide the time stamp counter value when the counter was overflowed
and the PEBS record was generated. The accurate time stamp can be used
to reconcile user samples. However, the current PEBS codes only can
convert the time stamp to sched_clock, which is not available from user
space. A solution to convert a given TSC to user visible monotonic
clock is required.

The perf_event subsystem only converts the TSC in a NMI handler. The
converter function must be fast and NMI safe.

Considered the below two existing functions, but none of them fulfill
the above requirements.
- The ktime_get_mono_fast_ns() is NMI safe, but it can only return the
current clock monotonic rather than a given time's monotonic.
- The get_device_system_crosststamp() can calculate the system time from
a given device time. But it's not fast and NMI safe.

Introduce a new generic interface, get_mono_fast_from_given_time, to
convert a given timestamp to clock monotonic.

Kan Liang (3):
timekeeping: NMI safe converter from a given time to monotonic
x86/tsc: Add set_tsc_system_counterval
perf/x86/intel/ds: Support monotonic clock for PEBS

arch/x86/events/intel/core.c | 2 +-
arch/x86/events/intel/ds.c | 30 +++++++++++++---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/tsc.c | 6 ++++
include/linux/timekeeping.h | 9 +++++
kernel/time/timekeeping.c | 68 ++++++++++++++++++++++++++++++++++--
6 files changed, 108 insertions(+), 8 deletions(-)

--
2.35.1



2023-01-23 18:29:43

by Liang, Kan

[permalink] [raw]
Subject: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

From: Kan Liang <[email protected]>

It's useful to provide a NMI safe function to convert a given time to
monotonic. For example, the perf_event subsystem wants to convert a TSC
of a PEBS record to a monotonic clock in a NMI handler.

Considered the below two existing functions, but none of them fulfill
the above requirements.
- The ktime_get_mono_fast_ns() is NMI safe, but it can only return the
current clock monotonic rather than a given time's monotonic.
- The get_device_system_crosststamp() can calculate the system time from
a given device time. But it's not fast and NMI safe.

The new function is a combination of the two existing functions. Use the
tk_fast_mono timekeeper to make the new function fast and NMI safe. Use
the get_time_fn callback to retrieve the given timestamp and its
clocksource. The history is not supported, since the perf case doesn't
really need it. It can be added later once there is a use case.

Signed-off-by: Kan Liang <[email protected]>
---
include/linux/timekeeping.h | 9 +++++
kernel/time/timekeeping.c | 68 +++++++++++++++++++++++++++++++++++--
2 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index fe1e467ba046..234fa87a846b 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -289,6 +289,15 @@ extern int get_device_system_crosststamp(
struct system_time_snapshot *history,
struct system_device_crosststamp *xtstamp);

+/*
+ * Fast NMI safe way to convert a given timestamp to clock monotonic
+ */
+extern int get_mono_fast_from_given_time(int (*get_time_fn)
+ (struct system_counterval_t *sys_counterval,
+ void *ctx),
+ void *ctx,
+ u64 *mono_ns);
+
/*
* Simultaneously snapshot realtime and monotonic raw clocks
*/
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 5579ead449f2..5bd32b2981fd 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -431,14 +431,19 @@ static void update_fast_timekeeper(const struct tk_read_base *tkr,
memcpy(base + 1, base, sizeof(*base));
}

-static __always_inline u64 fast_tk_get_delta_ns(struct tk_read_base *tkr)
+static __always_inline u64 fast_tk_get_delta_ns_from_cycles(struct tk_read_base *tkr,
+ u64 cycles)
{
- u64 delta, cycles = tk_clock_read(tkr);
+ u64 delta = clocksource_delta(cycles, tkr->cycle_last, tkr->mask);

- delta = clocksource_delta(cycles, tkr->cycle_last, tkr->mask);
return timekeeping_delta_to_ns(tkr, delta);
}

+static __always_inline u64 fast_tk_get_delta_ns(struct tk_read_base *tkr)
+{
+ return fast_tk_get_delta_ns_from_cycles(tkr, tk_clock_read(tkr));
+}
+
static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf)
{
struct tk_read_base *tkr;
@@ -1303,6 +1308,63 @@ int get_device_system_crosststamp(int (*get_time_fn)
}
EXPORT_SYMBOL_GPL(get_device_system_crosststamp);

+/**
+ * get_mono_fast_from_given_time - Fast NMI safe access to convert a given
+ * timestamp to clock monotonic
+ * @get_time_fn: Callback to get the given time and its clocksource
+ * @ctx: Context passed to get_time_fn()
+ * @mono_ns: The monotonic time of the given time
+ */
+int notrace get_mono_fast_from_given_time(int (*get_time_fn)
+ (struct system_counterval_t *sys_counterval,
+ void *ctx),
+ void *ctx,
+ u64 *mono_ns)
+{
+ struct system_counterval_t system_counterval;
+ struct tk_fast *tkf = &tk_fast_mono;
+ u64 cycles, now, interval_start;
+ struct tk_read_base *tkr;
+ unsigned int seq;
+ int ret;
+
+ do {
+ seq = raw_read_seqcount_latch(&tkf->seq);
+ tkr = tkf->base + (seq & 0x01);
+
+ ret = get_time_fn(&system_counterval, ctx);
+ if (ret)
+ return ret;
+
+ /*
+ * Verify that the clocksource associated with the given
+ * timestamp is the same as the currently installed
+ * timekeeper clocksource
+ */
+ if (tkr->clock != system_counterval.cs)
+ return -EOPNOTSUPP;
+ cycles = system_counterval.cycles;
+
+ /*
+ * Check whether the given timestamp is on the current
+ * timekeeping interval.
+ */
+ now = tk_clock_read(tkr);
+ interval_start = tkr->cycle_last;
+ if (!cycle_between(interval_start, cycles, now))
+ return -EOPNOTSUPP;
+
+ now = ktime_to_ns(tkr->base);
+ now += fast_tk_get_delta_ns_from_cycles(tkr, cycles);
+
+ } while (read_seqcount_latch_retry(&tkf->seq, seq));
+
+ *mono_ns = now;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(get_mono_fast_from_given_time);
+
/**
* do_settimeofday64 - Sets the time of day.
* @ts: pointer to the timespec64 variable containing the new time
--
2.35.1


2023-01-23 18:30:16

by Liang, Kan

[permalink] [raw]
Subject: [PATCH 2/3] x86/tsc: Add set_tsc_system_counterval

From: Kan Liang <[email protected]>

The perf_event subsystem wants to convert a TSC of a PEBS record to a
monotonic clock. Introduce a new function to provide both the TSC
value and TSC clocksource information with the format required by the
callback function from the timekeeping code.

Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/tsc.c | 6 ++++++
2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index fbdc3d951494..45aac88b5ce4 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -29,6 +29,7 @@ static inline cycles_t get_cycles(void)

extern struct system_counterval_t convert_art_to_tsc(u64 art);
extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);
+extern struct system_counterval_t set_tsc_system_counterval(u64 tsc);

extern void tsc_early_init(void);
extern void tsc_init(void);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a78e73da4a74..45803e65630e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1311,6 +1311,12 @@ struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns)
}
EXPORT_SYMBOL(convert_art_ns_to_tsc);

+struct system_counterval_t set_tsc_system_counterval(u64 tsc)
+{
+ return (struct system_counterval_t) { .cs = &clocksource_tsc,
+ .cycles = tsc};
+}
+EXPORT_SYMBOL(set_tsc_system_counterval);

static void tsc_refine_calibration_work(struct work_struct *work);
static DECLARE_DELAYED_WORK(tsc_irqwork, tsc_refine_calibration_work);
--
2.35.1


2023-01-23 18:30:50

by Liang, Kan

[permalink] [raw]
Subject: [PATCH 3/3] perf/x86/intel/ds: Support monotonic clock for PEBS

From: Kan Liang <[email protected]>

Users try to reconcile user samples with PEBS samples and require a
common clock source. However, the current PEBS codes only convert to
sched_clock, which is not available from the user space.

Only support converting to clock monotonic. Having one common clock
source is good enough to fulfill the requirement.

Enable the large PEBS for the monotonic clock to reduce the PEBS
overhead.

There are a few rare cases that may make the conversion fails. For
example, TSC overflows. The cycle_last may be changed between samples.
The time will fallback to the inaccurate SW times. But the cases are
extremely unlikely to happen.

Signed-off-by: Kan Liang <[email protected]>
---

The patch has to be on top of the below patch
https://lore.kernel.org/all/[email protected]/

arch/x86/events/intel/core.c | 2 +-
arch/x86/events/intel/ds.c | 30 ++++++++++++++++++++++++++----
2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 14f0a746257d..ea194556cc73 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3777,7 +3777,7 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
{
unsigned long flags = x86_pmu.large_pebs_flags;

- if (event->attr.use_clockid)
+ if (event->attr.use_clockid && (event->attr.clockid != CLOCK_MONOTONIC))
flags &= ~PERF_SAMPLE_TIME;
if (!event->attr.exclude_kernel)
flags &= ~PERF_SAMPLE_REGS_USER;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 7980e92dec64..d7f0eaf4405c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1570,13 +1570,33 @@ static u64 get_data_src(struct perf_event *event, u64 aux)
return val;
}

+static int pebs_get_synctime(struct system_counterval_t *system,
+ void *ctx)
+{
+ *system = set_tsc_system_counterval(*(u64 *)ctx);
+ return 0;
+}
+
+static inline int pebs_clockid_time(clockid_t clk_id, u64 tsc, u64 *clk_id_time)
+{
+ /* Only support converting to clock monotonic */
+ if (clk_id != CLOCK_MONOTONIC)
+ return -EINVAL;
+
+ return get_mono_fast_from_given_time(pebs_get_synctime, &tsc, clk_id_time);
+}
+
static void setup_pebs_time(struct perf_event *event,
struct perf_sample_data *data,
u64 tsc)
{
- /* Converting to a user-defined clock is not supported yet. */
- if (event->attr.use_clockid != 0)
- return;
+ u64 time;
+
+ if (event->attr.use_clockid != 0) {
+ if (pebs_clockid_time(event->attr.clockid, tsc, &time))
+ return;
+ goto done;
+ }

/*
* Converting the TSC to perf time is only supported,
@@ -1587,8 +1607,10 @@ static void setup_pebs_time(struct perf_event *event,
*/
if (!using_native_sched_clock() || !sched_clock_stable())
return;
+ time = native_sched_clock_from_tsc(tsc) + __sched_clock_offset;

- data->time = native_sched_clock_from_tsc(tsc) + __sched_clock_offset;
+done:
+ data->time = time;
data->sample_flags |= PERF_SAMPLE_TIME;
}

--
2.35.1


2023-01-24 06:14:02

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 0/3] Convert TSC to monotonic clock for PEBS

On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> A Processor Event Based Sampling (PEBS) record includes a field that
> provide the time stamp counter value when the counter was overflowed
> and the PEBS record was generated. The accurate time stamp can be used
> to reconcile user samples. However, the current PEBS codes only can
> convert the time stamp to sched_clock, which is not available from user
> space. A solution to convert a given TSC to user visible monotonic
> clock is required.
>
> The perf_event subsystem only converts the TSC in a NMI handler. The
> converter function must be fast and NMI safe.
>
> Considered the below two existing functions, but none of them fulfill
> the above requirements.
> - The ktime_get_mono_fast_ns() is NMI safe, but it can only return the
> current clock monotonic rather than a given time's monotonic.
> - The get_device_system_crosststamp() can calculate the system time from
> a given device time. But it's not fast and NMI safe.

So, apologies if this is a silly question (my brain quickly evicts the
details on get_device_system_crosststamp every time I look at it), but
rather then introducing a new interface, what would it take to rework
the existing get_device_system_crosststamp() logic to be usable for
both use cases?

thanks
-john

2023-01-24 06:56:28

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf/x86/intel/ds: Support monotonic clock for PEBS

On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> Users try to reconcile user samples with PEBS samples and require a
> common clock source. However, the current PEBS codes only convert to
> sched_clock, which is not available from the user space.
>
> Only support converting to clock monotonic. Having one common clock
> source is good enough to fulfill the requirement.
>
> Enable the large PEBS for the monotonic clock to reduce the PEBS
> overhead.
>
> There are a few rare cases that may make the conversion fails. For
> example, TSC overflows. The cycle_last may be changed between samples.
> The time will fallback to the inaccurate SW times. But the cases are
> extremely unlikely to happen.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---

Thanks for sending this out!
A few minor style issues below and a warning.

> The patch has to be on top of the below patch
> https://lore.kernel.org/all/[email protected]/
>
> arch/x86/events/intel/core.c | 2 +-
> arch/x86/events/intel/ds.c | 30 ++++++++++++++++++++++++++----
> 2 files changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 14f0a746257d..ea194556cc73 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3777,7 +3777,7 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
> {
> unsigned long flags = x86_pmu.large_pebs_flags;
>
> - if (event->attr.use_clockid)
> + if (event->attr.use_clockid && (event->attr.clockid != CLOCK_MONOTONIC))
> flags &= ~PERF_SAMPLE_TIME;
> if (!event->attr.exclude_kernel)
> flags &= ~PERF_SAMPLE_REGS_USER;
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 7980e92dec64..d7f0eaf4405c 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1570,13 +1570,33 @@ static u64 get_data_src(struct perf_event *event, u64 aux)
> return val;
> }
>
> +static int pebs_get_synctime(struct system_counterval_t *system,
> + void *ctx)

Just because the abstract function type taken by
get_mono_fast_from_given_time is vague, doesn't mean the
implementation needs to be.
ctx is really a tsc value, right? So let's call it that to make this a
bit more readable.

> +{
> + *system = set_tsc_system_counterval(*(u64 *)ctx);
> + return 0;
> +}
> +
> +static inline int pebs_clockid_time(clockid_t clk_id, u64 tsc, u64 *clk_id_time)

clk_id_time is maybe a bit too fuzzy. It is really a mono_ns value,
right? Let's keep that explicit here.

> +{
> + /* Only support converting to clock monotonic */
> + if (clk_id != CLOCK_MONOTONIC)
> + return -EINVAL;
> +
> + return get_mono_fast_from_given_time(pebs_get_synctime, &tsc, clk_id_time);
> +}
> +
> static void setup_pebs_time(struct perf_event *event,
> struct perf_sample_data *data,
> u64 tsc)
> {
> - /* Converting to a user-defined clock is not supported yet. */
> - if (event->attr.use_clockid != 0)
> - return;
> + u64 time;

Again, "time" is too generic a term without any context here.
mono_nsec or something would be more clear.

> +
> + if (event->attr.use_clockid != 0) {
> + if (pebs_clockid_time(event->attr.clockid, tsc, &time))
> + return;
> + goto done;
> + }

Apologies for this warning/rant:

So, I do get the NMI safety of the "fast" time accessors (along with
the "high performance" sounding name!) is attractive, but as its use
expands I worry the downsides of this interface isn't made clear
enough.

The fast accessors *can* see time discontinuities! Because the logic
is done without holding the tk_core.seq lock, If you are reading in
the middle of a ntp adjustment, you may find the current value to be
larger than the next time you read the time. These discontinuities
are likely to be very small, but a negative delta will look very large
as a u64. So part of using these "fast *and unsafe*" interfaces is
you get to keep both pieces when it breaks. Make sure the code here
that is using these interfaces guards against this (zeroing out
negative deltas).

thanks
-john

2023-01-24 07:01:37

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
> +int notrace get_mono_fast_from_given_time(int (*get_time_fn)
> + (struct system_counterval_t *sys_counterval,
> + void *ctx),
> + void *ctx,
> + u64 *mono_ns)
> +{
> + struct system_counterval_t system_counterval;
> + struct tk_fast *tkf = &tk_fast_mono;
> + u64 cycles, now, interval_start;
> + struct tk_read_base *tkr;
> + unsigned int seq;
> + int ret;
> +
> + do {
> + seq = raw_read_seqcount_latch(&tkf->seq);
> + tkr = tkf->base + (seq & 0x01);
> +
> + ret = get_time_fn(&system_counterval, ctx);
> + if (ret)
> + return ret;
> +
> + /*
> + * Verify that the clocksource associated with the given
> + * timestamp is the same as the currently installed
> + * timekeeper clocksource
> + */
> + if (tkr->clock != system_counterval.cs)
> + return -EOPNOTSUPP;
> + cycles = system_counterval.cycles;
> +
> + /*
> + * Check whether the given timestamp is on the current
> + * timekeeping interval.
> + */
> + now = tk_clock_read(tkr);
> + interval_start = tkr->cycle_last;
> + if (!cycle_between(interval_start, cycles, now))
> + return -EOPNOTSUPP;

So. I've not fully thought this out, but it seems like it would be
quite likely that you'd run into the case where the cycle_last value
is updated and your earlier TSC timestamp isn't valid for the current
interval. The get_device_system_crosststamp() logic has a big chunk of
complex code to try to handle this case by interpolating the cycle
value back in time. How well does just failing in this case work out?

thanks
-john

2023-01-24 08:52:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On Mon, Jan 23 2023 at 10:27, kan liang wrote:
> From: Kan Liang <[email protected]>
>
> It's useful to provide a NMI safe function to convert a given time to
> monotonic. For example, the perf_event subsystem wants to convert a TSC
> of a PEBS record to a monotonic clock in a NMI handler.

Why? That's a postprocessing problem, really.

Thanks,

tglx

2023-01-24 09:10:17

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <[email protected]> wrote:
>
> On Mon, Jan 23 2023 at 10:27, kan liang wrote:
> > From: Kan Liang <[email protected]>
> >
> > It's useful to provide a NMI safe function to convert a given time to
> > monotonic. For example, the perf_event subsystem wants to convert a TSC
> > of a PEBS record to a monotonic clock in a NMI handler.
>
> Why? That's a postprocessing problem, really.
>
Because you want to correlate samples captured by PEBS with samples
from applications timestamped with a user available clock such as
CLOCK_MONOTONIC, for instance.
When I create a perf_event event and I stipulate that
event_attr.clockid=MONOTONIC, I expect all the samples from
that event to be timestamped using the same clock source, regardless
of PEBS or IBS.



> Thanks,
>
> tglx

2023-01-24 15:09:21

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH 0/3] Convert TSC to monotonic clock for PEBS



On 2023-01-24 1:13 a.m., John Stultz wrote:
> On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
>>
>> From: Kan Liang <[email protected]>
>>
>> A Processor Event Based Sampling (PEBS) record includes a field that
>> provide the time stamp counter value when the counter was overflowed
>> and the PEBS record was generated. The accurate time stamp can be used
>> to reconcile user samples. However, the current PEBS codes only can
>> convert the time stamp to sched_clock, which is not available from user
>> space. A solution to convert a given TSC to user visible monotonic
>> clock is required.
>>
>> The perf_event subsystem only converts the TSC in a NMI handler. The
>> converter function must be fast and NMI safe.
>>
>> Considered the below two existing functions, but none of them fulfill
>> the above requirements.
>> - The ktime_get_mono_fast_ns() is NMI safe, but it can only return the
>> current clock monotonic rather than a given time's monotonic.
>> - The get_device_system_crosststamp() can calculate the system time from
>> a given device time. But it's not fast and NMI safe.
>
> So, apologies if this is a silly question (my brain quickly evicts the
> details on get_device_system_crosststamp every time I look at it), but
> rather then introducing a new interface, what would it take to rework
> the existing get_device_system_crosststamp() logic to be usable for
> both use cases?
>

I once tried to rework the existing get_device_system_crosststamp() but
I gave up finally, because
- The existing function is already very complex. Adding a new case will
make it more complex. It's not easy to be maintained.
- Perf doesn't need all logic of the existing function. For example, the
history is not required. (I think there is no problem for perf if we
cannot get values for some corner cases. The worst case for perf is to
fallback to the time captured in the NMI handler. It's not very
accurate, but it should be acceptable.). The performance is priority
one. We want a function with much simpler logic.
- If I understand correct, we already introduced several dedicated
functions for fast NMI access, e.g., ktime_get_mono_fast_ns(). I think
we can follow the same idea.


Thanks,
Kan

2023-01-24 15:09:24

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic



On 2023-01-24 2:01 a.m., John Stultz wrote:
> On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
>> +int notrace get_mono_fast_from_given_time(int (*get_time_fn)
>> + (struct system_counterval_t *sys_counterval,
>> + void *ctx),
>> + void *ctx,
>> + u64 *mono_ns)
>> +{
>> + struct system_counterval_t system_counterval;
>> + struct tk_fast *tkf = &tk_fast_mono;
>> + u64 cycles, now, interval_start;
>> + struct tk_read_base *tkr;
>> + unsigned int seq;
>> + int ret;
>> +
>> + do {
>> + seq = raw_read_seqcount_latch(&tkf->seq);
>> + tkr = tkf->base + (seq & 0x01);
>> +
>> + ret = get_time_fn(&system_counterval, ctx);
>> + if (ret)
>> + return ret;
>> +
>> + /*
>> + * Verify that the clocksource associated with the given
>> + * timestamp is the same as the currently installed
>> + * timekeeper clocksource
>> + */
>> + if (tkr->clock != system_counterval.cs)
>> + return -EOPNOTSUPP;
>> + cycles = system_counterval.cycles;
>> +
>> + /*
>> + * Check whether the given timestamp is on the current
>> + * timekeeping interval.
>> + */
>> + now = tk_clock_read(tkr);
>> + interval_start = tkr->cycle_last;
>> + if (!cycle_between(interval_start, cycles, now))
>> + return -EOPNOTSUPP;
>
> So. I've not fully thought this out, but it seems like it would be
> quite likely that you'd run into the case where the cycle_last value
> is updated and your earlier TSC timestamp isn't valid for the current
> interval. The get_device_system_crosststamp() logic has a big chunk of
> complex code to try to handle this case by interpolating the cycle
> value back in time. How well does just failing in this case work out?
>

For the case, perf fallback to the time captured in the NMI handler, via
ktime_get_mono_fast_ns().

The TSC in PEBS is captured by HW when the sample was generated. There
should be a small delta compared with the time captured in the NMI
handler. But I think the delta should be acceptable as a backup solution
for the most analysis cases. Also, I don't think the case (the
cycle_last value is updated during the monitoring) should occur very
often either. So I drop the history support to simplify the function.

Thanks,
Kan

2023-01-24 15:17:36

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH 3/3] perf/x86/intel/ds: Support monotonic clock for PEBS



On 2023-01-24 1:56 a.m., John Stultz wrote:
> On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
>>
>> From: Kan Liang <[email protected]>
>>
>> Users try to reconcile user samples with PEBS samples and require a
>> common clock source. However, the current PEBS codes only convert to
>> sched_clock, which is not available from the user space.
>>
>> Only support converting to clock monotonic. Having one common clock
>> source is good enough to fulfill the requirement.
>>
>> Enable the large PEBS for the monotonic clock to reduce the PEBS
>> overhead.
>>
>> There are a few rare cases that may make the conversion fails. For
>> example, TSC overflows. The cycle_last may be changed between samples.
>> The time will fallback to the inaccurate SW times. But the cases are
>> extremely unlikely to happen.
>>
>> Signed-off-by: Kan Liang <[email protected]>
>> ---
>
> Thanks for sending this out!
> A few minor style issues below and a warning.

Thanks.

>
>> The patch has to be on top of the below patch
>> https://lore.kernel.org/all/[email protected]/
>>
>> arch/x86/events/intel/core.c | 2 +-
>> arch/x86/events/intel/ds.c | 30 ++++++++++++++++++++++++++----
>> 2 files changed, 27 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 14f0a746257d..ea194556cc73 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -3777,7 +3777,7 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
>> {
>> unsigned long flags = x86_pmu.large_pebs_flags;
>>
>> - if (event->attr.use_clockid)
>> + if (event->attr.use_clockid && (event->attr.clockid != CLOCK_MONOTONIC))
>> flags &= ~PERF_SAMPLE_TIME;
>> if (!event->attr.exclude_kernel)
>> flags &= ~PERF_SAMPLE_REGS_USER;
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 7980e92dec64..d7f0eaf4405c 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1570,13 +1570,33 @@ static u64 get_data_src(struct perf_event *event, u64 aux)
>> return val;
>> }
>>
>> +static int pebs_get_synctime(struct system_counterval_t *system,
>> + void *ctx)
>
> Just because the abstract function type taken by
> get_mono_fast_from_given_time is vague, doesn't mean the
> implementation needs to be.
> ctx is really a tsc value, right? So let's call it that to make this a
> bit more readable.

Sure, I will use the tsc to replace ctx.

>
>> +{
>> + *system = set_tsc_system_counterval(*(u64 *)ctx);
>> + return 0;
>> +}
>> +
>> +static inline int pebs_clockid_time(clockid_t clk_id, u64 tsc, u64 *clk_id_time)
>
> clk_id_time is maybe a bit too fuzzy. It is really a mono_ns value,
> right? Let's keep that explicit here.

Yes. Will make it explicit.

>
>> +{
>> + /* Only support converting to clock monotonic */
>> + if (clk_id != CLOCK_MONOTONIC)
>> + return -EINVAL;
>> +
>> + return get_mono_fast_from_given_time(pebs_get_synctime, &tsc, clk_id_time);
>> +}
>> +
>> static void setup_pebs_time(struct perf_event *event,
>> struct perf_sample_data *data,
>> u64 tsc)
>> {
>> - /* Converting to a user-defined clock is not supported yet. */
>> - if (event->attr.use_clockid != 0)
>> - return;
>> + u64 time;
>
> Again, "time" is too generic a term without any context here.
> mono_nsec or something would be more clear.

Sure.

>
>> +
>> + if (event->attr.use_clockid != 0) {
>> + if (pebs_clockid_time(event->attr.clockid, tsc, &time))
>> + return;
>> + goto done;
>> + }
>
> Apologies for this warning/rant:
>
> So, I do get the NMI safety of the "fast" time accessors (along with
> the "high performance" sounding name!) is attractive, but as its use
> expands I worry the downsides of this interface isn't made clear
> enough.
>
> The fast accessors *can* see time discontinuities! Because the logic
> is done without holding the tk_core.seq lock, If you are reading in
> the middle of a ntp adjustment, you may find the current value to be
> larger than the next time you read the time. These discontinuities
> are likely to be very small, but a negative delta will look very large
> as a u64. So part of using these "fast *and unsafe*" interfaces is
> you get to keep both pieces when it breaks. Make sure the code here
> that is using these interfaces guards against this (zeroing out
> negative deltas).
>

Thanks for the warning.
I will add more comments and specially handle it here.

Thanks,
Kan

2023-01-24 16:07:05

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic



On 2023-01-24 4:10 a.m., Stephane Eranian wrote:
> On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <[email protected]> wrote:
>>
>> On Mon, Jan 23 2023 at 10:27, kan liang wrote:
>>> From: Kan Liang <[email protected]>
>>>
>>> It's useful to provide a NMI safe function to convert a given time to
>>> monotonic. For example, the perf_event subsystem wants to convert a TSC
>>> of a PEBS record to a monotonic clock in a NMI handler.
>>
>> Why? That's a postprocessing problem, really.
>>

Besides the reason Stephane mentioned, the current perf tool assumes
that kernel always gives the time it asked. For example, without the
patch, the kernel uses ktime_get_mono_fast_ns() to get the MONOTONIC
time and dump it to user space.

If we want to break the assumption, I think we have to modify the
current interfaces and dump more extra data to indicate the clock
source. That makes the existing interface more complex. Preparing and
dumping the extra data also brings overhead. We also have to force the
user to update their tools.


Thanks,
Kan

> Because you want to correlate samples captured by PEBS with samples
> from applications timestamped with a user available clock such as
> CLOCK_MONOTONIC, for instance.
> When I create a perf_event event and I stipulate that
> event_attr.clockid=MONOTONIC, I expect all the samples from
> that event to be timestamped using the same clock source, regardless
> of PEBS or IBS.
>
>
>
>> Thanks,
>>
>> tglx

2023-01-24 18:43:45

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <[email protected]> wrote:
> On 2023-01-24 2:01 a.m., John Stultz wrote:
> > On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
> >> + /*
> >> + * Check whether the given timestamp is on the current
> >> + * timekeeping interval.
> >> + */
> >> + now = tk_clock_read(tkr);
> >> + interval_start = tkr->cycle_last;
> >> + if (!cycle_between(interval_start, cycles, now))
> >> + return -EOPNOTSUPP;
> >
> > So. I've not fully thought this out, but it seems like it would be
> > quite likely that you'd run into the case where the cycle_last value
> > is updated and your earlier TSC timestamp isn't valid for the current
> > interval. The get_device_system_crosststamp() logic has a big chunk of
> > complex code to try to handle this case by interpolating the cycle
> > value back in time. How well does just failing in this case work out?
> >
>
> For the case, perf fallback to the time captured in the NMI handler, via
> ktime_get_mono_fast_ns().

This feels like *very* subtle behavior. Maybe I'm misunderstanding,
but the goal seems to be to have more accurate timestamps on the hw
events, and using the captured tsc timestamp avoids the measuring
latency reading the time again. But if every timekeeping update
interval (~tick) you transparently get a delayed value due to the
fallback, it makes it hard to understand which timestamps are better
or worse. The latency between two reads may be real or it may be just
bad luck. This doesn't intuitively seem like a great benefit over more
consistent latency of just using the ktime_get_mono_fast()
timestamping.

> The TSC in PEBS is captured by HW when the sample was generated. There
> should be a small delta compared with the time captured in the NMI
> handler. But I think the delta should be acceptable as a backup solution
> for the most analysis cases. Also, I don't think the case (the
> cycle_last value is updated during the monitoring) should occur very
> often either. So I drop the history support to simplify the function.

So the reads and this function are *always* used in NMI context? Has
this been stressed with things like SMIs to see how it does if
interrupted in those cases?

My worry is that (as I bored everyone earlier), the
ktime_get_*_fast_ns() interfaces already have some sharp edges and
need a fair amount of thought as to when they should be used. This is
sort of compounding that adding an interface that has further special
cases where it can fail, making it difficult to fully understand and
easier to accidentally misuse.

My other concern is that interfaces always get stretched and used
beyond anything they were initially planned for (see the
ktime_get_*fast_ns() interfaces here as an example! :), and in this
case the logic seems to have lots of implicit dependencies on the
facts of your specific use case, so it seems a bit fragile should
folks on other architectures with other constraints try to use it.

So I just want to push a bit to think how you might be able to
extend/generalize the existing get_system_crosststamp for your
purposes, or alternatively find a way to simplify the logic's behavior
so its less tied to specific constraints ("this works most of the time
from NMI, but otherwise no promises"). Or at least some better
documentation around the code, its uses and its constraints? ( "NMI
safe" is not the same as "Only safe to use from NMI" :)

And apologies if I've misunderstood things here.

thanks
-john

2023-01-24 20:13:01

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic



On 2023-01-24 1:43 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <[email protected]> wrote:
>> On 2023-01-24 2:01 a.m., John Stultz wrote:
>>> On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
>>>> + /*
>>>> + * Check whether the given timestamp is on the current
>>>> + * timekeeping interval.
>>>> + */
>>>> + now = tk_clock_read(tkr);
>>>> + interval_start = tkr->cycle_last;
>>>> + if (!cycle_between(interval_start, cycles, now))
>>>> + return -EOPNOTSUPP;
>>>
>>> So. I've not fully thought this out, but it seems like it would be
>>> quite likely that you'd run into the case where the cycle_last value
>>> is updated and your earlier TSC timestamp isn't valid for the current
>>> interval. The get_device_system_crosststamp() logic has a big chunk of
>>> complex code to try to handle this case by interpolating the cycle
>>> value back in time. How well does just failing in this case work out?
>>>
>>
>> For the case, perf fallback to the time captured in the NMI handler, via
>> ktime_get_mono_fast_ns().
>
> This feels like *very* subtle behavior. Maybe I'm misunderstanding,
> but the goal seems to be to have more accurate timestamps on the hw
> events, and using the captured tsc timestamp avoids the measuring
> latency reading the time again. But if every timekeeping update
> interval (~tick) you transparently get a delayed value due to the
> fallback, it makes it hard to understand which timestamps are better
> or worse. The latency between two reads may be real or it may be just
> bad luck. This doesn't intuitively seem like a great benefit over more
> consistent latency of just using the ktime_get_mono_fast()
> timestamping.

Your understand is correct. We want a more accurate timestamp for the
analysis work.

As my understanding, the timekeeping update should not be very often. If
I read the code correctly, it should happen only when adjusting NTP or
suspending/resuming. If so, I think the drawback should not impact the
normal analysis work. I will call out the drwabacks in the comments
where the function is used.

>
>> The TSC in PEBS is captured by HW when the sample was generated. There
>> should be a small delta compared with the time captured in the NMI
>> handler. But I think the delta should be acceptable as a backup solution
>> for the most analysis cases. Also, I don't think the case (the
>> cycle_last value is updated during the monitoring) should occur very
>> often either. So I drop the history support to simplify the function.
>
> So the reads and this function are *always* used in NMI context? Has
> this been stressed with things like SMIs to see how it does if
> interrupted in those cases?

Yes, it's *always* and only used in NMI context.

>
> My worry is that (as I bored everyone earlier), the
> ktime_get_*_fast_ns() interfaces already have some sharp edges and
> need a fair amount of thought as to when they should be used. This is
> sort of compounding that adding an interface that has further special
> cases where it can fail, making it difficult to fully understand and
> easier to accidentally misuse.
>
> My other concern is that interfaces always get stretched and used
> beyond anything they were initially planned for (see the
> ktime_get_*fast_ns() interfaces here as an example! :), and in this
> case the logic seems to have lots of implicit dependencies on the
> facts of your specific use case, so it seems a bit fragile should
> folks on other architectures with other constraints try to use it.
>
> So I just want to push a bit to think how you might be able to
> extend/generalize the existing get_system_crosststamp for your
> purposes, or alternatively find a way to simplify the logic's behavior
> so its less tied to specific constraints ("this works most of the time
> from NMI, but otherwise no promises"). Or at least some better
> documentation around the code, its uses and its constraints? ( "NMI
> safe" is not the same as "Only safe to use from NMI" :)

Since our usage is fixed (only in NMI), I prefer the latter. I think
extending/generalizing the existing function only makes the function
extremely complex and low efficient. The new function should have the
same constraints as the existing ktime_get_mono_fast_ns(). Since perf
can live with the ktime_get_mono_fast_ns(), there should be no problem
with the new function for the constraints. I will add more comments to
clarify the usage and constraints to avoid the abuse of the new function.

Thanks,
Kan


2023-01-24 20:33:58

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <[email protected]> wrote:
> On 2023-01-24 1:43 p.m., John Stultz wrote:
> > On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <[email protected]> wrote:
> >> On 2023-01-24 2:01 a.m., John Stultz wrote:
> >>> On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
> >>>> + /*
> >>>> + * Check whether the given timestamp is on the current
> >>>> + * timekeeping interval.
> >>>> + */
> >>>> + now = tk_clock_read(tkr);
> >>>> + interval_start = tkr->cycle_last;
> >>>> + if (!cycle_between(interval_start, cycles, now))
> >>>> + return -EOPNOTSUPP;
> >>>
> >>> So. I've not fully thought this out, but it seems like it would be
> >>> quite likely that you'd run into the case where the cycle_last value
> >>> is updated and your earlier TSC timestamp isn't valid for the current
> >>> interval. The get_device_system_crosststamp() logic has a big chunk of
> >>> complex code to try to handle this case by interpolating the cycle
> >>> value back in time. How well does just failing in this case work out?
> >>>
> >>
> >> For the case, perf fallback to the time captured in the NMI handler, via
> >> ktime_get_mono_fast_ns().
> >
> > This feels like *very* subtle behavior. Maybe I'm misunderstanding,
> > but the goal seems to be to have more accurate timestamps on the hw
> > events, and using the captured tsc timestamp avoids the measuring
> > latency reading the time again. But if every timekeeping update
> > interval (~tick) you transparently get a delayed value due to the
> > fallback, it makes it hard to understand which timestamps are better
> > or worse. The latency between two reads may be real or it may be just
> > bad luck. This doesn't intuitively seem like a great benefit over more
> > consistent latency of just using the ktime_get_mono_fast()
> > timestamping.
>
> Your understand is correct. We want a more accurate timestamp for the
> analysis work.
>
> As my understanding, the timekeeping update should not be very often. If

"Often" depends on your your timescale.

> I read the code correctly, it should happen only when adjusting NTP or
> suspending/resuming. If so, I think the drawback should not impact the
> normal analysis work. I will call out the drwabacks in the comments
> where the function is used.

So the adjustments are done at tick time depending on the current NTP
"error" (basically what the kernel tracks as the delta from its sense
of what NTP has told us).

Not just at the time when ntp makes an adjustment.

So the window for it to happen is every timekeeping update (which is ~HZ).


> >> The TSC in PEBS is captured by HW when the sample was generated. There
> >> should be a small delta compared with the time captured in the NMI
> >> handler. But I think the delta should be acceptable as a backup solution
> >> for the most analysis cases. Also, I don't think the case (the
> >> cycle_last value is updated during the monitoring) should occur very
> >> often either. So I drop the history support to simplify the function.
> >
> > So the reads and this function are *always* used in NMI context? Has
> > this been stressed with things like SMIs to see how it does if
> > interrupted in those cases?
>
> Yes, it's *always* and only used in NMI context.

Thanks, that is helpful to clarify.

> > My worry is that (as I bored everyone earlier), the
> > ktime_get_*_fast_ns() interfaces already have some sharp edges and
> > need a fair amount of thought as to when they should be used. This is
> > sort of compounding that adding an interface that has further special
> > cases where it can fail, making it difficult to fully understand and
> > easier to accidentally misuse.
> >
> > My other concern is that interfaces always get stretched and used
> > beyond anything they were initially planned for (see the
> > ktime_get_*fast_ns() interfaces here as an example! :), and in this
> > case the logic seems to have lots of implicit dependencies on the
> > facts of your specific use case, so it seems a bit fragile should
> > folks on other architectures with other constraints try to use it.
> >
> > So I just want to push a bit to think how you might be able to
> > extend/generalize the existing get_system_crosststamp for your
> > purposes, or alternatively find a way to simplify the logic's behavior
> > so its less tied to specific constraints ("this works most of the time
> > from NMI, but otherwise no promises"). Or at least some better
> > documentation around the code, its uses and its constraints? ( "NMI
> > safe" is not the same as "Only safe to use from NMI" :)
>
> Since our usage is fixed (only in NMI), I prefer the latter. I think
> extending/generalizing the existing function only makes the function
> extremely complex and low efficient. The new function should have the
> same constraints as the existing ktime_get_mono_fast_ns(). Since perf
> can live with the ktime_get_mono_fast_ns(), there should be no problem
> with the new function for the constraints. I will add more comments to
> clarify the usage and constraints to avoid the abuse of the new function.

I agree the existing function is complex, so adding more complexity
isn't ideal, but potentially breaking it up or reworking it might be
better. Having two similar but different implementations is also a
complexity. So I just want to make sure this is well considered. But
clearer documentation as a first step will help.

Thanks
-john

2023-01-24 22:08:11

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic



On 2023-01-24 3:33 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <[email protected]> wrote:
>> On 2023-01-24 1:43 p.m., John Stultz wrote:
>>> On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <[email protected]> wrote:
>>>> On 2023-01-24 2:01 a.m., John Stultz wrote:
>>>>> On Mon, Jan 23, 2023 at 10:27 AM <[email protected]> wrote:
>>>>>> + /*
>>>>>> + * Check whether the given timestamp is on the current
>>>>>> + * timekeeping interval.
>>>>>> + */
>>>>>> + now = tk_clock_read(tkr);
>>>>>> + interval_start = tkr->cycle_last;
>>>>>> + if (!cycle_between(interval_start, cycles, now))
>>>>>> + return -EOPNOTSUPP;
>>>>> So. I've not fully thought this out, but it seems like it would be
>>>>> quite likely that you'd run into the case where the cycle_last value
>>>>> is updated and your earlier TSC timestamp isn't valid for the current
>>>>> interval. The get_device_system_crosststamp() logic has a big chunk of
>>>>> complex code to try to handle this case by interpolating the cycle
>>>>> value back in time. How well does just failing in this case work out?
>>>>>
>>>> For the case, perf fallback to the time captured in the NMI handler, via
>>>> ktime_get_mono_fast_ns().
>>> This feels like *very* subtle behavior. Maybe I'm misunderstanding,
>>> but the goal seems to be to have more accurate timestamps on the hw
>>> events, and using the captured tsc timestamp avoids the measuring
>>> latency reading the time again. But if every timekeeping update
>>> interval (~tick) you transparently get a delayed value due to the
>>> fallback, it makes it hard to understand which timestamps are better
>>> or worse. The latency between two reads may be real or it may be just
>>> bad luck. This doesn't intuitively seem like a great benefit over more
>>> consistent latency of just using the ktime_get_mono_fast()
>>> timestamping.
>> Your understand is correct. We want a more accurate timestamp for the
>> analysis work.
>>
>> As my understanding, the timekeeping update should not be very often. If
> "Often" depends on your your timescale.
>
>> I read the code correctly, it should happen only when adjusting NTP or
>> suspending/resuming. If so, I think the drawback should not impact the
>> normal analysis work. I will call out the drwabacks in the comments
>> where the function is used.
> So the adjustments are done at tick time depending on the current NTP
> "error" (basically what the kernel tracks as the delta from its sense
> of what NTP has told us).
>
> Not just at the time when ntp makes an adjustment.
>
> So the window for it to happen is every timekeeping update (which is ~HZ).

You mean the tkf->base is updated in each tick?
If so, that should be a problem.

Does the NTP "error" consistently happen on all the platforms? Or just
on some platforms which we can check and limit the usage?

There are two configurations for PEBS, single PEBS, and large PEBS.
For the single PEBS mode, HW triggers a NMI for each record. The TSC is
the time which the record is generated, and we convert it in the same
NMI. I don't think the NTP "error" can impact it.

But for the large PEBS mode, HW triggers a NMI only when a fixed number
of records are generated. It's very likely that the base has been
updated between the records. That could bring troubles, since we can
only fall back to the NMI handler time of the last record. I'm afraid we
have to disable the large PEBS mode.

Stephane,

What do you think?
Is it still useful for you if we only support the single PEBS?


Thanks,
Kan

2023-01-24 22:40:57

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On Tue, Jan 24, 2023 at 2:08 PM Liang, Kan <[email protected]> wrote:
> On 2023-01-24 3:33 p.m., John Stultz wrote:
> > On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <[email protected]> wrote:
> >> On 2023-01-24 1:43 p.m., John Stultz wrote:
> >> I read the code correctly, it should happen only when adjusting NTP or
> >> suspending/resuming. If so, I think the drawback should not impact the
> >> normal analysis work. I will call out the drwabacks in the comments
> >> where the function is used.
> > So the adjustments are done at tick time depending on the current NTP
> > "error" (basically what the kernel tracks as the delta from its sense
> > of what NTP has told us).
> >
> > Not just at the time when ntp makes an adjustment.
> >
> > So the window for it to happen is every timekeeping update (which is ~HZ).
>
> You mean the tkf->base is updated in each tick?

Right, every call to timekeeping_update(), which is called for every
major state change in time along with regular updates via
timekeeping_advance() which is called from update_wall_time() which
gets called from the various tick handlers.

> If so, that should be a problem.
>
> Does the NTP "error" consistently happen on all the platforms? Or just
> on some platforms which we can check and limit the usage?

Basically any platform where adjtimex was ever called. Both offset
and freq adjustments will trigger this. The trouble is that the
adjustments requested are finer grained than what the actual
clocksources mult values can be adjusted for. So the kernel tracks the
error internally and will oscillate the clocksource multiplier value
to keep the long-term accuracy close to what was requested.

> There are two configurations for PEBS, single PEBS, and large PEBS.
> For the single PEBS mode, HW triggers a NMI for each record. The TSC is
> the time which the record is generated, and we convert it in the same
> NMI. I don't think the NTP "error" can impact it.

If the clocksource is being adjusted in between reads (imagine the
freq is dropped), when using the fast accessors, time may seem to go
backwards between those reads.

In fact, if the same TSC value is used for two calls in a row, you
could see time go backwards or forwards between them if the adjustment
happened inbetween.

The regular (non-fast) clocksource accessors use the seqcount to
prevent reads from occuring while the adjustment is made from being
used (you'll note we always re-read the clocksource within the
seqcount loop) but this makes it not NMI safe.


> But for the large PEBS mode, HW triggers a NMI only when a fixed number
> of records are generated. It's very likely that the base has been
> updated between the records. That could bring troubles, since we can
> only fall back to the NMI handler time of the last record. I'm afraid we
> have to disable the large PEBS mode.
>

Right, if I recall, this is why the interpolation is done in the
existing get_device_system_crosststamp(). It's obviously never going
to be perfect as we dont' keep the timekeeper state for every tick,
but it might be better than the bailout you have if you didn't luck
into the current interval.

My apologies, as my intention isn't to dissuade you here, but just to
make sure the complexities (and unfortunately there are a lot) are
well worked out and well tested.

thanks
-john

2023-01-25 16:44:14

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic



On 2023-01-24 5:40 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 2:08 PM Liang, Kan <[email protected]> wrote:
>> On 2023-01-24 3:33 p.m., John Stultz wrote:
>>> On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <[email protected]> wrote:
>>>> On 2023-01-24 1:43 p.m., John Stultz wrote:
>>>> I read the code correctly, it should happen only when adjusting NTP or
>>>> suspending/resuming. If so, I think the drawback should not impact the
>>>> normal analysis work. I will call out the drwabacks in the comments
>>>> where the function is used.
>>> So the adjustments are done at tick time depending on the current NTP
>>> "error" (basically what the kernel tracks as the delta from its sense
>>> of what NTP has told us).
>>>
>>> Not just at the time when ntp makes an adjustment.
>>>
>>> So the window for it to happen is every timekeeping update (which is ~HZ).
>>
>> You mean the tkf->base is updated in each tick?
>
> Right, every call to timekeeping_update(), which is called for every
> major state change in time along with regular updates via
> timekeeping_advance() which is called from update_wall_time() which
> gets called from the various tick handlers.
>
>> If so, that should be a problem.
>>
>> Does the NTP "error" consistently happen on all the platforms? Or just
>> on some platforms which we can check and limit the usage?
>
> Basically any platform where adjtimex was ever called. Both offset
> and freq adjustments will trigger this. The trouble is that the
> adjustments requested are finer grained than what the actual
> clocksources mult values can be adjusted for. So the kernel tracks the
> error internally and will oscillate the clocksource multiplier value
> to keep the long-term accuracy close to what was requested.
>
>> There are two configurations for PEBS, single PEBS, and large PEBS.
>> For the single PEBS mode, HW triggers a NMI for each record. The TSC is
>> the time which the record is generated, and we convert it in the same
>> NMI. I don't think the NTP "error" can impact it.
>
> If the clocksource is being adjusted in between reads (imagine the
> freq is dropped), when using the fast accessors, time may seem to go
> backwards between those reads.
>
> In fact, if the same TSC value is used for two calls in a row, you
> could see time go backwards or forwards between them if the adjustment
> happened inbetween.
>
> The regular (non-fast) clocksource accessors use the seqcount to
> prevent reads from occuring while the adjustment is made from being
> used (you'll note we always re-read the clocksource within the
> seqcount loop) but this makes it not NMI safe.
>
>
>> But for the large PEBS mode, HW triggers a NMI only when a fixed number
>> of records are generated. It's very likely that the base has been
>> updated between the records. That could bring troubles, since we can
>> only fall back to the NMI handler time of the last record. I'm afraid we
>> have to disable the large PEBS mode.
>>
>
> Right, if I recall, this is why the interpolation is done in the
> existing get_device_system_crosststamp(). It's obviously never going
> to be perfect as we dont' keep the timekeeper state for every tick,
> but it might be better than the bailout you have if you didn't luck
> into the current interval.
>
> My apologies, as my intention isn't to dissuade you here, but just to
> make sure the complexities (and unfortunately there are a lot) are
> well worked out and well tested.
>

Thank you very much for your patience and all the details, I really
appreciate it. I think I understand the problem and the concerns now. I
will reevaluate the current design and try to find a proper solution.

Thanks,
Kan

2023-01-27 13:30:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

On Tue, Jan 24 2023 at 01:10, Stephane Eranian wrote:
> On Tue, Jan 24, 2023 at 12:52 AM Thomas Gleixner <[email protected]> wrote:
>> > It's useful to provide a NMI safe function to convert a given time to
>> > monotonic. For example, the perf_event subsystem wants to convert a TSC
>> > of a PEBS record to a monotonic clock in a NMI handler.
>>
>> Why? That's a postprocessing problem, really.
>>
> Because you want to correlate samples captured by PEBS with samples
> from applications timestamped with a user available clock such as
> CLOCK_MONOTONIC, for instance.

Sure. Postprocessing can do that if the kernel provides the relevant
conversion information. Absolutely no need for doing this in the kernel.

> When I create a perf_event event and I stipulate that
> event_attr.clockid=MONOTONIC, I expect all the samples from that event
> to be timestamped using the same clock source, regardless of PEBS or
> IBS.

Just because the kernel can do the conversion, there is no requirement
to actually do so. Instrumentation has to be as lightweight as possible
and a lot of instrumentation goes the extra mile of doing postprocessing
exactly for this reason.

So if we implement this conversion in the kernel then there needs to be
a compelling technical reason to do so.

So far the provided arguments are in the 'I want a pony' realm.

Thanks,

tglx