2022-02-14 20:31:58

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing

Hi

These patches add 2 new perf event clocks based on TSC for use with VMs.

The first patch is a minor fix, the next 2 patches add each of the 2 new
clocks. The remaining patches add minimal tools support and are based on
top of the Intel PT Event Trace tools' patches.

The future work, to add the ability to use perf inject to inject perf
events from a VM guest perf.data file into a VM host perf.data file,
has yet to be implemented.


Changes in V2:
perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
Add __sched_clock_offset unconditionally

perf/x86: Add support for TSC as a perf event clock
Use an attribute bit 'ns_clockid' to identify non-standard clockids

perf/x86: Add support for TSC in nanoseconds as a perf event clock
Do not affect use of __sched_clock_offset
Adjust to use 'ns_clockid'

perf tools: Add new perf clock IDs
perf tools: Add API probes for new clock IDs
perf tools: Add new clock IDs to "perf time to TSC" test
perf tools: Add perf_read_tsc_conv_for_clockid()
perf intel-pt: Add support for new clock IDs
perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
perf intel-pt: Add config variables for timing parameters
perf intel-pt: Add documentation for new clock IDs
Adjust to use 'ns_clockid'


Adrian Hunter (11):
perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
perf/x86: Add support for TSC as a perf event clock
perf/x86: Add support for TSC in nanoseconds as a perf event clock
perf tools: Add new perf clock IDs
perf tools: Add API probes for new clock IDs
perf tools: Add new clock IDs to "perf time to TSC" test
perf tools: Add perf_read_tsc_conv_for_clockid()
perf intel-pt: Add support for new clock IDs
perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
perf intel-pt: Add config variables for timing parameters
perf intel-pt: Add documentation for new clock IDs

arch/x86/events/core.c | 39 ++++++++++--
arch/x86/include/asm/perf_event.h | 5 ++
arch/x86/kernel/tsc.c | 2 +-
include/uapi/linux/perf_event.h | 18 +++++-
kernel/events/core.c | 63 +++++++++++++-------
tools/include/uapi/linux/perf_event.h | 18 +++++-
tools/perf/Documentation/perf-config.txt | 18 ++++++
tools/perf/Documentation/perf-intel-pt.txt | 47 +++++++++++++++
tools/perf/Documentation/perf-record.txt | 9 ++-
tools/perf/arch/x86/util/intel-pt.c | 95 ++++++++++++++++++++++++++++--
tools/perf/builtin-record.c | 2 +-
tools/perf/tests/perf-time-to-tsc.c | 42 ++++++++++---
tools/perf/util/clockid.c | 14 +++++
tools/perf/util/evsel.c | 1 +
tools/perf/util/intel-pt.c | 27 +++++++--
tools/perf/util/intel-pt.h | 7 ++-
tools/perf/util/perf_api_probe.c | 24 ++++++++
tools/perf/util/perf_api_probe.h | 2 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/record.h | 2 +
tools/perf/util/tsc.c | 58 ++++++++++++++++++
tools/perf/util/tsc.h | 2 +
22 files changed, 444 insertions(+), 52 deletions(-)


Regards
Adrian


2022-02-14 20:39:09

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock

Currently, using Intel PT to trace a VM guest is limited to kernel space
because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
While these events can be collected for the host, there is not a way to do
that yet for a guest. One approach, would be to collect them inside the
guest, but that would require being able to synchronize with host
timestamps.

The motivation for this patch is to provide a clock that can be used within
a VM guest, and that correlates to a VM host clock. In the case of TSC, if
the hypervisor leaves rdtsc alone, the TSC value will be subject only to
the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
to inject events from a guest perf.data file, into a host perf.data file.

Thus making possible the collection of VM guest side band for Intel PT
decoding.

There are other potential benefits of TSC as a perf event clock:
- ability to work directly with TSC
- ability to inject non-Intel-PT-related events from a guest

Signed-off-by: Adrian Hunter <[email protected]>
---
arch/x86/events/core.c | 16 +++++++++
arch/x86/include/asm/perf_event.h | 3 ++
include/uapi/linux/perf_event.h | 12 ++++++-
kernel/events/core.c | 57 +++++++++++++++++++------------
4 files changed, 65 insertions(+), 23 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e686c5e0537b..51d5345de30a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2728,6 +2728,17 @@ void arch_perf_update_userpage(struct perf_event *event,
!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
userpg->pmc_width = x86_pmu.cntval_bits;

+ if (event->attr.use_clockid &&
+ event->attr.ns_clockid &&
+ event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
+ userpg->cap_user_time_zero = 1;
+ userpg->time_mult = 1;
+ userpg->time_shift = 0;
+ userpg->time_offset = 0;
+ userpg->time_zero = 0;
+ return;
+ }
+
if (!using_native_sched_clock() || !sched_clock_stable())
return;

@@ -2980,6 +2991,11 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
return misc;
}

+u64 perf_hw_clock(void)
+{
+ return rdtsc_ordered();
+}
+
void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
{
cap->version = x86_pmu.version;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 58d9e4b1fa0a..5288ea1ae2ba 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -451,6 +451,9 @@ extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
extern unsigned long perf_misc_flags(struct pt_regs *regs);
#define perf_misc_flags(regs) perf_misc_flags(regs)

+extern u64 perf_hw_clock(void);
+#define perf_hw_clock perf_hw_clock
+
#include <asm/stacktrace.h>

/*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 82858b697c05..e8617efd552b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -290,6 +290,15 @@ enum {
PERF_TXN_ABORT_SHIFT = 32,
};

+/*
+ * If supported, clockid value to select an architecture dependent hardware
+ * clock. Note this means the unit of time is ticks not nanoseconds.
+ * Requires ns_clockid to be set in addition to use_clockid.
+ * On x86, this clock is provided by the rdtsc instruction, and is not
+ * paravirtualized.
+ */
+#define CLOCK_PERF_HW_CLOCK 0x10000000
+
/*
* The format of the data returned by read() on a perf event fd,
* as specified by attr.read_format:
@@ -409,7 +418,8 @@ struct perf_event_attr {
inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */
remove_on_exec : 1, /* event is removed from task on exec */
sigtrap : 1, /* send synchronous SIGTRAP on event */
- __reserved_1 : 26;
+ ns_clockid : 1, /* non-standard clockid */
+ __reserved_1 : 25;

union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 57249f37c37d..15dee265a5b9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12008,35 +12008,48 @@ static void mutex_lock_double(struct mutex *a, struct mutex *b)
mutex_lock_nested(b, SINGLE_DEPTH_NESTING);
}

-static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
+static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id, bool ns_clockid)
{
bool nmi_safe = false;

- switch (clk_id) {
- case CLOCK_MONOTONIC:
- event->clock = &ktime_get_mono_fast_ns;
- nmi_safe = true;
- break;
+ if (ns_clockid) {
+ switch (clk_id) {
+#ifdef perf_hw_clock
+ case CLOCK_PERF_HW_CLOCK:
+ event->clock = &perf_hw_clock;
+ nmi_safe = true;
+ break;
+#endif
+ default:
+ return -EINVAL;
+ }
+ } else {
+ switch (clk_id) {
+ case CLOCK_MONOTONIC:
+ event->clock = &ktime_get_mono_fast_ns;
+ nmi_safe = true;
+ break;

- case CLOCK_MONOTONIC_RAW:
- event->clock = &ktime_get_raw_fast_ns;
- nmi_safe = true;
- break;
+ case CLOCK_MONOTONIC_RAW:
+ event->clock = &ktime_get_raw_fast_ns;
+ nmi_safe = true;
+ break;

- case CLOCK_REALTIME:
- event->clock = &ktime_get_real_ns;
- break;
+ case CLOCK_REALTIME:
+ event->clock = &ktime_get_real_ns;
+ break;

- case CLOCK_BOOTTIME:
- event->clock = &ktime_get_boottime_ns;
- break;
+ case CLOCK_BOOTTIME:
+ event->clock = &ktime_get_boottime_ns;
+ break;

- case CLOCK_TAI:
- event->clock = &ktime_get_clocktai_ns;
- break;
+ case CLOCK_TAI:
+ event->clock = &ktime_get_clocktai_ns;
+ break;

- default:
- return -EINVAL;
+ default:
+ return -EINVAL;
+ }
}

if (!nmi_safe && !(event->pmu->capabilities & PERF_PMU_CAP_NO_NMI))
@@ -12245,7 +12258,7 @@ SYSCALL_DEFINE5(perf_event_open,
pmu = event->pmu;

if (attr.use_clockid) {
- err = perf_event_set_clock(event, attr.clockid);
+ err = perf_event_set_clock(event, attr.clockid, attr.ns_clockid);
if (err)
goto err_alloc;
}
--
2.25.1

2022-02-14 20:50:53

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH V2 04/11] perf tools: Add new perf clock IDs

Add support for new clock IDs CLOCK_PERF_HW_CLOCK and
CLOCK_PERF_HW_CLOCK_NS.

Signed-off-by: Adrian Hunter <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 18 +++++++++++++++++-
tools/perf/Documentation/perf-record.txt | 9 ++++++++-
tools/perf/builtin-record.c | 2 +-
tools/perf/util/clockid.c | 13 +++++++++++++
tools/perf/util/evsel.c | 1 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/record.h | 1 +
7 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 1b65042ab1db..7b3455dfda23 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -290,6 +290,21 @@ enum {
PERF_TXN_ABORT_SHIFT = 32,
};

+/*
+ * If supported, clockid value to select an architecture dependent hardware
+ * clock. Note this means the unit of time is ticks not nanoseconds.
+ * Requires ns_clockid to be set in addition to use_clockid.
+ * On x86, this clock is provided by the rdtsc instruction, and is not
+ * paravirtualized.
+ */
+#define CLOCK_PERF_HW_CLOCK 0x10000000
+/*
+ * Same as CLOCK_PERF_HW_CLOCK but in nanoseconds. Note support of
+ * CLOCK_PERF_HW_CLOCK_NS does not necesssarily imply support of
+ * CLOCK_PERF_HW_CLOCK or vice versa.
+ */
+#define CLOCK_PERF_HW_CLOCK_NS 0x10000001
+
/*
* The format of the data returned by read() on a perf event fd,
* as specified by attr.read_format:
@@ -409,7 +424,8 @@ struct perf_event_attr {
inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */
remove_on_exec : 1, /* event is removed from task on exec */
sigtrap : 1, /* send synchronous SIGTRAP on event */
- __reserved_1 : 26;
+ ns_clockid : 1, /* non-standard clockid */
+ __reserved_1 : 25;

union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 9ccc75935bc5..a5ef4813093a 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -444,7 +444,14 @@ Record running and enabled time for read events (:S)
Sets the clock id to use for the various time fields in the perf_event_type
records. See clock_gettime(). In particular CLOCK_MONOTONIC and
CLOCK_MONOTONIC_RAW are supported, some events might also allow
-CLOCK_BOOTTIME, CLOCK_REALTIME and CLOCK_TAI.
+CLOCK_BOOTTIME, CLOCK_REALTIME and CLOCK_TAI. In addition, the kernel might
+support CLOCK_PERF_HW_CLOCK to select an architecture dependent hardware
+clock, for which the unit of time is ticks not nanoseconds. On x86,
+CLOCK_PERF_HW_CLOCK is provided by the rdtsc instruction, and is not
+paravirtualized. There is also CLOCK_PERF_HW_CLOCK_NS which is the same as
+CLOCK_PERF_HW_CLOCK, but converted to nanoseconds. Note support of
+CLOCK_PERF_HW_CLOCK_NS does not necessarily imply support of
+CLOCK_PERF_HW_CLOCK or vice versa.

-S::
--snapshot::
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index bb716c953d02..febb51bac6ac 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1553,7 +1553,7 @@ static int record__init_clock(struct record *rec)
struct timeval ref_tod;
u64 ref;

- if (!rec->opts.use_clockid)
+ if (!rec->opts.use_clockid || rec->opts.ns_clockid)
return 0;

if (rec->opts.use_clockid && rec->opts.clockid_res_ns)
diff --git a/tools/perf/util/clockid.c b/tools/perf/util/clockid.c
index 74365a5d99c1..2fcffee690e1 100644
--- a/tools/perf/util/clockid.c
+++ b/tools/perf/util/clockid.c
@@ -12,11 +12,15 @@
struct clockid_map {
const char *name;
int clockid;
+ bool non_standard;
};

#define CLOCKID_MAP(n, c) \
{ .name = n, .clockid = (c), }

+#define CLOCKID_MAP_NS(n, c) \
+ { .name = n, .clockid = (c), .non_standard = true, }
+
#define CLOCKID_END { .name = NULL, }


@@ -49,6 +53,10 @@ static const struct clockid_map clockids[] = {
CLOCKID_MAP("real", CLOCK_REALTIME),
CLOCKID_MAP("boot", CLOCK_BOOTTIME),

+ /* non-standard clocks */
+ CLOCKID_MAP_NS("perf_hw_clock", CLOCK_PERF_HW_CLOCK),
+ CLOCKID_MAP_NS("perf_hw_clock_ns", CLOCK_PERF_HW_CLOCK_NS),
+
CLOCKID_END,
};

@@ -97,6 +105,11 @@ int parse_clockid(const struct option *opt, const char *str, int unset)
for (cm = clockids; cm->name; cm++) {
if (!strcasecmp(str, cm->name)) {
opts->clockid = cm->clockid;
+ if (cm->non_standard) {
+ opts->ns_clockid = true;
+ opts->clockid_res_ns = 0;
+ return 0;
+ }
return get_clockid_res(opts->clockid,
&opts->clockid_res_ns);
}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 22d3267ce294..be1d30490a43 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1294,6 +1294,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
clockid = opts->clockid;
if (opts->use_clockid) {
attr->use_clockid = 1;
+ attr->ns_clockid = opts->ns_clockid;
attr->clockid = opts->clockid;
}

diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 98af3fa4ea35..398f05f2e5b3 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -128,6 +128,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
PRINT_ATTRf(mmap2, p_unsigned);
PRINT_ATTRf(comm_exec, p_unsigned);
PRINT_ATTRf(use_clockid, p_unsigned);
+ PRINT_ATTRf(ns_clockid, p_unsigned);
PRINT_ATTRf(context_switch, p_unsigned);
PRINT_ATTRf(write_backward, p_unsigned);
PRINT_ATTRf(namespaces, p_unsigned);
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index ef6c2715fdd9..1dbbf6b314dc 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -67,6 +67,7 @@ struct record_opts {
bool sample_transaction;
int initial_delay;
bool use_clockid;
+ bool ns_clockid;
clockid_t clockid;
u64 clockid_res_ns;
int nr_cblocks;
--
2.25.1

2022-02-14 21:00:58

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH V2 01/11] perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset

native_perf_sched_clock_from_tsc() is used to produce a time value that can
be consistent with perf_clock(). Consequently, it should be adjusted by
__sched_clock_offset, the same as perf_clock() would be.

Fixes: 698eff6355f735 ("sched/clock, x86/perf: Fix perf test tsc")
Signed-off-by: Adrian Hunter <[email protected]>
---
arch/x86/kernel/tsc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a698196377be..d9fe277c2471 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -242,7 +242,7 @@ u64 native_sched_clock(void)
*/
u64 native_sched_clock_from_tsc(u64 tsc)
{
- return cycles_2_ns(tsc);
+ return cycles_2_ns(tsc) + __sched_clock_offset;
}

/* We need to define a real function for sched_clock, to override the
--
2.25.1

2022-02-21 09:38:31

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing

On 14/02/2022 13:09, Adrian Hunter wrote:
> Hi
>
> These patches add 2 new perf event clocks based on TSC for use with VMs.
>
> The first patch is a minor fix, the next 2 patches add each of the 2 new
> clocks. The remaining patches add minimal tools support and are based on
> top of the Intel PT Event Trace tools' patches.
>
> The future work, to add the ability to use perf inject to inject perf
> events from a VM guest perf.data file into a VM host perf.data file,
> has yet to be implemented.
>
>
> Changes in V2:
> perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
> Add __sched_clock_offset unconditionally
>
> perf/x86: Add support for TSC as a perf event clock
> Use an attribute bit 'ns_clockid' to identify non-standard clockids
>
> perf/x86: Add support for TSC in nanoseconds as a perf event clock
> Do not affect use of __sched_clock_offset
> Adjust to use 'ns_clockid'

Any comments on version 2?

>
> perf tools: Add new perf clock IDs
> perf tools: Add API probes for new clock IDs
> perf tools: Add new clock IDs to "perf time to TSC" test
> perf tools: Add perf_read_tsc_conv_for_clockid()
> perf intel-pt: Add support for new clock IDs
> perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
> perf intel-pt: Add config variables for timing parameters
> perf intel-pt: Add documentation for new clock IDs
> Adjust to use 'ns_clockid'
>
>
> Adrian Hunter (11):
> perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
> perf/x86: Add support for TSC as a perf event clock
> perf/x86: Add support for TSC in nanoseconds as a perf event clock
> perf tools: Add new perf clock IDs
> perf tools: Add API probes for new clock IDs
> perf tools: Add new clock IDs to "perf time to TSC" test
> perf tools: Add perf_read_tsc_conv_for_clockid()
> perf intel-pt: Add support for new clock IDs
> perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
> perf intel-pt: Add config variables for timing parameters
> perf intel-pt: Add documentation for new clock IDs
>
> arch/x86/events/core.c | 39 ++++++++++--
> arch/x86/include/asm/perf_event.h | 5 ++
> arch/x86/kernel/tsc.c | 2 +-
> include/uapi/linux/perf_event.h | 18 +++++-
> kernel/events/core.c | 63 +++++++++++++-------
> tools/include/uapi/linux/perf_event.h | 18 +++++-
> tools/perf/Documentation/perf-config.txt | 18 ++++++
> tools/perf/Documentation/perf-intel-pt.txt | 47 +++++++++++++++
> tools/perf/Documentation/perf-record.txt | 9 ++-
> tools/perf/arch/x86/util/intel-pt.c | 95 ++++++++++++++++++++++++++++--
> tools/perf/builtin-record.c | 2 +-
> tools/perf/tests/perf-time-to-tsc.c | 42 ++++++++++---
> tools/perf/util/clockid.c | 14 +++++
> tools/perf/util/evsel.c | 1 +
> tools/perf/util/intel-pt.c | 27 +++++++--
> tools/perf/util/intel-pt.h | 7 ++-
> tools/perf/util/perf_api_probe.c | 24 ++++++++
> tools/perf/util/perf_api_probe.h | 2 +
> tools/perf/util/perf_event_attr_fprintf.c | 1 +
> tools/perf/util/record.h | 2 +
> tools/perf/util/tsc.c | 58 ++++++++++++++++++
> tools/perf/util/tsc.h | 2 +
> 22 files changed, 444 insertions(+), 52 deletions(-)
>
>
> Regards
> Adrian

2022-03-01 20:22:08

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing

On 21/02/2022 08:54, Adrian Hunter wrote:
> On 14/02/2022 13:09, Adrian Hunter wrote:
>> Hi
>>
>> These patches add 2 new perf event clocks based on TSC for use with VMs.
>>
>> The first patch is a minor fix, the next 2 patches add each of the 2 new
>> clocks. The remaining patches add minimal tools support and are based on
>> top of the Intel PT Event Trace tools' patches.
>>
>> The future work, to add the ability to use perf inject to inject perf
>> events from a VM guest perf.data file into a VM host perf.data file,
>> has yet to be implemented.
>>
>>
>> Changes in V2:
>> perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
>> Add __sched_clock_offset unconditionally
>>
>> perf/x86: Add support for TSC as a perf event clock
>> Use an attribute bit 'ns_clockid' to identify non-standard clockids
>>
>> perf/x86: Add support for TSC in nanoseconds as a perf event clock
>> Do not affect use of __sched_clock_offset
>> Adjust to use 'ns_clockid'
>
> Any comments on version 2?

☺/

2022-03-04 14:03:11

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock

On 04/03/2022 14:30, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> Currently, using Intel PT to trace a VM guest is limited to kernel space
>> because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
>> While these events can be collected for the host, there is not a way to do
>> that yet for a guest. One approach, would be to collect them inside the
>> guest, but that would require being able to synchronize with host
>> timestamps.
>>
>> The motivation for this patch is to provide a clock that can be used within
>> a VM guest, and that correlates to a VM host clock. In the case of TSC, if
>> the hypervisor leaves rdtsc alone, the TSC value will be subject only to
>> the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
>> to inject events from a guest perf.data file, into a host perf.data file.
>>
>> Thus making possible the collection of VM guest side band for Intel PT
>> decoding.
>>
>> There are other potential benefits of TSC as a perf event clock:
>> - ability to work directly with TSC
>> - ability to inject non-Intel-PT-related events from a guest
>>
>> Signed-off-by: Adrian Hunter <[email protected]>
>> ---
>> arch/x86/events/core.c | 16 +++++++++
>> arch/x86/include/asm/perf_event.h | 3 ++
>> include/uapi/linux/perf_event.h | 12 ++++++-
>> kernel/events/core.c | 57 +++++++++++++++++++------------
>> 4 files changed, 65 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index e686c5e0537b..51d5345de30a 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -2728,6 +2728,17 @@ void arch_perf_update_userpage(struct perf_event *event,
>> !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>> userpg->pmc_width = x86_pmu.cntval_bits;
>>
>> + if (event->attr.use_clockid &&
>> + event->attr.ns_clockid &&
>> + event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
>> + userpg->cap_user_time_zero = 1;
>> + userpg->time_mult = 1;
>> + userpg->time_shift = 0;
>> + userpg->time_offset = 0;
>> + userpg->time_zero = 0;
>> + return;
>> + }
>> +
>> if (!using_native_sched_clock() || !sched_clock_stable())
>> return;
>
> This looks the wrong way around. If TSC is found unstable, we should
> never expose it.

Intel PT traces contain TSC whether or not it is stable, and it could
still be usable in some cases e.g. short traces on a single CPU.

Ftrace seems to offer x86-tsc unconditionally as a clock.

We could add warnings to comments and documentation about its potential
pitfalls.

>
> And I'm not at all sure about the whole virt thing. Last time I looked
> at pvclock it made no sense at all.

It is certainly not useful for synchronizing events against TSC.

2022-03-04 15:40:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock

On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> +u64 perf_hw_clock(void)
> +{
> + return rdtsc_ordered();
> +}

Why the _ordered ?

2022-03-04 18:03:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock

On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> Currently, using Intel PT to trace a VM guest is limited to kernel space
> because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
> While these events can be collected for the host, there is not a way to do
> that yet for a guest. One approach, would be to collect them inside the
> guest, but that would require being able to synchronize with host
> timestamps.
>
> The motivation for this patch is to provide a clock that can be used within
> a VM guest, and that correlates to a VM host clock. In the case of TSC, if
> the hypervisor leaves rdtsc alone, the TSC value will be subject only to
> the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
> to inject events from a guest perf.data file, into a host perf.data file.
>
> Thus making possible the collection of VM guest side band for Intel PT
> decoding.
>
> There are other potential benefits of TSC as a perf event clock:
> - ability to work directly with TSC
> - ability to inject non-Intel-PT-related events from a guest
>
> Signed-off-by: Adrian Hunter <[email protected]>
> ---
> arch/x86/events/core.c | 16 +++++++++
> arch/x86/include/asm/perf_event.h | 3 ++
> include/uapi/linux/perf_event.h | 12 ++++++-
> kernel/events/core.c | 57 +++++++++++++++++++------------
> 4 files changed, 65 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index e686c5e0537b..51d5345de30a 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2728,6 +2728,17 @@ void arch_perf_update_userpage(struct perf_event *event,
> !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> userpg->pmc_width = x86_pmu.cntval_bits;
>
> + if (event->attr.use_clockid &&
> + event->attr.ns_clockid &&
> + event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
> + userpg->cap_user_time_zero = 1;
> + userpg->time_mult = 1;
> + userpg->time_shift = 0;
> + userpg->time_offset = 0;
> + userpg->time_zero = 0;
> + return;
> + }
> +
> if (!using_native_sched_clock() || !sched_clock_stable())
> return;

This looks the wrong way around. If TSC is found unstable, we should
never expose it.

And I'm not at all sure about the whole virt thing. Last time I looked
at pvclock it made no sense at all.

2022-03-04 18:51:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock

On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..e8617efd552b 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -290,6 +290,15 @@ enum {
> PERF_TXN_ABORT_SHIFT = 32,
> };
>
> +/*
> + * If supported, clockid value to select an architecture dependent hardware
> + * clock. Note this means the unit of time is ticks not nanoseconds.
> + * Requires ns_clockid to be set in addition to use_clockid.
> + * On x86, this clock is provided by the rdtsc instruction, and is not
> + * paravirtualized.
> + */
> +#define CLOCK_PERF_HW_CLOCK 0x10000000
> +
> /*
> * The format of the data returned by read() on a perf event fd,
> * as specified by attr.read_format:
> @@ -409,7 +418,8 @@ struct perf_event_attr {
> inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */
> remove_on_exec : 1, /* event is removed from task on exec */
> sigtrap : 1, /* send synchronous SIGTRAP on event */
> - __reserved_1 : 26;
> + ns_clockid : 1, /* non-standard clockid */
> + __reserved_1 : 25;
>
> union {
> __u32 wakeup_events; /* wakeup every n events */

Thomas, do we want to gate this behind this magic flag, or can that
CLOCKID be granted unconditionally?

2022-03-04 20:06:21

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock

On 04/03/2022 14:33, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> +u64 perf_hw_clock(void)
>> +{
>> + return rdtsc_ordered();
>> +}
>
> Why the _ordered ?

To be on the safe-side - in case it matters. trace_clock_x86_tsc() also uses the ordered variant.

2022-03-04 20:17:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock

On Fri, Mar 04 2022 at 13:32, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 82858b697c05..e8617efd552b 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -290,6 +290,15 @@ enum {
>> PERF_TXN_ABORT_SHIFT = 32,
>> };
>>
>> +/*
>> + * If supported, clockid value to select an architecture dependent hardware
>> + * clock. Note this means the unit of time is ticks not nanoseconds.
>> + * Requires ns_clockid to be set in addition to use_clockid.
>> + * On x86, this clock is provided by the rdtsc instruction, and is not
>> + * paravirtualized.
>> + */
>> +#define CLOCK_PERF_HW_CLOCK 0x10000000
>> +
>> /*
>> * The format of the data returned by read() on a perf event fd,
>> * as specified by attr.read_format:
>> @@ -409,7 +418,8 @@ struct perf_event_attr {
>> inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */
>> remove_on_exec : 1, /* event is removed from task on exec */
>> sigtrap : 1, /* send synchronous SIGTRAP on event */
>> - __reserved_1 : 26;
>> + ns_clockid : 1, /* non-standard clockid */
>> + __reserved_1 : 25;
>>
>> union {
>> __u32 wakeup_events; /* wakeup every n events */
>
> Thomas, do we want to gate this behind this magic flag, or can that
> CLOCKID be granted unconditionally?

I'm not seeing a point in that flag and please define the clock id where
the other clockids are defined. We want a proper ID range for such
magically defined clocks.

We use INT_MIN < id < 16 today. I have plans to expand the ID space past
16, so using something like the above is fine.

Thanks,

tglx