2021-11-19 10:21:48

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [RFC PATCH 0/2] KVM: arm64: Host/Guest trace syncronization

This small series introduces the necessary infrastructure to be able to
syncronize host and guest traces. The approach I'm following is a bit
biased since I tried to replicate the methods I've been using in the
past with x86.

This was tested on an Ampere Mt. Jade based machine.

---

Nicolas Saenz Julienne (2):
arm64/tracing: add cntvct based trace clock
KVM: arm64: export cntvoff in debugfs

arch/arm64/include/asm/kvm_host.h | 1 +
arch/arm64/include/asm/trace_clock.h | 12 ++++++++++++
arch/arm64/kernel/Makefile | 2 +-
arch/arm64/kernel/trace_clock.c | 12 ++++++++++++
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arch_timer.c | 2 +-
arch/arm64/kvm/debugfs.c | 25 +++++++++++++++++++++++++
include/kvm/arm_arch_timer.h | 3 +++
8 files changed, 56 insertions(+), 3 deletions(-)
create mode 100644 arch/arm64/include/asm/trace_clock.h
create mode 100644 arch/arm64/kernel/trace_clock.c
create mode 100644 arch/arm64/kvm/debugfs.c

--
2.33.1



2021-11-19 10:21:50

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [RFC PATCH 1/2] arm64/tracing: add cntvct based trace clock

Add a new arm64-specific trace clock using the cntvct register, similar
to x64-tsc. This gives us:
- A clock that is relatively fast (1GHz on armv8.6, 1-50MHz otherwise),
monotonic, and resilient to low power modes.
- It can be used to correlate events across cpus as well as across
hypervisor and guests.

By using arch_timer_read_counter() we make sure that armv8.6 cpus use
the less expensive CNTVCTSS_EL0, which cannot be accessed speculatively.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
arch/arm64/include/asm/trace_clock.h | 12 ++++++++++++
arch/arm64/kernel/Makefile | 2 +-
arch/arm64/kernel/trace_clock.c | 12 ++++++++++++
3 files changed, 25 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/include/asm/trace_clock.h
create mode 100644 arch/arm64/kernel/trace_clock.c

diff --git a/arch/arm64/include/asm/trace_clock.h b/arch/arm64/include/asm/trace_clock.h
new file mode 100644
index 000000000000..ce4a66d63108
--- /dev/null
+++ b/arch/arm64/include/asm/trace_clock.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_TRACE_CLOCK_H
+#define _ASM_ARM64_TRACE_CLOCK_H
+
+#include <linux/types.h>
+
+extern u64 notrace trace_clock_arm64_cntvct(void);
+
+# define ARCH_TRACE_CLOCKS \
+ { trace_clock_arm64_cntvct, "cntvct", .in_ns = 0 },
+
+#endif /* _ASM_ARM64_TRACE_CLOCK_H */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 88b3e2a21408..ec9180f0ac90 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -29,7 +29,7 @@ obj-y := debug-monitors.o entry.o irq.o fpsimd.o \
cpufeature.o alternative.o cacheinfo.o \
smp.o smp_spin_table.o topology.o smccc-call.o \
syscall.o proton-pack.o idreg-override.o idle.o \
- patching.o
+ patching.o trace_clock.o

targets += efi-entry.o

diff --git a/arch/arm64/kernel/trace_clock.c b/arch/arm64/kernel/trace_clock.c
new file mode 100644
index 000000000000..fe1315f368cb
--- /dev/null
+++ b/arch/arm64/kernel/trace_clock.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * arm64 trace clocks
+ */
+#include <asm/trace_clock.h>
+
+#include <clocksource/arm_arch_timer.h>
+
+u64 notrace trace_clock_arm64_cntvct(void)
+{
+ return arch_timer_read_counter();
+}
--
2.33.1


2021-11-19 10:21:52

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

While using cntvct as the raw clock for tracing, it's possible to
synchronize host/guest traces just by knowing the virtual offset applied
to the guest's virtual counter.

This is also the case on x86 when TSC is available. The offset is
exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
implement the same for arm64.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 1 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arch_timer.c | 2 +-
arch/arm64/kvm/debugfs.c | 25 +++++++++++++++++++++++++
include/kvm/arm_arch_timer.h | 3 +++
5 files changed, 31 insertions(+), 2 deletions(-)
create mode 100644 arch/arm64/kvm/debugfs.c

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2a5f7f38006f..130534c9079e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -29,6 +29,7 @@
#include <asm/thread_info.h>

#define __KVM_HAVE_ARCH_INTC_INITIALIZED
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS

#define KVM_HALT_POLL_NS_DEFAULT 500000

diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 989bb5dad2c8..17be7cf770f2 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -14,7 +14,7 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
$(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
inject_fault.o va_layout.o handle_exit.o \
- guest.o debug.o reset.o sys_regs.o \
+ guest.o debug.o debugfs.o reset.o sys_regs.o \
vgic-sys-reg-v3.o fpsimd.o pmu.o \
arch_timer.o trng.o\
vgic/vgic.o vgic/vgic-init.o \
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 3df67c127489..ee69387f7fb6 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -82,7 +82,7 @@ u64 timer_get_cval(struct arch_timer_context *ctxt)
}
}

-static u64 timer_get_offset(struct arch_timer_context *ctxt)
+u64 timer_get_offset(struct arch_timer_context *ctxt)
{
struct kvm_vcpu *vcpu = ctxt->vcpu;

diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
new file mode 100644
index 000000000000..f0f5083ea8d4
--- /dev/null
+++ b/arch/arm64/kvm/debugfs.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 Red Hat Inc.
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/debugfs.h>
+
+#include <kvm/arm_arch_timer.h>
+
+static int vcpu_get_cntv_offset(void *data, u64 *val)
+{
+ struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
+
+ *val = timer_get_offset(vcpu_vtimer(vcpu));
+
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
+
+void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
+{
+ debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
+}
diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index 51c19381108c..de0cd9be825c 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -106,4 +106,7 @@ void kvm_arm_timer_write_sysreg(struct kvm_vcpu *vcpu,
u32 timer_get_ctl(struct arch_timer_context *ctxt);
u64 timer_get_cval(struct arch_timer_context *ctxt);

+/* Nedded for debugfs */
+u64 timer_get_offset(struct arch_timer_context *ctxt);
+
#endif
--
2.33.1


2021-11-19 11:12:25

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

On Fri, Nov 19, 2021 at 11:21:18AM +0100, Nicolas Saenz Julienne wrote:
> While using cntvct as the raw clock for tracing, it's possible to
> synchronize host/guest traces just by knowing the virtual offset applied
> to the guest's virtual counter.
>
> This is also the case on x86 when TSC is available. The offset is
> exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> implement the same for arm64.
>
> Signed-off-by: Nicolas Saenz Julienne <[email protected]>

Hi Nicolas,

ARM:

CNTVCTSS_EL0, Counter-timer Self-Synchronized Virtual Count register
The CNTVCTSS_EL0 characteristics are:

Purpose
Holds the 64-bit virtual count value. The virtual count value is equal to the
physical count value visible in CNTPCT_EL0 minus the virtual offset visible in CNTVOFF_EL2.
^^^^^

x86:

24.6.5 Time-Stamp Counter Offset and Multiplier
The VM-execution control fields include a 64-bit TSC-offset field. If the “RDTSC exiting” control is 0 and the “use
TSC offsetting” control is 1, this field controls executions of the RDTSC and RDTSCP instructions. It also controls
executions of the RDMSR instruction that read from the IA32_TIME_STAMP_COUNTER MSR. For all of these, the
value of the TSC offset is added to the value of the time-stamp counter, and the sum is returned to guest software
^^^^^
in EDX:EAX.

So it would be nice to keep the formula consistent for userspace:

GUEST_CLOCK_VAL = HOST_CLOCK_VAL + CLOCK_OFFSET

So would have to add a negative sign to the value to userspace.

Other than that, both the clock value (VCNTPCT_EL0) and the offset
(CNTVOFF_EL2) are not modified during guest execution? That is, CNTVOFF_EL2 is
written once during guest initialization.


> ---
> arch/arm64/include/asm/kvm_host.h | 1 +
> arch/arm64/kvm/Makefile | 2 +-
> arch/arm64/kvm/arch_timer.c | 2 +-
> arch/arm64/kvm/debugfs.c | 25 +++++++++++++++++++++++++
> include/kvm/arm_arch_timer.h | 3 +++
> 5 files changed, 31 insertions(+), 2 deletions(-)
> create mode 100644 arch/arm64/kvm/debugfs.c
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 2a5f7f38006f..130534c9079e 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -29,6 +29,7 @@
> #include <asm/thread_info.h>
>
> #define __KVM_HAVE_ARCH_INTC_INITIALIZED
> +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
>
> #define KVM_HALT_POLL_NS_DEFAULT 500000
>
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 989bb5dad2c8..17be7cf770f2 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -14,7 +14,7 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
> $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
> arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
> inject_fault.o va_layout.o handle_exit.o \
> - guest.o debug.o reset.o sys_regs.o \
> + guest.o debug.o debugfs.o reset.o sys_regs.o \
> vgic-sys-reg-v3.o fpsimd.o pmu.o \
> arch_timer.o trng.o\
> vgic/vgic.o vgic/vgic-init.o \
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 3df67c127489..ee69387f7fb6 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -82,7 +82,7 @@ u64 timer_get_cval(struct arch_timer_context *ctxt)
> }
> }
>
> -static u64 timer_get_offset(struct arch_timer_context *ctxt)
> +u64 timer_get_offset(struct arch_timer_context *ctxt)
> {
> struct kvm_vcpu *vcpu = ctxt->vcpu;
>
> diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> new file mode 100644
> index 000000000000..f0f5083ea8d4
> --- /dev/null
> +++ b/arch/arm64/kvm/debugfs.c
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2021 Red Hat Inc.
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/debugfs.h>
> +
> +#include <kvm/arm_arch_timer.h>
> +
> +static int vcpu_get_cntv_offset(void *data, u64 *val)
> +{
> + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> +
> + *val = timer_get_offset(vcpu_vtimer(vcpu));
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
> +
> +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
> +{
> + debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
> +}
> diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
> index 51c19381108c..de0cd9be825c 100644
> --- a/include/kvm/arm_arch_timer.h
> +++ b/include/kvm/arm_arch_timer.h
> @@ -106,4 +106,7 @@ void kvm_arm_timer_write_sysreg(struct kvm_vcpu *vcpu,
> u32 timer_get_ctl(struct arch_timer_context *ctxt);
> u64 timer_get_cval(struct arch_timer_context *ctxt);
>
> +/* Nedded for debugfs */
> +u64 timer_get_offset(struct arch_timer_context *ctxt);
> +
> #endif
> --
> 2.33.1
>
>


2021-11-19 11:26:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] arm64/tracing: add cntvct based trace clock

On Fri, Nov 19, 2021 at 11:21:17AM +0100, Nicolas Saenz Julienne wrote:
> Add a new arm64-specific trace clock using the cntvct register, similar
> to x64-tsc. This gives us:
> - A clock that is relatively fast (1GHz on armv8.6, 1-50MHz otherwise),
> monotonic, and resilient to low power modes.
> - It can be used to correlate events across cpus as well as across
> hypervisor and guests.
>
> By using arch_timer_read_counter() we make sure that armv8.6 cpus use
> the less expensive CNTVCTSS_EL0, which cannot be accessed speculatively.

Can this register be read by userspace ? (otherwise it won't be possible
to correlate userspace events).


2021-11-19 12:00:46

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] arm64/tracing: add cntvct based trace clock

On Fri, 19 Nov 2021 11:26:24 +0000,
Marcelo Tosatti <[email protected]> wrote:
>
> On Fri, Nov 19, 2021 at 11:21:17AM +0100, Nicolas Saenz Julienne wrote:
> > Add a new arm64-specific trace clock using the cntvct register, similar
> > to x64-tsc. This gives us:
> > - A clock that is relatively fast (1GHz on armv8.6, 1-50MHz otherwise),
> > monotonic, and resilient to low power modes.
> > - It can be used to correlate events across cpus as well as across
> > hypervisor and guests.
> >
> > By using arch_timer_read_counter() we make sure that armv8.6 cpus use
> > the less expensive CNTVCTSS_EL0, which cannot be accessed speculatively.
>
> Can this register be read by userspace ? (otherwise it won't be possible
> to correlate userspace events).

Yes. That's part of the userspace ABI. Although this particular
accessor is only available from ARMv8.6 and is advertised via a hwcap
to userspace.

For currently existing implementations, userspace will use the
CNTVCT_EL0 accessor, which requires extra synchronisation as it can be
speculated.

M.

--
Without deviation from the norm, progress is not possible.

2021-11-19 12:17:10

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

On Fri, 19 Nov 2021 10:21:18 +0000,
Nicolas Saenz Julienne <[email protected]> wrote:
>
> While using cntvct as the raw clock for tracing, it's possible to
> synchronize host/guest traces just by knowing the virtual offset applied
> to the guest's virtual counter.
>
> This is also the case on x86 when TSC is available. The offset is
> exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> implement the same for arm64.

How does this work with NV, where the guest hypervisor is in control
of the virtual offset? How does userspace knows which vcpu to pick so
that it gets the right offset?

I also wonder why we need this when userspace already has direct
access to that information without any extra kernel support (read the
CNTVCT view of the vcpu using the ONEREG API, subtract it from the
host view of the counter, job done).

>
> Signed-off-by: Nicolas Saenz Julienne <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 1 +
> arch/arm64/kvm/Makefile | 2 +-
> arch/arm64/kvm/arch_timer.c | 2 +-
> arch/arm64/kvm/debugfs.c | 25 +++++++++++++++++++++++++
> include/kvm/arm_arch_timer.h | 3 +++
> 5 files changed, 31 insertions(+), 2 deletions(-)
> create mode 100644 arch/arm64/kvm/debugfs.c
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 2a5f7f38006f..130534c9079e 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -29,6 +29,7 @@
> #include <asm/thread_info.h>
>
> #define __KVM_HAVE_ARCH_INTC_INITIALIZED
> +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
>
> #define KVM_HALT_POLL_NS_DEFAULT 500000
>
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 989bb5dad2c8..17be7cf770f2 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -14,7 +14,7 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
> $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
> arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
> inject_fault.o va_layout.o handle_exit.o \
> - guest.o debug.o reset.o sys_regs.o \
> + guest.o debug.o debugfs.o reset.o sys_regs.o \
> vgic-sys-reg-v3.o fpsimd.o pmu.o \
> arch_timer.o trng.o\
> vgic/vgic.o vgic/vgic-init.o \
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 3df67c127489..ee69387f7fb6 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -82,7 +82,7 @@ u64 timer_get_cval(struct arch_timer_context *ctxt)
> }
> }
>
> -static u64 timer_get_offset(struct arch_timer_context *ctxt)
> +u64 timer_get_offset(struct arch_timer_context *ctxt)
> {
> struct kvm_vcpu *vcpu = ctxt->vcpu;
>
> diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> new file mode 100644
> index 000000000000..f0f5083ea8d4
> --- /dev/null
> +++ b/arch/arm64/kvm/debugfs.c
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2021 Red Hat Inc.
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/debugfs.h>
> +
> +#include <kvm/arm_arch_timer.h>
> +
> +static int vcpu_get_cntv_offset(void *data, u64 *val)
> +{
> + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> +
> + *val = timer_get_offset(vcpu_vtimer(vcpu));
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
> +
> +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
> +{
> + debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
> +}

This should be left in arch_timer.c until we actually need it for
multiple subsystems. When (and if) that happens, we will expose
per-subsystem debugfs initialisers instead of exposing the guts of the
timer code.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-11-19 12:59:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

On Fri, Nov 19, 2021 at 12:17:00PM +0000, Marc Zyngier wrote:
> On Fri, 19 Nov 2021 10:21:18 +0000,
> Nicolas Saenz Julienne <[email protected]> wrote:
> >
> > While using cntvct as the raw clock for tracing, it's possible to
> > synchronize host/guest traces just by knowing the virtual offset applied
> > to the guest's virtual counter.
> >
> > This is also the case on x86 when TSC is available. The offset is
> > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > implement the same for arm64.
>
> How does this work with NV, where the guest hypervisor is in control
> of the virtual offset? How does userspace knows which vcpu to pick so
> that it gets the right offset?

On x86, the offsets for different vcpus are the same due to the logic at
kvm_synchronize_tsc function:

During guest vcpu creation, when the TSC-clock values are written
in a short window of time (or the clock value is zero), the code uses
the same TSC.

This logic is problematic (since "short window of time" is a heuristic
which can fail), and is being replaced by writing the same offset
for each vCPU:

commit 828ca89628bfcb1b8f27535025f69dd00eb55207
Author: Oliver Upton <[email protected]>
Date: Thu Sep 16 18:15:38 2021 +0000

KVM: x86: Expose TSC offset controls to userspace

To date, VMM-directed TSC synchronization and migration has been a bit
messy. KVM has some baked-in heuristics around TSC writes to infer if
the VMM is attempting to synchronize. This is problematic, as it depends
on host userspace writing to the guest's TSC within 1 second of the last
write.

A much cleaner approach to configuring the guest's views of the TSC is to
simply migrate the TSC offset for every vCPU. Offsets are idempotent,
and thus not subject to change depending on when the VMM actually
reads/writes values from/to KVM. The VMM can then read the TSC once with
KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
the guest is paused.

So with that in place, the answer to

How does userspace knows which vcpu to pick so
that it gets the right offset?

is any vcpu, since the offsets are the same.

> I also wonder why we need this when userspace already has direct
> access to that information without any extra kernel support (read the
> CNTVCT view of the vcpu using the ONEREG API, subtract it from the
> host view of the counter, job done).

If guest has access to the clock offset (between guest and host), then
in the guest:

clockval = hostclockval - clockoffset

Adding "clockoffset" to that will retrieve the host clock.

Is that what you mean?


2021-11-19 13:26:10

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] arm64/tracing: add cntvct based trace clock

On Fri, 2021-11-19 at 12:00 +0000, Marc Zyngier wrote:
> On Fri, 19 Nov 2021 11:26:24 +0000,
> Marcelo Tosatti <[email protected]> wrote:
> >
> > On Fri, Nov 19, 2021 at 11:21:17AM +0100, Nicolas Saenz Julienne wrote:
> > > Add a new arm64-specific trace clock using the cntvct register, similar
> > > to x64-tsc. This gives us:
> > > - A clock that is relatively fast (1GHz on armv8.6, 1-50MHz otherwise),
> > > monotonic, and resilient to low power modes.
> > > - It can be used to correlate events across cpus as well as across
> > > hypervisor and guests.
> > >
> > > By using arch_timer_read_counter() we make sure that armv8.6 cpus use
> > > the less expensive CNTVCTSS_EL0, which cannot be accessed speculatively.
> >
> > Can this register be read by userspace ? (otherwise it won't be possible
> > to correlate userspace events).
>
> Yes. That's part of the userspace ABI. Although this particular
> accessor is only available from ARMv8.6 and is advertised via a hwcap
> to userspace.
>
> For currently existing implementations, userspace will use the
> CNTVCT_EL0 accessor, which requires extra synchronisation as it can be
> speculated.

To complete Marc's reply, here's an example of how CNTVCT_EL0 is being used in
rt-tests' oslat:

https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/tree/src/oslat/oslat.c#n87

--
Nicolás Sáenz


2021-11-19 13:31:17

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

On Fri, 19 Nov 2021 12:59:46 +0000,
Marcelo Tosatti <[email protected]> wrote:
>
> On Fri, Nov 19, 2021 at 12:17:00PM +0000, Marc Zyngier wrote:
> > On Fri, 19 Nov 2021 10:21:18 +0000,
> > Nicolas Saenz Julienne <[email protected]> wrote:
> > >
> > > While using cntvct as the raw clock for tracing, it's possible to
> > > synchronize host/guest traces just by knowing the virtual offset applied
> > > to the guest's virtual counter.
> > >
> > > This is also the case on x86 when TSC is available. The offset is
> > > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > > implement the same for arm64.
> >
> > How does this work with NV, where the guest hypervisor is in control
> > of the virtual offset? How does userspace knows which vcpu to pick so
> > that it gets the right offset?
>
> On x86, the offsets for different vcpus are the same due to the logic at
> kvm_synchronize_tsc function:
>
> During guest vcpu creation, when the TSC-clock values are written
> in a short window of time (or the clock value is zero), the code uses
> the same TSC.
>
> This logic is problematic (since "short window of time" is a heuristic
> which can fail), and is being replaced by writing the same offset
> for each vCPU:
>
> commit 828ca89628bfcb1b8f27535025f69dd00eb55207
> Author: Oliver Upton <[email protected]>
> Date: Thu Sep 16 18:15:38 2021 +0000
>
> KVM: x86: Expose TSC offset controls to userspace
>
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
>
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
>
> So with that in place, the answer to
>
> How does userspace knows which vcpu to pick so
> that it gets the right offset?
>
> is any vcpu, since the offsets are the same.

As I just said above, this assertion doesn't hold true once you have
nested virt, because the offset is per-cpu, and is adjusted to mean
different things on different hypervisors (some hypervisors expose
stolen time through it, for example).

What this patch is doing is to expose a Linux-specific behaviour, and
try to derive properties from it. It really doesn't work in general.

>
> > I also wonder why we need this when userspace already has direct
> > access to that information without any extra kernel support (read the
> > CNTVCT view of the vcpu using the ONEREG API, subtract it from the
> > host view of the counter, job done).
>
> If guest has access to the clock offset (between guest and host), then
> in the guest:
>
> clockval = hostclockval - clockoffset
>
> Adding "clockoffset" to that will retrieve the host clock.
>
> Is that what you mean?

No. The *VMM* (qemu, kvmtool, crosvm, insertyourfavouriteonehere) has
already access to it. Why do we need an extra interface?

M.

--
Without deviation from the norm, progress is not possible.

2021-11-22 14:57:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] arm64/tracing: add cntvct based trace clock

On Fri, 19 Nov 2021 11:21:17 +0100
Nicolas Saenz Julienne <[email protected]> wrote:

> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM64_TRACE_CLOCK_H
> +#define _ASM_ARM64_TRACE_CLOCK_H
> +
> +#include <linux/types.h>
> +
> +extern u64 notrace trace_clock_arm64_cntvct(void);
> +
> +# define ARCH_TRACE_CLOCKS \
> + { trace_clock_arm64_cntvct, "cntvct", .in_ns = 0 },
> +
> +#endif /* _ASM_ARM64_TRACE_CLOCK_H */

So this will appear as a usable clock in trace-cmd.

And since this will be used to synchronize between host and guest like the
x86_tsc is used, that means that trace-cmd needs to know that this is the
an arch "CPU" clock. I wonder if we should rename x86_clock (or at least
make it an alias) to "kvm_clock". Then we can have trace-cmd use
"kvm_clock" as the clock for synchronization between host and guests for
all architectures?

Thinking about this, instead of renaming it, I'll add code to create an
alias to these clocks. Then every arch can pick what clock is used that is
the same between hosts and guests such that user space tooling doesn't have
to keep a database of what clocks are used for synchronization between
hosts and guests for each arch.

I'll go add some code ;-)

-- Steve

2021-11-22 20:40:59

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

Hi Marc, thanks for the review.

On Fri, 2021-11-19 at 12:17 +0000, Marc Zyngier wrote:
> On Fri, 19 Nov 2021 10:21:18 +0000,
> Nicolas Saenz Julienne <[email protected]> wrote:
> >
> > While using cntvct as the raw clock for tracing, it's possible to
> > synchronize host/guest traces just by knowing the virtual offset applied
> > to the guest's virtual counter.
> >
> > This is also the case on x86 when TSC is available. The offset is
> > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > implement the same for arm64.
>
> How does this work with NV, where the guest hypervisor is in control
> of the virtual offset?

TBH I handn't thought about NV. Looking at it from that angle, I now see my
approach doesn't work on hosts that use CNTVCT (regardless of NV). Upon
entering into a guest, we change CNTVOFF before the host is done with tracing,
so traces like 'kvm_entry' will have weird timestamps. I was just lucky that
the hosts I was testing with use CNTPCT.

I believe the solution would be to be able to force a 0 offset between
guest/host. With that in mind, is there a reason why kvm_timer_vcpu_init()
imposes a non-zero one by default? I checked out the commits that introduced
that code, but couldn't find a compelling reason. VMMs can always change it
through KVM_REG_ARM_TIMER_CNT afterwards.

> I also wonder why we need this when userspace already has direct access to
> that information without any extra kernel support (read the CNTVCT view of
> the vcpu using the ONEREG API, subtract it from the host view of the counter,
> job done).

Well IIUC, you're at the mercy of how long it takes to return from the ONEREG
ioctl. The results will be skewed. For some workloads, where low latency is
key, we really need high precision traces in the order of single digit us or
even 100s of ns. I'm not sure you'll be able to get there with that approach.

[...]

> > diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> > new file mode 100644
> > index 000000000000..f0f5083ea8d4
> > --- /dev/null
> > +++ b/arch/arm64/kvm/debugfs.c
> > @@ -0,0 +1,25 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2021 Red Hat Inc.
> > + */
> > +
> > +#include <linux/kvm_host.h>
> > +#include <linux/debugfs.h>
> > +
> > +#include <kvm/arm_arch_timer.h>
> > +
> > +static int vcpu_get_cntv_offset(void *data, u64 *val)
> > +{
> > + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> > +
> > + *val = timer_get_offset(vcpu_vtimer(vcpu));
> > +
> > + return 0;
> > +}
> > +
> > +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
> > +
> > +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
> > +{
> > + debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
> > +}
>
> This should be left in arch_timer.c until we actually need it for
> multiple subsystems. When (and if) that happens, we will expose
> per-subsystem debugfs initialisers instead of exposing the guts of the
> timer code.

Noted.

--
Nicolás Sáenz


2021-11-23 11:09:24

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

On Mon, 22 Nov 2021 20:40:52 +0000,
Nicolas Saenz Julienne <[email protected]> wrote:
>
> Hi Marc, thanks for the review.
>
> On Fri, 2021-11-19 at 12:17 +0000, Marc Zyngier wrote:
> > On Fri, 19 Nov 2021 10:21:18 +0000,
> > Nicolas Saenz Julienne <[email protected]> wrote:
> > >
> > > While using cntvct as the raw clock for tracing, it's possible to
> > > synchronize host/guest traces just by knowing the virtual offset applied
> > > to the guest's virtual counter.
> > >
> > > This is also the case on x86 when TSC is available. The offset is
> > > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > > implement the same for arm64.
> >
> > How does this work with NV, where the guest hypervisor is in control
> > of the virtual offset?
>
> TBH I handn't thought about NV. Looking at it from that angle, I now see my
> approach doesn't work on hosts that use CNTVCT (regardless of NV). Upon
> entering into a guest, we change CNTVOFF before the host is done with tracing,
> so traces like 'kvm_entry' will have weird timestamps. I was just lucky that
> the hosts I was testing with use CNTPCT.

There are multiple things at play here:

- if the system is a host, the kernel will use CNTPCT. Userspace will
still use CNTVCT, and the offset is guaranteed to be 0 *when running
userspace*.

- if the system isn't a host (which doesn't necessarily means a
guest), CNTVCT is the only thing that is being used, and the offset
is unknown (Linux requires it to be constant across vcpus though).

So I doubt you'd get a bad timestamp on the host. It is just that you
have named your trace clock incorrectly (and Steven's idea of an
indirected clock could help here).

> I believe the solution would be to be able to force a 0 offset between
> guest/host. With that in mind, is there a reason why kvm_timer_vcpu_init()
> imposes a non-zero one by default? I checked out the commits that introduced
> that code, but couldn't find a compelling reason. VMMs can always change it
> through KVM_REG_ARM_TIMER_CNT afterwards.

We want to minimise the chance for an observable rollover of the
virtual counter, so time starts at 0 *in the guest*. The VMM can
change the view of that time for the purpose of migration.

If you want a 0 offset, set the counter to the physical value in the
VMM (imprecise) or have a look at Oliver Upton's patches that were
allowing an offset to be specified directly. But migration, by
definition, breaks this.

>
> > I also wonder why we need this when userspace already has direct access to
> > that information without any extra kernel support (read the CNTVCT view of
> > the vcpu using the ONEREG API, subtract it from the host view of the counter,
> > job done).
>
> Well IIUC, you're at the mercy of how long it takes to return from the ONEREG
> ioctl. The results will be skewed. For some workloads, where low latency is
> key, we really need high precision traces in the order of single digit us or
> even 100s of ns. I'm not sure you'll be able to get there with that approach.

The PTP clock does exactly that from the guest PoV, with a lot more
overhead, and this results in single digit ns precision. Why isn't
that possible from userspace?

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-11-24 09:45:51

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] arm64/tracing: add cntvct based trace clock

On Mon, 2021-11-22 at 09:57 -0500, Steven Rostedt wrote:
> On Fri, 19 Nov 2021 11:21:17 +0100
> Nicolas Saenz Julienne <[email protected]> wrote:
>
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_ARM64_TRACE_CLOCK_H
> > +#define _ASM_ARM64_TRACE_CLOCK_H
> > +
> > +#include <linux/types.h>
> > +
> > +extern u64 notrace trace_clock_arm64_cntvct(void);
> > +
> > +# define ARCH_TRACE_CLOCKS \
> > + { trace_clock_arm64_cntvct, "cntvct", .in_ns = 0 },
> > +
> > +#endif /* _ASM_ARM64_TRACE_CLOCK_H */
>
> So this will appear as a usable clock in trace-cmd.
>
> And since this will be used to synchronize between host and guest like the
> x86_tsc is used, that means that trace-cmd needs to know that this is the
> an arch "CPU" clock. I wonder if we should rename x86_clock (or at least
> make it an alias) to "kvm_clock". Then we can have trace-cmd use
> "kvm_clock" as the clock for synchronization between host and guests for
> all architectures?
>
> Thinking about this, instead of renaming it, I'll add code to create an
> alias to these clocks. Then every arch can pick what clock is used that is
> the same between hosts and guests such that user space tooling doesn't have
> to keep a database of what clocks are used for synchronization between
> hosts and guests for each arch.
>
> I'll go add some code ;-)

I really like the idea, please keep me in the loop if you send something
upstream.

--
Nicolás Sáenz


2021-11-29 13:03:14

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

On Mon, Nov 22, 2021 at 09:40:52PM +0100, Nicolas Saenz Julienne wrote:
> Hi Marc, thanks for the review.
>
> On Fri, 2021-11-19 at 12:17 +0000, Marc Zyngier wrote:
> > On Fri, 19 Nov 2021 10:21:18 +0000,
> > Nicolas Saenz Julienne <[email protected]> wrote:
> > >
> > > While using cntvct as the raw clock for tracing, it's possible to
> > > synchronize host/guest traces just by knowing the virtual offset applied
> > > to the guest's virtual counter.
> > >
> > > This is also the case on x86 when TSC is available. The offset is
> > > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > > implement the same for arm64.
> >
> > How does this work with NV, where the guest hypervisor is in control
> > of the virtual offset?
>
> TBH I handn't thought about NV. Looking at it from that angle, I now see my
> approach doesn't work on hosts that use CNTVCT (regardless of NV). Upon
> entering into a guest, we change CNTVOFF before the host is done with tracing,
> so traces like 'kvm_entry' will have weird timestamps. I was just lucky that
> the hosts I was testing with use CNTPCT.
>
> I believe the solution would be to be able to force a 0 offset between
> guest/host. With that in mind, is there a reason why kvm_timer_vcpu_init()
> imposes a non-zero one by default? I checked out the commits that introduced
> that code, but couldn't find a compelling reason. VMMs can always change it
> through KVM_REG_ARM_TIMER_CNT afterwards.

One reason is that you leak information from host to guest (the hosts
TSC value).

Another reason would be that you introduce a configuration which is
different from the what the hardware has, which can in theory trigger
guest bugs.

> > I also wonder why we need this when userspace already has direct access to
> > that information without any extra kernel support (read the CNTVCT view of
> > the vcpu using the ONEREG API, subtract it from the host view of the counter,
> > job done).
>
> Well IIUC, you're at the mercy of how long it takes to return from the ONEREG
> ioctl. The results will be skewed. For some workloads, where low latency is
> key, we really need high precision traces in the order of single digit us or
> even 100s of ns. I'm not sure you'll be able to get there with that approach.

If the guest can read the host to guest HW clock offset already, it
could directly do the conversion.

> [...]
>
> > > diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> > > new file mode 100644
> > > index 000000000000..f0f5083ea8d4
> > > --- /dev/null
> > > +++ b/arch/arm64/kvm/debugfs.c
> > > @@ -0,0 +1,25 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Copyright (C) 2021 Red Hat Inc.
> > > + */
> > > +
> > > +#include <linux/kvm_host.h>
> > > +#include <linux/debugfs.h>
> > > +
> > > +#include <kvm/arm_arch_timer.h>
> > > +
> > > +static int vcpu_get_cntv_offset(void *data, u64 *val)
> > > +{
> > > + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> > > +
> > > + *val = timer_get_offset(vcpu_vtimer(vcpu));
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
> > > +
> > > +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
> > > +{
> > > + debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
> > > +}
> >
> > This should be left in arch_timer.c until we actually need it for
> > multiple subsystems. When (and if) that happens, we will expose
> > per-subsystem debugfs initialisers instead of exposing the guts of the
> > timer code.
>
> Noted.
>
> --
> Nicol?s S?enz
>
>