LinuxLists.cc - [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups

2014-12-23 00:40:57

Subject: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups

This is a dramatic simplification and speedup of the vdso pvclock read
code. Is it correct?

Andy Lutomirski (2):
x86, vdso: Use asm volatile in __getcpu
x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

arch/x86/include/asm/vgtod.h | 6 ++--
arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
2 files changed, 51 insertions(+), 37 deletions(-)

--
2.1.0

2014-12-23 00:40:18

by Andy Lutomirski

[permalink] [raw]

Subject: [RFC 1/2] x86, vdso: Use asm volatile in __getcpu

In Linux 3.18 and below, GCC hoists the lsl instructions in the
pvclock code all the way to the beginning of __vdso_clock_gettime,
slowing the non-paravirt case significantly. For unknown reasons,
presumably related to the removal of a branch, the performance issue
is gone as of

e76b027e6408 x86,vdso: Use LSL unconditionally for vgetcpu

but I don't trust GCC enough to expect the problem to stay fixed.

There should be no correctness issue, because the __getcpu calls in
__vdso_vlock_gettime were never necessary in the first place.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/vgtod.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index e7e9682a33e9..f556c4843aa1 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -80,9 +80,11 @@ static inline unsigned int __getcpu(void)

/*
* Load per CPU data from GDT. LSL is faster than RDTSCP and
- * works on all CPUs.
+ * works on all CPUs. This is volatile so that it orders
+ * correctly wrt barrier() and to keep gcc from cleverly
+ * hoisting it out of the calling function.
*/
- asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+ asm volatile ("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));

return p;
}
--
2.1.0

2014-12-23 00:40:39

by Andy Lutomirski

[permalink] [raw]

Subject: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

The pvclock vdso code was too abstracted to understand easily and
excessively paranoid. Simplify it for a huge speedup.

This opens the door for additional simplifications, as the vdso no
longer accesses the pvti for any vcpu other than vcpu 0.

Before, vclock_gettime using kvm-clock took about 64ns on my machine.
With this change, it takes 19ns, which is almost as fast as the pure TSC
implementation.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
1 file changed, 47 insertions(+), 35 deletions(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 9793322751e0..f2e0396d5629 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)

static notrace cycle_t vread_pvclock(int *mode)
{
- const struct pvclock_vsyscall_time_info *pvti;
+ const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
cycle_t ret;
- u64 last;
- u32 version;
- u8 flags;
- unsigned cpu, cpu1;
-
+ u64 tsc, pvti_tsc;
+ u64 last, delta, pvti_system_time;
+ u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;

/*
- * Note: hypervisor must guarantee that:
- * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
- * 2. that per-CPU pvclock time info is updated if the
- * underlying CPU changes.
- * 3. that version is increased whenever underlying CPU
- * changes.
+ * Note: The kernel and hypervisor must guarantee that cpu ID
+ * number maps 1:1 to per-CPU pvclock time info.
+ *
+ * Because the hypervisor is entirely unaware of guest userspace
+ * preemption, it cannot guarantee that per-CPU pvclock time
+ * info is updated if the underlying CPU changes or that that
+ * version is increased whenever underlying CPU changes.
+ *
+ * On KVM, we are guaranteed that pvti updates for any vCPU are
+ * atomic as seen by *all* vCPUs. This is an even stronger
+ * guarantee than we get with a normal seqlock.
*
+ * On Xen, we don't appear to have that guarantee, but Xen still
+ * supplies a valid seqlock using the version field.
+
+ * We only do pvclock vdso timing at all if
+ * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+ * mean that all vCPUs have matching pvti and that the TSC is
+ * synced, so we can just look at vCPU 0's pvti.
*/
- do {
- cpu = __getcpu() & VGETCPU_CPU_MASK;
- /* TODO: We can put vcpu id into higher bits of pvti.version.
- * This will save a couple of cycles by getting rid of
- * __getcpu() calls (Gleb).
- */
-
- pvti = get_pvti(cpu);
-
- version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
-
- /*
- * Test we're still on the cpu as well as the version.
- * We could have been migrated just after the first
- * vgetcpu but before fetching the version, so we
- * wouldn't notice a version change.
- */
- cpu1 = __getcpu() & VGETCPU_CPU_MASK;
- } while (unlikely(cpu != cpu1 ||
- (pvti->pvti.version & 1) ||
- pvti->pvti.version != version));
-
- if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+ if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
+ return 0;
+ }
+
+ do {
+ version = pvti->version;
+
+ /* This is also a read barrier, so we'll read version first. */
+ rdtsc_barrier();
+ tsc = __native_read_tsc();
+
+ pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+ pvti_tsc_shift = pvti->tsc_shift;
+ pvti_system_time = pvti->system_time;
+ pvti_tsc = pvti->tsc_timestamp;
+
+ /* Make sure that the version double-check is last. */
+ smp_rmb();
+ } while (unlikely((version & 1) || version != pvti->version));
+
+ delta = tsc - pvti_tsc;
+ ret = pvti_system_time +
+ pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+ pvti_tsc_shift);

/* refer to tsc.c read_tsc() comment for rationale */
last = gtod->cycle_last;
--
2.1.0

2014-12-23 07:21:33

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups

On 23/12/2014 01:39, Andy Lutomirski wrote:
> This is a dramatic simplification and speedup of the vdso pvclock read
> code. Is it correct?
>
> Andy Lutomirski (2):
> x86, vdso: Use asm volatile in __getcpu
> x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

Patch 1 is ok,

Acked-by: Paolo Bonzini <[email protected]>

For patch 2 I will defer to Marcelo and Glauber (and the Xen folks).

Paolo

2014-12-23 08:17:18

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups

On Mon, Dec 22, 2014 at 11:21 PM, Paolo Bonzini <[email protected]> wrote:
>
>
> On 23/12/2014 01:39, Andy Lutomirski wrote:
>> This is a dramatic simplification and speedup of the vdso pvclock read
>> code. Is it correct?
>>
>> Andy Lutomirski (2):
>> x86, vdso: Use asm volatile in __getcpu
>> x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
>
> Patch 1 is ok,
>
> Acked-by: Paolo Bonzini <[email protected]>

Any thoughts as to whether it should be tagged for stable? I haven't
looked closely enough at the old pvclock code or the generated code to
have much of an opinion there. It'll be a big speedup for non-pvclock
users at least.

--Andy

>
> For patch 2 I will defer to Marcelo and Glauber (and the Xen folks).
>
> Paolo

--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-23 08:30:15

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups

On 23/12/2014 09:16, Andy Lutomirski wrote:
> Any thoughts as to whether it should be tagged for stable? I haven't
> looked closely enough at the old pvclock code or the generated code to
> have much of an opinion there. It'll be a big speedup for non-pvclock
> users at least.

Yes, please.

Paolo

2014-12-23 10:28:53

by David Vrabel

[permalink] [raw]

Subject: Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On 23/12/14 00:39, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid. Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.

This sounds plausible but I'm not going to be able to give it a detailed
look until the new year.

David

> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>
> static notrace cycle_t vread_pvclock(int *mode)
> {
> - const struct pvclock_vsyscall_time_info *pvti;
> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> cycle_t ret;
> - u64 last;
> - u32 version;
> - u8 flags;
> - unsigned cpu, cpu1;
> -
> + u64 tsc, pvti_tsc;
> + u64 last, delta, pvti_system_time;
> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>
> /*
> - * Note: hypervisor must guarantee that:
> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> - * 2. that per-CPU pvclock time info is updated if the
> - * underlying CPU changes.
> - * 3. that version is increased whenever underlying CPU
> - * changes.
> + * Note: The kernel and hypervisor must guarantee that cpu ID
> + * number maps 1:1 to per-CPU pvclock time info.
> + *
> + * Because the hypervisor is entirely unaware of guest userspace
> + * preemption, it cannot guarantee that per-CPU pvclock time
> + * info is updated if the underlying CPU changes or that that
> + * version is increased whenever underlying CPU changes.
> + *
> + * On KVM, we are guaranteed that pvti updates for any vCPU are
> + * atomic as seen by *all* vCPUs. This is an even stronger
> + * guarantee than we get with a normal seqlock.
> *
> + * On Xen, we don't appear to have that guarantee, but Xen still
> + * supplies a valid seqlock using the version field.
> +
> + * We only do pvclock vdso timing at all if
> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> + * mean that all vCPUs have matching pvti and that the TSC is
> + * synced, so we can just look at vCPU 0's pvti.
> */
> - do {
> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> - * This will save a couple of cycles by getting rid of
> - * __getcpu() calls (Gleb).
> - */
> -
> - pvti = get_pvti(cpu);
> -
> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> - /*
> - * Test we're still on the cpu as well as the version.
> - * We could have been migrated just after the first
> - * vgetcpu but before fetching the version, so we
> - * wouldn't notice a version change.
> - */
> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> - } while (unlikely(cpu != cpu1 ||
> - (pvti->pvti.version & 1) ||
> - pvti->pvti.version != version));
> -
> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> *mode = VCLOCK_NONE;
> + return 0;
> + }
> +
> + do {
> + version = pvti->version;
> +
> + /* This is also a read barrier, so we'll read version first. */
> + rdtsc_barrier();
> + tsc = __native_read_tsc();
> +
> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> + pvti_tsc_shift = pvti->tsc_shift;
> + pvti_system_time = pvti->system_time;
> + pvti_tsc = pvti->tsc_timestamp;
> +
> + /* Make sure that the version double-check is last. */
> + smp_rmb();
> + } while (unlikely((version & 1) || version != pvti->version));
> +
> + delta = tsc - pvti_tsc;
> + ret = pvti_system_time +
> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> + pvti_tsc_shift);
>
> /* refer to tsc.c read_tsc() comment for rationale */
> last = gtod->cycle_last;
>

2014-12-23 15:12:00

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On 12/22/2014 07:39 PM, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid. Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
> 1 file changed, 47 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>
> static notrace cycle_t vread_pvclock(int *mode)
> {
> - const struct pvclock_vsyscall_time_info *pvti;
> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> cycle_t ret;
> - u64 last;
> - u32 version;
> - u8 flags;
> - unsigned cpu, cpu1;
> -
> + u64 tsc, pvti_tsc;
> + u64 last, delta, pvti_system_time;
> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>
> /*
> - * Note: hypervisor must guarantee that:
> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> - * 2. that per-CPU pvclock time info is updated if the
> - * underlying CPU changes.
> - * 3. that version is increased whenever underlying CPU
> - * changes.
> + * Note: The kernel and hypervisor must guarantee that cpu ID
> + * number maps 1:1 to per-CPU pvclock time info.
> + *
> + * Because the hypervisor is entirely unaware of guest userspace
> + * preemption, it cannot guarantee that per-CPU pvclock time
> + * info is updated if the underlying CPU changes or that that
> + * version is increased whenever underlying CPU changes.
> + *
> + * On KVM, we are guaranteed that pvti updates for any vCPU are
> + * atomic as seen by *all* vCPUs. This is an even stronger
> + * guarantee than we get with a normal seqlock.
> *
> + * On Xen, we don't appear to have that guarantee, but Xen still
> + * supplies a valid seqlock using the version field.
> +
> + * We only do pvclock vdso timing at all if
> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> + * mean that all vCPUs have matching pvti and that the TSC is
> + * synced, so we can just look at vCPU 0's pvti.
> */
> - do {
> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> - * This will save a couple of cycles by getting rid of
> - * __getcpu() calls (Gleb).
> - */
> -
> - pvti = get_pvti(cpu);
> -
> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> - /*
> - * Test we're still on the cpu as well as the version.
> - * We could have been migrated just after the first
> - * vgetcpu but before fetching the version, so we
> - * wouldn't notice a version change.
> - */
> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> - } while (unlikely(cpu != cpu1 ||
> - (pvti->pvti.version & 1) ||
> - pvti->pvti.version != version));
> -
> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> *mode = VCLOCK_NONE;
> + return 0;
> + }
> +
> + do {
> + version = pvti->version;
> +
> + /* This is also a read barrier, so we'll read version first. */
> + rdtsc_barrier();
> + tsc = __native_read_tsc();

This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is
used, for example, after guest migrated (unless HW is capable of scaling
TSC rate).

-boris

> +
> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> + pvti_tsc_shift = pvti->tsc_shift;
> + pvti_system_time = pvti->system_time;
> + pvti_tsc = pvti->tsc_timestamp;
> +
> + /* Make sure that the version double-check is last. */
> + smp_rmb();
> + } while (unlikely((version & 1) || version != pvti->version));
> +
> + delta = tsc - pvti_tsc;
> + ret = pvti_system_time +
> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> + pvti_tsc_shift);
>
> /* refer to tsc.c read_tsc() comment for rationale */
> last = gtod->cycle_last;

2014-12-23 15:14:27

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On 23/12/2014 16:14, Boris Ostrovsky wrote:
>> + do {
>> + version = pvti->version;
>> +
>> + /* This is also a read barrier, so we'll read version first. */
>> + rdtsc_barrier();
>> + tsc = __native_read_tsc();
>
>
> This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is
> used, for example, after guest migrated (unless HW is capable of scaling
> TSC rate).

So does the __pvclock_read_cycles this is replacing (via
pvclock_get_nsec_offset).

Paolo

2014-12-23 15:23:53

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On 12/23/2014 10:14 AM, Paolo Bonzini wrote:
>
> On 23/12/2014 16:14, Boris Ostrovsky wrote:
>>> + do {
>>> + version = pvti->version;
>>> +
>>> + /* This is also a read barrier, so we'll read version first. */
>>> + rdtsc_barrier();
>>> + tsc = __native_read_tsc();
>>
>> This will cause VMEXIT on Xen with TSC_MODE_ALWAYS_EMULATE which is
>> used, for example, after guest migrated (unless HW is capable of scaling
>> TSC rate).
> So does the __pvclock_read_cycles this is replacing (via
> pvclock_get_nsec_offset).

Right, I didn't notice that.

-boris

2014-12-24 21:31:04

by David Matlack

[permalink] [raw]

Subject: Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On Mon, Dec 22, 2014 at 4:39 PM, Andy Lutomirski <[email protected]> wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid. Simplify it for a huge speedup.
>
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
>
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
> 1 file changed, 47 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>
> static notrace cycle_t vread_pvclock(int *mode)
> {
> - const struct pvclock_vsyscall_time_info *pvti;
> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> cycle_t ret;
> - u64 last;
> - u32 version;
> - u8 flags;
> - unsigned cpu, cpu1;
> -
> + u64 tsc, pvti_tsc;
> + u64 last, delta, pvti_system_time;
> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>
> /*
> - * Note: hypervisor must guarantee that:
> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> - * 2. that per-CPU pvclock time info is updated if the
> - * underlying CPU changes.
> - * 3. that version is increased whenever underlying CPU
> - * changes.
> + * Note: The kernel and hypervisor must guarantee that cpu ID
> + * number maps 1:1 to per-CPU pvclock time info.
> + *
> + * Because the hypervisor is entirely unaware of guest userspace
> + * preemption, it cannot guarantee that per-CPU pvclock time
> + * info is updated if the underlying CPU changes or that that
> + * version is increased whenever underlying CPU changes.
> + *
> + * On KVM, we are guaranteed that pvti updates for any vCPU are
> + * atomic as seen by *all* vCPUs. This is an even stronger
> + * guarantee than we get with a normal seqlock.
> *
> + * On Xen, we don't appear to have that guarantee, but Xen still
> + * supplies a valid seqlock using the version field.
> +

Forgotten * here?

> + * We only do pvclock vdso timing at all if
> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> + * mean that all vCPUs have matching pvti and that the TSC is
> + * synced, so we can just look at vCPU 0's pvti.
> */
> - do {
> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> - * This will save a couple of cycles by getting rid of
> - * __getcpu() calls (Gleb).
> - */
> -
> - pvti = get_pvti(cpu);
> -
> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> - /*
> - * Test we're still on the cpu as well as the version.
> - * We could have been migrated just after the first
> - * vgetcpu but before fetching the version, so we
> - * wouldn't notice a version change.
> - */
> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> - } while (unlikely(cpu != cpu1 ||
> - (pvti->pvti.version & 1) ||
> - pvti->pvti.version != version));
> -
> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> *mode = VCLOCK_NONE;
> + return 0;
> + }
> +
> + do {
> + version = pvti->version;
> +
> + /* This is also a read barrier, so we'll read version first. */
> + rdtsc_barrier();
> + tsc = __native_read_tsc();

Is there a reason why you read the tsc inside the loop rather than once
after the loop?

> +
> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> + pvti_tsc_shift = pvti->tsc_shift;
> + pvti_system_time = pvti->system_time;
> + pvti_tsc = pvti->tsc_timestamp;
> +
> + /* Make sure that the version double-check is last. */
> + smp_rmb();
> + } while (unlikely((version & 1) || version != pvti->version));
> +
> + delta = tsc - pvti_tsc;
> + ret = pvti_system_time +
> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> + pvti_tsc_shift);
>
> /* refer to tsc.c read_tsc() comment for rationale */
> last = gtod->cycle_last;
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-12-24 21:43:55

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On Wed, Dec 24, 2014 at 1:30 PM, David Matlack <[email protected]> wrote:
> On Mon, Dec 22, 2014 at 4:39 PM, Andy Lutomirski <[email protected]> wrote:
>> The pvclock vdso code was too abstracted to understand easily and
>> excessively paranoid. Simplify it for a huge speedup.
>>
>> This opens the door for additional simplifications, as the vdso no
>> longer accesses the pvti for any vcpu other than vcpu 0.
>>
>> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> implementation.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------
>> 1 file changed, 47 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> index 9793322751e0..f2e0396d5629 100644
>> --- a/arch/x86/vdso/vclock_gettime.c
>> +++ b/arch/x86/vdso/vclock_gettime.c
>> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
>>
>> static notrace cycle_t vread_pvclock(int *mode)
>> {
>> - const struct pvclock_vsyscall_time_info *pvti;
>> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>> cycle_t ret;
>> - u64 last;
>> - u32 version;
>> - u8 flags;
>> - unsigned cpu, cpu1;
>> -
>> + u64 tsc, pvti_tsc;
>> + u64 last, delta, pvti_system_time;
>> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>>
>> /*
>> - * Note: hypervisor must guarantee that:
>> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> - * 2. that per-CPU pvclock time info is updated if the
>> - * underlying CPU changes.
>> - * 3. that version is increased whenever underlying CPU
>> - * changes.
>> + * Note: The kernel and hypervisor must guarantee that cpu ID
>> + * number maps 1:1 to per-CPU pvclock time info.
>> + *
>> + * Because the hypervisor is entirely unaware of guest userspace
>> + * preemption, it cannot guarantee that per-CPU pvclock time
>> + * info is updated if the underlying CPU changes or that that
>> + * version is increased whenever underlying CPU changes.
>> + *
>> + * On KVM, we are guaranteed that pvti updates for any vCPU are
>> + * atomic as seen by *all* vCPUs. This is an even stronger
>> + * guarantee than we get with a normal seqlock.
>> *
>> + * On Xen, we don't appear to have that guarantee, but Xen still
>> + * supplies a valid seqlock using the version field.
>> +
>
> Forgotten * here?
>
>> + * We only do pvclock vdso timing at all if
>> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> + * mean that all vCPUs have matching pvti and that the TSC is
>> + * synced, so we can just look at vCPU 0's pvti.
>> */
>> - do {
>> - cpu = __getcpu() & VGETCPU_CPU_MASK;
>> - /* TODO: We can put vcpu id into higher bits of pvti.version.
>> - * This will save a couple of cycles by getting rid of
>> - * __getcpu() calls (Gleb).
>> - */
>> -
>> - pvti = get_pvti(cpu);
>> -
>> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> -
>> - /*
>> - * Test we're still on the cpu as well as the version.
>> - * We could have been migrated just after the first
>> - * vgetcpu but before fetching the version, so we
>> - * wouldn't notice a version change.
>> - */
>> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> - } while (unlikely(cpu != cpu1 ||
>> - (pvti->pvti.version & 1) ||
>> - pvti->pvti.version != version));
>> -
>> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> +
>> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>> *mode = VCLOCK_NONE;
>> + return 0;
>> + }
>> +
>> + do {
>> + version = pvti->version;
>> +
>> + /* This is also a read barrier, so we'll read version first. */
>> + rdtsc_barrier();
>> + tsc = __native_read_tsc();
>
> Is there a reason why you read the tsc inside the loop rather than once
> after the loop?

I want to make sure that the tsc value used is consistent with the
scale and offset. Otherwise it would be possible to read the pvti
data, then get preempted and sleep for a long time before rdtsc. The
result could be a time value larger than an immediate subsequent call
would return.

--Andy

>
>> +
>> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> + pvti_tsc_shift = pvti->tsc_shift;
>> + pvti_system_time = pvti->system_time;
>> + pvti_tsc = pvti->tsc_timestamp;
>> +
>> + /* Make sure that the version double-check is last. */
>> + smp_rmb();
>> + } while (unlikely((version & 1) || version != pvti->version));
>> +
>> + delta = tsc - pvti_tsc;
>> + ret = pvti_system_time +
>> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>> + pvti_tsc_shift);
>>
>> /* refer to tsc.c read_tsc() comment for rationale */
>> last = gtod->cycle_last;
>> --
>> 2.1.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/

--
Andy Lutomirski
AMA Capital Management, LLC

2015-02-26 22:46:46

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On Thu, Jan 8, 2015 at 2:43 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Jan 8, 2015 at 2:31 PM, Marcelo Tosatti <[email protected]> wrote:
>> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <[email protected]> wrote:
>>> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>>> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <[email protected]> wrote:
>>> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>>> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <[email protected]> wrote:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>>> >> >> > > > > Still confused. So we can freeze all vCPUs in the host, then update
>>> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have
>>> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>>> >> >> > > > > doesn't increment the version pre-update, and we can return completely
>>> >> >> > > > > bogus results.
>>> >> >> > > > Yes.
>>> >> >> > > But then the getcpu test would fail (1->0). Even if you have an ABA
>>> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>>> >> >> > > one returned by the first getcpu.
>>> >> >> >
>>> >> >> > ... this case of partial update of pvti, which is caught by the version
>>> >> >> > field, if of course different from the other (extremely unlikely) that
>>> >> >> > Andy pointed out. That is when the getcpus are done on the same vCPU,
>>> >> >> > but the rdtsc is another.
>>> >> >> >
>>> >> >> > That one can be fixed by rdtscp, like
>>> >> >> >
>>> >> >> > do {
>>> >> >> > // get a consistent (pvti, v, tsc) tuple
>>> >> >> > do {
>>> >> >> > cpu = get_cpu();
>>> >> >> > pvti = get_pvti(cpu);
>>> >> >> > v = pvti->version & ~1;
>>> >> >> > // also acts as rmb();
>>> >> >> > rdtsc_barrier();
>>> >> >> > tsc = rdtscp(&cpu1);
>>> >> >>
>>> >> >> Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD
>>> >> >> specified it that way and both AMD and Intel implement it correctly.
>>> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>>> >> >>
>>> >> >> > // control dependency, no need for rdtsc_barrier?
>>> >> >> > } while(cpu != cpu1);
>>> >> >> >
>>> >> >> > // ... compute nanoseconds from pvti and tsc ...
>>> >> >> > rmb();
>>> >> >> > } while(v != pvti->version);
>>> >> >>
>>> >> >> Still no good. We can migrate a bunch of times so we see the same CPU
>>> >> >> all three times and *still* don't get a consistent read, unless we
>>> >> >> play nasty games with lots of version checks (I have a patch for that,
>>> >> >> but I don't like it very much). The patch is here:
>>> >> >>
>>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>>> >> >>
>>> >> >> but I don't like it.
>>> >> >>
>>> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>>> >> >> while it's being written, and I think you're now telling me that this
>>> >> >> isn't true and that a guest *can* observe pvti while it's being
>>> >> >> written while the low bit of the version field is not set. If so,
>>> >> >> this is rather strongly incompatible with the spec in the KVM docs.
>>> >> >>
>>> >> >> I don't suppose that you and Marcelo could agree on what the actual
>>> >> >> semantics that KVM provides are and could write it down in a way that
>>> >> >> people who haven't spent a long time staring at the request code
>>> >> >> understand? And maybe you could even fix the implementation while
>>> >> >> you're at it if the implementation is, indeed, broken. I have ugly
>>> >> >> patches to fix it here:
>>> >> >>
>>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>>> >> >>
>>> >> >> but I'm not thrilled with them.
>>> >> >>
>>> >> >> --Andy
>>> >> >
>>> >> > I suppose that separating the version write from the rest of the pvclock
>>> >> > structure is sufficient, as that would guarantee the writes are not
>>> >> > reordered even with fast string REP MOVS.
>>> >> >
>>> >> > Thanks for catching this Andy!
>>> >> >
>>> >>
>>> >> Don't you stil need:
>>> >>
>>> >> version++;
>>> >> write the rest;
>>> >> version++;
>>> >>
>>> >> with possible smp_wmb() in there to keep the compiler from messing around?
>>> >
>>> > Correct. Could just as well follow the protocol and use odd/even, which
>>> > is what your patch does.
>>> >
>>> > What is the point with the new flags bit though?
>>>
>>> To try to work around the problem on old hosts. I'm not at all
>>> convinced that this is worthwhile or that it helps, though.
>>
>> Andy,
>>
>> Are you going to submit the fix or should i?
>>
>
> I'd prefer if you did it. I'm not familiar enough with the KVM memory
> management stuff to do it confidently. Feel free to mooch from my
> patch if it's helpful.

Any update here? I can try it myself if no one else wants to do it.

--Andy

>
> --Andy
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

--
Andy Lutomirski
AMA Capital Management, LLC