2015-04-23 11:47:11

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH] kvm: x86: fix kvmclock update protocol

From: Radim Krčmář <[email protected]>

The kvmclock spec says that the host will increment a version field to
an odd number, then update stuff, then increment it to an even number.
The host is buggy and doesn't do this, and the result is observable
when one vcpu reads another vcpu's kvmclock data.

There's no good way for a guest kernel to keep its vdso from reading
a different vcpu's kvmclock data, but we don't need to care about
changing VCPUs as long as we read a consistent data from kvmclock.
(VCPU can change outside of this loop too, so it doesn't matter if we
return a value not fit for this VCPU.)

Based on a patch by Radim Krčmář.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/x86.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed31c31b2485..c73efcd03e29 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1669,12 +1669,28 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
&guest_hv_clock, sizeof(guest_hv_clock))))
return 0;

- /*
- * The interface expects us to write an even number signaling that the
- * update is finished. Since the guest won't see the intermediate
- * state, we just increase by 2 at the end.
+ /* This VCPU is paused, but it's legal for a guest to read another
+ * VCPU's kvmclock, so we really have to follow the specification where
+ * it says that version is odd if data is being modified, and even after
+ * it is consistent.
+ *
+ * Version field updates must be kept separate. This is because
+ * kvm_write_guest_cached might use a "rep movs" instruction, and
+ * writes within a string instruction are weakly ordered. So there
+ * are three writes overall.
+ *
+ * As a small optimization, only write the version field in the first
+ * and third write. The vcpu->pv_time cache is still valid, because the
+ * version field is the first in the struct.
*/
- vcpu->hv_clock.version = guest_hv_clock.version + 2;
+ BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);
+
+ vcpu->hv_clock.version = guest_hv_clock.version + 1;
+ kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
+ &vcpu->hv_clock,
+ sizeof(vcpu->hv_clock.version));
+
+ smp_wmb();

/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
pvclock_flags = (guest_hv_clock.flags & PVCLOCK_GUEST_STOPPED);
@@ -1695,6 +1711,13 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
&vcpu->hv_clock,
sizeof(vcpu->hv_clock));
+
+ smp_wmb();
+
+ vcpu->hv_clock.version++;
+ kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
+ &vcpu->hv_clock,
+ sizeof(vcpu->hv_clock.version));
return 0;
}

--
1.8.3.1


2015-04-23 15:24:13

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86: fix kvmclock update protocol

2015-04-23 13:46+0200, Paolo Bonzini:
> From: Radim Krčmář <[email protected]>
>
> The kvmclock spec says that the host will increment a version field to
> an odd number, then update stuff, then increment it to an even number.
> The host is buggy and doesn't do this, and the result is observable
> when one vcpu reads another vcpu's kvmclock data.
>
> There's no good way for a guest kernel to keep its vdso from reading
> a different vcpu's kvmclock data, but we don't need to care about
> changing VCPUs as long as we read a consistent data from kvmclock.
> (VCPU can change outside of this loop too, so it doesn't matter if we
> return a value not fit for this VCPU.)
>
> Based on a patch by Radim Krčmář.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---

Nice,

Reviewed-by: Radim Krčmář <[email protected]>

2015-04-24 19:17:50

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] kvm: x86: fix kvmclock update protocol

On Thu, Apr 23, 2015 at 01:46:55PM +0200, Paolo Bonzini wrote:
> From: Radim Krčmář <[email protected]>
>
> The kvmclock spec says that the host will increment a version field to
> an odd number, then update stuff, then increment it to an even number.
> The host is buggy and doesn't do this, and the result is observable
> when one vcpu reads another vcpu's kvmclock data.
>
> There's no good way for a guest kernel to keep its vdso from reading
> a different vcpu's kvmclock data, but we don't need to care about
> changing VCPUs as long as we read a consistent data from kvmclock.
> (VCPU can change outside of this loop too, so it doesn't matter if we
> return a value not fit for this VCPU.)
>
> Based on a patch by Radim Krčmář.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++++++++++-----
> 1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ed31c31b2485..c73efcd03e29 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1669,12 +1669,28 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> &guest_hv_clock, sizeof(guest_hv_clock))))
> return 0;
>
> - /*
> - * The interface expects us to write an even number signaling that the
> - * update is finished. Since the guest won't see the intermediate
> - * state, we just increase by 2 at the end.
> + /* This VCPU is paused, but it's legal for a guest to read another
> + * VCPU's kvmclock, so we really have to follow the specification where
> + * it says that version is odd if data is being modified, and even after
> + * it is consistent.
> + *
> + * Version field updates must be kept separate. This is because
> + * kvm_write_guest_cached might use a "rep movs" instruction, and
> + * writes within a string instruction are weakly ordered. So there
> + * are three writes overall.
> + *
> + * As a small optimization, only write the version field in the first
> + * and third write. The vcpu->pv_time cache is still valid, because the
> + * version field is the first in the struct.
> */
> - vcpu->hv_clock.version = guest_hv_clock.version + 2;
> + BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);
> +
> + vcpu->hv_clock.version = guest_hv_clock.version + 1;
> + kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
> + &vcpu->hv_clock,
> + sizeof(vcpu->hv_clock.version));
> +
> + smp_wmb();
>
> /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
> pvclock_flags = (guest_hv_clock.flags & PVCLOCK_GUEST_STOPPED);
> @@ -1695,6 +1711,13 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
> &vcpu->hv_clock,
> sizeof(vcpu->hv_clock));
> +
> + smp_wmb();
> +
> + vcpu->hv_clock.version++;
> + kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
> + &vcpu->hv_clock,
> + sizeof(vcpu->hv_clock.version));
> return 0;
> }
>
> --
> 1.8.3.1

Acked-by: Marcelo Tosatti <[email protected]>