LinuxLists.cc - [FYI PATCH 0/5] Missing TLB flushes

2020-01-30 18:04:29

Subject: [FYI PATCH 0/5] Missing TLB flushes

From: Boris Ostrovsky <[email protected]>

The KVM hypervisor may provide a guest with ability to defer remote TLB
flush when the remote VCPU is not running. When this feature is used,
the TLB flush will happen only when the remote VPCU is scheduled to run
again. This will avoid unnecessary (and expensive) IPIs.

Under certain circumstances, when a guest initiates such deferred action,
the hypervisor may miss the request. It is also possible that the guest
may mistakenly assume that it has already marked remote VCPU as needing
a flush when in fact that request had already been processed by the
hypervisor. In both cases this will result in an invalid translation
being present in a vCPU, potentially allowing accesses to memory locations
in that guest's address space that should not be accessible.

Note that only intra-guest memory is vulnerable.

The attached patches address both of these problems:
1. The first patch makes sure the hypervisor doesn't accidentally clear
guest's remote flush request
2. The rest of the patches prevent the race between hypervisor
acknowledging a remote flush request and guest issuing a new one.

Boris Ostrovsky (5):
x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit
x86/kvm: Introduce kvm_(un)map_gfn()
x86/kvm: Cache gfn to pfn translation
x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed
x86/KVM: Clean up host's steal time structure

arch/x86/include/asm/kvm_host.h | 4 +-
arch/x86/kvm/x86.c | 69 +++++++++++++++---------
include/linux/kvm_host.h | 5 ++
include/linux/kvm_types.h | 9 +++-
virt/kvm/kvm_main.c | 113 ++++++++++++++++++++++++++++++++++------
5 files changed, 154 insertions(+), 46 deletions(-)

--
1.8.3.1

2020-01-30 18:04:48

by Paolo Bonzini

[permalink] [raw]

Subject: [FYI PATCH 4/5] x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed

From: Boris Ostrovsky <[email protected]>

There is a potential race in record_steal_time() between setting
host-local vcpu->arch.st.steal.preempted to zero (i.e. clearing
KVM_VCPU_PREEMPTED) and propagating this value to the guest with
kvm_write_guest_cached(). Between those two events the guest may
still see KVM_VCPU_PREEMPTED in its copy of kvm_steal_time, set
KVM_VCPU_FLUSH_TLB and assume that hypervisor will do the right
thing. Which it won't.

Instad of copying, we should map kvm_steal_time and that will
guarantee atomicity of accesses to @preempted.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <[email protected]>
Reviewed-by: Joao Martins <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/x86.c | 51 ++++++++++++++++++++++++++++++---------------------
1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0795bc8..f1845df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2581,45 +2581,47 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)

static void record_steal_time(struct kvm_vcpu *vcpu)
{
+ struct kvm_host_map map;
+ struct kvm_steal_time *st;
+
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;

- if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
- &vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
+ /* -EAGAIN is returned in atomic context so we can just return. */
+ if (kvm_map_gfn(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT,
+ &map, &vcpu->arch.st.cache, false))
return;

+ st = map.hva +
+ offset_in_page(vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS);
+
/*
* Doing a TLB flush here, on the guest's behalf, can avoid
* expensive IPIs.
*/
trace_kvm_pv_tlb_flush(vcpu->vcpu_id,
- vcpu->arch.st.steal.preempted & KVM_VCPU_FLUSH_TLB);
- if (xchg(&vcpu->arch.st.steal.preempted, 0) & KVM_VCPU_FLUSH_TLB)
+ st->preempted & KVM_VCPU_FLUSH_TLB);
+ if (xchg(&st->preempted, 0) & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb(vcpu, false);

- if (vcpu->arch.st.steal.version & 1)
- vcpu->arch.st.steal.version += 1; /* first time write, random junk */
+ vcpu->arch.st.steal.preempted = 0;

- vcpu->arch.st.steal.version += 1;
+ if (st->version & 1)
+ st->version += 1; /* first time write, random junk */

- kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
- &vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+ st->version += 1;

smp_wmb();

- vcpu->arch.st.steal.steal += current->sched_info.run_delay -
+ st->steal += current->sched_info.run_delay -
vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;

- kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
- &vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
-
smp_wmb();

- vcpu->arch.st.steal.version += 1;
+ st->version += 1;

- kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
- &vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+ kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, false);
}

int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
@@ -3501,18 +3503,25 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
{
+ struct kvm_host_map map;
+ struct kvm_steal_time *st;
+
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;

if (vcpu->arch.st.steal.preempted)
return;

- vcpu->arch.st.steal.preempted = KVM_VCPU_PREEMPTED;
+ if (kvm_map_gfn(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT, &map,
+ &vcpu->arch.st.cache, true))
+ return;
+
+ st = map.hva +
+ offset_in_page(vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS);
+
+ st->preempted = vcpu->arch.st.steal.preempted = KVM_VCPU_PREEMPTED;

- kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
- &vcpu->arch.st.steal.preempted,
- offsetof(struct kvm_steal_time, preempted),
- sizeof(vcpu->arch.st.steal.preempted));
+ kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, true);
}

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
--
1.8.3.1

2020-01-30 18:04:50

by Paolo Bonzini

[permalink] [raw]

Subject: [FYI PATCH 1/5] x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit

From: Boris Ostrovsky <[email protected]>

kvm_steal_time_set_preempted() may accidentally clear KVM_VCPU_FLUSH_TLB
bit if it is called more than once while VCPU is preempted.

This is part of CVE-2019-3016.

(This bug was also independently discovered by Jim Mattson
<[email protected]>)

Signed-off-by: Boris Ostrovsky <[email protected]>
Reviewed-by: Joao Martins <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/x86.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf91713..8c93691 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3504,6 +3504,9 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;

+ if (vcpu->arch.st.steal.preempted)
+ return;
+
vcpu->arch.st.steal.preempted = KVM_VCPU_PREEMPTED;

kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
--
1.8.3.1

2020-01-30 18:05:17

by Paolo Bonzini

[permalink] [raw]

Subject: [FYI PATCH 3/5] x86/kvm: Cache gfn to pfn translation

From: Boris Ostrovsky <[email protected]>

__kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
* relatively expensive
* in certain cases (such as when done from atomic context) cannot be called

Stashing gfn-to-pfn mapping should help with both cases.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <[email protected]>
Reviewed-by: Joao Martins <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 10 +++++
include/linux/kvm_host.h | 7 ++-
include/linux/kvm_types.h | 9 +++-
virt/kvm/kvm_main.c | 98 +++++++++++++++++++++++++++++++++--------
5 files changed, 103 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b79cd6a..f48a306e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -689,6 +689,7 @@ struct kvm_vcpu_arch {
u64 last_steal;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
+ struct gfn_to_pfn_cache cache;
} st;

u64 tsc_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8c93691..0795bc8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9088,6 +9088,9 @@ static void fx_init(struct kvm_vcpu *vcpu)
void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
{
void *wbinvd_dirty_mask = vcpu->arch.wbinvd_dirty_mask;
+ struct gfn_to_pfn_cache *cache = &vcpu->arch.st.cache;
+
+ kvm_release_pfn(cache->pfn, cache->dirty, cache);

kvmclock_reset(vcpu);

@@ -9761,11 +9764,18 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,

void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
{
+ struct kvm_vcpu *vcpu;
+ int i;
+
/*
* memslots->generation has been incremented.
* mmio generation may have reached its maximum value.
*/
kvm_mmu_invalidate_mmio_sptes(kvm, gen);
+
+ /* Force re-initialization of steal_time cache */
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_vcpu_kick(vcpu);
}

int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0cb78f5..71cb9cc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -723,6 +723,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
void kvm_set_pfn_accessed(kvm_pfn_t pfn);
void kvm_get_pfn(kvm_pfn_t pfn);

+void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
int len);
int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
@@ -775,10 +776,12 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
-int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map);
+int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map,
+ struct gfn_to_pfn_cache *cache, bool atomic);
struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty);
-int kvm_unmap_gfn(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty);
+int kvm_unmap_gfn(struct kvm_vcpu *vcpu, struct kvm_host_map *map,
+ struct gfn_to_pfn_cache *cache, bool dirty, bool atomic);
unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, int offset,
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 1c88e69..68e84cf 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -18,7 +18,7 @@

enum kvm_mr_change;

-#include <asm/types.h>
+#include <linux/types.h>

/*
* Address types:
@@ -51,4 +51,11 @@ struct gfn_to_hva_cache {
struct kvm_memory_slot *memslot;
};

+struct gfn_to_pfn_cache {
+ u64 generation;
+ gfn_t gfn;
+ kvm_pfn_t pfn;
+ bool dirty;
+};
+
#endif /* __KVM_TYPES_H__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9ef58a2..67eb302 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1821,27 +1821,72 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(gfn_to_page);

+void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache)
+{
+ if (pfn == 0)
+ return;
+
+ if (cache)
+ cache->pfn = cache->gfn = 0;
+
+ if (dirty)
+ kvm_release_pfn_dirty(pfn);
+ else
+ kvm_release_pfn_clean(pfn);
+}
+
+static void kvm_cache_gfn_to_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
+ struct gfn_to_pfn_cache *cache, u64 gen)
+{
+ kvm_release_pfn(cache->pfn, cache->dirty, cache);
+
+ cache->pfn = gfn_to_pfn_memslot(slot, gfn);
+ cache->gfn = gfn;
+ cache->dirty = false;
+ cache->generation = gen;
+}
+
static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
- struct kvm_host_map *map)
+ struct kvm_host_map *map,
+ struct gfn_to_pfn_cache *cache,
+ bool atomic)
{
kvm_pfn_t pfn;
void *hva = NULL;
struct page *page = KVM_UNMAPPED_PAGE;
struct kvm_memory_slot *slot = __gfn_to_memslot(slots, gfn);
+ u64 gen = slots->generation;

if (!map)
return -EINVAL;

- pfn = gfn_to_pfn_memslot(slot, gfn);
+ if (cache) {
+ if (!cache->pfn || cache->gfn != gfn ||
+ cache->generation != gen) {
+ if (atomic)
+ return -EAGAIN;
+ kvm_cache_gfn_to_pfn(slot, gfn, cache, gen);
+ }
+ pfn = cache->pfn;
+ } else {
+ if (atomic)
+ return -EAGAIN;
+ pfn = gfn_to_pfn_memslot(slot, gfn);
+ }
if (is_error_noslot_pfn(pfn))
return -EINVAL;

if (pfn_valid(pfn)) {
page = pfn_to_page(pfn);
- hva = kmap(page);
+ if (atomic)
+ hva = kmap_atomic(page);
+ else
+ hva = kmap(page);
#ifdef CONFIG_HAS_IOMEM
- } else {
+ } else if (!atomic) {
hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
+ } else {
+ return -EINVAL;
#endif
}

@@ -1856,20 +1901,25 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
return 0;
}

-int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
+int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map,
+ struct gfn_to_pfn_cache *cache, bool atomic)
{
- return __kvm_map_gfn(kvm_memslots(vcpu->kvm), gfn, map);
+ return __kvm_map_gfn(kvm_memslots(vcpu->kvm), gfn, map,
+ cache, atomic);
}
EXPORT_SYMBOL_GPL(kvm_map_gfn);

int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
{
- return __kvm_map_gfn(kvm_vcpu_memslots(vcpu), gfn, map);
+ return __kvm_map_gfn(kvm_vcpu_memslots(vcpu), gfn, map,
+ NULL, false);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_map);

static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot,
- struct kvm_host_map *map, bool dirty)
+ struct kvm_host_map *map,
+ struct gfn_to_pfn_cache *cache,
+ bool dirty, bool atomic)
{
if (!map)
return;
@@ -1877,34 +1927,44 @@ static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot,
if (!map->hva)
return;

- if (map->page != KVM_UNMAPPED_PAGE)
- kunmap(map->page);
+ if (map->page != KVM_UNMAPPED_PAGE) {
+ if (atomic)
+ kunmap_atomic(map->hva);
+ else
+ kunmap(map->page);
+ }
#ifdef CONFIG_HAS_IOMEM
- else
+ else if (!atomic)
memunmap(map->hva);
+ else
+ WARN_ONCE(1, "Unexpected unmapping in atomic context");
#endif

- if (dirty) {
+ if (dirty)
mark_page_dirty_in_slot(memslot, map->gfn);
- kvm_release_pfn_dirty(map->pfn);
- } else {
- kvm_release_pfn_clean(map->pfn);
- }
+
+ if (cache)
+ cache->dirty |= dirty;
+ else
+ kvm_release_pfn(map->pfn, dirty, NULL);

map->hva = NULL;
map->page = NULL;
}

-int kvm_unmap_gfn(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty)
+int kvm_unmap_gfn(struct kvm_vcpu *vcpu, struct kvm_host_map *map,
+ struct gfn_to_pfn_cache *cache, bool dirty, bool atomic)
{
- __kvm_unmap_gfn(gfn_to_memslot(vcpu->kvm, map->gfn), map, dirty);
+ __kvm_unmap_gfn(gfn_to_memslot(vcpu->kvm, map->gfn), map,
+ cache, dirty, atomic);
return 0;
}
EXPORT_SYMBOL_GPL(kvm_unmap_gfn);

void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty)
{
- __kvm_unmap_gfn(kvm_vcpu_gfn_to_memslot(vcpu, map->gfn), map, dirty);
+ __kvm_unmap_gfn(kvm_vcpu_gfn_to_memslot(vcpu, map->gfn), map, NULL,
+ dirty, false);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);

--
1.8.3.1

2020-02-01 05:55:46

by Wanpeng Li

[permalink] [raw]

Subject: Re: [FYI PATCH 0/5] Missing TLB flushes

On Fri, 31 Jan 2020 at 02:02, Paolo Bonzini <[email protected]> wrote:
>
> From: Boris Ostrovsky <[email protected]>
>
> The KVM hypervisor may provide a guest with ability to defer remote TLB
> flush when the remote VCPU is not running. When this feature is used,
> the TLB flush will happen only when the remote VPCU is scheduled to run
> again. This will avoid unnecessary (and expensive) IPIs.
>
> Under certain circumstances, when a guest initiates such deferred action,
> the hypervisor may miss the request. It is also possible that the guest
> may mistakenly assume that it has already marked remote VCPU as needing
> a flush when in fact that request had already been processed by the
> hypervisor. In both cases this will result in an invalid translation
> being present in a vCPU, potentially allowing accesses to memory locations
> in that guest's address space that should not be accessible.
>
> Note that only intra-guest memory is vulnerable.
>
> The attached patches address both of these problems:
> 1. The first patch makes sure the hypervisor doesn't accidentally clear
> guest's remote flush request
> 2. The rest of the patches prevent the race between hypervisor
> acknowledging a remote flush request and guest issuing a new one.

Looks good, thanks for the patchset.

Wanpeng