LinuxLists.cc - [PATCH v2 0/9] KVM: MMU: fix and improve validation of mmio page fault

2015-08-05 04:10:12

Subject: [PATCH v2 0/9] KVM: MMU: fix and improve validation of mmio page fault

Changelog in v2:
- rename reset_*_rsvds_bits_mask() to reset_*_zero_bits_mask() and
is_shadow_rsvd_bits_set() to is_shadow_zero_bits_set() to better
match what we are checking. Thanks for Paolo's suggestion.

Current code validating mmio #PF is buggy, it was spotted by Pavel
Shirshov, the bug is that qemu complained with "KVM: unknown exit,
hardware reason 31" and KVM shown these info:
[84245.284948] EPT: Misconfiguration.
[84245.285056] EPT: GPA: 0xfeda848
[84245.285154] ept_misconfig_inspect_spte: spte 0x5eaef50107 level 4
[84245.285344] ept_misconfig_inspect_spte: spte 0x5f5fadc107 level 3
[84245.285532] ept_misconfig_inspect_spte: spte 0x5141d18107 level 2
[84245.285723] ept_misconfig_inspect_spte: spte 0x52e40dad77 level 1

This is because we got a mmio #PF and the handler see the mmio spte
becomes normal (points to the ram page)

However, this is valid after introducing fast mmio spte invalidation which
increases the generation-number instead of zapping mmio sptes, a example
is as follows:
1. QEMU drops mmio region by adding a new memslot
2. invalidate all mmio sptes
3.

VCPU 0 VCPU 1
access the invalid mmio spte

access the region originally was MMIO before
set the spte to the normal ram map

mmio #PF
check the spte and see it becomes normal ram mapping !!!

The first patch simply fixes the bug by dropping the validation in mmio
handler which is good for backport

Later patches enable fully check reserved bits for shadow page table
entries, since shadow page table and guest page table have the some
format, this patches reuse the logic which checks reserved bits on
guest pte to check sptes

In order to catching as many bugs as possible, we not only check the
reserved bits on hardware but also check other bits that spte never used

Xiao Guangrong (9):
KVM: MMU: fix validation of mmio page fault
KVM: MMU: move FNAME(is_rsvd_bits_set) to mmu.c
KVM: MMU: introduce rsvd_bits_validate
KVM: MMU: split reset_rsvds_bits_mask
KVM: MMU: split reset_rsvds_bits_mask_ept
KVM: MMU: introduce the framework to check zero bits on sptes
KVM: MMU: introduce is_shadow_zero_bits_set()
KVM: MMU: fully check zero bits for sptes
KVM: VMX: drop ept misconfig check

arch/x86/include/asm/kvm_host.h | 16 ++-
arch/x86/kvm/mmu.c | 283 ++++++++++++++++++++++++----------------
arch/x86/kvm/mmu.h | 4 +-
arch/x86/kvm/paging_tmpl.h | 13 +-
arch/x86/kvm/svm.c | 1 +
arch/x86/kvm/vmx.c | 74 +----------
arch/x86/kvm/x86.c | 3 +-
7 files changed, 193 insertions(+), 201 deletions(-)

--
2.1.0

2015-08-05 04:10:14

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 1/9] KVM: MMU: fix validation of mmio page fault

We got the bug that qemu complained with "KVM: unknown exit, hardware
reason 31" and KVM shown these info:
[84245.284948] EPT: Misconfiguration.
[84245.285056] EPT: GPA: 0xfeda848
[84245.285154] ept_misconfig_inspect_spte: spte 0x5eaef50107 level 4
[84245.285344] ept_misconfig_inspect_spte: spte 0x5f5fadc107 level 3
[84245.285532] ept_misconfig_inspect_spte: spte 0x5141d18107 level 2
[84245.285723] ept_misconfig_inspect_spte: spte 0x52e40dad77 level 1

This is because we got a mmio #PF and the handler see the mmio spte becomes
normal (points to the ram page)

However, this is valid after introducing fast mmio spte invalidation which
increases the generation-number instead of zapping mmio sptes, a example
is as follows:
1. QEMU drops mmio region by adding a new memslot
2. invalidate all mmio sptes
3.

VCPU 0 VCPU 1
access the invalid mmio spte
access the region originally was MMIO before
set the spte to the normal ram map

mmio #PF
check the spte and see it becomes normal ram mapping !!!

This patch fixes the bug just by dropping the check in mmio handler, it's
good for backport. Full check will be introduced in later patches

Reported-by: Pavel Shirshov <[email protected]>
Tested-by: Pavel Shirshov <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 45 ---------------------------------------------
1 file changed, 45 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6de896f..f432e9b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -357,12 +357,6 @@ static u64 __get_spte_lockless(u64 *sptep)
{
return ACCESS_ONCE(*sptep);
}
-
-static bool __check_direct_spte_mmio_pf(u64 spte)
-{
- /* It is valid if the spte is zapped. */
- return spte == 0ull;
-}
#else
union split_spte {
struct {
@@ -478,23 +472,6 @@ retry:

return spte.spte;
}
-
-static bool __check_direct_spte_mmio_pf(u64 spte)
-{
- union split_spte sspte = (union split_spte)spte;
- u32 high_mmio_mask = shadow_mmio_mask >> 32;
-
- /* It is valid if the spte is zapped. */
- if (spte == 0ull)
- return true;
-
- /* It is valid if the spte is being zapped. */
- if (sspte.spte_low == 0ull &&
- (sspte.spte_high & high_mmio_mask) == high_mmio_mask)
- return true;
-
- return false;
-}
#endif

static bool spte_is_locklessly_modifiable(u64 spte)
@@ -3299,21 +3276,6 @@ static bool quickly_check_mmio_pf(struct kvm_vcpu *vcpu, u64 addr, bool direct)
return vcpu_match_mmio_gva(vcpu, addr);
}

-
-/*
- * On direct hosts, the last spte is only allows two states
- * for mmio page fault:
- * - It is the mmio spte
- * - It is zapped or it is being zapped.
- *
- * This function completely checks the spte when the last spte
- * is not the mmio spte.
- */
-static bool check_direct_spte_mmio_pf(u64 spte)
-{
- return __check_direct_spte_mmio_pf(spte);
-}
-
static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
{
struct kvm_shadow_walk_iterator iterator;
@@ -3356,13 +3318,6 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
}

/*
- * It's ok if the gva is remapped by other cpus on shadow guest,
- * it's a BUG if the gfn is not a mmio page.
- */
- if (direct && !check_direct_spte_mmio_pf(spte))
- return RET_MMIO_PF_BUG;
-
- /*
* If the page table is zapped by other cpus, let CPU fault again on
* the address.
*/
--
2.1.0

2015-08-05 04:13:00

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 2/9] KVM: MMU: move FNAME(is_rsvd_bits_set) to mmu.c

FNAME(is_rsvd_bits_set) does not depend on guest mmu mode, move it
to mmu.c to stop being compiled multiple times

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 8 ++++++++
arch/x86/kvm/paging_tmpl.h | 13 ++-----------
2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f432e9b..3f1c403 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3546,6 +3546,14 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
return mmu->last_pte_bitmap & (1 << index);
}

+static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+ int bit7 = (gpte >> 7) & 1, low6 = gpte & 0x3f;
+
+ return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) |
+ ((mmu->bad_mt_xwr & (1ull << low6)) != 0);
+}
+
#define PTTYPE_EPT 18 /* arbitrary */
#define PTTYPE PTTYPE_EPT
#include "paging_tmpl.h"
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 0f67d7e..736e6ab 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -128,14 +128,6 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
*access &= mask;
}

-static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
-{
- int bit7 = (gpte >> 7) & 1, low6 = gpte & 0x3f;
-
- return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) |
- ((mmu->bad_mt_xwr & (1ull << low6)) != 0);
-}
-
static inline int FNAME(is_present_gpte)(unsigned long pte)
{
#if PTTYPE != PTTYPE_EPT
@@ -172,7 +164,7 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp, u64 *spte,
u64 gpte)
{
- if (FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+ if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
goto no_present;

if (!FNAME(is_present_gpte)(gpte))
@@ -353,8 +345,7 @@ retry_walk:
if (unlikely(!FNAME(is_present_gpte)(pte)))
goto error;

- if (unlikely(FNAME(is_rsvd_bits_set)(mmu, pte,
- walker->level))) {
+ if (unlikely(is_rsvd_bits_set(mmu, pte, walker->level))) {
errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
goto error;
}
--
2.1.0

2015-08-05 04:11:52

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 3/9] KVM: MMU: introduce rsvd_bits_validate

These two fields, rsvd_bits_mask and bad_mt_xwr, in "struct kvm_mmu" are
used to check if reserved bits set on guest ptes, move them to a data
struct so that the approach can be applied to check host shadow page
table entries as well

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 +++--
arch/x86/kvm/mmu.c | 75 +++++++++++++++++++++++------------------
arch/x86/kvm/x86.c | 3 +-
3 files changed, 50 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6e851d5..3e33c0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -252,6 +252,11 @@ struct kvm_pio_request {
int size;
};

+struct rsvd_bits_validate {
+ u64 rsvd_bits_mask[2][4];
+ u64 bad_mt_xwr;
+};
+
/*
* x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
* 32-bit). The kvm_mmu structure abstracts the details of the current mmu
@@ -289,8 +294,7 @@ struct kvm_mmu {

u64 *pae_root;
u64 *lm_root;
- u64 rsvd_bits_mask[2][4];
- u64 bad_mt_xwr;
+ struct rsvd_bits_validate guest_rsvd_check;

/*
* Bitmap: bit set = last pte in walk
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3f1c403..23633f5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3548,10 +3548,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp

static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
{
+ struct rsvd_bits_validate *rsvd_check = &mmu->guest_rsvd_check;
int bit7 = (gpte >> 7) & 1, low6 = gpte & 0x3f;

- return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) |
- ((mmu->bad_mt_xwr & (1ull << low6)) != 0);
+ return (gpte & rsvd_check->rsvd_bits_mask[bit7][level-1]) |
+ ((rsvd_check->bad_mt_xwr & (1ull << low6)) != 0);
}

#define PTTYPE_EPT 18 /* arbitrary */
@@ -3570,12 +3571,13 @@ static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
struct kvm_mmu *context)
{
+ struct rsvd_bits_validate *rsvd_check = &context->guest_rsvd_check;
int maxphyaddr = cpuid_maxphyaddr(vcpu);
u64 exb_bit_rsvd = 0;
u64 gbpages_bit_rsvd = 0;
u64 nonleaf_bit8_rsvd = 0;

- context->bad_mt_xwr = 0;
+ rsvd_check->bad_mt_xwr = 0;

if (!context->nx)
exb_bit_rsvd = rsvd_bits(63, 63);
@@ -3592,52 +3594,58 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
switch (context->root_level) {
case PT32_ROOT_LEVEL:
/* no rsvd bits for 2 level 4K page table entries */
- context->rsvd_bits_mask[0][1] = 0;
- context->rsvd_bits_mask[0][0] = 0;
- context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+ rsvd_check->rsvd_bits_mask[0][1] = 0;
+ rsvd_check->rsvd_bits_mask[0][0] = 0;
+ rsvd_check->rsvd_bits_mask[1][0] =
+ rsvd_check->rsvd_bits_mask[0][0];

if (!is_pse(vcpu)) {
- context->rsvd_bits_mask[1][1] = 0;
+ rsvd_check->rsvd_bits_mask[1][1] = 0;
break;
}

if (is_cpuid_PSE36())
/* 36bits PSE 4MB page */
- context->rsvd_bits_mask[1][1] = rsvd_bits(17, 21);
+ rsvd_check->rsvd_bits_mask[1][1] = rsvd_bits(17, 21);
else
/* 32 bits PSE 4MB page */
- context->rsvd_bits_mask[1][1] = rsvd_bits(13, 21);
+ rsvd_check->rsvd_bits_mask[1][1] = rsvd_bits(13, 21);
break;
case PT32E_ROOT_LEVEL:
- context->rsvd_bits_mask[0][2] =
+ rsvd_check->rsvd_bits_mask[0][2] =
rsvd_bits(maxphyaddr, 63) |
rsvd_bits(5, 8) | rsvd_bits(1, 2); /* PDPTE */
- context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
+ rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 62); /* PDE */
- context->rsvd_bits_mask[0][0] = exb_bit_rsvd |
+ rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 62); /* PTE */
- context->rsvd_bits_mask[1][1] = exb_bit_rsvd |
+ rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 62) |
rsvd_bits(13, 20); /* large page */
- context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+ rsvd_check->rsvd_bits_mask[1][0] =
+ rsvd_check->rsvd_bits_mask[0][0];
break;
case PT64_ROOT_LEVEL:
- context->rsvd_bits_mask[0][3] = exb_bit_rsvd |
- nonleaf_bit8_rsvd | rsvd_bits(7, 7) | rsvd_bits(maxphyaddr, 51);
- context->rsvd_bits_mask[0][2] = exb_bit_rsvd |
- nonleaf_bit8_rsvd | gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51);
- context->rsvd_bits_mask[0][1] = exb_bit_rsvd |
+ rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
+ nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
rsvd_bits(maxphyaddr, 51);
- context->rsvd_bits_mask[0][0] = exb_bit_rsvd |
+ rsvd_check->rsvd_bits_mask[0][2] = exb_bit_rsvd |
+ nonleaf_bit8_rsvd | gbpages_bit_rsvd |
rsvd_bits(maxphyaddr, 51);
- context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
- context->rsvd_bits_mask[1][2] = exb_bit_rsvd |
+ rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd |
+ rsvd_bits(maxphyaddr, 51);
+ rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd |
+ rsvd_bits(maxphyaddr, 51);
+ rsvd_check->rsvd_bits_mask[1][3] =
+ rsvd_check->rsvd_bits_mask[0][3];
+ rsvd_check->rsvd_bits_mask[1][2] = exb_bit_rsvd |
gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51) |
rsvd_bits(13, 29);
- context->rsvd_bits_mask[1][1] = exb_bit_rsvd |
+ rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 51) |
rsvd_bits(13, 20); /* large page */
- context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+ rsvd_check->rsvd_bits_mask[1][0] =
+ rsvd_check->rsvd_bits_mask[0][0];
break;
}
}
@@ -3645,24 +3653,25 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
struct kvm_mmu *context, bool execonly)
{
+ struct rsvd_bits_validate *rsvd_check = &context->guest_rsvd_check;
int maxphyaddr = cpuid_maxphyaddr(vcpu);
int pte;

- context->rsvd_bits_mask[0][3] =
+ rsvd_check->rsvd_bits_mask[0][3] =
rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
- context->rsvd_bits_mask[0][2] =
+ rsvd_check->rsvd_bits_mask[0][2] =
rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
- context->rsvd_bits_mask[0][1] =
+ rsvd_check->rsvd_bits_mask[0][1] =
rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
- context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+ rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);

/* large page */
- context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
- context->rsvd_bits_mask[1][2] =
+ rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3];
+ rsvd_check->rsvd_bits_mask[1][2] =
rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
- context->rsvd_bits_mask[1][1] =
+ rsvd_check->rsvd_bits_mask[1][1] =
rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
- context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+ rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0];

for (pte = 0; pte < 64; pte++) {
int rwx_bits = pte & 7;
@@ -3670,7 +3679,7 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
rwx_bits == 0x2 || rwx_bits == 0x6 ||
(rwx_bits == 0x4 && !execonly))
- context->bad_mt_xwr |= (1ull << pte);
+ rsvd_check->bad_mt_xwr |= (1ull << pte);
}
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ec7bf2b..e585007 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -525,7 +525,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
}
for (i = 0; i < ARRAY_SIZE(pdpte); ++i) {
if (is_present_gpte(pdpte[i]) &&
- (pdpte[i] & vcpu->arch.mmu.rsvd_bits_mask[0][2])) {
+ (pdpte[i] &
+ vcpu->arch.mmu.guest_rsvd_check.rsvd_bits_mask[0][2])) {
ret = 0;
goto out;
}
--
2.1.0

2015-08-05 04:10:18

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 4/9] KVM: MMU: split reset_rsvds_bits_mask

Since softmmu & AMD nested shadow page tables and guest page tables have
the same format, split reset_rsvds_bits_mask so that the logic can be
reused by later patches which check zero bits on sptes

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 23633f5..693d565 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3568,20 +3568,21 @@ static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
#include "paging_tmpl.h"
#undef PTTYPE

-static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
- struct kvm_mmu *context)
+static void
+__reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
+ struct rsvd_bits_validate *rsvd_check,
+ int maxphyaddr, int level, bool nx, bool gbpages,
+ bool pse)
{
- struct rsvd_bits_validate *rsvd_check = &context->guest_rsvd_check;
- int maxphyaddr = cpuid_maxphyaddr(vcpu);
u64 exb_bit_rsvd = 0;
u64 gbpages_bit_rsvd = 0;
u64 nonleaf_bit8_rsvd = 0;

rsvd_check->bad_mt_xwr = 0;

- if (!context->nx)
+ if (!nx)
exb_bit_rsvd = rsvd_bits(63, 63);
- if (!guest_cpuid_has_gbpages(vcpu))
+ if (!gbpages)
gbpages_bit_rsvd = rsvd_bits(7, 7);

/*
@@ -3591,7 +3592,7 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
if (guest_cpuid_is_amd(vcpu))
nonleaf_bit8_rsvd = rsvd_bits(8, 8);

- switch (context->root_level) {
+ switch (level) {
case PT32_ROOT_LEVEL:
/* no rsvd bits for 2 level 4K page table entries */
rsvd_check->rsvd_bits_mask[0][1] = 0;
@@ -3599,7 +3600,7 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
rsvd_check->rsvd_bits_mask[1][0] =
rsvd_check->rsvd_bits_mask[0][0];

- if (!is_pse(vcpu)) {
+ if (!pse) {
rsvd_check->rsvd_bits_mask[1][1] = 0;
break;
}
@@ -3650,6 +3651,15 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
}
}

+static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context)
+{
+ __reset_rsvds_bits_mask(vcpu, &context->guest_rsvd_check,
+ cpuid_maxphyaddr(vcpu), context->root_level,
+ context->nx, guest_cpuid_has_gbpages(vcpu),
+ is_pse(vcpu));
+}
+
static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
struct kvm_mmu *context, bool execonly)
{
--
2.1.0

2015-08-05 04:11:50

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 5/9] KVM: MMU: split reset_rsvds_bits_mask_ept

Since shdow ept page tables and intel nested guest page tables have the
same format, split reset_rsvds_bits_mask_ept so that the logic can be
reused by later patches which check zero bits on sptes

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 693d565..d11d212 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3660,11 +3660,10 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
is_pse(vcpu));
}

-static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
- struct kvm_mmu *context, bool execonly)
+static void
+__reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
+ int maxphyaddr, bool execonly)
{
- struct rsvd_bits_validate *rsvd_check = &context->guest_rsvd_check;
- int maxphyaddr = cpuid_maxphyaddr(vcpu);
int pte;

rsvd_check->rsvd_bits_mask[0][3] =
@@ -3693,6 +3692,13 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
}
}

+static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context, bool execonly)
+{
+ __reset_rsvds_bits_mask_ept(&context->guest_rsvd_check,
+ cpuid_maxphyaddr(vcpu), execonly);
+}
+
static void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
{
--
2.1.0

2015-08-05 04:10:22

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 6/9] KVM: MMU: introduce the framework to check zero bits on sptes

We have abstracted the data struct and functions which are used to check
reserved bit on guest page tables, now we extend the logic to check
zero bits on shadow page tables

The zero bits on sptes include not only reserved bits on hardware but also
the bits sptes nerve used

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 +++++++
arch/x86/kvm/mmu.c | 50 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/svm.c | 1 +
4 files changed, 62 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3e33c0d..09acaa6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -294,6 +294,14 @@ struct kvm_mmu {

u64 *pae_root;
u64 *lm_root;
+
+ /*
+ * check zero bits on shadow page table entries, these
+ * bits include not only hardware reserved bits but also
+ * the bits spte never used.
+ */
+ struct rsvd_bits_validate shadow_zero_check;
+
struct rsvd_bits_validate guest_rsvd_check;

/*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d11d212..edf1ec5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3699,6 +3699,53 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
cpuid_maxphyaddr(vcpu), execonly);
}

+/*
+ * the page table on host is the shadow page table for the page
+ * table in guest or amd nested guest, its mmu features completely
+ * follow the features in guest.
+ */
+void
+reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
+{
+ __reset_rsvds_bits_mask(vcpu, &context->shadow_zero_check,
+ boot_cpu_data.x86_phys_bits,
+ context->shadow_root_level, context->nx,
+ guest_cpuid_has_gbpages(vcpu), is_pse(vcpu));
+}
+EXPORT_SYMBOL_GPL(reset_shadow_zero_bits_mask);
+
+/*
+ * the direct page table on host, use as much mmu features as
+ * possible, however, kvm currently does not do execution-protection.
+ */
+static void
+reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context)
+{
+ if (guest_cpuid_is_amd(vcpu))
+ __reset_rsvds_bits_mask(vcpu, &context->shadow_zero_check,
+ boot_cpu_data.x86_phys_bits,
+ context->shadow_root_level, false,
+ cpu_has_gbpages, true);
+ else
+ __reset_rsvds_bits_mask_ept(&context->shadow_zero_check,
+ boot_cpu_data.x86_phys_bits,
+ false);
+
+}
+
+/*
+ * as the comments in reset_shadow_zero_bits_mask() except it
+ * is the shadow page table for intel nested guest.
+ */
+static void
+reset_ept_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context, bool execonly)
+{
+ __reset_rsvds_bits_mask_ept(&context->shadow_zero_check,
+ boot_cpu_data.x86_phys_bits, execonly);
+}
+
static void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
{
@@ -3877,6 +3924,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)

update_permission_bitmask(vcpu, context, false);
update_last_pte_bitmap(vcpu, context);
+ reset_tdp_shadow_zero_bits_mask(vcpu, context);
}

void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu)
@@ -3904,6 +3952,7 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu)
context->base_role.smap_andnot_wp
= smap && !is_write_protection(vcpu);
context->base_role.smm = is_smm(vcpu);
+ reset_shadow_zero_bits_mask(vcpu, context);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);

@@ -3927,6 +3976,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly)

update_permission_bitmask(vcpu, context, true);
reset_rsvds_bits_mask_ept(vcpu, context, execonly);
+ reset_ept_shadow_zero_bits_mask(vcpu, context, execonly);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 398d21c..2299d15 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -53,6 +53,9 @@ static inline u64 rsvd_bits(int s, int e)
int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);

+void
+reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+
/*
* Return values of handle_mmio_page_fault_common:
* RET_MMIO_PF_EMULATE: it is a real mmio page fault, emulate the instruction
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 568cd0f..189e464 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2107,6 +2107,7 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)
vcpu->arch.mmu.get_pdptr = nested_svm_get_tdp_pdptr;
vcpu->arch.mmu.inject_page_fault = nested_svm_inject_npf_exit;
vcpu->arch.mmu.shadow_root_level = get_npt_level();
+ reset_shadow_zero_bits_mask(vcpu, &vcpu->arch.mmu);
vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu;
}

--
2.1.0

2015-08-05 04:10:24

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 7/9] KVM: MMU: introduce is_shadow_zero_bits_set()

We have the same data struct to check reserved bits on guest page tables
and shadow page tables, split is_rsvd_bits_set() so that the logic can be
shared between these two paths

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index edf1ec5..e6a7ed0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3268,6 +3268,25 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
return vcpu->arch.nested_mmu.translate_gpa(vcpu, vaddr, access, exception);
}

+static bool
+__is_rsvd_bits_set(struct rsvd_bits_validate *rsvd_check, u64 pte, int level)
+{
+ int bit7 = (pte >> 7) & 1, low6 = pte & 0x3f;
+
+ return (pte & rsvd_check->rsvd_bits_mask[bit7][level-1]) |
+ ((rsvd_check->bad_mt_xwr & (1ull << low6)) != 0);
+}
+
+static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+ return __is_rsvd_bits_set(&mmu->guest_rsvd_check, gpte, level);
+}
+
+static bool is_shadow_zero_bits_set(struct kvm_mmu *mmu, u64 spte, int level)
+{
+ return __is_rsvd_bits_set(&mmu->shadow_zero_check, spte, level);
+}
+
static bool quickly_check_mmio_pf(struct kvm_vcpu *vcpu, u64 addr, bool direct)
{
if (direct)
@@ -3546,15 +3565,6 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
return mmu->last_pte_bitmap & (1 << index);
}

-static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
-{
- struct rsvd_bits_validate *rsvd_check = &mmu->guest_rsvd_check;
- int bit7 = (gpte >> 7) & 1, low6 = gpte & 0x3f;
-
- return (gpte & rsvd_check->rsvd_bits_mask[bit7][level-1]) |
- ((rsvd_check->bad_mt_xwr & (1ull << low6)) != 0);
-}
-
#define PTTYPE_EPT 18 /* arbitrary */
#define PTTYPE PTTYPE_EPT
#include "paging_tmpl.h"
--
2.1.0

2015-08-05 04:11:05

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 8/9] KVM: MMU: fully check zero bits for sptes

The #PF with PFEC.RSV = 1 is designed to speed MMIO emulation, however,
it is possible that the RSV #PF is caused by real BUG by mis-configure
shadow page table entries

This patch enables full check for the zero bits on shadow page table
entries which include not only the reserved bit on hardware but also
the bits spte never used, then dump the shadow page table hierarchy
if the real bug is detected

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 41 +++++++++++++++++++++++++++++++++++------
1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e6a7ed0..1393317 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3295,31 +3295,60 @@ static bool quickly_check_mmio_pf(struct kvm_vcpu *vcpu, u64 addr, bool direct)
return vcpu_match_mmio_gva(vcpu, addr);
}

-static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
+/* return true if reserved bit is detected on spte. */
+static bool
+walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
{
struct kvm_shadow_walk_iterator iterator;
- u64 spte = 0ull;
+ u64 sptes[PT64_ROOT_LEVEL], spte = 0ull;
+ int root = 0, leaf;
+ bool reserved = false;

if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
- return spte;
+ goto exit;

walk_shadow_page_lockless_begin(vcpu);
- for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
+ for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
+ leaf = iterator.level;
+
+ if (!root)
+ root = leaf;
+
+ sptes[leaf - 1] = spte;
+
if (!is_shadow_present_pte(spte))
break;
+
+ reserved |= is_shadow_zero_bits_set(&vcpu->arch.mmu, spte,
+ leaf);
+ }
walk_shadow_page_lockless_end(vcpu);

- return spte;
+ if (reserved) {
+ pr_err("%s: detect reserved bits on spte, addr 0x%llx, dump hierarchy:\n",
+ __func__, addr);
+ while (root >= leaf) {
+ pr_err("------ spte 0x%llx level %d.\n",
+ sptes[root - 1], root);
+ root--;
+ }
+ }
+exit:
+ *sptep = spte;
+ return reserved;
}

int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
{
u64 spte;
+ bool reserved;

if (quickly_check_mmio_pf(vcpu, addr, direct))
return RET_MMIO_PF_EMULATE;

- spte = walk_shadow_page_get_mmio_spte(vcpu, addr);
+ reserved = walk_shadow_page_get_mmio_spte(vcpu, addr, &spte);
+ if (unlikely(reserved))
+ return RET_MMIO_PF_BUG;

if (is_mmio_spte(spte)) {
gfn_t gfn = get_mmio_spte_gfn(spte);
--
2.1.0

2015-08-05 04:10:37

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v2 9/9] KVM: VMX: drop ept misconfig check

The logic used to check ept misconfig is completely contained in common
reserved bits check for sptes, so it can be removed

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 22 ----------------
arch/x86/kvm/mmu.h | 1 -
arch/x86/kvm/vmx.c | 74 ++----------------------------------------------------
3 files changed, 2 insertions(+), 95 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1393317..1dabaec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4937,28 +4937,6 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm)
return nr_mmu_pages;
}

-int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4])
-{
- struct kvm_shadow_walk_iterator iterator;
- u64 spte;
- int nr_sptes = 0;
-
- if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
- return nr_sptes;
-
- walk_shadow_page_lockless_begin(vcpu);
- for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
- sptes[iterator.level-1] = spte;
- nr_sptes++;
- if (!is_shadow_present_pte(spte))
- break;
- }
- walk_shadow_page_lockless_end(vcpu);
-
- return nr_sptes;
-}
-EXPORT_SYMBOL_GPL(kvm_mmu_get_spte_hierarchy);
-
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
kvm_mmu_unload(vcpu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2299d15..e4202e4 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -50,7 +50,6 @@ static inline u64 rsvd_bits(int s, int e)
return ((1ULL << (e - s + 1)) - 1) << s;
}

-int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);

void
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 217f663..4cf25b9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5748,73 +5748,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}

-static u64 ept_rsvd_mask(u64 spte, int level)
-{
- int i;
- u64 mask = 0;
-
- for (i = 51; i > boot_cpu_data.x86_phys_bits; i--)
- mask |= (1ULL << i);
-
- if (level == 4)
- /* bits 7:3 reserved */
- mask |= 0xf8;
- else if (spte & (1ULL << 7))
- /*
- * 1GB/2MB page, bits 29:12 or 20:12 reserved respectively,
- * level == 1 if the hypervisor is using the ignored bit 7.
- */
- mask |= (PAGE_SIZE << ((level - 1) * 9)) - PAGE_SIZE;
- else if (level > 1)
- /* bits 6:3 reserved */
- mask |= 0x78;
-
- return mask;
-}
-
-static void ept_misconfig_inspect_spte(struct kvm_vcpu *vcpu, u64 spte,
- int level)
-{
- printk(KERN_ERR "%s: spte 0x%llx level %d\n", __func__, spte, level);
-
- /* 010b (write-only) */
- WARN_ON((spte & 0x7) == 0x2);
-
- /* 110b (write/execute) */
- WARN_ON((spte & 0x7) == 0x6);
-
- /* 100b (execute-only) and value not supported by logical processor */
- if (!cpu_has_vmx_ept_execute_only())
- WARN_ON((spte & 0x7) == 0x4);
-
- /* not 000b */
- if ((spte & 0x7)) {
- u64 rsvd_bits = spte & ept_rsvd_mask(spte, level);
-
- if (rsvd_bits != 0) {
- printk(KERN_ERR "%s: rsvd_bits = 0x%llx\n",
- __func__, rsvd_bits);
- WARN_ON(1);
- }
-
- /* bits 5:3 are _not_ reserved for large page or leaf page */
- if ((rsvd_bits & 0x38) == 0) {
- u64 ept_mem_type = (spte & 0x38) >> 3;
-
- if (ept_mem_type == 2 || ept_mem_type == 3 ||
- ept_mem_type == 7) {
- printk(KERN_ERR "%s: ept_mem_type=0x%llx\n",
- __func__, ept_mem_type);
- WARN_ON(1);
- }
- }
- }
-}
-
static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
{
- u64 sptes[4];
- int nr_sptes, i, ret;
+ int ret;
gpa_t gpa;

gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
@@ -5835,13 +5771,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
return 1;

/* It is the real ept misconfig */
- printk(KERN_ERR "EPT: Misconfiguration.\n");
- printk(KERN_ERR "EPT: GPA: 0x%llx\n", gpa);
-
- nr_sptes = kvm_mmu_get_spte_hierarchy(vcpu, gpa, sptes);
-
- for (i = PT64_ROOT_LEVEL; i > PT64_ROOT_LEVEL - nr_sptes; --i)
- ept_misconfig_inspect_spte(vcpu, sptes[i-1], i);
+ WARN_ON(1);

vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
--
2.1.0

2015-08-05 10:12:56

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v2 8/9] KVM: MMU: fully check zero bits for sptes

On 05/08/2015 06:04, Xiao Guangrong wrote:
> - for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
> + for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
> + leaf = iterator.level;
> +
> + if (!root)
> + root = leaf;
> +
> + sptes[leaf - 1] = spte;
> +

I'm a bit undecided between this and open-coding the macro:

for (shadow_walk_init(&iterator, vcpu, addr), root = iterator.level;
shadow_walk_okay(&iterator);
__shadow_walk_next(&iterator, spte)) {
leaf = iterator.level;
spte = mmu_spte_get_lockless(iterator.sptep);

Any second opinions?

Paolo

2015-08-06 02:59:09

by Xiao Guangrong

[permalink] [raw]

Subject: Re: [PATCH v2 8/9] KVM: MMU: fully check zero bits for sptes

On 08/05/2015 06:12 PM, Paolo Bonzini wrote:
>
>
> On 05/08/2015 06:04, Xiao Guangrong wrote:
>> - for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
>> + for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
>> + leaf = iterator.level;
>> +
>> + if (!root)
>> + root = leaf;
>> +
>> + sptes[leaf - 1] = spte;
>> +
>
> I'm a bit undecided between this and open-coding the macro:
>
> for (shadow_walk_init(&iterator, vcpu, addr), root = iterator.level;
> shadow_walk_okay(&iterator);
> __shadow_walk_next(&iterator, spte)) {
> leaf = iterator.level;
> spte = mmu_spte_get_lockless(iterator.sptep);
>
> Any second opinions?

Your adjustment is good to me, i do not have other ideas... :)