2020-06-19 15:45:15

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

When EPT/NPT is enabled, KVM does not really look at guest physical
address size. Address bits above maximum physical memory size are reserved.
Because KVM does not look at these guest physical addresses, it currently
effectively supports guest physical address sizes equal to the host.

This can be problem when having a mixed setup of machines with 5-level page
tables and machines with 4-level page tables, as live migration can change
MAXPHYADDR while the guest runs, which can theoretically introduce bugs.

In this patch series we add checks on guest physical addresses in EPT
violation/misconfig and NPF vmexits and if needed inject the proper
page faults in the guest.

A more subtle issue is when the host MAXPHYADDR is larger than that of the
guest. Page faults caused by reserved bits on the guest won't cause an EPT
violation/NPF and hence we also check guest MAXPHYADDR and add PFERR_RSVD_MASK
error code to the page fault if needed.

The last 3 patches (i.e. SVM bits and patch 11) are not intended for
immediate inclusion and probably need more discussion.
We've been noticing some unexpected behavior in handling NPF vmexits
on AMD CPUs (see individual patches for details), and thus we are
proposing a workaround (see last patch) that adds a capability that
userspace can use to decide who to deal with hosts that might have
issues supprting guest MAXPHYADDR < host MAXPHYADDR.


Mohammed Gamal (7):
KVM: x86: Add helper functions for illegal GPA checking and page fault
injection
KVM: x86: mmu: Move translate_gpa() to mmu.c
KVM: x86: mmu: Add guest physical address check in translate_gpa()
KVM: VMX: Add guest physical address check in EPT violation and
misconfig
KVM: SVM: introduce svm_need_pf_intercept
KVM: SVM: Add guest physical address check in NPF/PF interception
KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
configurable

Paolo Bonzini (4):
KVM: x86: rename update_bp_intercept to update_exception_bitmap
KVM: x86: update exception bitmap on CPUID changes
KVM: VMX: introduce vmx_need_pf_intercept
KVM: VMX: optimize #PF injection when MAXPHYADDR does not match

arch/x86/include/asm/kvm_host.h | 10 ++------
arch/x86/kvm/cpuid.c | 2 ++
arch/x86/kvm/mmu.h | 6 +++++
arch/x86/kvm/mmu/mmu.c | 12 +++++++++
arch/x86/kvm/svm/svm.c | 41 +++++++++++++++++++++++++++---
arch/x86/kvm/svm/svm.h | 6 +++++
arch/x86/kvm/vmx/nested.c | 28 ++++++++++++--------
arch/x86/kvm/vmx/vmx.c | 45 +++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/vmx.h | 6 +++++
arch/x86/kvm/x86.c | 29 ++++++++++++++++++++-
arch/x86/kvm/x86.h | 1 +
include/uapi/linux/kvm.h | 1 +
12 files changed, 158 insertions(+), 29 deletions(-)

--
2.26.2


2020-06-19 15:45:18

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH v2 01/11] KVM: x86: Add helper functions for illegal GPA checking and page fault injection

This patch adds two helper functions that will be used to support virtualizing
MAXPHYADDR in both kvm-intel.ko and kvm.ko.

kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA,
while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits.

Signed-off-by: Mohammed Gamal <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu.h | 6 ++++++
arch/x86/kvm/x86.c | 21 +++++++++++++++++++++
arch/x86/kvm/x86.h | 1 +
3 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0ad06bfe2c2c..555237dfb91c 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -4,6 +4,7 @@

#include <linux/kvm_host.h>
#include "kvm_cache_regs.h"
+#include "cpuid.h"

#define PT64_PT_BITS 9
#define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
@@ -158,6 +159,11 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu)
return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
}

+static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu)));
+}
+
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
* page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00c88c2f34e4..ac8642e890b1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10693,6 +10693,27 @@ u64 kvm_spec_ctrl_valid_bits(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits);

+void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code)
+{
+ struct x86_exception fault;
+
+ if (!(error_code & PFERR_PRESENT_MASK) ||
+ vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code, &fault) != UNMAPPED_GVA) {
+ /*
+ * If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page
+ * tables probably do not match the TLB. Just proceed
+ * with the error code that the processor gave.
+ */
+ fault.vector = PF_VECTOR;
+ fault.error_code_valid = true;
+ fault.error_code = error_code;
+ fault.nested_page_fault = false;
+ fault.address = gva;
+ }
+ vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);
+}
+EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error);
+
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6eb62e97e59f..239ae0f3e40b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
int page_num);
bool kvm_vector_hashing_enabled(void);
+void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int emulation_type, void *insn, int insn_len);
fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
--
2.26.2

2020-06-20 00:54:26

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH v2 08/11] KVM: VMX: optimize #PF injection when MAXPHYADDR does not match

From: Paolo Bonzini <[email protected]>

Ignore non-present page faults, since those cannot have reserved
bits set.

When running access.flat with "-cpu Haswell,phys-bits=36", the
number of trapped page faults goes down from 8872644 to 3978948.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f38cbadcb3a5..8daf78b2d4cb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4358,6 +4358,16 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmx->pt_desc.guest.output_mask = 0x7F;
vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}
+
+ /*
+ * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
+ * between guest and host. In that case we only care about present
+ * faults.
+ */
+ if (enable_ept) {
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
+ }
}

static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
--
2.26.2

2020-06-20 00:59:12

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH v2 06/11] KVM: VMX: introduce vmx_need_pf_intercept

From: Paolo Bonzini <[email protected]>

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 28 +++++++++++++++++-----------
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/vmx/vmx.h | 5 +++++
3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d1af20b050a8..328411919518 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2433,22 +2433,28 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)

/*
* Whether page-faults are trapped is determined by a combination of
- * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
- * If enable_ept, L0 doesn't care about page faults and we should
- * set all of these to L1's desires. However, if !enable_ept, L0 does
- * care about (at least some) page faults, and because it is not easy
- * (if at all possible?) to merge L0 and L1's desires, we simply ask
- * to exit on each and every L2 page fault. This is done by setting
- * MASK=MATCH=0 and (see below) EB.PF=1.
+ * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF. If L0
+ * doesn't care about page faults then we should set all of these to
+ * L1's desires. However, if L0 does care about (some) page faults, it
+ * is not easy (if at all possible?) to merge L0 and L1's desires, we
+ * simply ask to exit on each and every L2 page fault. This is done by
+ * setting MASK=MATCH=0 and (see below) EB.PF=1.
* Note that below we don't need special code to set EB.PF beyond the
* "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
* vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
* !enable_ept, EB.PF is 1, so the "or" will always be 1.
*/
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
- enable_ept ? vmcs12->page_fault_error_code_mask : 0);
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
- enable_ept ? vmcs12->page_fault_error_code_match : 0);
+ if (vmx_need_pf_intercept(&vmx->vcpu)) {
+ /*
+ * TODO: if both L0 and L1 need the same MASK and MATCH,
+ * go ahead and use it?
+ */
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
+ } else {
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, vmcs12->page_fault_error_code_mask);
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, vmcs12->page_fault_error_code_match);
+ }

if (cpu_has_vmx_apicv()) {
vmcs_write64(EOI_EXIT_BITMAP0, vmcs12->eoi_exit_bitmap0);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f82c42ac87f9..46d522ee5cb1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -783,7 +783,7 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
eb |= 1u << BP_VECTOR;
if (to_vmx(vcpu)->rmode.vm86_active)
eb = ~0;
- if (enable_ept)
+ if (!vmx_need_pf_intercept(vcpu))
eb &= ~(1u << PF_VECTOR);

/* When we are running a nested L2 guest and L1 specified for it a
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 8a83b5edc820..5e2da15fe94f 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -552,6 +552,11 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
}

+static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
+{
+ return !enable_ept;
+}
+
void dump_vmcs(void);

#endif /* __KVM_X86_VMX_H */
--
2.26.2

2020-06-20 01:00:44

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH v2 04/11] KVM: x86: rename update_bp_intercept to update_exception_bitmap

From: Paolo Bonzini <[email protected]>

We would like to introduce a callback to update the #PF intercept
when CPUID changes. Just reuse update_bp_intercept since VMX is
already using update_exception_bitmap instead of a bespoke function.

While at it, remove an unnecessary assignment in the SVM version,
which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug)
and has nothing to do with the exception bitmap.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/svm/svm.c | 7 +++----
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 2 +-
4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bc0fb116cc5c..7ebdb43632e0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1098,7 +1098,7 @@ struct kvm_x86_ops {
void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
void (*vcpu_put)(struct kvm_vcpu *vcpu);

- void (*update_bp_intercept)(struct kvm_vcpu *vcpu);
+ void (*update_exception_bitmap)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8ccfa4197d9c..94108e6cc6da 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1627,7 +1627,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
mark_dirty(svm->vmcb, VMCB_SEG);
}

-static void update_bp_intercept(struct kvm_vcpu *vcpu)
+static void update_exception_bitmap(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);

@@ -1636,8 +1636,7 @@ static void update_bp_intercept(struct kvm_vcpu *vcpu)
if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) {
if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP)
set_exception_intercept(svm, BP_VECTOR);
- } else
- vcpu->guest_debug = 0;
+ }
}

static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
@@ -3989,7 +3988,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_blocking = svm_vcpu_blocking,
.vcpu_unblocking = svm_vcpu_unblocking,

- .update_bp_intercept = update_bp_intercept,
+ .update_exception_bitmap = update_exception_bitmap,
.get_msr_feature = svm_get_msr_feature,
.get_msr = svm_get_msr,
.set_msr = svm_set_msr,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 36c771728c8c..f82c42ac87f9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7881,7 +7881,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_load = vmx_vcpu_load,
.vcpu_put = vmx_vcpu_put,

- .update_bp_intercept = update_exception_bitmap,
+ .update_exception_bitmap = update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
.get_msr = vmx_get_msr,
.set_msr = vmx_set_msr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac8642e890b1..84f1f0084d2e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9275,7 +9275,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
*/
kvm_set_rflags(vcpu, rflags);

- kvm_x86_ops.update_bp_intercept(vcpu);
+ kvm_x86_ops.update_exception_bitmap(vcpu);

r = 0;

--
2.26.2

2020-06-20 01:10:25

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 19/06/20 17:39, Mohammed Gamal wrote:
> The last 3 patches (i.e. SVM bits and patch 11) are not intended for
> immediate inclusion and probably need more discussion.
> We've been noticing some unexpected behavior in handling NPF vmexits
> on AMD CPUs (see individual patches for details), and thus we are
> proposing a workaround (see last patch) that adds a capability that
> userspace can use to decide who to deal with hosts that might have
> issues supprting guest MAXPHYADDR < host MAXPHYADDR.

I think patch 11 can be committed (more or less). You would actually
move it to the beginning of the series and have
"allow_smaller_maxphyaddr = !enable_ept;" for VMX. It is then changed
to "allow_smaller_maxphyaddr = true;" in "KVM: VMX: Add guest physical
address check in EPT violation and misconfig".

In fact, it would be a no-brainer to commit patch 11 in that form, so
feel free to submit it separately.

Thanks,

Paolo

2020-06-20 01:26:15

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH v2 11/11] KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support configurable

The reason behind including this patch is unexpected behaviour we see
with NPT vmexit handling in AMD processor.

With previous patch ("KVM: SVM: Add guest physical address check in
NPF/PF interception") we see the followning error multiple times in
the 'access' test in kvm-unit-tests:

test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001
Dump mapping: address: 0x123400000000
------L4: 24c3027
------L3: 24c4027
------L2: 24c5021
------L1: 1002000021

This shows that the PTE's accessed bit is apparently being set by
the CPU hardware before the NPF vmexit. This completely handled by
hardware and can not be fixed in software.

This patch introduces a workaround. We add a boolean variable:
'allow_smaller_maxphyaddr'
Which is set individually by VMX and SVM init routines. On VMX it's
always set to true, on SVM it's only set to true when NPT is not
enabled.

We also add a new capability KVM_CAP_SMALLER_MAXPHYADDR which
allows userspace to query if the underlying architecture would
support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
(e.g. qemu can decide if it would ignore the -cpu ..,phys-bits=X)

CC: Tom Lendacky <[email protected]>
CC: Babu Moger <[email protected]>
Signed-off-by: Mohammed Gamal <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/svm/svm.c | 15 +++++++++++++++
arch/x86/kvm/vmx/vmx.c | 7 +++++++
arch/x86/kvm/x86.c | 6 ++++++
include/uapi/linux/kvm.h | 1 +
5 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7ebdb43632e0..b25f7497307d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1304,7 +1304,7 @@ struct kvm_arch_async_pf {
};

extern u64 __read_mostly host_efer;
-
+extern bool __read_mostly allow_smaller_maxphyaddr;
extern struct kvm_x86_ops kvm_x86_ops;

#define __KVM_HAVE_ARCH_VM_ALLOC
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ec3224a2e7c2..1b8880b89e9f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -924,6 +924,21 @@ static __init int svm_hardware_setup(void)

svm_set_cpu_caps();

+ /*
+ * It seems that on AMD processors PTE's accessed bit is
+ * being set by the CPU hardware before the NPF vmexit.
+ * This is not expected behaviour and our tests fail because
+ * of it.
+ * A workaround here is to disable support for
+ * GUEST_MAXPHYADDR < HOST_MAXPHYADDR if NPT is enabled.
+ * In this case userspace can know if there is support using
+ * KVM_CAP_SMALLER_MAXPHYADDR extension and decide how to handle
+ * it
+ * If future AMD CPU models change the behaviour described above,
+ * this variable can be changed accordingly
+ */
+ allow_smaller_maxphyaddr = !npt_enabled;
+
return 0;

err:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8daf78b2d4cb..fe0ca39c0887 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8316,6 +8316,13 @@ static int __init vmx_init(void)
#endif
vmx_check_vmcs12_offsets();

+ /*
+ * Intel processors don't have problems with
+ * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
+ * it for VMX by default
+ */
+ allow_smaller_maxphyaddr = true;
+
return 0;
}
module_init(vmx_init);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 84f1f0084d2e..5bca6d6d24e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -187,6 +187,9 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);

+bool __read_mostly allow_smaller_maxphyaddr;
+EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
+
static u64 __read_mostly host_xss;
u64 __read_mostly supported_xss;
EXPORT_SYMBOL_GPL(supported_xss);
@@ -3533,6 +3536,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
r = kvm_x86_ops.nested_ops->enable_evmcs != NULL;
break;
+ case KVM_CAP_SMALLER_MAXPHYADDR:
+ r = (int) allow_smaller_maxphyaddr;
+ break;
default:
break;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..68cd3a0af9bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PPC_SECURE_GUEST 181
#define KVM_CAP_HALT_POLL 182
#define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_SMALLER_MAXPHYADDR 184

#ifdef KVM_CAP_IRQ_ROUTING

--
2.26.2

2020-06-20 01:36:43

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH v2 10/11] KVM: SVM: Add guest physical address check in NPF/PF interception

Check guest physical address against it's maximum physical memory. If
the guest's physical address exceeds the maximum (i.e. has reserved bits
set), inject a guest page fault with PFERR_RSVD_MASK set.

Similar ot VMX, this has to be done both in the NPF and page fault interceptions,
as there are complications in both cases with respect to the computation
of the correct error code.

For NPF interceptions, unfortunately the only possibility is to emulate,
because the access type in the exit qualification might refer to an
access to a paging structure, rather than to the access performed by
the program.

Trapping page faults instead is needed in order to correct the error code,
but the access type can be obtained from the original error code and
passed to gva_to_gpa. The corrections required in the error code are
subtle. For example, imagine that a PTE for a supervisor page has a reserved
bit set. On a supervisor-mode access, the EPT violation path would trigger.
However, on a user-mode access, the processor will not notice the reserved
bit and not include PFERR_RSVD_MASK in the error code.

CC: Tom Lendacky <[email protected]>
CC: Babu Moger <[email protected]>
Signed-off-by: Mohammed Gamal <[email protected]>
---
arch/x86/kvm/svm/svm.c | 11 +++++++++++
arch/x86/kvm/svm/svm.h | 2 +-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 05412818027d..ec3224a2e7c2 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1702,6 +1702,12 @@ static int pf_interception(struct vcpu_svm *svm)
u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2);
u64 error_code = svm->vmcb->control.exit_info_1;

+ if (npt_enabled && !svm->vcpu.arch.apf.host_apf_flags) {
+ kvm_fixup_and_inject_pf_error(&svm->vcpu,
+ fault_address, error_code);
+ return 1;
+ }
+
return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
@@ -1714,6 +1720,11 @@ static int npf_interception(struct vcpu_svm *svm)
u64 error_code = svm->vmcb->control.exit_info_1;

trace_kvm_page_fault(fault_address, error_code);
+
+ /* Check if guest gpa doesn't exceed physical memory limits */
+ if (unlikely(kvm_mmu_is_illegal_gpa(&svm->vcpu, fault_address)))
+ return kvm_emulate_instruction(&svm->vcpu, 0);
+
return kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 2b7469f3db0e..12b502e36dbd 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -348,7 +348,7 @@ static inline bool gif_set(struct vcpu_svm *svm)

static inline bool svm_need_pf_intercept(struct vcpu_svm *svm)
{
- return !npt_enabled;
+ return !npt_enabled || cpuid_maxphyaddr(&svm->vcpu) < boot_cpu_data.x86_phys_bits;
}

/* svm.c */
--
2.26.2

2020-06-20 01:46:44

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH v2 03/11] KVM: x86: mmu: Add guest physical address check in translate_gpa()

In case of running a guest with 4-level page tables on a 5-level page
table host, it might happen that a guest might have a physical address
with reserved bits set, but the host won't see that and trap it.

Hence, we need to check page faults' physical addresses against the guest's
maximum physical memory and if it's exceeded, we need to add
the PFERR_RSVD_MASK bits to the PF's error code.

Also make sure the error code isn't overwritten by the page table walker.

Signed-off-by: Mohammed Gamal <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ee113fc1f1bf..10409b76b2d8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -518,6 +518,12 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte)
static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
struct x86_exception *exception)
{
+ /* Check if guest physical address doesn't exceed guest maximum */
+ if (kvm_mmu_is_illegal_gpa(vcpu, gpa)) {
+ exception->error_code |= PFERR_RSVD_MASK;
+ return UNMAPPED_GVA;
+ }
+
return gpa;
}

--
2.26.2

2020-06-20 04:48:50

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 6/19/20 10:39 AM, Mohammed Gamal wrote:
> When EPT/NPT is enabled, KVM does not really look at guest physical
> address size. Address bits above maximum physical memory size are reserved.
> Because KVM does not look at these guest physical addresses, it currently
> effectively supports guest physical address sizes equal to the host.
>
> This can be problem when having a mixed setup of machines with 5-level page
> tables and machines with 4-level page tables, as live migration can change
> MAXPHYADDR while the guest runs, which can theoretically introduce bugs.
>
> In this patch series we add checks on guest physical addresses in EPT
> violation/misconfig and NPF vmexits and if needed inject the proper
> page faults in the guest.
>
> A more subtle issue is when the host MAXPHYADDR is larger than that of the
> guest. Page faults caused by reserved bits on the guest won't cause an EPT
> violation/NPF and hence we also check guest MAXPHYADDR and add PFERR_RSVD_MASK
> error code to the page fault if needed.

I'm probably missing something here, but I'm confused by this statement.
Is this for a case where a page has been marked not present and the guest
has also set what it believes are reserved bits? Then when the page is
accessed, the guest sees a page fault without the error code for reserved
bits? If so, my understanding is that is architecturally correct. P=0 is
considered higher priority than other page faults, at least on AMD. So if
you have a P=0 and other issues exist within the PTE, AMD will report the
P=0 fault and that's it.

The priority of other page fault conditions when P=1 is not defined and I
don't think we guarantee that you would get all error codes on fault.
Software is always expected to address the page fault and retry, and it
may get another page fault when it does, with a different error code.
Assuming the other errors are addressed, eventually the reserved bits
would cause an NPF and that could be detected by the HV and handled
appropriately.

>
> The last 3 patches (i.e. SVM bits and patch 11) are not intended for
> immediate inclusion and probably need more discussion.
> We've been noticing some unexpected behavior in handling NPF vmexits
> on AMD CPUs (see individual patches for details), and thus we are
> proposing a workaround (see last patch) that adds a capability that
> userspace can use to decide who to deal with hosts that might have
> issues supprting guest MAXPHYADDR < host MAXPHYADDR.

Also, something to consider. On AMD, when memory encryption is enabled
(via the SYS_CFG MSR), a guest can actually have a larger MAXPHYADDR than
the host. How do these patches all play into that?

Thanks,
Tom

>
>
> Mohammed Gamal (7):
> KVM: x86: Add helper functions for illegal GPA checking and page fault
> injection
> KVM: x86: mmu: Move translate_gpa() to mmu.c
> KVM: x86: mmu: Add guest physical address check in translate_gpa()
> KVM: VMX: Add guest physical address check in EPT violation and
> misconfig
> KVM: SVM: introduce svm_need_pf_intercept
> KVM: SVM: Add guest physical address check in NPF/PF interception
> KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
> configurable
>
> Paolo Bonzini (4):
> KVM: x86: rename update_bp_intercept to update_exception_bitmap
> KVM: x86: update exception bitmap on CPUID changes
> KVM: VMX: introduce vmx_need_pf_intercept
> KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
>
> arch/x86/include/asm/kvm_host.h | 10 ++------
> arch/x86/kvm/cpuid.c | 2 ++
> arch/x86/kvm/mmu.h | 6 +++++
> arch/x86/kvm/mmu/mmu.c | 12 +++++++++
> arch/x86/kvm/svm/svm.c | 41 +++++++++++++++++++++++++++---
> arch/x86/kvm/svm/svm.h | 6 +++++
> arch/x86/kvm/vmx/nested.c | 28 ++++++++++++--------
> arch/x86/kvm/vmx/vmx.c | 45 +++++++++++++++++++++++++++++----
> arch/x86/kvm/vmx/vmx.h | 6 +++++
> arch/x86/kvm/x86.c | 29 ++++++++++++++++++++-
> arch/x86/kvm/x86.h | 1 +
> include/uapi/linux/kvm.h | 1 +
> 12 files changed, 158 insertions(+), 29 deletions(-)
>

2020-06-20 04:55:24

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v2 06/11] KVM: VMX: introduce vmx_need_pf_intercept

On Fri, Jun 19, 2020 at 8:40 AM Mohammed Gamal <[email protected]> wrote:
>
> From: Paolo Bonzini <[email protected]>
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 28 +++++++++++++++++-----------
> arch/x86/kvm/vmx/vmx.c | 2 +-
> arch/x86/kvm/vmx/vmx.h | 5 +++++
> 3 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index d1af20b050a8..328411919518 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2433,22 +2433,28 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>
> /*
> * Whether page-faults are trapped is determined by a combination of
> - * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
> - * If enable_ept, L0 doesn't care about page faults and we should
> - * set all of these to L1's desires. However, if !enable_ept, L0 does
> - * care about (at least some) page faults, and because it is not easy
> - * (if at all possible?) to merge L0 and L1's desires, we simply ask
> - * to exit on each and every L2 page fault. This is done by setting
> - * MASK=MATCH=0 and (see below) EB.PF=1.
> + * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF. If L0
> + * doesn't care about page faults then we should set all of these to
> + * L1's desires. However, if L0 does care about (some) page faults, it
> + * is not easy (if at all possible?) to merge L0 and L1's desires, we
> + * simply ask to exit on each and every L2 page fault. This is done by
> + * setting MASK=MATCH=0 and (see below) EB.PF=1.
> * Note that below we don't need special code to set EB.PF beyond the
> * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
> * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
> * !enable_ept, EB.PF is 1, so the "or" will always be 1.
> */
> - vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
> - enable_ept ? vmcs12->page_fault_error_code_mask : 0);
> - vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
> - enable_ept ? vmcs12->page_fault_error_code_match : 0);
> + if (vmx_need_pf_intercept(&vmx->vcpu)) {
> + /*
> + * TODO: if both L0 and L1 need the same MASK and MATCH,
> + * go ahead and use it?
> + */

I'm not sure there's much "TODO", since L0's MASK and MATCH are both
0. So, if L1's MASK and MATCH are also both 0, we would go ahead and
use 0's, which it seems we already do here:

> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
> + } else {
> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, vmcs12->page_fault_error_code_mask);
> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, vmcs12->page_fault_error_code_match);
> + }
>
> if (cpu_has_vmx_apicv()) {
> vmcs_write64(EOI_EXIT_BITMAP0, vmcs12->eoi_exit_bitmap0);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f82c42ac87f9..46d522ee5cb1 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -783,7 +783,7 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
> eb |= 1u << BP_VECTOR;
> if (to_vmx(vcpu)->rmode.vm86_active)
> eb = ~0;
> - if (enable_ept)
> + if (!vmx_need_pf_intercept(vcpu))
> eb &= ~(1u << PF_VECTOR);
>
> /* When we are running a nested L2 guest and L1 specified for it a
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 8a83b5edc820..5e2da15fe94f 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -552,6 +552,11 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
> SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
> }
>
> +static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> +{
> + return !enable_ept;
> +}
> +
> void dump_vmcs(void);
>
> #endif /* __KVM_X86_VMX_H */
> --
> 2.26.2
>
Reviewed-by: Jim Mattson <[email protected]>

2020-06-20 04:59:14

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 19/06/20 23:52, Tom Lendacky wrote:
>> A more subtle issue is when the host MAXPHYADDR is larger than that
>> of the guest. Page faults caused by reserved bits on the guest won't
>> cause an EPT violation/NPF and hence we also check guest MAXPHYADDR
>> and add PFERR_RSVD_MASK error code to the page fault if needed.
>
> I'm probably missing something here, but I'm confused by this
> statement. Is this for a case where a page has been marked not
> present and the guest has also set what it believes are reserved
> bits? Then when the page is accessed, the guest sees a page fault
> without the error code for reserved bits?

No, for non-present page there is no issue because there are no reserved
bits in that case. If the page is present and no reserved bits are set
according to the host, however, there are two cases to consider:

- if the page is not accessible to the guest according to the
permissions in the page table, it will cause a #PF. We need to trap it
and change the error code into P|RSVD if the guest physical address has
any guest-reserved bits.

- if the page is accessible to the guest according to the permissions in
the page table, it will cause a #NPF. Again, we need to trap it, check
the guest physical address and inject a P|RSVD #PF if the guest physical
address has any guest-reserved bits.

The AMD specific issue happens in the second case. By the time the NPF
vmexit occurs, the accessed and/or dirty bits have been set and this
should not have happened before the RSVD page fault that we want to
inject. On Intel processors, instead, EPT violations trigger before
accessed and dirty bits are set. I cannot find an explicit mention of
the intended behavior in either the
Intel SDM or the AMD APM.

Paolo

2020-06-22 04:34:50

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On Fri, Jun 19, 2020 at 05:39:14PM +0200, Mohammed Gamal wrote:
> When EPT/NPT is enabled, KVM does not really look at guest physical
> address size. Address bits above maximum physical memory size are reserved.
> Because KVM does not look at these guest physical addresses, it currently
> effectively supports guest physical address sizes equal to the host.
>
> This can be problem when having a mixed setup of machines with 5-level page
> tables and machines with 4-level page tables, as live migration can change
> MAXPHYADDR while the guest runs, which can theoretically introduce bugs.
>
> In this patch series we add checks on guest physical addresses in EPT
> violation/misconfig and NPF vmexits and if needed inject the proper
> page faults in the guest.
>
> A more subtle issue is when the host MAXPHYADDR is larger than that of the
> guest. Page faults caused by reserved bits on the guest won't cause an EPT
> violation/NPF and hence we also check guest MAXPHYADDR and add PFERR_RSVD_MASK
> error code to the page fault if needed.
>
> The last 3 patches (i.e. SVM bits and patch 11) are not intended for
> immediate inclusion and probably need more discussion.
> We've been noticing some unexpected behavior in handling NPF vmexits
> on AMD CPUs (see individual patches for details), and thus we are
> proposing a workaround (see last patch) that adds a capability that
> userspace can use to decide who to deal with hosts that might have
> issues supprting guest MAXPHYADDR < host MAXPHYADDR.
>
>
> Mohammed Gamal (7):
> KVM: x86: Add helper functions for illegal GPA checking and page fault
> injection
> KVM: x86: mmu: Move translate_gpa() to mmu.c
> KVM: x86: mmu: Add guest physical address check in translate_gpa()
> KVM: VMX: Add guest physical address check in EPT violation and
> misconfig
> KVM: SVM: introduce svm_need_pf_intercept
> KVM: SVM: Add guest physical address check in NPF/PF interception
> KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
> configurable
>
> Paolo Bonzini (4):
> KVM: x86: rename update_bp_intercept to update_exception_bitmap
> KVM: x86: update exception bitmap on CPUID changes
> KVM: VMX: introduce vmx_need_pf_intercept
> KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
>
> arch/x86/include/asm/kvm_host.h | 10 ++------
> arch/x86/kvm/cpuid.c | 2 ++
> arch/x86/kvm/mmu.h | 6 +++++
> arch/x86/kvm/mmu/mmu.c | 12 +++++++++
> arch/x86/kvm/svm/svm.c | 41 +++++++++++++++++++++++++++---
> arch/x86/kvm/svm/svm.h | 6 +++++
> arch/x86/kvm/vmx/nested.c | 28 ++++++++++++--------
> arch/x86/kvm/vmx/vmx.c | 45 +++++++++++++++++++++++++++++----
> arch/x86/kvm/vmx/vmx.h | 6 +++++
> arch/x86/kvm/x86.c | 29 ++++++++++++++++++++-
> arch/x86/kvm/x86.h | 1 +
> include/uapi/linux/kvm.h | 1 +
> 12 files changed, 158 insertions(+), 29 deletions(-)
>
> --
> 2.26.2
>

2020-06-22 04:49:26

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v2 01/11] KVM: x86: Add helper functions for illegal GPA checking and page fault injection

On Fri, Jun 19, 2020 at 05:39:15PM +0200, Mohammed Gamal wrote:
> This patch adds two helper functions that will be used to support virtualizing
> MAXPHYADDR in both kvm-intel.ko and kvm.ko.
>
> kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA,
> while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits.
>
> Signed-off-by: Mohammed Gamal <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu.h | 6 ++++++
> arch/x86/kvm/x86.c | 21 +++++++++++++++++++++
> arch/x86/kvm/x86.h | 1 +
> 3 files changed, 28 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 0ad06bfe2c2c..555237dfb91c 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -4,6 +4,7 @@
>
> #include <linux/kvm_host.h>
> #include "kvm_cache_regs.h"
> +#include "cpuid.h"
>
> #define PT64_PT_BITS 9
> #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
> @@ -158,6 +159,11 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu)
> return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
> }
>
> +static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> + return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu)));
> +}
> +
> /*
> * Check if a given access (described through the I/D, W/R and U/S bits of a
> * page fault error code pfec) causes a permission fault with the given PTE
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 00c88c2f34e4..ac8642e890b1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10693,6 +10693,27 @@ u64 kvm_spec_ctrl_valid_bits(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits);
>
> +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code)
> +{
> + struct x86_exception fault;
> +
> + if (!(error_code & PFERR_PRESENT_MASK) ||
> + vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code, &fault) != UNMAPPED_GVA) {
> + /*
> + * If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page
> + * tables probably do not match the TLB. Just proceed
> + * with the error code that the processor gave.
> + */
> + fault.vector = PF_VECTOR;
> + fault.error_code_valid = true;
> + fault.error_code = error_code;
> + fault.nested_page_fault = false;
> + fault.address = gva;
> + }
> + vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);

Should this "vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault)" inside the last brace?
Otherwise an uninitialized fault variable will be passed to the walk_mmu->inject_page_fault.

> +}
> +EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error);
> +
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 6eb62e97e59f..239ae0f3e40b 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
> bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
> int page_num);
> bool kvm_vector_hashing_enabled(void);
> +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
> int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int emulation_type, void *insn, int insn_len);
> fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
> --
> 2.26.2
>

2020-06-22 12:26:05

by Mohammed Gamal

[permalink] [raw]
Subject: Re: [PATCH v2 01/11] KVM: x86: Add helper functions for illegal GPA checking and page fault injection

On Mon, 2020-06-22 at 12:44 +0800, Yuan Yao wrote:
> On Fri, Jun 19, 2020 at 05:39:15PM +0200, Mohammed Gamal wrote:
> > This patch adds two helper functions that will be used to support
> > virtualizing
> > MAXPHYADDR in both kvm-intel.ko and kvm.ko.
> >
> > kvm_fixup_and_inject_pf_error() injects a page fault for a user-
> > specified GVA,
> > while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU
> > address limits.
> >
> > Signed-off-by: Mohammed Gamal <[email protected]>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/kvm/mmu.h | 6 ++++++
> > arch/x86/kvm/x86.c | 21 +++++++++++++++++++++
> > arch/x86/kvm/x86.h | 1 +
> > 3 files changed, 28 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 0ad06bfe2c2c..555237dfb91c 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -4,6 +4,7 @@
> >
> > #include <linux/kvm_host.h>
> > #include "kvm_cache_regs.h"
> > +#include "cpuid.h"
> >
> > #define PT64_PT_BITS 9
> > #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
> > @@ -158,6 +159,11 @@ static inline bool is_write_protection(struct
> > kvm_vcpu *vcpu)
> > return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
> > }
> >
> > +static inline bool kvm_mmu_is_illegal_gpa(struct kvm_vcpu *vcpu,
> > gpa_t gpa)
> > +{
> > + return (gpa >= BIT_ULL(cpuid_maxphyaddr(vcpu)));
> > +}
> > +
> > /*
> > * Check if a given access (described through the I/D, W/R and U/S
> > bits of a
> > * page fault error code pfec) causes a permission fault with the
> > given PTE
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 00c88c2f34e4..ac8642e890b1 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10693,6 +10693,27 @@ u64 kvm_spec_ctrl_valid_bits(struct
> > kvm_vcpu *vcpu)
> > }
> > EXPORT_SYMBOL_GPL(kvm_spec_ctrl_valid_bits);
> >
> > +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t
> > gva, u16 error_code)
> > +{
> > + struct x86_exception fault;
> > +
> > + if (!(error_code & PFERR_PRESENT_MASK) ||
> > + vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, error_code,
> > &fault) != UNMAPPED_GVA) {
> > + /*
> > + * If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the
> > page
> > + * tables probably do not match the TLB. Just proceed
> > + * with the error code that the processor gave.
> > + */
> > + fault.vector = PF_VECTOR;
> > + fault.error_code_valid = true;
> > + fault.error_code = error_code;
> > + fault.nested_page_fault = false;
> > + fault.address = gva;
> > + }
> > + vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);
>
> Should this "vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault)"
> inside the last brace?
> Otherwise an uninitialized fault variable will be passed to the
> walk_mmu->inject_page_fault.

Good catch. You're right. Will fix it in v3

>
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_fixup_and_inject_pf_error);
> > +
> > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index 6eb62e97e59f..239ae0f3e40b 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -272,6 +272,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32
> > msr, u64 *pdata);
> > bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu,
> > gfn_t gfn,
> > int page_num);
> > bool kvm_vector_hashing_enabled(void);
> > +void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t
> > gva, u16 error_code);
> > int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t
> > cr2_or_gpa,
> > int emulation_type, void *insn, int
> > insn_len);
> > fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
> > --
> > 2.26.2
> >

2020-06-22 14:00:27

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 06/11] KVM: VMX: introduce vmx_need_pf_intercept

On 20/06/20 00:45, Jim Mattson wrote:
>> + /*
>> + * TODO: if both L0 and L1 need the same MASK and MATCH,
>> + * go ahead and use it?
>> + */
> I'm not sure there's much "TODO", since L0's MASK and MATCH are both
> 0. So, if L1's MASK and MATCH are also both 0, we would go ahead and
> use 0's, which it seems we already do here:

True, the comment should be moved to patch 8.

Paolo

>> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
>> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);

2020-06-22 15:14:00

by Mohammed Gamal

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On Fri, 2020-06-19 at 16:52 -0500, Tom Lendacky wrote:
> On 6/19/20 10:39 AM, Mohammed Gamal wrote:
> > When EPT/NPT is enabled, KVM does not really look at guest physical
> > address size. Address bits above maximum physical memory size are
> > reserved.
> > Because KVM does not look at these guest physical addresses, it
> > currently
> > effectively supports guest physical address sizes equal to the
> > host.
> >
> > This can be problem when having a mixed setup of machines with 5-
> > level page
> > tables and machines with 4-level page tables, as live migration can
> > change
> > MAXPHYADDR while the guest runs, which can theoretically introduce
> > bugs.
> >
> > In this patch series we add checks on guest physical addresses in
> > EPT
> > violation/misconfig and NPF vmexits and if needed inject the proper
> > page faults in the guest.
> >
> > A more subtle issue is when the host MAXPHYADDR is larger than that
> > of the
> > guest. Page faults caused by reserved bits on the guest won't cause
> > an EPT
> > violation/NPF and hence we also check guest MAXPHYADDR and add
> > PFERR_RSVD_MASK
> > error code to the page fault if needed.
>
> I'm probably missing something here, but I'm confused by this
> statement.
> Is this for a case where a page has been marked not present and the
> guest
> has also set what it believes are reserved bits? Then when the page
> is
> accessed, the guest sees a page fault without the error code for
> reserved
> bits? If so, my understanding is that is architecturally correct. P=0
> is
> considered higher priority than other page faults, at least on AMD.
> So if
> you have a P=0 and other issues exist within the PTE, AMD will report
> the
> P=0 fault and that's it.
>
> The priority of other page fault conditions when P=1 is not defined
> and I
> don't think we guarantee that you would get all error codes on
> fault.
> Software is always expected to address the page fault and retry, and
> it
> may get another page fault when it does, with a different error
> code.
> Assuming the other errors are addressed, eventually the reserved
> bits
> would cause an NPF and that could be detected by the HV and handled
> appropriately.
>
> > The last 3 patches (i.e. SVM bits and patch 11) are not intended
> > for
> > immediate inclusion and probably need more discussion.
> > We've been noticing some unexpected behavior in handling NPF
> > vmexits
> > on AMD CPUs (see individual patches for details), and thus we are
> > proposing a workaround (see last patch) that adds a capability that
> > userspace can use to decide who to deal with hosts that might have
> > issues supprting guest MAXPHYADDR < host MAXPHYADDR.
>
> Also, something to consider. On AMD, when memory encryption is
> enabled
> (via the SYS_CFG MSR), a guest can actually have a larger MAXPHYADDR
> than
> the host. How do these patches all play into that?

Well the patches definitely don't address that case. It's assumed a
guest VM's MAXPHYADDR <= host MAXPHYADDR, and hence we handle the case
where a guests's physical address space is smaller and try to trap
faults that may go unnoticed by the host.

My question is in the case of guest MAXPHYADDR > host MAXPHYADDR, do we
expect somehow that there might be guest physical addresses that
contain what the host could see as reserved bits? And how'd the host
handle that?

Thanks,
Mohammed

>
> Thanks,
> Tom
>
> >
> > Mohammed Gamal (7):
> > KVM: x86: Add helper functions for illegal GPA checking and page
> > fault
> > injection
> > KVM: x86: mmu: Move translate_gpa() to mmu.c
> > KVM: x86: mmu: Add guest physical address check in
> > translate_gpa()
> > KVM: VMX: Add guest physical address check in EPT violation and
> > misconfig
> > KVM: SVM: introduce svm_need_pf_intercept
> > KVM: SVM: Add guest physical address check in NPF/PF
> > interception
> > KVM: x86: SVM: VMX: Make GUEST_MAXPHYADDR < HOST_MAXPHYADDR
> > support
> > configurable
> >
> > Paolo Bonzini (4):
> > KVM: x86: rename update_bp_intercept to update_exception_bitmap
> > KVM: x86: update exception bitmap on CPUID changes
> > KVM: VMX: introduce vmx_need_pf_intercept
> > KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
> >
> > arch/x86/include/asm/kvm_host.h | 10 ++------
> > arch/x86/kvm/cpuid.c | 2 ++
> > arch/x86/kvm/mmu.h | 6 +++++
> > arch/x86/kvm/mmu/mmu.c | 12 +++++++++
> > arch/x86/kvm/svm/svm.c | 41 +++++++++++++++++++++++++++-
> > --
> > arch/x86/kvm/svm/svm.h | 6 +++++
> > arch/x86/kvm/vmx/nested.c | 28 ++++++++++++--------
> > arch/x86/kvm/vmx/vmx.c | 45
> > +++++++++++++++++++++++++++++----
> > arch/x86/kvm/vmx/vmx.h | 6 +++++
> > arch/x86/kvm/x86.c | 29 ++++++++++++++++++++-
> > arch/x86/kvm/x86.h | 1 +
> > include/uapi/linux/kvm.h | 1 +
> > 12 files changed, 158 insertions(+), 29 deletions(-)
> >

2020-06-22 15:28:31

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 22/06/20 17:08, Mohammed Gamal wrote:
>> Also, something to consider. On AMD, when memory encryption is
>> enabled (via the SYS_CFG MSR), a guest can actually have a larger
>> MAXPHYADDR than the host. How do these patches all play into that?

As long as the NPT page tables handle the guest MAXPHYADDR just fine,
there's no need to do anything. I think that's the case?

Paolo

> Well the patches definitely don't address that case. It's assumed a
> guest VM's MAXPHYADDR <= host MAXPHYADDR, and hence we handle the case
> where a guests's physical address space is smaller and try to trap
> faults that may go unnoticed by the host.
>
> My question is in the case of guest MAXPHYADDR > host MAXPHYADDR, do we
> expect somehow that there might be guest physical addresses that
> contain what the host could see as reserved bits? And how'd the host
> handle that?

2020-06-22 16:37:57

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 6/19/20 6:07 PM, Paolo Bonzini wrote:
> On 19/06/20 23:52, Tom Lendacky wrote:
>>> A more subtle issue is when the host MAXPHYADDR is larger than that
>>> of the guest. Page faults caused by reserved bits on the guest won't
>>> cause an EPT violation/NPF and hence we also check guest MAXPHYADDR
>>> and add PFERR_RSVD_MASK error code to the page fault if needed.
>>
>> I'm probably missing something here, but I'm confused by this
>> statement. Is this for a case where a page has been marked not
>> present and the guest has also set what it believes are reserved
>> bits? Then when the page is accessed, the guest sees a page fault
>> without the error code for reserved bits?
>
> No, for non-present page there is no issue because there are no reserved
> bits in that case. If the page is present and no reserved bits are set
> according to the host, however, there are two cases to consider:
>
> - if the page is not accessible to the guest according to the
> permissions in the page table, it will cause a #PF. We need to trap it
> and change the error code into P|RSVD if the guest physical address has
> any guest-reserved bits.

I'm not a big fan of trapping #PF for this. Can't this have a performance
impact on the guest? If I'm not mistaken, Qemu will default to TCG
physical address size (40-bits), unless told otherwise, causing #PF to now
be trapped. Maybe libvirt defaults to matching host/guest CPU MAXPHYADDR?

In bare-metal, there's no guarantee a CPU will report all the faults in a
single PF error code. And because of race conditions, software can never
rely on that behavior. Whenever the OS thinks it has cured an error, it
must always be able to handle another #PF for the same access when it
retries because another processor could have modified the PTE in the
meantime. What's the purpose of reporting RSVD in the error code in the
guest in regards to live migration?

>
> - if the page is accessible to the guest according to the permissions in
> the page table, it will cause a #NPF. Again, we need to trap it, check
> the guest physical address and inject a P|RSVD #PF if the guest physical
> address has any guest-reserved bits.
>
> The AMD specific issue happens in the second case. By the time the NPF
> vmexit occurs, the accessed and/or dirty bits have been set and this
> should not have happened before the RSVD page fault that we want to
> inject. On Intel processors, instead, EPT violations trigger before
> accessed and dirty bits are set. I cannot find an explicit mention of
> the intended behavior in either the
> Intel SDM or the AMD APM.

Section 15.25.6 of the AMD APM volume 2 talks about page faults (nested vs
guest) and fault ordering. It does talk about setting guest A/D bits
during the walk, before an #NPF is taken. I don't see any way around that
given a virtual MAXPHYADDR in the guest being less than the host MAXPHYADDR.

Thanks,
Tom

>
> Paolo
>

2020-06-22 16:38:26

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 6/22/20 10:23 AM, Paolo Bonzini wrote:
> On 22/06/20 17:08, Mohammed Gamal wrote:
>>> Also, something to consider. On AMD, when memory encryption is
>>> enabled (via the SYS_CFG MSR), a guest can actually have a larger
>>> MAXPHYADDR than the host. How do these patches all play into that?
>
> As long as the NPT page tables handle the guest MAXPHYADDR just fine,
> there's no need to do anything. I think that's the case?

Yes, it does.

Thanks,
Tom

>
> Paolo
>
>> Well the patches definitely don't address that case. It's assumed a
>> guest VM's MAXPHYADDR <= host MAXPHYADDR, and hence we handle the case
>> where a guests's physical address space is smaller and try to trap
>> faults that may go unnoticed by the host.
>>
>> My question is in the case of guest MAXPHYADDR > host MAXPHYADDR, do we
>> expect somehow that there might be guest physical addresses that
>> contain what the host could see as reserved bits? And how'd the host
>> handle that?
>

2020-06-22 17:05:59

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 22/06/20 18:33, Tom Lendacky wrote:
> I'm not a big fan of trapping #PF for this. Can't this have a performance
> impact on the guest? If I'm not mistaken, Qemu will default to TCG
> physical address size (40-bits), unless told otherwise, causing #PF to now
> be trapped. Maybe libvirt defaults to matching host/guest CPU MAXPHYADDR?

Yes, this is true. We should change it similar to how we handle TSC
frequency (and having support for guest MAXPHYADDR < host MAXPHYADDR is
a prerequisite).

> In bare-metal, there's no guarantee a CPU will report all the faults in a
> single PF error code. And because of race conditions, software can never
> rely on that behavior. Whenever the OS thinks it has cured an error, it
> must always be able to handle another #PF for the same access when it
> retries because another processor could have modified the PTE in the
> meantime.

I agree, but I don't understand the relation to this patch. Can you
explain?

> What's the purpose of reporting RSVD in the error code in the
> guest in regards to live migration?
>
>> - if the page is accessible to the guest according to the permissions in
>> the page table, it will cause a #NPF. Again, we need to trap it, check
>> the guest physical address and inject a P|RSVD #PF if the guest physical
>> address has any guest-reserved bits.
>>
>> The AMD specific issue happens in the second case. By the time the NPF
>> vmexit occurs, the accessed and/or dirty bits have been set and this
>> should not have happened before the RSVD page fault that we want to
>> inject. On Intel processors, instead, EPT violations trigger before
>> accessed and dirty bits are set. I cannot find an explicit mention of
>> the intended behavior in either the
>> Intel SDM or the AMD APM.
>
> Section 15.25.6 of the AMD APM volume 2 talks about page faults (nested vs
> guest) and fault ordering. It does talk about setting guest A/D bits
> during the walk, before an #NPF is taken. I don't see any way around that
> given a virtual MAXPHYADDR in the guest being less than the host MAXPHYADDR.

Right you are... Then this behavior cannot be implemented on AMD.

Paolo

2020-06-22 18:01:56

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 6/22/20 12:03 PM, Paolo Bonzini wrote:
> On 22/06/20 18:33, Tom Lendacky wrote:
>> I'm not a big fan of trapping #PF for this. Can't this have a performance
>> impact on the guest? If I'm not mistaken, Qemu will default to TCG
>> physical address size (40-bits), unless told otherwise, causing #PF to now
>> be trapped. Maybe libvirt defaults to matching host/guest CPU MAXPHYADDR?
>
> Yes, this is true. We should change it similar to how we handle TSC
> frequency (and having support for guest MAXPHYADDR < host MAXPHYADDR is
> a prerequisite).
>
>> In bare-metal, there's no guarantee a CPU will report all the faults in a
>> single PF error code. And because of race conditions, software can never
>> rely on that behavior. Whenever the OS thinks it has cured an error, it
>> must always be able to handle another #PF for the same access when it
>> retries because another processor could have modified the PTE in the
>> meantime.
>
> I agree, but I don't understand the relation to this patch. Can you
> explain?

I guess I'm trying to understand why RSVD has to be reported to the guest
on a #PF (vs an NPF) when there's no guarantee that it can receive that
error code today even when guest MAXPHYADDR == host MAXPHYADDR. That would
eliminate the need to trap #PF.

Thanks,
Tom

>
>> What's the purpose of reporting RSVD in the error code in the
>> guest in regards to live migration?
>>
>>> - if the page is accessible to the guest according to the permissions in
>>> the page table, it will cause a #NPF. Again, we need to trap it, check
>>> the guest physical address and inject a P|RSVD #PF if the guest physical
>>> address has any guest-reserved bits.
>>>
>>> The AMD specific issue happens in the second case. By the time the NPF
>>> vmexit occurs, the accessed and/or dirty bits have been set and this
>>> should not have happened before the RSVD page fault that we want to
>>> inject. On Intel processors, instead, EPT violations trigger before
>>> accessed and dirty bits are set. I cannot find an explicit mention of
>>> the intended behavior in either the
>>> Intel SDM or the AMD APM.
>>
>> Section 15.25.6 of the AMD APM volume 2 talks about page faults (nested vs
>> guest) and fault ordering. It does talk about setting guest A/D bits
>> during the walk, before an #NPF is taken. I don't see any way around that
>> given a virtual MAXPHYADDR in the guest being less than the host MAXPHYADDR.
>
> Right you are... Then this behavior cannot be implemented on AMD.
>
> Paolo
>

2020-06-22 18:04:06

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 22/06/20 19:57, Tom Lendacky wrote:
>>> In bare-metal, there's no guarantee a CPU will report all the faults in a
>>> single PF error code. And because of race conditions, software can never
>>> rely on that behavior. Whenever the OS thinks it has cured an error, it
>>> must always be able to handle another #PF for the same access when it
>>> retries because another processor could have modified the PTE in the
>>> meantime.
>> I agree, but I don't understand the relation to this patch. Can you
>> explain?
>
> I guess I'm trying to understand why RSVD has to be reported to the guest
> on a #PF (vs an NPF) when there's no guarantee that it can receive that
> error code today even when guest MAXPHYADDR == host MAXPHYADDR. That would
> eliminate the need to trap #PF.

That's an interesting observation! But do processors exist where either:

1) RSVD doesn't win over all other bits, assuming no race conditions

2) A/D bits can be clobbered in a page table entry that has reserved
bits set?

Running the x86/access.flat testcase from kvm-unit-tests on bare metal
suggests that all existing processors do neither of the above.

In particular, the second would be a showstopper on AMD.

Paolo

2020-06-22 19:17:04

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 6/22/20 1:01 PM, Paolo Bonzini wrote:
> On 22/06/20 19:57, Tom Lendacky wrote:
>>>> In bare-metal, there's no guarantee a CPU will report all the faults in a
>>>> single PF error code. And because of race conditions, software can never
>>>> rely on that behavior. Whenever the OS thinks it has cured an error, it
>>>> must always be able to handle another #PF for the same access when it
>>>> retries because another processor could have modified the PTE in the
>>>> meantime.
>>> I agree, but I don't understand the relation to this patch. Can you
>>> explain?
>>
>> I guess I'm trying to understand why RSVD has to be reported to the guest
>> on a #PF (vs an NPF) when there's no guarantee that it can receive that
>> error code today even when guest MAXPHYADDR == host MAXPHYADDR. That would
>> eliminate the need to trap #PF.
>
> That's an interesting observation! But do processors exist where either:
>
> 1) RSVD doesn't win over all other bits, assuming no race conditions

There may not be any today, but, present bit aside (which is always
checked), there is no architectural statement that says every error
condition has to be checked and reported in a #PF error code. So software
can't rely on RSVD being present when there are other errors present.
That's why I'm saying I don't think trapping #PF just to check and report
RSVD should be done.

>
> 2) A/D bits can be clobbered in a page table entry that has reserved
> bits set?

There is nothing we can do about this one. The APM documents this when
using nested page tables. If the guest is using the same MAXPHYADDR as the
host, then I'm pretty sure this doesn't happen, correct? So it's only when
the guest is using something less than the host MAXPHYADDR that this occurs.

I'm not arguing against injecting a #PF with the RSVD on an NPF where it's
detected that bits are set above the guest MAXPHYADDR, just the #PF trapping.

Thanks,
Tom

>
> Running the x86/access.flat testcase from kvm-unit-tests on bare metal
> suggests that all existing processors do neither of the above.
>
> In particular, the second would be a showstopper on AMD.
>
> Paolo
>

2020-06-22 22:24:47

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 22/06/20 21:14, Tom Lendacky wrote:
>>> I guess I'm trying to understand why RSVD has to be reported to the guest
>>> on a #PF (vs an NPF) when there's no guarantee that it can receive that
>>> error code today even when guest MAXPHYADDR == host MAXPHYADDR. That would
>>> eliminate the need to trap #PF.
>>
>> That's an interesting observation! But do processors exist where either:
>>
>> 1) RSVD doesn't win over all other bits, assuming no race conditions
>
> There may not be any today, but, present bit aside (which is always
> checked), there is no architectural statement that says every error
> condition has to be checked and reported in a #PF error code. So software
> can't rely on RSVD being present when there are other errors present.
> That's why I'm saying I don't think trapping #PF just to check and report
> RSVD should be done.

Fair enough---if I could get rid of the #PF case I would only be happy.
But I'm worried about guests being upset if they see non-RSVD page
faults for a page table entry that has one or more reserved bits set.

>> 2) A/D bits can be clobbered in a page table entry that has reserved
>> bits set?
>
> There is nothing we can do about this one. The APM documents this when
> using nested page tables.

Understood.

> If the guest is using the same MAXPHYADDR as the
> host, then I'm pretty sure this doesn't happen, correct? So it's only when
> the guest is using something less than the host MAXPHYADDR that this occurs.
> I'm not arguing against injecting a #PF with the RSVD on an NPF where it's
> detected that bits are set above the guest MAXPHYADDR, just the #PF trapping.

Got it. My question is: is there an architectural guarantee that the
dirty bit is not set if the instruction raises a page fault? (And what
about the accessed bit?).

If so, the NPF behavior makes it impossible to emulate lower MAXPHYADDR
values from the NPF vmexit handler. It would be incorrect to inject a
#PF with the RSVD error code from the NPF handler, because the guest
would not expect the dirty bits to be set in the page table entry.

Even if there's no such guarantee, I would be reluctant to break it
because software could well be expecting it.

Paolo

> Thanks,
> Tom
>
>>
>> Running the x86/access.flat testcase from kvm-unit-tests on bare metal
>> suggests that all existing processors do neither of the above.
>>
>> In particular, the second would be a showstopper on AMD.
>>
>> Paolo
>>
>

2020-06-22 23:52:56

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 6/19/20 4:07 PM, Paolo Bonzini wrote:
> On 19/06/20 23:52, Tom Lendacky wrote:
>>> A more subtle issue is when the host MAXPHYADDR is larger than that
>>> of the guest. Page faults caused by reserved bits on the guest won't
>>> cause an EPT violation/NPF and hence we also check guest MAXPHYADDR
>>> and add PFERR_RSVD_MASK error code to the page fault if needed.
>>
>> I'm probably missing something here, but I'm confused by this
>> statement. Is this for a case where a page has been marked not
>> present and the guest has also set what it believes are reserved
>> bits? Then when the page is accessed, the guest sees a page fault
>> without the error code for reserved bits?
>
> No, for non-present page there is no issue because there are no reserved
> bits in that case. If the page is present and no reserved bits are set
> according to the host, however, there are two cases to consider:
>
> - if the page is not accessible to the guest according to the
> permissions in the page table, it will cause a #PF. We need to trap it
> and change the error code into P|RSVD if the guest physical address has
> any guest-reserved bits.

You say "we need to trap it". I think this should have a clear
justification for exactly what this accomplishes and how it benefits
what real-world guests. The performance implications and the exact
condition under which they apply should IMO be clearly documented in
Documentation/, in the code, or, at the very least, in the changelog. I
don't see such docs right now.

As I understand it, the problematic case is where a guest OS
intentionally creates a present PTE with reserved bits set. (I believe
that Xen does this. Linux does not.) For a guest to actually be
functional in this case, the guest needs to make sure that it is not
setting bits that are not, in fact, reserved on the CPU. This means the
guest needs to check MAXPHYADDR and do something different on different
CPUs.

Do such guests exist? As far as I know, Xen is busted on systems with
unusually large MAXPHYADDR regardless of any virtualization issues, so,
at best, this series would make Xen, running as a KVM guest, work better
on new hardware than it does running bare metal on that hardware. This
seems like an insufficient justification for a performance-eating series
like this.

And, unless I've misunderstood, this will eat performance quite badly.
Linux guests [0] (and probably many other guests), in quite a few
workloads, is fairly sensitive to the performance of ordinary
write-protect or not-present faults. Promoting these to VM exits
because you
want to check for bits above the guest's MAXPHYADDR is going to hurt.

(Also, I'm confused. Wouldn't faults like this be EPT/NPT violations,
not page faults?)

--Andy


[0] From rather out-of-date memory, Linux doesn't make as much use as
one might expect of the A bit. Instead it uses minor faults. Ouch.

2020-06-23 00:55:37

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: Support guest MAXPHYADDR < host MAXPHYADDR

On 23/06/20 01:47, Andy Lutomirski wrote:
> I believe that Xen does this. Linux does not.) For a guest to
> actually be functional in this case, the guest needs to make sure
> that it is not setting bits that are not, in fact, reserved on the
> CPU. This means the guest needs to check MAXPHYADDR and do something
> different on different CPUs.
>
> Do such guests exist?

I don't know; at least KVM does it too when EPT is disabled, though. It
tries to minimize the effect of this issue by preferring bit 51, but
this does not help if the host MAXPHYADDR is 52.

> As far as I know, Xen is busted on systems
> with unusually large MAXPHYADDR regardless of any virtualization
> issues, so, at best, this series would make Xen, running as a KVM
> guest, work better on new hardware than it does running bare metal on
> that hardware. This seems like an insufficient justification for a
> performance-eating series like this.
>
> And, unless I've misunderstood, this will eat performance quite
> badly. Linux guests [0] (and probably many other guests), in quite a
> few workloads, is fairly sensitive to the performance of ordinary
> write-protect or not-present faults. Promoting these to VM exits
> because you want to check for bits above the guest's MAXPHYADDR is
> going to hurt.

The series needs benchmarking indeed, however note that the vmexits do
not occur for not-present faults. QEMU sets a fixed MAXPHYADDR of 40
but that is generally a bad idea and several distros change that to just
use host MAXPHYADDR instead (which would disable the new code).

> (Also, I'm confused. Wouldn't faults like this be EPT/NPT
> violations, not page faults?)

Only if the pages are actually accessible. Otherwise, W/U/F faults
would prevail over the RSVD fault. Tom is saying that there's no
architectural promise that RSVD faults prevail, either, so that would
remove the need to trap #PF.

Paolo

> --Andy
>
>
> [0] From rather out-of-date memory, Linux doesn't make as much use
> as one might expect of the A bit. Instead it uses minor faults.
> Ouch.