This patch series is taken from SEV RFC series [1]. These patches do not
depend on the SEV feature and can be reviewed and merged on their own.
- Add support for additional SVM NFP error codes
- Add kvm_fast_pio_in support
- Use the hardware provided GPA instead of page walk
[1] http://marc.info/?l=linux-mm&m=147190814023863&w=2
---
Changes since v1:
- remove redundant gpa_avail check
Tom Lendacky (3):
kvm: svm: Add support for additional SVM NPF error codes
kvm: svm: Add kvm_fast_pio_in support
kvm: svm: Use the hardware provided GPA instead of page walk
arch/x86/include/asm/kvm_emulate.h | 3 ++
arch/x86/include/asm/kvm_host.h | 15 ++++++++-
arch/x86/kvm/mmu.c | 20 +++++++++++-
arch/x86/kvm/svm.c | 9 ++++-
arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++-
5 files changed, 100 insertions(+), 7 deletions(-)
--
Brijesh Singh
From: Tom Lendacky <[email protected]>
AMD hardware adds two additional bits to aid in nested page fault handling.
Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables
The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.
Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.
Signed-off-by: Tom Lendacky <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 11 ++++++++++-
arch/x86/kvm/mmu.c | 20 ++++++++++++++++++--
arch/x86/kvm/svm.c | 2 +-
3 files changed, 29 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bdde807..da07e17 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -191,6 +191,8 @@ enum {
#define PFERR_RSVD_BIT 3
#define PFERR_FETCH_BIT 4
#define PFERR_PK_BIT 5
+#define PFERR_GUEST_FINAL_BIT 32
+#define PFERR_GUEST_PAGE_BIT 33
#define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
#define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
@@ -198,6 +200,13 @@ enum {
#define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT)
#define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)
#define PFERR_PK_MASK (1U << PFERR_PK_BIT)
+#define PFERR_GUEST_FINAL_MASK (1ULL << PFERR_GUEST_FINAL_BIT)
+#define PFERR_GUEST_PAGE_MASK (1ULL << PFERR_GUEST_PAGE_BIT)
+
+#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
+ PFERR_USER_MASK | \
+ PFERR_WRITE_MASK | \
+ PFERR_PRESENT_MASK)
/* apic attention bits */
#define KVM_APIC_CHECK_VAPIC 0
@@ -1203,7 +1212,7 @@ void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu);
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code,
+int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u64 error_code,
void *insn, int insn_len);
void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
void kvm_mmu_new_cr3(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d9c7e98..f633d29 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4508,7 +4508,7 @@ static void make_mmu_pages_available(struct kvm_vcpu *vcpu)
kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
}
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
+int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
void *insn, int insn_len)
{
int r, emulation_type = EMULTYPE_RETRY;
@@ -4527,12 +4527,28 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
return r;
}
- r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
+ r = vcpu->arch.mmu.page_fault(vcpu, cr2, lower_32_bits(error_code),
+ false);
if (r < 0)
return r;
if (!r)
return 1;
+ /*
+ * Before emulating the instruction, check if the error code
+ * was due to a RO violation while translating the guest page.
+ * This can occur when using nested virtualization with nested
+ * paging in both guests. If true, we simply unprotect the page
+ * and resume the guest.
+ *
+ * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
+ * in PFERR_NEXT_GUEST_PAGE)
+ */
+ if (error_code == PFERR_NESTED_GUEST_PAGE) {
+ kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
+ return 1;
+ }
+
if (mmio_info_in_cache(vcpu, cr2, direct))
emulation_type = 0;
emulate:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8ca1eca..4e462bb 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2074,7 +2074,7 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
static int pf_interception(struct vcpu_svm *svm)
{
u64 fault_address = svm->vmcb->control.exit_info_2;
- u32 error_code;
+ u64 error_code;
int r = 1;
switch (svm->apf_reason) {
From: Tom Lendacky <[email protected]>
When a guest causes a NPF which requires emulation, KVM sometimes walks
the guest page tables to translate the GVA to a GPA. This is unnecessary
most of the time on AMD hardware since the hardware provides the GPA in
EXITINFO2.
The only exception cases involve string operations involving rep or
operations that use two memory locations. With rep, the GPA will only be
the value of the initial NPF and with dual memory locations we won't know
which memory address was translated into EXITINFO2.
Signed-off-by: Tom Lendacky <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 3 +++
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/svm.c | 2 ++
arch/x86/kvm/x86.c | 17 ++++++++++++++++-
4 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index e9cd7be..2d1ac09 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -344,6 +344,9 @@ struct x86_emulate_ctxt {
struct read_cache mem_read;
};
+/* String operation identifier (matches the definition in emulate.c) */
+#define CTXT_STRING_OP (1 << 13)
+
/* Repeat String Operation Prefix */
#define REPE_PREFIX 0xf3
#define REPNE_PREFIX 0xf2
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 77cb3f9..fd5b1c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -668,6 +668,9 @@ struct kvm_vcpu_arch {
int pending_ioapic_eoi;
int pending_external_vector;
+
+ /* GPA available (AMD only) */
+ bool gpa_available;
};
struct kvm_lpage_info {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 5e64e656..1bbd04c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4246,6 +4246,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
return 1;
}
+ vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
+
return svm_exit_handlers[exit_code](svm);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c30f62dc..5002eea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4441,7 +4441,19 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
return 1;
}
- *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
+ /*
+ * If the exit was due to a NPF we may already have a GPA.
+ * If the GPA is present, use it to avoid the GVA to GPA table
+ * walk. Note, this cannot be used on string operations since
+ * string operation using rep will only have the initial GPA
+ * from when the NPF occurred.
+ */
+ if (vcpu->arch.gpa_available &&
+ !(vcpu->arch.emulate_ctxt.d & CTXT_STRING_OP))
+ *gpa = exception->address;
+ else
+ *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
+ exception);
if (*gpa == UNMAPPED_GVA)
return -1;
@@ -5563,6 +5575,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
}
restart:
+ /* Save the faulting GPA (cr2) in the address field */
+ ctxt->exception.address = cr2;
+
r = x86_emulate_insn(ctxt);
if (r == EMULATION_INTERCEPTED)
From: Tom Lendacky <[email protected]>
Update the I/O interception support to add the kvm_fast_pio_in function
to speed up the in instruction similar to the out instruction.
Signed-off-by: Tom Lendacky <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/svm.c | 5 +++--
arch/x86/kvm/x86.c | 43 +++++++++++++++++++++++++++++++++++++++
3 files changed, 47 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da07e17..77cb3f9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1133,6 +1133,7 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
struct x86_emulate_ctxt;
int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port);
+int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size, unsigned short port);
void kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
int kvm_emulate_halt(struct kvm_vcpu *vcpu);
int kvm_vcpu_halt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4e462bb..5e64e656 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2270,7 +2270,7 @@ static int io_interception(struct vcpu_svm *svm)
++svm->vcpu.stat.io_exits;
string = (io_info & SVM_IOIO_STR_MASK) != 0;
in = (io_info & SVM_IOIO_TYPE_MASK) != 0;
- if (string || in)
+ if (string)
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
port = io_info >> 16;
@@ -2278,7 +2278,8 @@ static int io_interception(struct vcpu_svm *svm)
svm->next_rip = svm->vmcb->control.exit_info_2;
skip_emulated_instruction(&svm->vcpu);
- return kvm_fast_pio_out(vcpu, size, port);
+ return in ? kvm_fast_pio_in(vcpu, size, port)
+ : kvm_fast_pio_out(vcpu, size, port);
}
static int nmi_interception(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04c5d96..c30f62dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5638,6 +5638,49 @@ int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port)
}
EXPORT_SYMBOL_GPL(kvm_fast_pio_out);
+static int complete_fast_pio_in(struct kvm_vcpu *vcpu)
+{
+ unsigned long val;
+
+ /* We should only ever be called with arch.pio.count equal to 1 */
+ BUG_ON(vcpu->arch.pio.count != 1);
+
+ /* For size less than 4 we merge, else we zero extend */
+ val = (vcpu->arch.pio.size < 4) ? kvm_register_read(vcpu, VCPU_REGS_RAX)
+ : 0;
+
+ /*
+ * Since vcpu->arch.pio.count == 1 let emulator_pio_in_emulated perform
+ * the copy and tracing
+ */
+ emulator_pio_in_emulated(&vcpu->arch.emulate_ctxt, vcpu->arch.pio.size,
+ vcpu->arch.pio.port, &val, 1);
+ kvm_register_write(vcpu, VCPU_REGS_RAX, val);
+
+ return 1;
+}
+
+int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size, unsigned short port)
+{
+ unsigned long val;
+ int ret;
+
+ /* For size less than 4 we merge, else we zero extend */
+ val = (size < 4) ? kvm_register_read(vcpu, VCPU_REGS_RAX) : 0;
+
+ ret = emulator_pio_in_emulated(&vcpu->arch.emulate_ctxt, size, port,
+ &val, 1);
+ if (ret) {
+ kvm_register_write(vcpu, VCPU_REGS_RAX, val);
+ return ret;
+ }
+
+ vcpu->arch.complete_userspace_io = complete_fast_pio_in;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_fast_pio_in);
+
static int kvmclock_cpu_down_prep(unsigned int cpu)
{
__this_cpu_write(cpu_tsc_khz, 0);
On 23/11/2016 18:02, Brijesh Singh wrote:
> From: Tom Lendacky <[email protected]>
>
> When a guest causes a NPF which requires emulation, KVM sometimes walks
> the guest page tables to translate the GVA to a GPA. This is unnecessary
> most of the time on AMD hardware since the hardware provides the GPA in
> EXITINFO2.
>
> The only exception cases involve string operations involving rep or
> operations that use two memory locations. With rep, the GPA will only be
> the value of the initial NPF and with dual memory locations we won't know
> which memory address was translated into EXITINFO2.
>
> Signed-off-by: Tom Lendacky <[email protected]>
> Reviewed-by: Borislav Petkov <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/kvm_emulate.h | 3 +++
> arch/x86/include/asm/kvm_host.h | 3 +++
> arch/x86/kvm/svm.c | 2 ++
> arch/x86/kvm/x86.c | 17 ++++++++++++++++-
> 4 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
> index e9cd7be..2d1ac09 100644
> --- a/arch/x86/include/asm/kvm_emulate.h
> +++ b/arch/x86/include/asm/kvm_emulate.h
> @@ -344,6 +344,9 @@ struct x86_emulate_ctxt {
> struct read_cache mem_read;
> };
>
> +/* String operation identifier (matches the definition in emulate.c) */
> +#define CTXT_STRING_OP (1 << 13)
I'm not sure how I missed this in v1, so this is my fault, but it is
wrong in many ways:
- it doesn't say *what* it matches (it matches "String")
- it doesn't include a check that it actually does match the definition
- it doesn't say with what field you're supposed to use it.
- it is not clear from the code that vcpu->arch.emulate_ctxt is valid at
all in vcpu_mmio_gva_to_gpa.
So please make a non-static function in emulate.c that takes the
emulation context pointer and returns a bool.
Also, in fact, now that I look at it, this path is relatively unlikely
to ever trigger, necause vcpu_match_mmio_gva probably matches first
anyway. But it _is_ cheaper, since it doesn't need to test
permission_fault, so it is indeed a worthwhile patch to have.
However, please place it straight in emulator_read_write_onepage, like
if (vcpu->arch.gpa_available &&
... test CTXT_STRING_OP ... &&
vcpu_is_mmio_gpa(vcpu, exception->address)) {
gpa = exception->address;
goto mmio;
}
where vcpu_is_mmio_gpa is essentially the APIC_DEFAULT_PHYS_BASE +
vcpu_match_mmio_gpa pair of checks, extracted from vcpu_mmio_gva_to_gpa.
Thanks, and sorry for the bad earlier review.
Paolo
> /* Repeat String Operation Prefix */
> #define REPE_PREFIX 0xf3
> #define REPNE_PREFIX 0xf2
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 77cb3f9..fd5b1c8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -668,6 +668,9 @@ struct kvm_vcpu_arch {
>
> int pending_ioapic_eoi;
> int pending_external_vector;
> +
> + /* GPA available (AMD only) */
> + bool gpa_available;
> };
>
> struct kvm_lpage_info {
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 5e64e656..1bbd04c 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -4246,6 +4246,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> + vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
> +
> return svm_exit_handlers[exit_code](svm);
> }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c30f62dc..5002eea 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4441,7 +4441,19 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
> return 1;
> }
>
> - *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
> + /*
> + * If the exit was due to a NPF we may already have a GPA.
> + * If the GPA is present, use it to avoid the GVA to GPA table
> + * walk. Note, this cannot be used on string operations since
> + * string operation using rep will only have the initial GPA
> + * from when the NPF occurred.
> + */
> + if (vcpu->arch.gpa_available &&
> + !(vcpu->arch.emulate_ctxt.d & CTXT_STRING_OP))
> + *gpa = exception->address;
> + else
> + *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
> + exception);
>
> if (*gpa == UNMAPPED_GVA)
> return -1;
> @@ -5563,6 +5575,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
> }
>
> restart:
> + /* Save the faulting GPA (cr2) in the address field */
> + ctxt->exception.address = cr2;
> +
> r = x86_emulate_insn(ctxt);
>
> if (r == EMULATION_INTERCEPTED)
>
2016-11-23 12:01-0500, Brijesh Singh:
> This patch series is taken from SEV RFC series [1]. These patches do not
> depend on the SEV feature and can be reviewed and merged on their own.
>
> - Add support for additional SVM NFP error codes
> - Add kvm_fast_pio_in support
First two applied to kvm/queue, thanks.
> - Use the hardware provided GPA instead of page walk
>
> [1] http://marc.info/?l=linux-mm&m=147190814023863&w=2
On 23/11/2016 18:02, Brijesh Singh wrote:
> From: Tom Lendacky <[email protected]>
>
> When a guest causes a NPF which requires emulation, KVM sometimes walks
> the guest page tables to translate the GVA to a GPA. This is unnecessary
> most of the time on AMD hardware since the hardware provides the GPA in
> EXITINFO2.
>
> The only exception cases involve string operations involving rep or
> operations that use two memory locations. With rep, the GPA will only be
> the value of the initial NPF and with dual memory locations we won't know
> which memory address was translated into EXITINFO2.
>
> Signed-off-by: Tom Lendacky <[email protected]>
> Reviewed-by: Borislav Petkov <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
Tom, Brijesh,
I would like to confirm that you have run kvm-unit-tests with this patch?
I haven't yet tried AMD (will do before sending the pull request to Linus),
but a similar patch for Intel gives me these 4 failures on emulator.flat:
FAIL: push mem
FAIL: pop mem
FAIL: cross-page mmio read
FAIL: cross-page mmio write
The VMX patch to set gpa_available is just this:
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 25d48380c312..5d7b60d4795b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6393,6 +6393,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
/* ept page table is present? */
error_code |= (exit_qualification & 0x38) != 0;
+ vcpu->arch.gpa_available = true;
vcpu->arch.exit_qualification = exit_qualification;
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
@@ -6410,6 +6411,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
}
ret = handle_mmio_page_fault(vcpu, gpa, true);
+ vcpu->arch.gpa_available = true;
if (likely(ret == RET_MMIO_PF_EMULATE))
return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
EMULATE_DONE;
@@ -8524,6 +8526,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
u32 vectoring_info = vmx->idt_vectoring_info;
trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
+ vcpu->arch.gpa_available = false;
/*
* Flush logged GPAs PML buffer, this will make dirty_bitmap more
Thanks,
Paolo
> ---
> arch/x86/include/asm/kvm_emulate.h | 3 +++
> arch/x86/include/asm/kvm_host.h | 3 +++
> arch/x86/kvm/svm.c | 2 ++
> arch/x86/kvm/x86.c | 17 ++++++++++++++++-
> 4 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
> index e9cd7be..2d1ac09 100644
> --- a/arch/x86/include/asm/kvm_emulate.h
> +++ b/arch/x86/include/asm/kvm_emulate.h
> @@ -344,6 +344,9 @@ struct x86_emulate_ctxt {
> struct read_cache mem_read;
> };
>
> +/* String operation identifier (matches the definition in emulate.c) */
> +#define CTXT_STRING_OP (1 << 13)
> +
> /* Repeat String Operation Prefix */
> #define REPE_PREFIX 0xf3
> #define REPNE_PREFIX 0xf2
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 77cb3f9..fd5b1c8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -668,6 +668,9 @@ struct kvm_vcpu_arch {
>
> int pending_ioapic_eoi;
> int pending_external_vector;
> +
> + /* GPA available (AMD only) */
> + bool gpa_available;
> };
>
> struct kvm_lpage_info {
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 5e64e656..1bbd04c 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -4246,6 +4246,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> + vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
> +
> return svm_exit_handlers[exit_code](svm);
> }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c30f62dc..5002eea 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4441,7 +4441,19 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
> return 1;
> }
>
> - *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
> + /*
> + * If the exit was due to a NPF we may already have a GPA.
> + * If the GPA is present, use it to avoid the GVA to GPA table
> + * walk. Note, this cannot be used on string operations since
> + * string operation using rep will only have the initial GPA
> + * from when the NPF occurred.
> + */
> + if (vcpu->arch.gpa_available &&
> + !(vcpu->arch.emulate_ctxt.d & CTXT_STRING_OP))
> + *gpa = exception->address;
> + else
> + *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
> + exception);
>
> if (*gpa == UNMAPPED_GVA)
> return -1;
> @@ -5563,6 +5575,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
> }
>
> restart:
> + /* Save the faulting GPA (cr2) in the address field */
> + ctxt->exception.address = cr2;
> +
> r = x86_emulate_insn(ctxt);
>
> if (r == EMULATION_INTERCEPTED)
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Hi Paolo,
On 12/08/2016 08:52 AM, Paolo Bonzini wrote:
>
>
> On 23/11/2016 18:02, Brijesh Singh wrote:
>> From: Tom Lendacky <[email protected]>
>>
>> When a guest causes a NPF which requires emulation, KVM sometimes walks
>> the guest page tables to translate the GVA to a GPA. This is unnecessary
>> most of the time on AMD hardware since the hardware provides the GPA in
>> EXITINFO2.
>>
>> The only exception cases involve string operations involving rep or
>> operations that use two memory locations. With rep, the GPA will only be
>> the value of the initial NPF and with dual memory locations we won't know
>> which memory address was translated into EXITINFO2.
>>
>> Signed-off-by: Tom Lendacky <[email protected]>
>> Reviewed-by: Borislav Petkov <[email protected]>
>> Signed-off-by: Brijesh Singh <[email protected]>
>
> Tom, Brijesh,
>
> I would like to confirm that you have run kvm-unit-tests with this patch?
> I haven't yet tried AMD (will do before sending the pull request to Linus),
> but a similar patch for Intel gives me these 4 failures on emulator.flat:
>
> FAIL: push mem
> FAIL: pop mem
> FAIL: cross-page mmio read
> FAIL: cross-page mmio write
>
I did not ran kvm-unit-tests. However, I was able to boot Linux
guest and run some stress test. I will run kvm-unit-tests and let
you know soon.
-Brijesh
> The VMX patch to set gpa_available is just this:
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 25d48380c312..5d7b60d4795b 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6393,6 +6393,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> /* ept page table is present? */
> error_code |= (exit_qualification & 0x38) != 0;
>
> + vcpu->arch.gpa_available = true;
> vcpu->arch.exit_qualification = exit_qualification;
>
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> @@ -6410,6 +6411,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> }
>
> ret = handle_mmio_page_fault(vcpu, gpa, true);
> + vcpu->arch.gpa_available = true;
> if (likely(ret == RET_MMIO_PF_EMULATE))
> return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
> EMULATE_DONE;
> @@ -8524,6 +8526,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
> u32 vectoring_info = vmx->idt_vectoring_info;
>
> trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
> + vcpu->arch.gpa_available = false;
>
> /*
> * Flush logged GPAs PML buffer, this will make dirty_bitmap more
>
> Thanks,
>
> Paolo
>
>> ---
>> arch/x86/include/asm/kvm_emulate.h | 3 +++
>> arch/x86/include/asm/kvm_host.h | 3 +++
>> arch/x86/kvm/svm.c | 2 ++
>> arch/x86/kvm/x86.c | 17 ++++++++++++++++-
>> 4 files changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
>> index e9cd7be..2d1ac09 100644
>> --- a/arch/x86/include/asm/kvm_emulate.h
>> +++ b/arch/x86/include/asm/kvm_emulate.h
>> @@ -344,6 +344,9 @@ struct x86_emulate_ctxt {
>> struct read_cache mem_read;
>> };
>>
>> +/* String operation identifier (matches the definition in emulate.c) */
>> +#define CTXT_STRING_OP (1 << 13)
>> +
>> /* Repeat String Operation Prefix */
>> #define REPE_PREFIX 0xf3
>> #define REPNE_PREFIX 0xf2
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 77cb3f9..fd5b1c8 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -668,6 +668,9 @@ struct kvm_vcpu_arch {
>>
>> int pending_ioapic_eoi;
>> int pending_external_vector;
>> +
>> + /* GPA available (AMD only) */
>> + bool gpa_available;
>> };
>>
>> struct kvm_lpage_info {
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 5e64e656..1bbd04c 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -4246,6 +4246,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
>> return 1;
>> }
>>
>> + vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
>> +
>> return svm_exit_handlers[exit_code](svm);
>> }
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c30f62dc..5002eea 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4441,7 +4441,19 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
>> return 1;
>> }
>>
>> - *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
>> + /*
>> + * If the exit was due to a NPF we may already have a GPA.
>> + * If the GPA is present, use it to avoid the GVA to GPA table
>> + * walk. Note, this cannot be used on string operations since
>> + * string operation using rep will only have the initial GPA
>> + * from when the NPF occurred.
>> + */
>> + if (vcpu->arch.gpa_available &&
>> + !(vcpu->arch.emulate_ctxt.d & CTXT_STRING_OP))
>> + *gpa = exception->address;
>> + else
>> + *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
>> + exception);
>>
>> if (*gpa == UNMAPPED_GVA)
>> return -1;
>> @@ -5563,6 +5575,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>> }
>>
>> restart:
>> + /* Save the faulting GPA (cr2) in the address field */
>> + ctxt->exception.address = cr2;
>> +
>> r = x86_emulate_insn(ctxt);
>>
>> if (r == EMULATION_INTERCEPTED)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
Hi Paolo,
On 12/08/2016 09:39 AM, Brijesh Singh wrote:
> Hi Paolo,
>
> On 12/08/2016 08:52 AM, Paolo Bonzini wrote:
>>
>>
>> On 23/11/2016 18:02, Brijesh Singh wrote:
>>> From: Tom Lendacky <[email protected]>
>>>
>>> When a guest causes a NPF which requires emulation, KVM sometimes walks
>>> the guest page tables to translate the GVA to a GPA. This is unnecessary
>>> most of the time on AMD hardware since the hardware provides the GPA in
>>> EXITINFO2.
>>>
>>> The only exception cases involve string operations involving rep or
>>> operations that use two memory locations. With rep, the GPA will only be
>>> the value of the initial NPF and with dual memory locations we won't
>>> know
>>> which memory address was translated into EXITINFO2.
>>>
>>> Signed-off-by: Tom Lendacky <[email protected]>
>>> Reviewed-by: Borislav Petkov <[email protected]>
>>> Signed-off-by: Brijesh Singh <[email protected]>
>>
>> Tom, Brijesh,
>>
>> I would like to confirm that you have run kvm-unit-tests with this patch?
>> I haven't yet tried AMD (will do before sending the pull request to
>> Linus),
>> but a similar patch for Intel gives me these 4 failures on emulator.flat:
>>
>> FAIL: push mem
>> FAIL: pop mem
>> FAIL: cross-page mmio read
>> FAIL: cross-page mmio write
>>
>
> I did not ran kvm-unit-tests. However, I was able to boot Linux
> guest and run some stress test. I will run kvm-unit-tests and let
> you know soon.
>
I am able to reproduce it on AMD HW using kvm-unit-tests. Looking at
test, the initial thought is "push mem" has two operands (the memory
being pushed and the stack pointer). The provided GPA could be either
one of those.
If we can detect those cases, we should not set the gpa_available on
them (like what we do with string move).
We probably haven't hit this case in guest booting. Will investigate bit
further and provide a updated patch to handle it.
> -Brijesh
>> The VMX patch to set gpa_available is just this:
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 25d48380c312..5d7b60d4795b 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -6393,6 +6393,7 @@ static int handle_ept_violation(struct kvm_vcpu
>> *vcpu)
>> /* ept page table is present? */
>> error_code |= (exit_qualification & 0x38) != 0;
>>
>> + vcpu->arch.gpa_available = true;
>> vcpu->arch.exit_qualification = exit_qualification;
>>
>> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> @@ -6410,6 +6411,7 @@ static int handle_ept_misconfig(struct kvm_vcpu
>> *vcpu)
>> }
>>
>> ret = handle_mmio_page_fault(vcpu, gpa, true);
>> + vcpu->arch.gpa_available = true;
>> if (likely(ret == RET_MMIO_PF_EMULATE))
>> return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
>> EMULATE_DONE;
>> @@ -8524,6 +8526,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>> u32 vectoring_info = vmx->idt_vectoring_info;
>>
>> trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
>> + vcpu->arch.gpa_available = false;
>>
>> /*
>> * Flush logged GPAs PML buffer, this will make dirty_bitmap more
>>
>> Thanks,
>>
>> Paolo
>>
>>> ---
>>> arch/x86/include/asm/kvm_emulate.h | 3 +++
>>> arch/x86/include/asm/kvm_host.h | 3 +++
>>> arch/x86/kvm/svm.c | 2 ++
>>> arch/x86/kvm/x86.c | 17 ++++++++++++++++-
>>> 4 files changed, 24 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_emulate.h
>>> b/arch/x86/include/asm/kvm_emulate.h
>>> index e9cd7be..2d1ac09 100644
>>> --- a/arch/x86/include/asm/kvm_emulate.h
>>> +++ b/arch/x86/include/asm/kvm_emulate.h
>>> @@ -344,6 +344,9 @@ struct x86_emulate_ctxt {
>>> struct read_cache mem_read;
>>> };
>>>
>>> +/* String operation identifier (matches the definition in emulate.c) */
>>> +#define CTXT_STRING_OP (1 << 13)
>>> +
>>> /* Repeat String Operation Prefix */
>>> #define REPE_PREFIX 0xf3
>>> #define REPNE_PREFIX 0xf2
>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 77cb3f9..fd5b1c8 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -668,6 +668,9 @@ struct kvm_vcpu_arch {
>>>
>>> int pending_ioapic_eoi;
>>> int pending_external_vector;
>>> +
>>> + /* GPA available (AMD only) */
>>> + bool gpa_available;
>>> };
>>>
>>> struct kvm_lpage_info {
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index 5e64e656..1bbd04c 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -4246,6 +4246,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
>>> return 1;
>>> }
>>>
>>> + vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
>>> +
>>> return svm_exit_handlers[exit_code](svm);
>>> }
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index c30f62dc..5002eea 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -4441,7 +4441,19 @@ static int vcpu_mmio_gva_to_gpa(struct
>>> kvm_vcpu *vcpu, unsigned long gva,
>>> return 1;
>>> }
>>>
>>> - *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
>>> exception);
>>> + /*
>>> + * If the exit was due to a NPF we may already have a GPA.
>>> + * If the GPA is present, use it to avoid the GVA to GPA table
>>> + * walk. Note, this cannot be used on string operations since
>>> + * string operation using rep will only have the initial GPA
>>> + * from when the NPF occurred.
>>> + */
>>> + if (vcpu->arch.gpa_available &&
>>> + !(vcpu->arch.emulate_ctxt.d & CTXT_STRING_OP))
>>> + *gpa = exception->address;
>>> + else
>>> + *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
>>> + exception);
>>>
>>> if (*gpa == UNMAPPED_GVA)
>>> return -1;
>>> @@ -5563,6 +5575,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>>> }
>>>
>>> restart:
>>> + /* Save the faulting GPA (cr2) in the address field */
>>> + ctxt->exception.address = cr2;
>>> +
>>> r = x86_emulate_insn(ctxt);
>>>
>>> if (r == EMULATION_INTERCEPTED)
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
> I am able to reproduce it on AMD HW using kvm-unit-tests. Looking at
> test, the initial thought is "push mem" has two operands (the memory
> being pushed and the stack pointer). The provided GPA could be either
> one of those.
Aha, this makes sense---and it's easy to handle too, since you can just
add a flag to the decode table and extend the string op case to cover
that flag too. Detecting cross-page MMIO is more problematic, I don't
have an idea offhand of how to solve it.
> If we can detect those cases, we should not set the gpa_available on
> them (like what we do with string move).
It would forbid usage of "push/pop mem" instructions with MMIO for SEV,
right? It probably doesn't happen much in practice, but it's unfortunate.
Paolo
> We probably haven't hit this case in guest booting. Will investigate bit
> further and provide a updated patch to handle it.
>
> > -Brijesh
> >> The VMX patch to set gpa_available is just this:
> >>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 25d48380c312..5d7b60d4795b 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -6393,6 +6393,7 @@ static int handle_ept_violation(struct kvm_vcpu
> >> *vcpu)
> >> /* ept page table is present? */
> >> error_code |= (exit_qualification & 0x38) != 0;
> >>
> >> + vcpu->arch.gpa_available = true;
> >> vcpu->arch.exit_qualification = exit_qualification;
> >>
> >> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> >> @@ -6410,6 +6411,7 @@ static int handle_ept_misconfig(struct kvm_vcpu
> >> *vcpu)
> >> }
> >>
> >> ret = handle_mmio_page_fault(vcpu, gpa, true);
> >> + vcpu->arch.gpa_available = true;
> >> if (likely(ret == RET_MMIO_PF_EMULATE))
> >> return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
> >> EMULATE_DONE;
> >> @@ -8524,6 +8526,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
> >> u32 vectoring_info = vmx->idt_vectoring_info;
> >>
> >> trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
> >> + vcpu->arch.gpa_available = false;
> >>
> >> /*
> >> * Flush logged GPAs PML buffer, this will make dirty_bitmap more
> >>
> >> Thanks,
> >>
> >> Paolo
> >>
> >>> ---
> >>> arch/x86/include/asm/kvm_emulate.h | 3 +++
> >>> arch/x86/include/asm/kvm_host.h | 3 +++
> >>> arch/x86/kvm/svm.c | 2 ++
> >>> arch/x86/kvm/x86.c | 17 ++++++++++++++++-
> >>> 4 files changed, 24 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/arch/x86/include/asm/kvm_emulate.h
> >>> b/arch/x86/include/asm/kvm_emulate.h
> >>> index e9cd7be..2d1ac09 100644
> >>> --- a/arch/x86/include/asm/kvm_emulate.h
> >>> +++ b/arch/x86/include/asm/kvm_emulate.h
> >>> @@ -344,6 +344,9 @@ struct x86_emulate_ctxt {
> >>> struct read_cache mem_read;
> >>> };
> >>>
> >>> +/* String operation identifier (matches the definition in emulate.c) */
> >>> +#define CTXT_STRING_OP (1 << 13)
> >>> +
> >>> /* Repeat String Operation Prefix */
> >>> #define REPE_PREFIX 0xf3
> >>> #define REPNE_PREFIX 0xf2
> >>> diff --git a/arch/x86/include/asm/kvm_host.h
> >>> b/arch/x86/include/asm/kvm_host.h
> >>> index 77cb3f9..fd5b1c8 100644
> >>> --- a/arch/x86/include/asm/kvm_host.h
> >>> +++ b/arch/x86/include/asm/kvm_host.h
> >>> @@ -668,6 +668,9 @@ struct kvm_vcpu_arch {
> >>>
> >>> int pending_ioapic_eoi;
> >>> int pending_external_vector;
> >>> +
> >>> + /* GPA available (AMD only) */
> >>> + bool gpa_available;
> >>> };
> >>>
> >>> struct kvm_lpage_info {
> >>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >>> index 5e64e656..1bbd04c 100644
> >>> --- a/arch/x86/kvm/svm.c
> >>> +++ b/arch/x86/kvm/svm.c
> >>> @@ -4246,6 +4246,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
> >>> return 1;
> >>> }
> >>>
> >>> + vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
> >>> +
> >>> return svm_exit_handlers[exit_code](svm);
> >>> }
> >>>
> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>> index c30f62dc..5002eea 100644
> >>> --- a/arch/x86/kvm/x86.c
> >>> +++ b/arch/x86/kvm/x86.c
> >>> @@ -4441,7 +4441,19 @@ static int vcpu_mmio_gva_to_gpa(struct
> >>> kvm_vcpu *vcpu, unsigned long gva,
> >>> return 1;
> >>> }
> >>>
> >>> - *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
> >>> exception);
> >>> + /*
> >>> + * If the exit was due to a NPF we may already have a GPA.
> >>> + * If the GPA is present, use it to avoid the GVA to GPA table
> >>> + * walk. Note, this cannot be used on string operations since
> >>> + * string operation using rep will only have the initial GPA
> >>> + * from when the NPF occurred.
> >>> + */
> >>> + if (vcpu->arch.gpa_available &&
> >>> + !(vcpu->arch.emulate_ctxt.d & CTXT_STRING_OP))
> >>> + *gpa = exception->address;
> >>> + else
> >>> + *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
> >>> + exception);
> >>>
> >>> if (*gpa == UNMAPPED_GVA)
> >>> return -1;
> >>> @@ -5563,6 +5575,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
> >>> }
> >>>
> >>> restart:
> >>> + /* Save the faulting GPA (cr2) in the address field */
> >>> + ctxt->exception.address = cr2;
> >>> +
> >>> r = x86_emulate_insn(ctxt);
> >>>
> >>> if (r == EMULATION_INTERCEPTED)
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> the body of a message to [email protected]
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
>
Hi Paolo,
On 12/09/2016 09:41 AM, Paolo Bonzini wrote:
>
>> I am able to reproduce it on AMD HW using kvm-unit-tests. Looking at
>> test, the initial thought is "push mem" has two operands (the memory
>> being pushed and the stack pointer). The provided GPA could be either
>> one of those.
>
> Aha, this makes sense---and it's easy to handle too, since you can just
> add a flag to the decode table and extend the string op case to cover
> that flag too. Detecting cross-page MMIO is more problematic, I don't
> have an idea offhand of how to solve it.
>
I added a new flag "TwoMemOp" and this seems to be passing the "push
mem" and "pop mem" tests. If you are okay with this then I can convert
it into a patch for review.
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0ea543e..c86dc1d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -171,6 +171,7 @@
#define NearBranch ((u64)1 << 52) /* Near branches */
#define No16 ((u64)1 << 53) /* No 16 bit operand */
#define IncSP ((u64)1 << 54) /* SP is incremented before ModRM
calc */
+#define TwoMemOp ((u64)1 << 55) /* Instruction has two memory
operand */
#define DstXacc (DstAccLo | SrcAccHi | SrcWrite)
@@ -4104,7 +4105,7 @@ static const struct opcode group1[] = {
};
static const struct opcode group1A[] = {
- I(DstMem | SrcNone | Mov | Stack | IncSP, em_pop), N, N, N, N,
N, N, N,
+ I(DstMem | SrcNone | Mov | Stack | IncSP | TwoMemOp, em_pop), N,
N, N, N, N, N, N,
};
static const struct opcode group2[] = {
@@ -4142,7 +4143,7 @@
I(SrcMemFAddr | ImplicitOps, em_jmp_far),
- I(SrcMem | Stack, em_push), D(Undefined),
+ I(SrcMem | Stack | TwoMemOp, em_push), D(Undefined),
static const struct opcode group6[] = {
@@ -4331,8 +4332,8 @@
I(DstReg | SrcMem | ModRM | Src2ImmByte, em_imul_3op),
- I2bvIP(SrcSI | DstDX | String, em_out, outs, check_perm_out), /*
outsb, outsw/outsd */
+ b, insw/insd */
+ */
/* 0x80 - 0x87 */
@@ -4360,13 +4361,13 @@ static const struct opcode opcode_table[256] = {
/* 0xA0 - 0xA7 */
I2bv(DstAcc | SrcMem | Mov | MemAbs, em_mov),
I2bv(DstMem | SrcAcc | Mov | MemAbs | PageTable, em_mov),
- I2bv(SrcSI | DstDI | Mov | String, em_mov),
- F2bv(SrcSI | DstDI | String | NoWrite, em_cmp_r),
+ I2bv(SrcSI | DstDI | Mov | String | TwoMemOp, em_mov),
+ F2bv(SrcSI | DstDI | String | NoWrite | TwoMemOp, em_cmp_r),
/* 0xA8 - 0xAF */
F2bv(DstAcc | SrcImm | NoWrite, em_test),
- I2bv(SrcAcc | DstDI | Mov | String, em_mov),
- I2bv(SrcSI | DstAcc | Mov | String, em_mov),
- F2bv(SrcAcc | DstDI | String | NoWrite, em_cmp_r),
+ I2bv(SrcAcc | DstDI | Mov | String | TwoMemOp, em_mov),
+ I2bv(SrcSI | DstAcc | Mov | String | TwoMemOp, em_mov),
+ F2bv(SrcAcc | DstDI | String | NoWrite | TwoMemOp, em_cmp_r),
/* 0xB0 - 0xB7 */
X8(I(ByteOp | DstReg | SrcImm | Mov, em_mov)),
/* 0xB8 - 0xBF */
@@ -5484,10 +5485,7 @@ void emulator_writeback_register_cache(struct
x86_emulate_ctxt *ctxt)
writeback_registers(ctxt);
}
-bool emulator_is_string_op(struct x86_emulate_ctxt *ctxt)
+bool emulator_is_two_memory_op(struct x86_emulate_ctxt *ctxt)
{
- if (ctxt->d & String)
- return true;
-
- return false;
+ return ctxt->d & TwoMemOp ? true : false;
}
>> If we can detect those cases, we should not set the gpa_available on
>> them (like what we do with string move).
>
> It would forbid usage of "push/pop mem" instructions with MMIO for SEV,
> right? It probably doesn't happen much in practice, but it's unfortunate.
>
As per the AMD BKDG [1] Section 2.7.1, we should not be using any of
these instruction for MMIO access, the behavior is undefined.
The question is, do we really need to add logic to detect the cross-page
MMIO accesses and push/pop mem operations so that we pass the
kvm-unit-test or we should update the unit test? Like you said
cross-page MMIO access detection is going to be a bit tricky.
Thoughts ?
[1] http://support.amd.com/TechDocs/52740_16h_Models_30h-3Fh_BKDG.pdf
> Paolo
>
>> We probably haven't hit this case in guest booting. Will investigate bit
>> further and provide a updated patch to handle it.
>>
>>> -Brijesh
>>>> The VMX patch to set gpa_available is just this:
>>>>
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 25d48380c312..5d7b60d4795b 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -6393,6 +6393,7 @@ static int handle_ept_violation(struct kvm_vcpu
>>>> *vcpu)
>>>> /* ept page table is present? */
>>>> error_code |= (exit_qualification & 0x38) != 0;
>>>>
>>>> + vcpu->arch.gpa_available = true;
>>>> vcpu->arch.exit_qualification = exit_qualification;
>>>>
>>>> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>>>> @@ -6410,6 +6411,7 @@ static int handle_ept_misconfig(struct kvm_vcpu
>>>> *vcpu)
>>>> }
>>>>
>>>> ret = handle_mmio_page_fault(vcpu, gpa, true);
>>>> + vcpu->arch.gpa_available = true;
>>>> if (likely(ret == RET_MMIO_PF_EMULATE))
>>>> return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
>>>> EMULATE_DONE;
>>>> @@ -8524,6 +8526,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>>>> u32 vectoring_info = vmx->idt_vectoring_info;
>>>>
>>>> trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
>>>> + vcpu->arch.gpa_available = false;
>>>>
>>>> /*
>>>> * Flush logged GPAs PML buffer, this will make dirty_bitmap more
>>>>
>>>> Thanks,
>>>>
>>>> Paolo
>>>>
>>>>> ---
>>>>> arch/x86/include/asm/kvm_emulate.h | 3 +++
>>>>> arch/x86/include/asm/kvm_host.h | 3 +++
>>>>> arch/x86/kvm/svm.c | 2 ++
>>>>> arch/x86/kvm/x86.c | 17 ++++++++++++++++-
>>>>> 4 files changed, 24 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/x86/include/asm/kvm_emulate.h
>>>>> b/arch/x86/include/asm/kvm_emulate.h
>>>>> index e9cd7be..2d1ac09 100644
>>>>> --- a/arch/x86/include/asm/kvm_emulate.h
>>>>> +++ b/arch/x86/include/asm/kvm_emulate.h
>>>>> @@ -344,6 +344,9 @@ struct x86_emulate_ctxt {
>>>>> struct read_cache mem_read;
>>>>> };
>>>>>
>>>>> +/* String operation identifier (matches the definition in emulate.c) */
>>>>> +#define CTXT_STRING_OP (1 << 13)
>>>>> +
>>>>> /* Repeat String Operation Prefix */
>>>>> #define REPE_PREFIX 0xf3
>>>>> #define REPNE_PREFIX 0xf2
>>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>>> b/arch/x86/include/asm/kvm_host.h
>>>>> index 77cb3f9..fd5b1c8 100644
>>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>>> @@ -668,6 +668,9 @@ struct kvm_vcpu_arch {
>>>>>
>>>>> int pending_ioapic_eoi;
>>>>> int pending_external_vector;
>>>>> +
>>>>> + /* GPA available (AMD only) */
>>>>> + bool gpa_available;
>>>>> };
>>>>>
>>>>> struct kvm_lpage_info {
>>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>>> index 5e64e656..1bbd04c 100644
>>>>> --- a/arch/x86/kvm/svm.c
>>>>> +++ b/arch/x86/kvm/svm.c
>>>>> @@ -4246,6 +4246,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
>>>>> return 1;
>>>>> }
>>>>>
>>>>> + vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
>>>>> +
>>>>> return svm_exit_handlers[exit_code](svm);
>>>>> }
>>>>>
>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>> index c30f62dc..5002eea 100644
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -4441,7 +4441,19 @@ static int vcpu_mmio_gva_to_gpa(struct
>>>>> kvm_vcpu *vcpu, unsigned long gva,
>>>>> return 1;
>>>>> }
>>>>>
>>>>> - *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
>>>>> exception);
>>>>> + /*
>>>>> + * If the exit was due to a NPF we may already have a GPA.
>>>>> + * If the GPA is present, use it to avoid the GVA to GPA table
>>>>> + * walk. Note, this cannot be used on string operations since
>>>>> + * string operation using rep will only have the initial GPA
>>>>> + * from when the NPF occurred.
>>>>> + */
>>>>> + if (vcpu->arch.gpa_available &&
>>>>> + !(vcpu->arch.emulate_ctxt.d & CTXT_STRING_OP))
>>>>> + *gpa = exception->address;
>>>>> + else
>>>>> + *gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access,
>>>>> + exception);
>>>>>
>>>>> if (*gpa == UNMAPPED_GVA)
>>>>> return -1;
>>>>> @@ -5563,6 +5575,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>>>>> }
>>>>>
>>>>> restart:
>>>>> + /* Save the faulting GPA (cr2) in the address field */
>>>>> + ctxt->exception.address = cr2;
>>>>> +
>>>>> r = x86_emulate_insn(ctxt);
>>>>>
>>>>> if (r == EMULATION_INTERCEPTED)
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>
On 12/12/2016 18:51, Brijesh Singh wrote:
> As per the AMD BKDG [1] Section 2.7.1, we should not be using any of
> these instruction for MMIO access, the behavior is undefined.
>
> The question is, do we really need to add logic to detect the cross-page
> MMIO accesses and push/pop mem operations so that we pass the
> kvm-unit-test or we should update the unit test? Like you said
> cross-page MMIO access detection is going to be a bit tricky.
Actually there is a nice trick you can do to support cross-page
MMIO access detection:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 37cd31645d45..754d251dc611 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4549,6 +4549,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
*/
if (vcpu->arch.gpa_available &&
!emulator_is_string_op(ctxt) &&
+ (addr & ~PAGE_MASK) == (exception->address & ~PAGE_MASK) &&
vcpu_is_mmio_gpa(vcpu, addr, exception->address, write)) {
gpa = exception->address;
goto mmio;
It fixes the testcase for push/pop with two memory ops too,
but it's not reliable, so your change for TwoMemOp is still
necessary. Feel free to include it in your patch!
Regarding the replacement of emulator_is_string_op with
emulator_is_two_memory_op, what about REP prefixes? In that
case I think that you do need to reject string ops. So the
function would have to reject all TwoMemOps, and REP-prefixed
String operations.
Paolo
Hi Paolo,
On 12/13/2016 11:09 AM, Paolo Bonzini wrote:
>
>
> On 12/12/2016 18:51, Brijesh Singh wrote:
>> As per the AMD BKDG [1] Section 2.7.1, we should not be using any of
>> these instruction for MMIO access, the behavior is undefined.
>>
>> The question is, do we really need to add logic to detect the cross-page
>> MMIO accesses and push/pop mem operations so that we pass the
>> kvm-unit-test or we should update the unit test? Like you said
>> cross-page MMIO access detection is going to be a bit tricky.
>
> Actually there is a nice trick you can do to support cross-page
> MMIO access detection:
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 37cd31645d45..754d251dc611 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4549,6 +4549,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
> */
> if (vcpu->arch.gpa_available &&
> !emulator_is_string_op(ctxt) &&
> + (addr & ~PAGE_MASK) == (exception->address & ~PAGE_MASK) &&
> vcpu_is_mmio_gpa(vcpu, addr, exception->address, write)) {
> gpa = exception->address;
> goto mmio;
>
>
> It fixes the testcase for push/pop with two memory ops too,
> but it's not reliable, so your change for TwoMemOp is still
> necessary. Feel free to include it in your patch!
>
> Regarding the replacement of emulator_is_string_op with
> emulator_is_two_memory_op, what about REP prefixes? In that
> case I think that you do need to reject string ops. So the
> function would have to reject all TwoMemOps, and REP-prefixed
> String operations.
>
Since now we are going to perform multiple conditional checks before
concluding that its safe to use HW provided GPA. How about if we add two
functions "emulator_is_rep_string_op" and "emulator_is_two_mem_op" into
emulator.c and use these functions inside the x86.c to determine if its
safe to use HW provided gpa?
Please let me know if you are okay with this approach.
diff --git a/arch/x86/include/asm/kvm_emulate.h
b/arch/x86/include/asm/kvm_emulate.h
index 777eea2..29e44cb 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -441,6 +441,7 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
int emulate_int_real(struct x86_emulate_ctxt *ctxt, int irq);
void emulator_invalidate_register_cache(struct x86_emulate_ctxt *ctxt);
void emulator_writeback_register_cache(struct x86_emulate_ctxt *ctxt);
-bool emulator_is_string_op(struct x86_emulate_ctxt *ctxt);
+bool emulator_is_rep_string_op(struct x86_emulate_ctxt *ctxt);
+bool emulator_is_two_mem_op(struct x86_emulate_ctxt *ctxt);
#endif /* _ASM_X86_KVM_X86_EMULATE_H */
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8e7d09f..16149ad 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5485,10 +5485,12 @@ void emulator_writeback_register_cache(struct
x86_emulate_ctxt *ctxt)
writeback_registers(ctxt);
}
-bool emulator_is_string_op(struct x86_emulate_ctxt *ctxt)
+bool emulator_is_rep_string_op(struct x86_emulate_ctxt *ctxt)
{
- if (ctxt->d & String)
- return true;
+ return ctxt->rep_prefix && (ctxt->d & String) ? true: false;
+}
- return false;
+bool emulator_is_two_mem_op(struct x86_emulate_ctxt *ctxt)
+{
+ return ctxt->d & TwoMemOp ? true : false;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 640527b..0bc814a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4548,6 +4548,12 @@ static const struct read_write_emulator_ops
write_emultor = {
.write = true,
};
+static bool emulator_can_use_gpa(struct x86_emulate_ctxt *ctxt)
+{
+ return emulator_is_rep_string_op(ctxt) &&
+ emulator_is_two_mem_op(ctxt) ? true : false;
+}
+
static int emulator_read_write_onepage(unsigned long addr, void *val,
unsigned int bytes,
struct x86_exception *exception,
@@ -4568,7 +4574,7 @@ static int emulator_read_write_onepage(unsigned
long addr, void *val,
* occurred.
*/
if (vcpu->arch.gpa_available &&
- !emulator_is_string_op(ctxt) &&
+ !emulator_can_use_gpa(ctxt) &&
vcpu_is_mmio_gpa(vcpu, addr, exception->address, write) &&
(addr & ~PAGE_MASK) == (exception->address & ~PAGE_MASK)) {
On 14/12/2016 18:07, Brijesh Singh wrote:
>>
>
> Since now we are going to perform multiple conditional checks before
> concluding that its safe to use HW provided GPA. How about if we add two
> functions "emulator_is_rep_string_op" and "emulator_is_two_mem_op" into
> emulator.c and use these functions inside the x86.c to determine if its
> safe to use HW provided gpa?
Why not export only emulator_can_use_gpa from emulate.c? (So in the end
leaving emulator_is_string_op in emulate.c was the right thing to do, it
was just the test that was wrong :)).
The patch below is still missing the check for cross-page MMIO. Your
reference to the BKDG only covers MMCONFIG (sometimes referred to as
ECAM), not MMIO in general. Doing AND or OR into video memory for
example is perfectly legal, and I'm fairly sure that some obscure legacy
software does PUSH/POP into vram as well!
Thanks,
Paolo
On 14/12/2016 19:39, Brijesh Singh wrote:
>
> On 12/14/2016 11:23 AM, Paolo Bonzini wrote:
>>
>>
>> On 14/12/2016 18:07, Brijesh Singh wrote:
>>>>
>>>
>>> Since now we are going to perform multiple conditional checks before
>>> concluding that its safe to use HW provided GPA. How about if we add two
>>> functions "emulator_is_rep_string_op" and "emulator_is_two_mem_op" into
>>> emulator.c and use these functions inside the x86.c to determine if its
>>> safe to use HW provided gpa?
>>
>> Why not export only emulator_can_use_gpa from emulate.c? (So in the end
>> leaving emulator_is_string_op in emulate.c was the right thing to do, it
>> was just the test that was wrong :)).
>>
>
> Actually, I was not sure if putting emulator_can_use_gpa() in emulate.c
> was right thing - mainly because emulator.c does not deal with GPA. I
> will go with your advice and put it in emulator.c, it makes easy :)
>
>
>> The patch below is still missing the check for cross-page MMIO. Your
>> reference to the BKDG only covers MMCONFIG (sometimes referred to as
>> ECAM), not MMIO in general. Doing AND or OR into video memory for
>> example is perfectly legal, and I'm fairly sure that some obscure legacy
>> software does PUSH/POP into vram as well!
>>
>>
>
> I used your below code snippet to detect cross-page MMIO access. After
> applying these changes cross-page MMIO read/write unit test is passing
> just fine. I will include it in patch.
Great, thanks. I hope we can include it in 4.10.
Paolo
On 12/14/2016 11:23 AM, Paolo Bonzini wrote:
>
>
> On 14/12/2016 18:07, Brijesh Singh wrote:
>>>
>>
>> Since now we are going to perform multiple conditional checks before
>> concluding that its safe to use HW provided GPA. How about if we add two
>> functions "emulator_is_rep_string_op" and "emulator_is_two_mem_op" into
>> emulator.c and use these functions inside the x86.c to determine if its
>> safe to use HW provided gpa?
>
> Why not export only emulator_can_use_gpa from emulate.c? (So in the end
> leaving emulator_is_string_op in emulate.c was the right thing to do, it
> was just the test that was wrong :)).
>
Actually, I was not sure if putting emulator_can_use_gpa() in emulate.c
was right thing - mainly because emulator.c does not deal with GPA. I
will go with your advice and put it in emulator.c, it makes easy :)
> The patch below is still missing the check for cross-page MMIO. Your
> reference to the BKDG only covers MMCONFIG (sometimes referred to as
> ECAM), not MMIO in general. Doing AND or OR into video memory for
> example is perfectly legal, and I'm fairly sure that some obscure legacy
> software does PUSH/POP into vram as well!
>
>
I used your below code snippet to detect cross-page MMIO access. After
applying these changes cross-page MMIO read/write unit test is passing
just fine. I will include it in patch.
> Actually there is a nice trick you can do to support cross-page
> MMIO access detection:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 37cd31645d45..754d251dc611 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4549,6 +4549,7 @@ static int emulator_read_write_onepage(unsigned
long addr, void *val,
*/
if (vcpu->arch.gpa_available &&
!emulator_can_use_hw_gpa(ctxt) &&
+ (addr & ~PAGE_MASK) == (exception->address & ~PAGE_MASK) &&
vcpu_is_mmio_gpa(vcpu, addr, exception->address, write)) {
gpa = exception->address;
goto mmio;
On 23/11/2016 18:01, Brijesh Singh wrote:
>
> + /*
> + * Before emulating the instruction, check if the error code
> + * was due to a RO violation while translating the guest page.
> + * This can occur when using nested virtualization with nested
> + * paging in both guests. If true, we simply unprotect the page
> + * and resume the guest.
> + *
> + * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
> + * in PFERR_NEXT_GUEST_PAGE)
> + */
> + if (error_code == PFERR_NESTED_GUEST_PAGE) {
> + kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
> + return 1;
> + }
What happens if L1 is mapping some memory that is read only in L0? That
is, the L1 nested page tables make it read-write, but the L0 shadow
nested page tables make it read-only.
Accessing it would cause an NPF, and then my guess is that the L1 guest
would loop on the failing instruction instead of just dropping the write.
Paolo
Hi Paolo,
On 07/27/2017 11:27 AM, Paolo Bonzini wrote:
> On 23/11/2016 18:01, Brijesh Singh wrote:
>>
>> + /*
>> + * Before emulating the instruction, check if the error code
>> + * was due to a RO violation while translating the guest page.
>> + * This can occur when using nested virtualization with nested
>> + * paging in both guests. If true, we simply unprotect the page
>> + * and resume the guest.
>> + *
>> + * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
>> + * in PFERR_NEXT_GUEST_PAGE)
>> + */
>> + if (error_code == PFERR_NESTED_GUEST_PAGE) {
>> + kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
>> + return 1;
>> + }
>
>
> What happens if L1 is mapping some memory that is read only in L0? That
> is, the L1 nested page tables make it read-write, but the L0 shadow
> nested page tables make it read-only.
>
> Accessing it would cause an NPF, and then my guess is that the L1 guest
> would loop on the failing instruction instead of just dropping the write.
>
Not sure if I am able to follow your use case. Could you please explain me
in bit detail.
The purpose of the code above was really for when we resume from the L2 guest
back to the L1 guest. The L1 page tables are marked RO when in the L2 guest
(for shadow paging) as I recall, so when we come back to the L1 guest, it can
get a fault since its page tables are not marked writeable at L0 as they need to be.
-Brijesh
On 31/07/2017 15:30, Brijesh Singh wrote:
> Hi Paolo,
>
> On 07/27/2017 11:27 AM, Paolo Bonzini wrote:
>> On 23/11/2016 18:01, Brijesh Singh wrote:
>>> + /*
>>> + * Before emulating the instruction, check if the error code
>>> + * was due to a RO violation while translating the guest page.
>>> + * This can occur when using nested virtualization with nested
>>> + * paging in both guests. If true, we simply unprotect the page
>>> + * and resume the guest.
>>> + *
>>> + * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
>>> + * in PFERR_NEXT_GUEST_PAGE)
>>> + */
>>> + if (error_code == PFERR_NESTED_GUEST_PAGE) {
>>> + kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
>>> + return 1;
>>> + }
>>
>>
>> What happens if L1 is mapping some memory that is read only in L0? That
>> is, the L1 nested page tables make it read-write, but the L0 shadow
>> nested page tables make it read-only.
>>
>> Accessing it would cause an NPF, and then my guess is that the L1 guest
>> would loop on the failing instruction instead of just dropping the write.
>>
>
>
> Not sure if I am able to follow your use case. Could you please explain me
> in bit detail.
>
> The purpose of the code above was really for when we resume from the L2 guest
> back to the L1 guest. The L1 page tables are marked RO when in the L2 guest
> (for shadow paging) as I recall, so when we come back to the L1 guest, it can
> get a fault since its page tables are not marked writeable at L0 as they
> need to be.
There can be different cases where an L0->L2 shadow nested page table is
marked read only, in particular when a page is read only in L1's nested
page tables. If such a page is accessed by L2 while walking page tables
it will cause a nested page fault (page table walks are write accesses).
However, after kvm_mmu_unprotect_page you will get another page fault,
and again in an endless stream.
Instead, emulation would have caused a nested page fault vmexit, I think.
Paolo
On 07/31/2017 10:44 AM, Paolo Bonzini wrote:
> On 31/07/2017 15:30, Brijesh Singh wrote:
>> Hi Paolo,
>>
>> On 07/27/2017 11:27 AM, Paolo Bonzini wrote:
>>> On 23/11/2016 18:01, Brijesh Singh wrote:
>>>> + /*
>>>> + * Before emulating the instruction, check if the error code
>>>> + * was due to a RO violation while translating the guest page.
>>>> + * This can occur when using nested virtualization with nested
>>>> + * paging in both guests. If true, we simply unprotect the page
>>>> + * and resume the guest.
>>>> + *
>>>> + * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
>>>> + * in PFERR_NEXT_GUEST_PAGE)
>>>> + */
>>>> + if (error_code == PFERR_NESTED_GUEST_PAGE) {
>>>> + kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
>>>> + return 1;
>>>> + }
>>>
>>>
>>> What happens if L1 is mapping some memory that is read only in L0? That
>>> is, the L1 nested page tables make it read-write, but the L0 shadow
>>> nested page tables make it read-only.
>>>
>>> Accessing it would cause an NPF, and then my guess is that the L1 guest
>>> would loop on the failing instruction instead of just dropping the write.
>>>
>>
>>
>> Not sure if I am able to follow your use case. Could you please explain me
>> in bit detail.
>>
>> The purpose of the code above was really for when we resume from the L2 guest
>> back to the L1 guest. The L1 page tables are marked RO when in the L2 guest
>> (for shadow paging) as I recall, so when we come back to the L1 guest, it can
>> get a fault since its page tables are not marked writeable at L0 as they
>> need to be.
>
> There can be different cases where an L0->L2 shadow nested page table is
> marked read only, in particular when a page is read only in L1's nested
> page tables. If such a page is accessed by L2 while walking page tables
> it will cause a nested page fault (page table walks are write accesses).
> However, after kvm_mmu_unprotect_page you will get another page fault,
> and again in an endless stream.
>
> Instead, emulation would have caused a nested page fault vmexit, I think.
>
If possible could you please give me some pointer on how to create this use
case so that we can get definitive answer.
Looking at the code path is giving me indication that the new code
(the kvm_mmu_unprotect_page call) only happens if vcpu->arch.mmu_page_fault()
returns an indication that the instruction should be emulated. I would not expect
that to be the case scenario you described above since L1 making a page read-only
(this is a page table for L2) is an error and should result in #NPF being injected
into L1. It's bit hard for me to visualize the code flow and figure out exactly
how that would happen, but I just tried booting nested virtualization and it seem
to be working okay.
Is there a kvm-unit-test which I can run to trigger this scenario ? thanks
-Brijesh
> > There can be different cases where an L0->L2 shadow nested page table is
> > marked read only, in particular when a page is read only in L1's nested
> > page tables. If such a page is accessed by L2 while walking page tables
> > it will cause a nested page fault (page table walks are write accesses).
> > However, after kvm_mmu_unprotect_page you will get another page fault,
> > and again in an endless stream.
> >
> > Instead, emulation would have caused a nested page fault vmexit, I think.
>
> If possible could you please give me some pointer on how to create this use
> case so that we can get definitive answer.
>
> Looking at the code path is giving me indication that the new code
> (the kvm_mmu_unprotect_page call) only happens if vcpu->arch.mmu_page_fault()
> returns an indication that the instruction should be emulated. I would not
> expect that to be the case scenario you described above since L1 making a page
> read-only (this is a page table for L2) is an error and should result in #NPF
> being injected into L1.
The flow is:
hardware walks page table; L2 page table points to read only memory
-> pf_interception (code =
-> kvm_handle_page_fault (need_unprotect = false)
-> kvm_mmu_page_fault
-> paging64_page_fault (for example)
-> try_async_pf
map_writable set to false
-> paging64_fetch(write_fault = true, map_writable = false, prefault = false)
-> mmu_set_spte(speculative = false, host_writable = false, write_fault = true)
-> set_spte
mmu_need_write_protect returns true
return true
write_fault == true -> set emulate = true
return true
return true
return true
emulate
Without this patch, emulation would have called
..._gva_to_gpa_nested
-> translate_nested_gpa
-> paging64_gva_to_gpa
-> paging64_walk_addr
-> paging64_walk_addr_generic
set fault (nested_page_fault=true)
and then:
kvm_propagate_fault
-> nested_svm_inject_npf_exit
> It's bit hard for me to visualize the code flow and
> figure out exactly how that would happen, but I just tried booting nested
> virtualization and it seem to be working okay.
I don't expect the above to happen when booting a normal guest (usual L1
guests hardly have readonly mappings).
> Is there a kvm-unit-test which I can run to trigger this scenario ? thanks
No, there isn't.
Paolo
> -Brijesh
>
On 07/31/2017 03:05 PM, Paolo Bonzini wrote:
>
>>> There can be different cases where an L0->L2 shadow nested page table is
>>> marked read only, in particular when a page is read only in L1's nested
>>> page tables. If such a page is accessed by L2 while walking page tables
>>> it will cause a nested page fault (page table walks are write accesses).
>>> However, after kvm_mmu_unprotect_page you will get another page fault,
>>> and again in an endless stream.
>>>
>>> Instead, emulation would have caused a nested page fault vmexit, I think.
>>
>> If possible could you please give me some pointer on how to create this use
>> case so that we can get definitive answer.
>>
>> Looking at the code path is giving me indication that the new code
>> (the kvm_mmu_unprotect_page call) only happens if vcpu->arch.mmu_page_fault()
>> returns an indication that the instruction should be emulated. I would not
>> expect that to be the case scenario you described above since L1 making a page
>> read-only (this is a page table for L2) is an error and should result in #NPF
>> being injected into L1.
>
> The flow is:
>
> hardware walks page table; L2 page table points to read only memory
> -> pf_interception (code =
> -> kvm_handle_page_fault (need_unprotect = false)
> -> kvm_mmu_page_fault
> -> paging64_page_fault (for example)
> -> try_async_pf
> map_writable set to false
> -> paging64_fetch(write_fault = true, map_writable = false, prefault = false)
> -> mmu_set_spte(speculative = false, host_writable = false, write_fault = true)
> -> set_spte
> mmu_need_write_protect returns true
> return true
> write_fault == true -> set emulate = true
> return true
> return true
> return true
> emulate
>
> Without this patch, emulation would have called
>
> ..._gva_to_gpa_nested
> -> translate_nested_gpa
> -> paging64_gva_to_gpa
> -> paging64_walk_addr
> -> paging64_walk_addr_generic
> set fault (nested_page_fault=true)
>
> and then:
>
> kvm_propagate_fault
> -> nested_svm_inject_npf_exit
>
maybe then safer thing would be to qualify the new error_code check with
!mmu_is_nested(vcpu) or something like that. So that way it would run on
L1 guest, and not the L2 guest. I believe that would restrict it avoid
hitting this case. Are you okay with this change ?
IIRC, the main place where this check was valuable was when L1 guest had
a fault (when coming out of the L2 guest) and emulation was not needed.
-Brijesh
On 01/08/2017 15:36, Brijesh Singh wrote:
>>
>> The flow is:
>>
>> hardware walks page table; L2 page table points to read only memory
>> -> pf_interception (code =
>> -> kvm_handle_page_fault (need_unprotect = false)
>> -> kvm_mmu_page_fault
>> -> paging64_page_fault (for example)
>> -> try_async_pf
>> map_writable set to false
>> -> paging64_fetch(write_fault = true, map_writable = false,
>> prefault = false)
>> -> mmu_set_spte(speculative = false, host_writable = false,
>> write_fault = true)
>> -> set_spte
>> mmu_need_write_protect returns true
>> return true
>> write_fault == true -> set emulate = true
>> return true
>> return true
>> return true
>> emulate
>>
>> Without this patch, emulation would have called
>>
>> ..._gva_to_gpa_nested
>> -> translate_nested_gpa
>> -> paging64_gva_to_gpa
>> -> paging64_walk_addr
>> -> paging64_walk_addr_generic
>> set fault (nested_page_fault=true)
>>
>> and then:
>>
>> kvm_propagate_fault
>> -> nested_svm_inject_npf_exit
>>
>
> maybe then safer thing would be to qualify the new error_code check with
> !mmu_is_nested(vcpu) or something like that. So that way it would run on
> L1 guest, and not the L2 guest. I believe that would restrict it avoid
> hitting this case. Are you okay with this change ?
Or check "vcpu->arch.mmu.direct_map"? That would be true when not using
shadow pages.
> IIRC, the main place where this check was valuable was when L1 guest had
> a fault (when coming out of the L2 guest) and emulation was not needed.
How do I measure the effect? I tried counting the number of emulations,
and any difference from the patch was lost in noise.
Paolo
On 8/2/17 5:42 AM, Paolo Bonzini wrote:
> On 01/08/2017 15:36, Brijesh Singh wrote:
>>> The flow is:
>>>
>>> hardware walks page table; L2 page table points to read only memory
>>> -> pf_interception (code =
>>> -> kvm_handle_page_fault (need_unprotect = false)
>>> -> kvm_mmu_page_fault
>>> -> paging64_page_fault (for example)
>>> -> try_async_pf
>>> map_writable set to false
>>> -> paging64_fetch(write_fault = true, map_writable = false,
>>> prefault = false)
>>> -> mmu_set_spte(speculative = false, host_writable = false,
>>> write_fault = true)
>>> -> set_spte
>>> mmu_need_write_protect returns true
>>> return true
>>> write_fault == true -> set emulate = true
>>> return true
>>> return true
>>> return true
>>> emulate
>>>
>>> Without this patch, emulation would have called
>>>
>>> ..._gva_to_gpa_nested
>>> -> translate_nested_gpa
>>> -> paging64_gva_to_gpa
>>> -> paging64_walk_addr
>>> -> paging64_walk_addr_generic
>>> set fault (nested_page_fault=true)
>>>
>>> and then:
>>>
>>> kvm_propagate_fault
>>> -> nested_svm_inject_npf_exit
>>>
>> maybe then safer thing would be to qualify the new error_code check with
>> !mmu_is_nested(vcpu) or something like that. So that way it would run on
>> L1 guest, and not the L2 guest. I believe that would restrict it avoid
>> hitting this case. Are you okay with this change ?
> Or check "vcpu->arch.mmu.direct_map"? That would be true when not using
> shadow pages.
Yes that can be used.
>> IIRC, the main place where this check was valuable was when L1 guest had
>> a fault (when coming out of the L2 guest) and emulation was not needed.
> How do I measure the effect? I tried counting the number of emulations,
> and any difference from the patch was lost in noise.
I think this patch is necessary for functional reasons (not just perf), because we added the other patch to look at the GPA and stop walking the guest page tables on a NPF.
The issue I think was that hardware has taken an NPF because the page table is marked RO, and it saves the GPA in the VMCB. KVM was then going and emulating the instruction and it saw that a GPA was available. But that GPA was not the GPA of the instruction it is emulating, since it was the GPA of the tablewalk page that had the fault. It was debugged that at the time and realized that emulating the instruction was unnecessary so we added this new code in there which fixed the functional issue and helps perf.
I don't have any data on how much perf, as I recall it was most effective when the L1 guest page tables and L2 nested page tables were exactly the same. In that case, it avoided emulations for code that L1 executes which I think could be as much as one emulation per 4kb code page.
On 04/08/2017 02:30, Brijesh Singh wrote:
>
>
> On 8/2/17 5:42 AM, Paolo Bonzini wrote:
>> On 01/08/2017 15:36, Brijesh Singh wrote:
>>>> The flow is:
>>>>
>>>> hardware walks page table; L2 page table points to read only memory
>>>> -> pf_interception (code =
>>>> -> kvm_handle_page_fault (need_unprotect = false)
>>>> -> kvm_mmu_page_fault
>>>> -> paging64_page_fault (for example)
>>>> -> try_async_pf
>>>> map_writable set to false
>>>> -> paging64_fetch(write_fault = true, map_writable = false,
>>>> prefault = false)
>>>> -> mmu_set_spte(speculative = false, host_writable = false,
>>>> write_fault = true)
>>>> -> set_spte
>>>> mmu_need_write_protect returns true
>>>> return true
>>>> write_fault == true -> set emulate = true
>>>> return true
>>>> return true
>>>> return true
>>>> emulate
>>>>
>>>> Without this patch, emulation would have called
>>>>
>>>> ..._gva_to_gpa_nested
>>>> -> translate_nested_gpa
>>>> -> paging64_gva_to_gpa
>>>> -> paging64_walk_addr
>>>> -> paging64_walk_addr_generic
>>>> set fault (nested_page_fault=true)
>>>>
>>>> and then:
>>>>
>>>> kvm_propagate_fault
>>>> -> nested_svm_inject_npf_exit
>>>>
>>> maybe then safer thing would be to qualify the new error_code check with
>>> !mmu_is_nested(vcpu) or something like that. So that way it would run on
>>> L1 guest, and not the L2 guest. I believe that would restrict it avoid
>>> hitting this case. Are you okay with this change ?
>> Or check "vcpu->arch.mmu.direct_map"? That would be true when not using
>> shadow pages.
>
> Yes that can be used.
Are you going to send a patch for this?
Paolo
>>> IIRC, the main place where this check was valuable was when L1 guest had
>>> a fault (when coming out of the L2 guest) and emulation was not needed.
>> How do I measure the effect? I tried counting the number of emulations,
>> and any difference from the patch was lost in noise.
>
> I think this patch is necessary for functional reasons (not just
> perf), because we added the other patch to look at the GPA and stop
> walking the guest page tables on a NPF.
>
> The issue I think was that hardware has taken an NPF because the page
> table is marked RO, and it saves the GPA in the VMCB. KVM was then going
> and emulating the instruction and it saw that a GPA was available. But
> that GPA was not the GPA of the instruction it is emulating, since it
> was the GPA of the tablewalk page that had the fault. It was debugged
> that at the time and realized that emulating the instruction was
> unnecessary so we added this new code in there which fixed the
> functional issue and helps perf.
>
> I don't have any data on how much perf, as I recall it was most
> effective when the L1 guest page tables and L2 nested page tables were
> exactly the same. In that case, it avoided emulations for code that L1
> executes which I think could be as much as one emulation per 4kb code page.
>
Hi Paolo,
On 08/04/2017 09:05 AM, Paolo Bonzini wrote:
> On 04/08/2017 02:30, Brijesh Singh wrote:
>>
>>
>> On 8/2/17 5:42 AM, Paolo Bonzini wrote:
>>> On 01/08/2017 15:36, Brijesh Singh wrote:
>>>>> The flow is:
>>>>>
>>>>> hardware walks page table; L2 page table points to read only memory
>>>>> -> pf_interception (code =
>>>>> -> kvm_handle_page_fault (need_unprotect = false)
>>>>> -> kvm_mmu_page_fault
>>>>> -> paging64_page_fault (for example)
>>>>> -> try_async_pf
>>>>> map_writable set to false
>>>>> -> paging64_fetch(write_fault = true, map_writable = false,
>>>>> prefault = false)
>>>>> -> mmu_set_spte(speculative = false, host_writable = false,
>>>>> write_fault = true)
>>>>> -> set_spte
>>>>> mmu_need_write_protect returns true
>>>>> return true
>>>>> write_fault == true -> set emulate = true
>>>>> return true
>>>>> return true
>>>>> return true
>>>>> emulate
>>>>>
>>>>> Without this patch, emulation would have called
>>>>>
>>>>> ..._gva_to_gpa_nested
>>>>> -> translate_nested_gpa
>>>>> -> paging64_gva_to_gpa
>>>>> -> paging64_walk_addr
>>>>> -> paging64_walk_addr_generic
>>>>> set fault (nested_page_fault=true)
>>>>>
>>>>> and then:
>>>>>
>>>>> kvm_propagate_fault
>>>>> -> nested_svm_inject_npf_exit
>>>>>
>>>> maybe then safer thing would be to qualify the new error_code check with
>>>> !mmu_is_nested(vcpu) or something like that. So that way it would run on
>>>> L1 guest, and not the L2 guest. I believe that would restrict it avoid
>>>> hitting this case. Are you okay with this change ?
>>> Or check "vcpu->arch.mmu.direct_map"? That would be true when not using
>>> shadow pages.
>>
>> Yes that can be used.
>
> Are you going to send a patch for this?
>
Yes. I should be posting it by Monday or Tuesday - need sometime to verify it.
-Brijesh