2021-07-27 10:43:05

by Praveen Kumar

[permalink] [raw]
Subject: [PATCH v3] hyperv: root partition faults writing to VP ASSIST MSR PAGE

For Root partition the VP assist pages are pre-determined by the
hypervisor. The Root kernel is not allowed to change them to
different locations. And thus, we are getting below stack as in
current implementation Root is trying to perform write to specific
MSR.

[ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to
write 0x0000000145ac5001) at rIP: 0xffffffff810c1084
(native_write_msr+0x4/0x30)
[ 2.784867] Call Trace:
[ 2.791507] hv_cpu_init+0xf1/0x1c0
[ 2.798144] ? hyperv_report_panic+0xd0/0xd0
[ 2.804806] cpuhp_invoke_callback+0x11a/0x440
[ 2.811465] ? hv_resume+0x90/0x90
[ 2.818137] cpuhp_issue_call+0x126/0x130
[ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
[ 2.831427] ? hyperv_report_panic+0xd0/0xd0
[ 2.838075] ? hyperv_report_panic+0xd0/0xd0
[ 2.844723] ? hv_resume+0x90/0x90
[ 2.851375] __cpuhp_setup_state+0x3d/0x90
[ 2.858030] hyperv_init+0x14e/0x410
[ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
[ 2.871349] apic_intr_mode_init+0x8b/0x100
[ 2.878017] x86_late_time_init+0x20/0x30
[ 2.884675] start_kernel+0x459/0x4fb
[ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb

Since, the hypervisor already provides the VP assist page for root
partition, we need to memremap the memory from hypervisor for root
kernel to use. The mapping is done in hv_cpu_init during bringup and
is unmaped in hv_cpu_die during teardown.

Signed-off-by: Praveen Kumar <[email protected]>
---
arch/x86/hyperv/hv_init.c | 61 +++++++++++++++++++++---------
arch/x86/include/asm/hyperv-tlfs.h | 9 +++++
2 files changed, 53 insertions(+), 17 deletions(-)

changelog:
v1: initial patch
v2: commit message changes, removal of HV_MSR_APIC_ACCESS_AVAILABLE
check and addition of null check before reading the VP assist MSR
for root partition
v3: added new data structure to handle VP ASSIST MSR page and done
handling in hv_cpu_init and hv_cpu_die

---
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 6f247e7e07eb..b859e42b4943 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -44,6 +44,7 @@ EXPORT_SYMBOL_GPL(hv_vp_assist_page);

static int hv_cpu_init(unsigned int cpu)
{
+ union hv_vp_assist_msr_contents msr;
struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
int ret;

@@ -54,27 +55,41 @@ static int hv_cpu_init(unsigned int cpu)
if (!hv_vp_assist_page)
return 0;

- /*
- * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
- * 5.2.1 "GPA Overlay Pages"). Here it must be zeroed out to make sure
- * we always write the EOI MSR in hv_apic_eoi_write() *after* the
- * EOI optimization is disabled in hv_cpu_die(), otherwise a CPU may
- * not be stopped in the case of CPU offlining and the VM will hang.
- */
- if (!*hvp) {
- *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
+ if (hv_root_partition) {
+ /*
+ * For Root partition we get the hypervisor provided VP ASSIST
+ * PAGE, instead of allocating a new page.
+ */
+ rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
+
+ /* remapping to root partition address space */
+ if (!*hvp)
+ *hvp = memremap(msr.guest_physical_address <<
+ HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
+ PAGE_SIZE, MEMREMAP_WB);
+ } else {
+ /*
+ * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's
+ * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
+ * out to make sure we always write the EOI MSR in
+ * hv_apic_eoi_write() *after* theEOI optimization is disabled
+ * in hv_cpu_die(), otherwise a CPU may not be stopped in the
+ * case of CPU offlining and the VM will hang.
+ */
+ if (!*hvp)
+ *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
+
}

- if (*hvp) {
- u64 val;
+ WARN_ON(!(*hvp));

- val = vmalloc_to_pfn(*hvp);
- val = (val << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) |
- HV_X64_MSR_VP_ASSIST_PAGE_ENABLE;
+ if (*hvp) {
+ if (!hv_root_partition)
+ msr.guest_physical_address = vmalloc_to_pfn(*hvp);

- wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, val);
+ msr.enable = 1;
+ wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
}
-
return 0;
}

@@ -170,9 +185,21 @@ static int hv_cpu_die(unsigned int cpu)

hv_common_cpu_die(cpu);

- if (hv_vp_assist_page && hv_vp_assist_page[cpu])
+ if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);

+ if (hv_root_partition) {
+ /*
+ * For Root partition the VP ASSIST page is mapped to
+ * hypervisor provided page, and thus, we unmap the
+ * page here and nullify it, so that in future we have
+ * correct page address mapped in hv_cpu_init
+ */
+ memunmap(hv_vp_assist_page[cpu]);
+ hv_vp_assist_page[cpu] = NULL;
+ }
+ }
+
if (hv_reenlightenment_cb == NULL)
return 0;

diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index f1366ce609e3..2e4e87046aa7 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -288,6 +288,15 @@ union hv_x64_msr_hypercall_contents {
} __packed;
};

+union hv_vp_assist_msr_contents {
+ u64 as_uint64;
+ struct {
+ u64 enable:1;
+ u64 reserved:11;
+ u64 guest_physical_address:52;
+ } __packed;
+};
+
struct hv_reenlightenment_control {
__u64 vector:8;
__u64 reserved1:8;
--
2.25.1



2021-07-27 16:36:28

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v3] hyperv: root partition faults writing to VP ASSIST MSR PAGE

From: Praveen Kumar <[email protected]> Sent: Tuesday, July 27, 2021 3:41 AM
>
> For Root partition the VP assist pages are pre-determined by the
> hypervisor. The Root kernel is not allowed to change them to
> different locations. And thus, we are getting below stack as in
> current implementation Root is trying to perform write to specific
> MSR.
>
> [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to
> write 0x0000000145ac5001) at rIP: 0xffffffff810c1084
> (native_write_msr+0x4/0x30)
> [ 2.784867] Call Trace:
> [ 2.791507] hv_cpu_init+0xf1/0x1c0
> [ 2.798144] ? hyperv_report_panic+0xd0/0xd0
> [ 2.804806] cpuhp_invoke_callback+0x11a/0x440
> [ 2.811465] ? hv_resume+0x90/0x90
> [ 2.818137] cpuhp_issue_call+0x126/0x130
> [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
> [ 2.831427] ? hyperv_report_panic+0xd0/0xd0
> [ 2.838075] ? hyperv_report_panic+0xd0/0xd0
> [ 2.844723] ? hv_resume+0x90/0x90
> [ 2.851375] __cpuhp_setup_state+0x3d/0x90
> [ 2.858030] hyperv_init+0x14e/0x410
> [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
> [ 2.871349] apic_intr_mode_init+0x8b/0x100
> [ 2.878017] x86_late_time_init+0x20/0x30
> [ 2.884675] start_kernel+0x459/0x4fb
> [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb
>
> Since, the hypervisor already provides the VP assist page for root
> partition, we need to memremap the memory from hypervisor for root
> kernel to use. The mapping is done in hv_cpu_init during bringup and
> is unmaped in hv_cpu_die during teardown.
>
> Signed-off-by: Praveen Kumar <[email protected]>
> ---
> arch/x86/hyperv/hv_init.c | 61 +++++++++++++++++++++---------
> arch/x86/include/asm/hyperv-tlfs.h | 9 +++++
> 2 files changed, 53 insertions(+), 17 deletions(-)
>
> changelog:
> v1: initial patch
> v2: commit message changes, removal of HV_MSR_APIC_ACCESS_AVAILABLE
> check and addition of null check before reading the VP assist MSR
> for root partition
> v3: added new data structure to handle VP ASSIST MSR page and done
> handling in hv_cpu_init and hv_cpu_die
>
> ---
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 6f247e7e07eb..b859e42b4943 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -44,6 +44,7 @@ EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>
> static int hv_cpu_init(unsigned int cpu)
> {
> + union hv_vp_assist_msr_contents msr;
> struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
> int ret;
>
> @@ -54,27 +55,41 @@ static int hv_cpu_init(unsigned int cpu)
> if (!hv_vp_assist_page)
> return 0;
>
> - /*
> - * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
> - * 5.2.1 "GPA Overlay Pages"). Here it must be zeroed out to make sure
> - * we always write the EOI MSR in hv_apic_eoi_write() *after* the
> - * EOI optimization is disabled in hv_cpu_die(), otherwise a CPU may
> - * not be stopped in the case of CPU offlining and the VM will hang.
> - */
> - if (!*hvp) {
> - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
> + if (hv_root_partition) {
> + /*
> + * For Root partition we get the hypervisor provided VP ASSIST
> + * PAGE, instead of allocating a new page.
> + */
> + rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
> +
> + /* remapping to root partition address space */
> + if (!*hvp)
> + *hvp = memremap(msr.guest_physical_address <<
> + HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
> + PAGE_SIZE, MEMREMAP_WB);
> + } else {
> + /*
> + * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's
> + * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
> + * out to make sure we always write the EOI MSR in
> + * hv_apic_eoi_write() *after* theEOI optimization is disabled
> + * in hv_cpu_die(), otherwise a CPU may not be stopped in the
> + * case of CPU offlining and the VM will hang.
> + */
> + if (!*hvp)
> + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
> +
> }

The tests here could be reversed to eliminate some duplication. For example:

if(!*hvp) {
if (hv_root_partition) {
rdmsrl(....);
*hvp = memremap( .....);
} else {
*hvp = __vmalloc(....);
}
}


>
> - if (*hvp) {
> - u64 val;
> + WARN_ON(!(*hvp));
>
> - val = vmalloc_to_pfn(*hvp);
> - val = (val << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) |
> - HV_X64_MSR_VP_ASSIST_PAGE_ENABLE;
> + if (*hvp) {
> + if (!hv_root_partition)
> + msr.guest_physical_address = vmalloc_to_pfn(*hvp);
>
> - wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, val);
> + msr.enable = 1;
> + wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);

This version has a substantive difference compared with previous versions
in that the "enable" bit is being set and written back to the MSR even when
running in the root partition. Is that intentional?

> }
> -
> return 0;
> }
>
> @@ -170,9 +185,21 @@ static int hv_cpu_die(unsigned int cpu)
>
> hv_common_cpu_die(cpu);
>
> - if (hv_vp_assist_page && hv_vp_assist_page[cpu])
> + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);

This will set the guest_physical_address in the MSR to zero,
even in the root partition case. Is that OK? It seems inconsistent
with hv_cpu_init() where the existing guest_physical_address
in the MSR is carefully preserved for the root partition case.
Or is the intent here simply to clear the "enable" flag?

>
> + if (hv_root_partition) {
> + /*
> + * For Root partition the VP ASSIST page is mapped to
> + * hypervisor provided page, and thus, we unmap the
> + * page here and nullify it, so that in future we have
> + * correct page address mapped in hv_cpu_init
> + */
> + memunmap(hv_vp_assist_page[cpu]);
> + hv_vp_assist_page[cpu] = NULL;
> + }
> + }
> +
> if (hv_reenlightenment_cb == NULL)
> return 0;
>
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index f1366ce609e3..2e4e87046aa7 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -288,6 +288,15 @@ union hv_x64_msr_hypercall_contents {
> } __packed;
> };
>
> +union hv_vp_assist_msr_contents {
> + u64 as_uint64;
> + struct {
> + u64 enable:1;
> + u64 reserved:11;
> + u64 guest_physical_address:52;

This field really should be named "guest_physical_page", as
it is a page number, not an address. You've matched the
field names used in hv_x64_msr_hypercall_contents, which
is good for consistency, except that the field name is
wrong in hv_x64_msr_hypercall_contents. :-( I think
the Hyper-V TLFS originally called it a "physical address", but
the TLFS has since been fixed to described it as a page number.
I'd suggest getting this one named correctly; fixing the field
name in hv_x64_msr_hypercall_contents is a separate cleanup
that doesn't need to be done now.

> + } __packed;
> +};
> +
> struct hv_reenlightenment_control {
> __u64 vector:8;
> __u64 reserved1:8;
> --
> 2.25.1


2021-07-27 16:46:59

by Sunil Muthuswamy

[permalink] [raw]
Subject: RE: [PATCH v3] hyperv: root partition faults writing to VP ASSIST MSR PAGE

From: Praveen Kumar <[email protected]> Sent: Tuesday, July 27, 2021 3:41 AM
>
> For Root partition the VP assist pages are pre-determined by the
> hypervisor. The Root kernel is not allowed to change them to
> different locations. And thus, we are getting below stack as in
> current implementation Root is trying to perform write to specific
> MSR.
>
> [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to
> write 0x0000000145ac5001) at rIP: 0xffffffff810c1084
> (native_write_msr+0x4/0x30)
> [ 2.784867] Call Trace:
> [ 2.791507] hv_cpu_init+0xf1/0x1c0
> [ 2.798144] ? hyperv_report_panic+0xd0/0xd0
> [ 2.804806] cpuhp_invoke_callback+0x11a/0x440
> [ 2.811465] ? hv_resume+0x90/0x90
> [ 2.818137] cpuhp_issue_call+0x126/0x130
> [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
> [ 2.831427] ? hyperv_report_panic+0xd0/0xd0
> [ 2.838075] ? hyperv_report_panic+0xd0/0xd0
> [ 2.844723] ? hv_resume+0x90/0x90
> [ 2.851375] __cpuhp_setup_state+0x3d/0x90
> [ 2.858030] hyperv_init+0x14e/0x410
> [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
> [ 2.871349] apic_intr_mode_init+0x8b/0x100
> [ 2.878017] x86_late_time_init+0x20/0x30
> [ 2.884675] start_kernel+0x459/0x4fb
> [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb
>
> Since, the hypervisor already provides the VP assist page for root
> partition, we need to memremap the memory from hypervisor for root
> kernel to use. The mapping is done in hv_cpu_init during bringup and
> is unmaped in hv_cpu_die during teardown.
>
> Signed-off-by: Praveen Kumar <[email protected]>
> ---
> arch/x86/hyperv/hv_init.c | 61 +++++++++++++++++++++---------
> arch/x86/include/asm/hyperv-tlfs.h | 9 +++++
> 2 files changed, 53 insertions(+), 17 deletions(-)
>
> changelog:
> v1: initial patch
> v2: commit message changes, removal of HV_MSR_APIC_ACCESS_AVAILABLE
> check and addition of null check before reading the VP assist MSR
> for root partition
> v3: added new data structure to handle VP ASSIST MSR page and done
> handling in hv_cpu_init and hv_cpu_die
>
> ---
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 6f247e7e07eb..b859e42b4943 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -44,6 +44,7 @@ EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>
> static int hv_cpu_init(unsigned int cpu)
> {
> + union hv_vp_assist_msr_contents msr;
> struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
> int ret;
>
> @@ -54,27 +55,41 @@ static int hv_cpu_init(unsigned int cpu)
> if (!hv_vp_assist_page)
> return 0;

Not related to this code, but I am not sure about the usefulness of this NULL check as
we have already accessed this pointer above. If it was NULL, things would already
blow up.

>
> - /*
> - * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
> - * 5.2.1 "GPA Overlay Pages"). Here it must be zeroed out to make sure
> - * we always write the EOI MSR in hv_apic_eoi_write() *after* the
> - * EOI optimization is disabled in hv_cpu_die(), otherwise a CPU may
> - * not be stopped in the case of CPU offlining and the VM will hang.
> - */
> - if (!*hvp) {
> - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
> + if (hv_root_partition) {
> + /*
> + * For Root partition we get the hypervisor provided VP ASSIST
> + * PAGE, instead of allocating a new page.
> + */
> + rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
> +
> + /* remapping to root partition address space */

Better to leave out comments that are obvious from the code.

> + if (!*hvp)
> + *hvp = memremap(msr.guest_physical_address <<
> + HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
> + PAGE_SIZE, MEMREMAP_WB);
> + } else {
> + /*
> + * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's
> + * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
> + * out to make sure we always write the EOI MSR in
> + * hv_apic_eoi_write() *after* theEOI optimization is disabled
> + * in hv_cpu_die(), otherwise a CPU may not be stopped in the
> + * case of CPU offlining and the VM will hang.
> + */
> + if (!*hvp)
> + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
> +
> }
>
> - if (*hvp) {
> - u64 val;
> + WARN_ON(!(*hvp));
>
> - val = vmalloc_to_pfn(*hvp);
> - val = (val << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) |
> - HV_X64_MSR_VP_ASSIST_PAGE_ENABLE;
> + if (*hvp) {
> + if (!hv_root_partition)
> + msr.guest_physical_address = vmalloc_to_pfn(*hvp);

It's better to move this above in the else section where we are 'vmalloc' the page.
If you just check for the NULL for the page above and return if NULL, that should
clean up the code as well.

>
> - wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, val);
> + msr.enable = 1;

We should also set the reserved bits to 0.

> + wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
> }
> -
> return 0;
> }
>
> @@ -170,9 +185,21 @@ static int hv_cpu_die(unsigned int cpu)
>
> hv_common_cpu_die(cpu);
>
> - if (hv_vp_assist_page && hv_vp_assist_page[cpu])
> + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);

Its better to read the MSR, set the enable bit to 0 and write it back.

>
> + if (hv_root_partition) {
> + /*
> + * For Root partition the VP ASSIST page is mapped to
> + * hypervisor provided page, and thus, we unmap the
> + * page here and nullify it, so that in future we have
> + * correct page address mapped in hv_cpu_init
> + */
> + memunmap(hv_vp_assist_page[cpu]);
> + hv_vp_assist_page[cpu] = NULL;
> + }

For the guest case, where are we freeing the page?

> + }
> +
> if (hv_reenlightenment_cb == NULL)
> return 0;
>
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index f1366ce609e3..2e4e87046aa7 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -288,6 +288,15 @@ union hv_x64_msr_hypercall_contents {
> } __packed;
> };
>
> +union hv_vp_assist_msr_contents {
> + u64 as_uint64;
> + struct {
> + u64 enable:1;
> + u64 reserved:11;
> + u64 guest_physical_address:52;


This field contains the page frame number and not the physical address. It is also
better to drop the phrase 'guest' from the name as this applies to root as well.

- Sunil

2021-07-27 16:54:45

by Sunil Muthuswamy

[permalink] [raw]
Subject: RE: [PATCH v3] hyperv: root partition faults writing to VP ASSIST MSR PAGE

> >
> > - if (*hvp) {
> > - u64 val;
> > + WARN_ON(!(*hvp));
> >
> > - val = vmalloc_to_pfn(*hvp);
> > - val = (val << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) |
> > - HV_X64_MSR_VP_ASSIST_PAGE_ENABLE;
> > + if (*hvp) {
> > + if (!hv_root_partition)
> > + msr.guest_physical_address = vmalloc_to_pfn(*hvp);
> >
> > - wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, val);
> > + msr.enable = 1;
> > + wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>
> This version has a substantive difference compared with previous versions
> in that the "enable" bit is being set and written back to the MSR even when
> running in the root partition. Is that intentional?

Yes, the hypervisor won't allow the root to change the page address. So, to
enable/disable it, the page address needs to be preserved and just the enable
bit needs to be toggled.

- Sunil

2021-07-27 17:00:20

by Praveen Kumar

[permalink] [raw]
Subject: Re: [PATCH v3] hyperv: root partition faults writing to VP ASSIST MSR PAGE

On 27-07-2021 22:05, Michael Kelley wrote:
> From: Praveen Kumar <[email protected]> Sent: Tuesday, July 27, 2021 3:41 AM
>>
>> For Root partition the VP assist pages are pre-determined by the
>> hypervisor. The Root kernel is not allowed to change them to
>> different locations. And thus, we are getting below stack as in
>> current implementation Root is trying to perform write to specific
>> MSR.
>>
>> [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to
>> write 0x0000000145ac5001) at rIP: 0xffffffff810c1084
>> (native_write_msr+0x4/0x30)
>> [ 2.784867] Call Trace:
>> [ 2.791507] hv_cpu_init+0xf1/0x1c0
>> [ 2.798144] ? hyperv_report_panic+0xd0/0xd0
>> [ 2.804806] cpuhp_invoke_callback+0x11a/0x440
>> [ 2.811465] ? hv_resume+0x90/0x90
>> [ 2.818137] cpuhp_issue_call+0x126/0x130
>> [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
>> [ 2.831427] ? hyperv_report_panic+0xd0/0xd0
>> [ 2.838075] ? hyperv_report_panic+0xd0/0xd0
>> [ 2.844723] ? hv_resume+0x90/0x90
>> [ 2.851375] __cpuhp_setup_state+0x3d/0x90
>> [ 2.858030] hyperv_init+0x14e/0x410
>> [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
>> [ 2.871349] apic_intr_mode_init+0x8b/0x100
>> [ 2.878017] x86_late_time_init+0x20/0x30
>> [ 2.884675] start_kernel+0x459/0x4fb
>> [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb
>>
>> Since, the hypervisor already provides the VP assist page for root
>> partition, we need to memremap the memory from hypervisor for root
>> kernel to use. The mapping is done in hv_cpu_init during bringup and
>> is unmaped in hv_cpu_die during teardown.
>>
>> Signed-off-by: Praveen Kumar <[email protected]>
>> ---
>> arch/x86/hyperv/hv_init.c | 61 +++++++++++++++++++++---------
>> arch/x86/include/asm/hyperv-tlfs.h | 9 +++++
>> 2 files changed, 53 insertions(+), 17 deletions(-)
>>
>> changelog:
>> v1: initial patch
>> v2: commit message changes, removal of HV_MSR_APIC_ACCESS_AVAILABLE
>> check and addition of null check before reading the VP assist MSR
>> for root partition
>> v3: added new data structure to handle VP ASSIST MSR page and done
>> handling in hv_cpu_init and hv_cpu_die
>>
>> ---
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index 6f247e7e07eb..b859e42b4943 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -44,6 +44,7 @@ EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>>
>> static int hv_cpu_init(unsigned int cpu)
>> {
>> + union hv_vp_assist_msr_contents msr;
>> struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
>> int ret;
>>
>> @@ -54,27 +55,41 @@ static int hv_cpu_init(unsigned int cpu)
>> if (!hv_vp_assist_page)
>> return 0;
>>
>> - /*
>> - * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
>> - * 5.2.1 "GPA Overlay Pages"). Here it must be zeroed out to make sure
>> - * we always write the EOI MSR in hv_apic_eoi_write() *after* the
>> - * EOI optimization is disabled in hv_cpu_die(), otherwise a CPU may
>> - * not be stopped in the case of CPU offlining and the VM will hang.
>> - */
>> - if (!*hvp) {
>> - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
>> + if (hv_root_partition) {
>> + /*
>> + * For Root partition we get the hypervisor provided VP ASSIST
>> + * PAGE, instead of allocating a new page.
>> + */
>> + rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> +
>> + /* remapping to root partition address space */
>> + if (!*hvp)
>> + *hvp = memremap(msr.guest_physical_address <<
>> + HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> + PAGE_SIZE, MEMREMAP_WB);
>> + } else {
>> + /*
>> + * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's
>> + * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> + * out to make sure we always write the EOI MSR in
>> + * hv_apic_eoi_write() *after* theEOI optimization is disabled
>> + * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> + * case of CPU offlining and the VM will hang.
>> + */
>> + if (!*hvp)
>> + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
>> +
>> }
>
> The tests here could be reversed to eliminate some duplication. For example:
>
> if(!*hvp) {
> if (hv_root_partition) {
> rdmsrl(....);
> *hvp = memremap( .....);
> } else {
> *hvp = __vmalloc(....);
> }
> }
>
>
Sure. Thanks.

>>
>> - if (*hvp) {
>> - u64 val;
>> + WARN_ON(!(*hvp));
>>
>> - val = vmalloc_to_pfn(*hvp);
>> - val = (val << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) |
>> - HV_X64_MSR_VP_ASSIST_PAGE_ENABLE;
>> + if (*hvp) {
>> + if (!hv_root_partition)
>> + msr.guest_physical_address = vmalloc_to_pfn(*hvp);
>>
>> - wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, val);
>> + msr.enable = 1;
>> + wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>
> This version has a substantive difference compared with previous versions
> in that the "enable" bit is being set and written back to the MSR even when
> running in the root partition. Is that intentional?
>
Yes, we need to enable the same for root partition as well.

>> }
>> -
>> return 0;
>> }
>>
>> @@ -170,9 +185,21 @@ static int hv_cpu_die(unsigned int cpu)
>>
>> hv_common_cpu_die(cpu);
>>
>> - if (hv_vp_assist_page && hv_vp_assist_page[cpu])
>> + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
>> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
>
> This will set the guest_physical_address in the MSR to zero,
> even in the root partition case. Is that OK? It seems inconsistent
> with hv_cpu_init() where the existing guest_physical_address
> in the MSR is carefully preserved for the root partition case.
> Or is the intent here simply to clear the "enable" flag?
>
>>
>> + if (hv_root_partition) {
>> + /*
>> + * For Root partition the VP ASSIST page is mapped to
>> + * hypervisor provided page, and thus, we unmap the
>> + * page here and nullify it, so that in future we have
>> + * correct page address mapped in hv_cpu_init
>> + */
>> + memunmap(hv_vp_assist_page[cpu]);
>> + hv_vp_assist_page[cpu] = NULL;
>> + }
>> + }
>> +
>> if (hv_reenlightenment_cb == NULL)
>> return 0;
>>
>> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
>> index f1366ce609e3..2e4e87046aa7 100644
>> --- a/arch/x86/include/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/asm/hyperv-tlfs.h
>> @@ -288,6 +288,15 @@ union hv_x64_msr_hypercall_contents {
>> } __packed;
>> };
>>
>> +union hv_vp_assist_msr_contents {
>> + u64 as_uint64;
>> + struct {
>> + u64 enable:1;
>> + u64 reserved:11;
>> + u64 guest_physical_address:52;
>
> This field really should be named "guest_physical_page", as
> it is a page number, not an address. You've matched the
> field names used in hv_x64_msr_hypercall_contents, which
> is good for consistency, except that the field name is
> wrong in hv_x64_msr_hypercall_contents. :-( I think
> the Hyper-V TLFS originally called it a "physical address", but
> the TLFS has since been fixed to described it as a page number.
> I'd suggest getting this one named correctly; fixing the field
> name in hv_x64_msr_hypercall_contents is a separate cleanup
> that doesn't need to be done now.
>
Sure. Will do it for this new data structure.

>> + } __packed;
>> +};
>> +
>> struct hv_reenlightenment_control {
>> __u64 vector:8;
>> __u64 reserved1:8;
>> --
>> 2.25.1

Regards,

~Praveen.

2021-07-27 17:14:04

by Praveen Kumar

[permalink] [raw]
Subject: Re: [PATCH v3] hyperv: root partition faults writing to VP ASSIST MSR PAGE

On 27-07-2021 22:15, Sunil Muthuswamy wrote:
> From: Praveen Kumar <[email protected]> Sent: Tuesday, July 27, 2021 3:41 AM
>>
>> For Root partition the VP assist pages are pre-determined by the
>> hypervisor. The Root kernel is not allowed to change them to
>> different locations. And thus, we are getting below stack as in
>> current implementation Root is trying to perform write to specific
>> MSR.
>>
>> [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to
>> write 0x0000000145ac5001) at rIP: 0xffffffff810c1084
>> (native_write_msr+0x4/0x30)
>> [ 2.784867] Call Trace:
>> [ 2.791507] hv_cpu_init+0xf1/0x1c0
>> [ 2.798144] ? hyperv_report_panic+0xd0/0xd0
>> [ 2.804806] cpuhp_invoke_callback+0x11a/0x440
>> [ 2.811465] ? hv_resume+0x90/0x90
>> [ 2.818137] cpuhp_issue_call+0x126/0x130
>> [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
>> [ 2.831427] ? hyperv_report_panic+0xd0/0xd0
>> [ 2.838075] ? hyperv_report_panic+0xd0/0xd0
>> [ 2.844723] ? hv_resume+0x90/0x90
>> [ 2.851375] __cpuhp_setup_state+0x3d/0x90
>> [ 2.858030] hyperv_init+0x14e/0x410
>> [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
>> [ 2.871349] apic_intr_mode_init+0x8b/0x100
>> [ 2.878017] x86_late_time_init+0x20/0x30
>> [ 2.884675] start_kernel+0x459/0x4fb
>> [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb
>>
>> Since, the hypervisor already provides the VP assist page for root
>> partition, we need to memremap the memory from hypervisor for root
>> kernel to use. The mapping is done in hv_cpu_init during bringup and
>> is unmaped in hv_cpu_die during teardown.
>>
>> Signed-off-by: Praveen Kumar <[email protected]>
>> ---
>> arch/x86/hyperv/hv_init.c | 61 +++++++++++++++++++++---------
>> arch/x86/include/asm/hyperv-tlfs.h | 9 +++++
>> 2 files changed, 53 insertions(+), 17 deletions(-)
>>
>> changelog:
>> v1: initial patch
>> v2: commit message changes, removal of HV_MSR_APIC_ACCESS_AVAILABLE
>> check and addition of null check before reading the VP assist MSR
>> for root partition
>> v3: added new data structure to handle VP ASSIST MSR page and done
>> handling in hv_cpu_init and hv_cpu_die
>>
>> ---
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index 6f247e7e07eb..b859e42b4943 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -44,6 +44,7 @@ EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>>
>> static int hv_cpu_init(unsigned int cpu)
>> {
>> + union hv_vp_assist_msr_contents msr;
>> struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
>> int ret;
>>
>> @@ -54,27 +55,41 @@ static int hv_cpu_init(unsigned int cpu)
>> if (!hv_vp_assist_page)
>> return 0;
>
> Not related to this code, but I am not sure about the usefulness of this NULL check as
> we have already accessed this pointer above. If it was NULL, things would already
> blow up.
>
What I understood, hvp will point to "hv_vp_assist_page stack address + smp_processor_id()"
So, we are good, and this NULL check is required, as in when we de-reference the location, later in the code, it may fault.
Please do correct me if my understanding is wrong here. Thanks.

>>
>> - /*
>> - * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
>> - * 5.2.1 "GPA Overlay Pages"). Here it must be zeroed out to make sure
>> - * we always write the EOI MSR in hv_apic_eoi_write() *after* the
>> - * EOI optimization is disabled in hv_cpu_die(), otherwise a CPU may
>> - * not be stopped in the case of CPU offlining and the VM will hang.
>> - */
>> - if (!*hvp) {
>> - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
>> + if (hv_root_partition) {
>> + /*
>> + * For Root partition we get the hypervisor provided VP ASSIST
>> + * PAGE, instead of allocating a new page.
>> + */
>> + rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> +
>> + /* remapping to root partition address space */
>
> Better to leave out comments that are obvious from the code.
>

Sure will remove the comment here. Thanks.

>> + if (!*hvp)
>> + *hvp = memremap(msr.guest_physical_address <<
>> + HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> + PAGE_SIZE, MEMREMAP_WB);
>> + } else {
>> + /*
>> + * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's
>> + * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> + * out to make sure we always write the EOI MSR in
>> + * hv_apic_eoi_write() *after* theEOI optimization is disabled
>> + * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> + * case of CPU offlining and the VM will hang.
>> + */
>> + if (!*hvp)
>> + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
>> +
>> }
>>
>> - if (*hvp) {
>> - u64 val;
>> + WARN_ON(!(*hvp));
>>
>> - val = vmalloc_to_pfn(*hvp);
>> - val = (val << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) |
>> - HV_X64_MSR_VP_ASSIST_PAGE_ENABLE;
>> + if (*hvp) {
>> + if (!hv_root_partition)
>> + msr.guest_physical_address = vmalloc_to_pfn(*hvp);
>
> It's better to move this above in the else section where we are 'vmalloc' the page.
> If you just check for the NULL for the page above and return if NULL, that should
> clean up the code as well.
>
>>

Ok, I was trying to make code cleaner, but I would have done better.

>> - wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, val);
>> + msr.enable = 1;
>
> We should also set the reserved bits to 0.
>
>> + wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> }
>> -
>> return 0;
>> }
>>
>> @@ -170,9 +185,21 @@ static int hv_cpu_die(unsigned int cpu)
>>
>> hv_common_cpu_die(cpu);
>>
>> - if (hv_vp_assist_page && hv_vp_assist_page[cpu])
>> + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
>> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
>
> Its better to read the MSR, set the enable bit to 0 and write it back.
>

Ok. probably this is the same code, which is being done in hv_cpu_init just with enable bit set to 0.
Let me try of a API / macro which can be used.

>>
>> + if (hv_root_partition) {
>> + /*
>> + * For Root partition the VP ASSIST page is mapped to
>> + * hypervisor provided page, and thus, we unmap the
>> + * page here and nullify it, so that in future we have
>> + * correct page address mapped in hv_cpu_init
>> + */
>> + memunmap(hv_vp_assist_page[cpu]);
>> + hv_vp_assist_page[cpu] = NULL;
>> + }
>
> For the guest case, where are we freeing the page?
>

Yes, this I have observed,and looks like we don't want to allocate the page every-time, instead reuse the same.
If someone in the list can please provide some information or history behind this, it would give some insights. But for now,
I am leaving it as it is.

>> + }
>> +
>> if (hv_reenlightenment_cb == NULL)
>> return 0;
>>
>> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
>> index f1366ce609e3..2e4e87046aa7 100644
>> --- a/arch/x86/include/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/asm/hyperv-tlfs.h
>> @@ -288,6 +288,15 @@ union hv_x64_msr_hypercall_contents {
>> } __packed;
>> };
>>
>> +union hv_vp_assist_msr_contents {
>> + u64 as_uint64;
>> + struct {
>> + u64 enable:1;
>> + u64 reserved:11;
>> + u64 guest_physical_address:52;
>
>
> This field contains the page frame number and not the physical address. It is also
> better to drop the phrase 'guest' from the name as this applies to root as well.
>

Ack.

> - Sunil


Regards,

~Praveen.
>


2021-07-27 17:27:31

by Sunil Muthuswamy

[permalink] [raw]
Subject: RE: [PATCH v3] hyperv: root partition faults writing to VP ASSIST MSR PAGE

> >> static int hv_cpu_init(unsigned int cpu)
> >> {
> >> + union hv_vp_assist_msr_contents msr;
> >> struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
> >> int ret;
> >>
> >> @@ -54,27 +55,41 @@ static int hv_cpu_init(unsigned int cpu)
> >> if (!hv_vp_assist_page)
> >> return 0;
> >
> > Not related to this code, but I am not sure about the usefulness of this NULL check as
> > we have already accessed this pointer above. If it was NULL, things would already
> > blow up.
> >
> What I understood, hvp will point to "hv_vp_assist_page stack address + smp_processor_id()"
> So, we are good, and this NULL check is required, as in when we de-reference the location, later in the code, it may fault.
> Please do correct me if my understanding is wrong here. Thanks.
>

'hv_vp_assist_page' comes from the heap, there is nothing on the stack there. As I
mentioned previously, if 'hv_vp_assist_page' was NULL, we would have already
crashed by now, as we are accessing it above. So, the check here is useless in my opinion.
But, since this is code that this patch doesn't touch, its fine to leave it as is for this patch.
I was just pointing it out.

- Sunil

2021-07-27 17:53:13

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH v3] hyperv: root partition faults writing to VP ASSIST MSR PAGE

On Tue, Jul 27, 2021 at 04:10:44PM +0530, Praveen Kumar wrote:
[...]
>
> @@ -170,9 +185,21 @@ static int hv_cpu_die(unsigned int cpu)
>
> hv_common_cpu_die(cpu);
>
> - if (hv_vp_assist_page && hv_vp_assist_page[cpu])
> + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
>

The content of the MSR should be preserved; otherwise you hit the same
fault for root kernel.

Wei.