2020-09-03 14:57:51

by Mohammed Gamal

[permalink] [raw]
Subject: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.

Since smaller physical address spaces are only supported on VMX, the parameter
is only exposed in the kvm_intel module.
Modifications to VMX page fault and EPT violation handling will depend on whether
that parameter is enabled.

Also disable support by default, and let the user decide if they want to enable
it.

Signed-off-by: Mohammed Gamal <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
arch/x86/kvm/vmx/vmx.h | 3 +++
arch/x86/kvm/x86.c | 2 +-
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 819c185adf09..dc778c7b5a06 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
#endif

+extern bool __read_mostly allow_smaller_maxphyaddr;
+module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
+
#define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
#define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
#define KVM_VM_CR0_ALWAYS_ON \
@@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)

if (is_page_fault(intr_info)) {
cr2 = vmx_get_exit_qual(vcpu);
- if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
+ if (enable_ept && !vcpu->arch.apf.host_apf_flags
+ && allow_smaller_maxphyaddr) {
/*
* EPT will cause page fault only if we need to
* detect illegal GPAs.
@@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
* would also use advanced VM-exit information for EPT violations to
* reconstruct the page fault error code.
*/
- if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
+ if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) && allow_smaller_maxphyaddr)
return kvm_emulate_instruction(vcpu, 0);

return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
@@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
#endif
vmx_check_vmcs12_offsets();

- /*
- * Intel processors don't have problems with
- * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
- * it for VMX by default
- */
- allow_smaller_maxphyaddr = true;
-
return 0;
}
module_init(vmx_init);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 26175a4759fa..b859435efa2e 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)

static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
{
+ if (!allow_smaller_maxphyaddr)
+ return false;
+
return !enable_ept || cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d39d6cf1d473..982f1d73a884 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);

-bool __read_mostly allow_smaller_maxphyaddr;
+bool __read_mostly allow_smaller_maxphyaddr = 0;
EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);

static u64 __read_mostly host_xss;
--
2.26.2


2020-09-03 17:59:42

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <[email protected]> wrote:
>
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to enable
> it.
>
> Signed-off-by: Mohammed Gamal <[email protected]>

I think a smaller guest physical address width *should* be allowed.
However, perhaps the pedantic adherence to the architectural
specification could be turned on or off per-VM? And, if we're going to
be pedantic, I think we should go all the way and get MOV-to-CR3
correct.

Does the typical guest care about whether or not setting any of the
bits 51:46 in a PFN results in a fault?

2020-09-03 18:07:25

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On 03/09/20 19:57, Jim Mattson wrote:
> On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <[email protected]> wrote:
>> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>>
>> Since smaller physical address spaces are only supported on VMX, the parameter
>> is only exposed in the kvm_intel module.
>> Modifications to VMX page fault and EPT violation handling will depend on whether
>> that parameter is enabled.
>>
>> Also disable support by default, and let the user decide if they want to enable
>> it.
>
> I think a smaller guest physical address width *should* be allowed.
> However, perhaps the pedantic adherence to the architectural
> specification could be turned on or off per-VM? And, if we're going to
> be pedantic, I think we should go all the way and get MOV-to-CR3
> correct.

That would be way too slow. Even the current trapping of present #PF
can introduce some slowdown depending on the workload.

> Does the typical guest care about whether or not setting any of the
> bits 51:46 in a PFN results in a fault?

At least KVM with shadow pages does, which is a bit niche but it shows
that you cannot really rely on no one doing it. As you guessed, the
main usage of the feature is for machines with 5-level page tables where
there are no reserved bits; emulating smaller MAXPHYADDR allows
migrating VMs from 4-level page-table hosts.

Enabling per-VM would not be particularly useful IMO because if you want
to disable this code you can just set host MAXPHYADDR = guest
MAXPHYADDR, which should be the common case unless you want to do that
kind of Skylake to Icelake (or similar) migration.

Paolo

2020-09-03 18:36:50

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Thu, Sep 3, 2020 at 11:03 AM Paolo Bonzini <[email protected]> wrote:
>
> On 03/09/20 19:57, Jim Mattson wrote:
> > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <[email protected]> wrote:
> >> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
> >>
> >> Since smaller physical address spaces are only supported on VMX, the parameter
> >> is only exposed in the kvm_intel module.
> >> Modifications to VMX page fault and EPT violation handling will depend on whether
> >> that parameter is enabled.
> >>
> >> Also disable support by default, and let the user decide if they want to enable
> >> it.
> >
> > I think a smaller guest physical address width *should* be allowed.
> > However, perhaps the pedantic adherence to the architectural
> > specification could be turned on or off per-VM? And, if we're going to
> > be pedantic, I think we should go all the way and get MOV-to-CR3
> > correct.
>
> That would be way too slow. Even the current trapping of present #PF
> can introduce some slowdown depending on the workload.

Yes, I was concerned about that...which is why I would not want to
enable pedantic mode. But if you're going to be pedantic, why go
halfway?

> > Does the typical guest care about whether or not setting any of the
> > bits 51:46 in a PFN results in a fault?
>
> At least KVM with shadow pages does, which is a bit niche but it shows
> that you cannot really rely on no one doing it. As you guessed, the
> main usage of the feature is for machines with 5-level page tables where
> there are no reserved bits; emulating smaller MAXPHYADDR allows
> migrating VMs from 4-level page-table hosts.
>
> Enabling per-VM would not be particularly useful IMO because if you want
> to disable this code you can just set host MAXPHYADDR = guest
> MAXPHYADDR, which should be the common case unless you want to do that
> kind of Skylake to Icelake (or similar) migration.

I expect that it will be quite common to run 46-bit wide legacy VMs on
Ice Lake hardware, as Ice Lake machines start showing up in
heterogeneous data centers.

2020-09-03 20:06:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On 03/09/20 20:32, Jim Mattson wrote:
>> [Checking writes to CR3] would be way too slow. Even the current
>> trapping of present #PF can introduce some slowdown depending on the
>> workload.
>
> Yes, I was concerned about that...which is why I would not want to
> enable pedantic mode. But if you're going to be pedantic, why go
> halfway?

Because I am not sure about any guest, even KVM, caring about setting
bits 51:46 in CR3.

>>> Does the typical guest care about whether or not setting any of the
>>> bits 51:46 in a PFN results in a fault?
>>
>> At least KVM with shadow pages does, which is a bit niche but it shows
>> that you cannot really rely on no one doing it. As you guessed, the
>> main usage of the feature is for machines with 5-level page tables where
>> there are no reserved bits; emulating smaller MAXPHYADDR allows
>> migrating VMs from 4-level page-table hosts.
>>
>> Enabling per-VM would not be particularly useful IMO because if you want
>> to disable this code you can just set host MAXPHYADDR = guest
>> MAXPHYADDR, which should be the common case unless you want to do that
>> kind of Skylake to Icelake (or similar) migration.
>
> I expect that it will be quite common to run 46-bit wide legacy VMs on
> Ice Lake hardware, as Ice Lake machines start showing up in
> heterogeneous data centers.

If you'll be okay with running _all_ 46-bit wide legacy VMs without
MAXPHYADDR emulation, that's what this patch is for. If you'll be okay
with running _only_ 46-bit wide VMs without emulation, you still don't
need special enabling per-VM beyond the automatic one based on
CPUID[0x8000_0008]. Do you think you'll need to enable it for some
special 46-bit VMs?

Paolo

2020-09-03 21:28:16

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Thu, Sep 3, 2020 at 1:02 PM Paolo Bonzini <[email protected]> wrote:
>
> On 03/09/20 20:32, Jim Mattson wrote:
> >> [Checking writes to CR3] would be way too slow. Even the current
> >> trapping of present #PF can introduce some slowdown depending on the
> >> workload.
> >
> > Yes, I was concerned about that...which is why I would not want to
> > enable pedantic mode. But if you're going to be pedantic, why go
> > halfway?
>
> Because I am not sure about any guest, even KVM, caring about setting
> bits 51:46 in CR3.
>
> >>> Does the typical guest care about whether or not setting any of the
> >>> bits 51:46 in a PFN results in a fault?
> >>
> >> At least KVM with shadow pages does, which is a bit niche but it shows
> >> that you cannot really rely on no one doing it. As you guessed, the
> >> main usage of the feature is for machines with 5-level page tables where
> >> there are no reserved bits; emulating smaller MAXPHYADDR allows
> >> migrating VMs from 4-level page-table hosts.
> >>
> >> Enabling per-VM would not be particularly useful IMO because if you want
> >> to disable this code you can just set host MAXPHYADDR = guest
> >> MAXPHYADDR, which should be the common case unless you want to do that
> >> kind of Skylake to Icelake (or similar) migration.
> >
> > I expect that it will be quite common to run 46-bit wide legacy VMs on
> > Ice Lake hardware, as Ice Lake machines start showing up in
> > heterogeneous data centers.
>
> If you'll be okay with running _all_ 46-bit wide legacy VMs without
> MAXPHYADDR emulation, that's what this patch is for. If you'll be okay
> with running _only_ 46-bit wide VMs without emulation, you still don't
> need special enabling per-VM beyond the automatic one based on
> CPUID[0x8000_0008]. Do you think you'll need to enable it for some
> special 46-bit VMs?

Yes. From what you've said above, we would only want to enable this
for the niche case of 46-bit KVM guests using shadow paging. I would
expect that to be a very small number of VMs. :-)

2020-09-23 13:48:03

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On 03/09/20 16:11, Mohammed Gamal wrote:
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to enable
> it.
>
> Signed-off-by: Mohammed Gamal <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> arch/x86/kvm/vmx/vmx.h | 3 +++
> arch/x86/kvm/x86.c | 2 +-
> 3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> #endif
>
> +extern bool __read_mostly allow_smaller_maxphyaddr;
> +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> +
> #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> #define KVM_VM_CR0_ALWAYS_ON \
> @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>
> if (is_page_fault(intr_info)) {
> cr2 = vmx_get_exit_qual(vcpu);
> - if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> + if (enable_ept && !vcpu->arch.apf.host_apf_flags
> + && allow_smaller_maxphyaddr) {
> /*
> * EPT will cause page fault only if we need to
> * detect illegal GPAs.
> @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> * would also use advanced VM-exit information for EPT violations to
> * reconstruct the page fault error code.
> */
> - if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) && allow_smaller_maxphyaddr)
> return kvm_emulate_instruction(vcpu, 0);
>
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
> #endif
> vmx_check_vmcs12_offsets();
>
> - /*
> - * Intel processors don't have problems with
> - * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> - * it for VMX by default
> - */
> - allow_smaller_maxphyaddr = true;
> -
> return 0;
> }
> module_init(vmx_init);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 26175a4759fa..b859435efa2e 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
>
> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> {
> + if (!allow_smaller_maxphyaddr)
> + return false;
> +
> return !enable_ept || cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;

This needs to return true if !enable_ept.

if (!enable_ept)
return true;

return allow_smaller_maxphyaddr &&
cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;

Fixed and queued, thanks.

Paolo

> }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d39d6cf1d473..982f1d73a884 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
>
> -bool __read_mostly allow_smaller_maxphyaddr;
> +bool __read_mostly allow_smaller_maxphyaddr = 0;
> EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
>
> static u64 __read_mostly host_xss;
>

2020-09-23 15:21:14

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On 23/09/20 16:32, Jim Mattson wrote:
> You don’t buy my argument that this should be per-VM, then? I’d bet that
> we have very close to 0 customers who care about emulating the reserved
> bits properly for 46-but wide VMs, but there may be someone out there
> using shadow paging in a nested kvm. 
>

I do buy it but I prefer not to have a released version of Linux where
this is enabled by default. I procrastinated hoping to cobble something
together but I didn't have time.

Paolo

2020-09-28 15:36:29

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Thu, 2020-09-03 at 16:11 +0200, Mohammed Gamal wrote:
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on
> whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to
> enable
> it.
>
> Signed-off-by: Mohammed Gamal <[email protected]>

Running a simple SR-IOV on Intel will trigger the warning below:

.config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config

P.S.: I did confirm the linux-next included this hunk as well:
https://lore.kernel.org/kvm/[email protected]/


[ 1119.752137][ T7441] WARNING: CPU: 27 PID: 7441 at arch/x86/kvm/vmx/vmx.c:4809 handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1119.763312][ T7441] Modules linked in: loop nls_ascii nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ses enclosure efivarfs ip_tables x_tables sd_mod nvme tg3 firmware_class smartpqi nvme_core libphy scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[ 1119.786660][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted: G I 5.9.0-rc7-next-20200928+ #2
[ 1119.796572][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[ 1119.805870][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1119.812788][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8 <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
[ 1119.832384][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
[ 1119.838353][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX: 0000000000000000
[ 1119.846247][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc2b98940
[ 1119.854124][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09: ffffed11d1fcd071
[ 1119.862012][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12: ffff8888f6129000
[ 1119.869903][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15: 0000000000000000
[ 1119.877795][ T7441] FS: 00007fc3277fe700(0000) GS:ffff88901f640000(0000) knlGS:0000000000000000
[ 1119.886649][ T7441] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1119.893127][ T7441] CR2: 0000000000000000 CR3: 0000000ab7376002 CR4: 00000000007726e0
[ 1119.901022][ T7441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1119.908916][ T7441] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1119.916806][ T7441] PKRU: 55555554
[ 1119.920226][ T7441] Call Trace:
[ 1119.923426][ T7441] vcpu_enter_guest+0x1ef4/0x4850 [kvm]
[ 1119.928877][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1119.934335][ T7441] ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
[ 1119.941029][ T7441] ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
[ 1119.947177][ T7441] ? rcu_read_unlock+0x40/0x40
[ 1119.951822][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1119.957272][ T7441] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 1119.962441][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1119.968325][ T7441] ? __local_bh_enable_ip+0xa0/0xe0
[ 1119.973433][ T7441] ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
[ 1119.979776][ T7441] ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1119.985945][ T7441] kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1119.991924][ T7441] kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
[ 1119.997047][ T7441] ? kvm_vcpu_block+0xc40/0xc40 [kvm]
[ 1120.002305][ T7441] ? find_held_lock+0x33/0x1c0
[ 1120.006968][ T7441] ? __fget_files+0x1a4/0x2e0
[ 1120.011533][ T7441] ? lock_downgrade+0x730/0x730
[ 1120.016283][ T7441] ? rcu_read_lock_sched_held+0xd0/0xd0
[ 1120.021714][ T7441] ? __fget_files+0x1c3/0x2e0
[ 1120.026287][ T7441] __x64_sys_ioctl+0x315/0xfc0
[ 1120.030933][ T7441] ? generic_block_fiemap+0x60/0x60
[ 1120.036034][ T7441] ? up_read+0x1a3/0x730
[ 1120.040158][ T7441] ? down_read_nested+0x420/0x420
[ 1120.045080][ T7441] ? syscall_enter_from_user_mode+0x17/0x50
[ 1120.050862][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.056311][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.061743][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.067629][ T7441] ? syscall_enter_from_user_mode+0x1c/0x50
[ 1120.073410][ T7441] do_syscall_64+0x33/0x40
[ 1120.077723][ T7441] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1120.083505][ T7441] RIP: 0033:0x7fc3368c687b
[ 1120.087813][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
[ 1120.107408][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1120.115739][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX: 00007fc3368c687b
[ 1120.123622][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[ 1120.131513][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09: 0000000000000000
[ 1120.139402][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12: 00005613156b4100
[ 1120.147294][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15: 00005613177da760
[ 1120.155185][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted: G I 5.9.0-rc7-next-20200928+ #2
[ 1120.165072][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[ 1120.174345][ T7441] Call Trace:
[ 1120.177509][ T7441] dump_stack+0x99/0xcb
[ 1120.181548][ T7441] __warn.cold.13+0xe/0x55
[ 1120.185848][ T7441] ? handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1120.192157][ T7441] report_bug+0x1af/0x260
[ 1120.196367][ T7441] handle_bug+0x44/0x80
[ 1120.200400][ T7441] exc_invalid_op+0x13/0x40
[ 1120.204784][ T7441] asm_exc_invalid_op+0x12/0x20
[ 1120.209520][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1120.216435][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8 <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
[ 1120.236016][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
[ 1120.241973][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX: 0000000000000000
[ 1120.249851][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc2b98940
[ 1120.257729][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09: ffffed11d1fcd071
[ 1120.265605][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12: ffff8888f6129000
[ 1120.273483][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15: 0000000000000000
[ 1120.281364][ T7441] ? handle_exception_nmi+0x788/0xe60 [kvm_intel]
[ 1120.287695][ T7441] vcpu_enter_guest+0x1ef4/0x4850 [kvm]
[ 1120.293126][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.298582][ T7441] ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
[ 1120.305262][ T7441] ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
[ 1120.311392][ T7441] ? rcu_read_unlock+0x40/0x40
[ 1120.316038][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.321469][ T7441] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 1120.326639][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.332506][ T7441] ? __local_bh_enable_ip+0xa0/0xe0
[ 1120.337613][ T7441] ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
[ 1120.343942][ T7441] ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1120.350096][ T7441] kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1120.356076][ T7441] kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
[ 1120.361180][ T7441] ? kvm_vcpu_block+0xc40/0xc40 [kvm]
[ 1120.366436][ T7441] ? find_held_lock+0x33/0x1c0
[ 1120.371082][ T7441] ? __fget_files+0x1a4/0x2e0
[ 1120.375639][ T7441] ? lock_downgrade+0x730/0x730
[ 1120.380371][ T7441] ? rcu_read_lock_sched_held+0xd0/0xd0
[ 1120.385802][ T7441] ? __fget_files+0x1c3/0x2e0
[ 1120.390360][ T7441] __x64_sys_ioctl+0x315/0xfc0
[ 1120.395006][ T7441] ? generic_block_fiemap+0x60/0x60
[ 1120.400088][ T7441] ? up_read+0x1a3/0x730
[ 1120.404209][ T7441] ? down_read_nested+0x420/0x420
[ 1120.409116][ T7441] ? syscall_enter_from_user_mode+0x17/0x50
[ 1120.414898][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.420329][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.425761][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.431628][ T7441] ? syscall_enter_from_user_mode+0x1c/0x50
[ 1120.437410][ T7441] do_syscall_64+0x33/0x40
[ 1120.441704][ T7441] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1120.447484][ T7441] RIP: 0033:0x7fc3368c687b
[ 1120.451779][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
[ 1120.471361][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1120.479676][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX: 00007fc3368c687b
[ 1120.487552][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[ 1120.495429][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09: 0000000000000000
[ 1120.503305][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12: 00005613156b4100
[ 1120.511181][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15: 00005613177da760
[ 1120.519104][ T7441] irq event stamp: 4321673
[ 1120.523400][ T7441] hardirqs last enabled at (4321681): [<ffffffffa6c2a76f>] console_unlock+0x81f/0xa20
[ 1120.532948][ T7441] hardirqs last disabled at (4321690): [<ffffffffa6c2a67b>] console_unlock+0x72b/0xa20
[ 1120.542495][ T7441] softirqs last enabled at (4321672): [<ffffffffa800061b>] __do_softirq+0x61b/0x95d
[ 1120.551864][ T7441] softirqs last disabled at (4321665): [<ffffffffa7e00ec2>] asm_call_irq_on_stack+0x12/0x20
[ 1120.561852][ T7441] ---[ end trace 31c2bbb23abc5aa2 ]---

> ---
> arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> arch/x86/kvm/vmx/vmx.h | 3 +++
> arch/x86/kvm/x86.c | 2 +-
> 3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> #endif
>
> +extern bool __read_mostly allow_smaller_maxphyaddr;
> +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> +
> #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> #define KVM_VM_CR0_ALWAYS_ON \
> @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>
> if (is_page_fault(intr_info)) {
> cr2 = vmx_get_exit_qual(vcpu);
> - if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> + if (enable_ept && !vcpu->arch.apf.host_apf_flags
> + && allow_smaller_maxphyaddr) {
> /*
> * EPT will cause page fault only if we need to
> * detect illegal GPAs.
> @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> * would also use advanced VM-exit information for EPT violations to
> * reconstruct the page fault error code.
> */
> - if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) &&
> allow_smaller_maxphyaddr)
> return kvm_emulate_instruction(vcpu, 0);
>
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
> #endif
> vmx_check_vmcs12_offsets();
>
> - /*
> - * Intel processors don't have problems with
> - * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> - * it for VMX by default
> - */
> - allow_smaller_maxphyaddr = true;
> -
> return 0;
> }
> module_init(vmx_init);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 26175a4759fa..b859435efa2e 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
>
> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> {
> + if (!allow_smaller_maxphyaddr)
> + return false;
> +
> return !enable_ept || cpuid_maxphyaddr(vcpu) <
> boot_cpu_data.x86_phys_bits;
> }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d39d6cf1d473..982f1d73a884 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
>
> -bool __read_mostly allow_smaller_maxphyaddr;
> +bool __read_mostly allow_smaller_maxphyaddr = 0;
> EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
>
> static u64 __read_mostly host_xss;

2020-09-29 12:03:04

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Mon, 2020-09-28 at 11:34 -0400, Qian Cai wrote:
> On Thu, 2020-09-03 at 16:11 +0200, Mohammed Gamal wrote:
> > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > parameter.
> >
> > Since smaller physical address spaces are only supported on VMX, the
> > parameter
> > is only exposed in the kvm_intel module.
> > Modifications to VMX page fault and EPT violation handling will depend on
> > whether
> > that parameter is enabled.
> >
> > Also disable support by default, and let the user decide if they want to
> > enable
> > it.
> >
> > Signed-off-by: Mohammed Gamal <[email protected]>
>
> Running a simple SR-IOV on Intel will trigger the warning below:
>
> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config
>
> P.S.: I did confirm the linux-next included this hunk as well:
> https://lore.kernel.org/kvm/[email protected]/
>
>
> [ 1119.752137][ T7441] WARNING: CPU: 27 PID: 7441 at
> arch/x86/kvm/vmx/vmx.c:4809 handle_exception_nmi+0xbfc/0xe60 [kvm_intel]

That is:

WARN_ON_ONCE(!allow_smaller_maxphyaddr);

I noticed the origin patch did not have this WARN_ON_ONCE(), but the mainline
commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address space
support user-configurable") does have it for some reasons.

Paolo, any idea?

> [ 1119.763312][ T7441] Modules linked in: loop nls_ascii nls_cp437 vfat fat
> kvm_intel kvm irqbypass efivars ses enclosure efivarfs ip_tables x_tables
> sd_mod nvme tg3 firmware_class smartpqi nvme_core libphy scsi_transport_sas
> dm_mirror dm_region_hash dm_log dm_mod
> [ 1119.786660][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted:
> G I 5.9.0-rc7-next-20200928+ #2
> [ 1119.796572][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560
> Gen10, BIOS U34 11/13/2019
> [ 1119.805870][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1119.812788][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00
> 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8
> <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
> [ 1119.832384][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
> [ 1119.838353][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX:
> 0000000000000000
> [ 1119.846247][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffffffc2b98940
> [ 1119.854124][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09:
> ffffed11d1fcd071
> [ 1119.862012][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12:
> ffff8888f6129000
> [ 1119.869903][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15:
> 0000000000000000
> [ 1119.877795][ T7441] FS: 00007fc3277fe700(0000) GS:ffff88901f640000(0000)
> knlGS:0000000000000000
> [ 1119.886649][ T7441] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1119.893127][ T7441] CR2: 0000000000000000 CR3: 0000000ab7376002 CR4:
> 00000000007726e0
> [ 1119.901022][ T7441] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 1119.908916][ T7441] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 1119.916806][ T7441] PKRU: 55555554
> [ 1119.920226][ T7441] Call Trace:
> [ 1119.923426][ T7441] vcpu_enter_guest+0x1ef4/0x4850 [kvm]
> [ 1119.928877][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1119.934335][ T7441] ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
> [ 1119.941029][ T7441] ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
> [ 1119.947177][ T7441] ? rcu_read_unlock+0x40/0x40
> [ 1119.951822][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1119.957272][ T7441] ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 1119.962441][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1119.968325][ T7441] ? __local_bh_enable_ip+0xa0/0xe0
> [ 1119.973433][ T7441] ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
> [ 1119.979776][ T7441] ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1119.985945][ T7441] kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1119.991924][ T7441] kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
> [ 1119.997047][ T7441] ? kvm_vcpu_block+0xc40/0xc40 [kvm]
> [ 1120.002305][ T7441] ? find_held_lock+0x33/0x1c0
> [ 1120.006968][ T7441] ? __fget_files+0x1a4/0x2e0
> [ 1120.011533][ T7441] ? lock_downgrade+0x730/0x730
> [ 1120.016283][ T7441] ? rcu_read_lock_sched_held+0xd0/0xd0
> [ 1120.021714][ T7441] ? __fget_files+0x1c3/0x2e0
> [ 1120.026287][ T7441] __x64_sys_ioctl+0x315/0xfc0
> [ 1120.030933][ T7441] ? generic_block_fiemap+0x60/0x60
> [ 1120.036034][ T7441] ? up_read+0x1a3/0x730
> [ 1120.040158][ T7441] ? down_read_nested+0x420/0x420
> [ 1120.045080][ T7441] ? syscall_enter_from_user_mode+0x17/0x50
> [ 1120.050862][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.056311][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.061743][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.067629][ T7441] ? syscall_enter_from_user_mode+0x1c/0x50
> [ 1120.073410][ T7441] do_syscall_64+0x33/0x40
> [ 1120.077723][ T7441] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1120.083505][ T7441] RIP: 0033:0x7fc3368c687b
> [ 1120.087813][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
> [ 1120.107408][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 1120.115739][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX:
> 00007fc3368c687b
> [ 1120.123622][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000017
> [ 1120.131513][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09:
> 0000000000000000
> [ 1120.139402][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00005613156b4100
> [ 1120.147294][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15:
> 00005613177da760
> [ 1120.155185][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted:
> G I 5.9.0-rc7-next-20200928+ #2
> [ 1120.165072][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560
> Gen10, BIOS U34 11/13/2019
> [ 1120.174345][ T7441] Call Trace:
> [ 1120.177509][ T7441] dump_stack+0x99/0xcb
> [ 1120.181548][ T7441] __warn.cold.13+0xe/0x55
> [ 1120.185848][ T7441] ? handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1120.192157][ T7441] report_bug+0x1af/0x260
> [ 1120.196367][ T7441] handle_bug+0x44/0x80
> [ 1120.200400][ T7441] exc_invalid_op+0x13/0x40
> [ 1120.204784][ T7441] asm_exc_invalid_op+0x12/0x20
> [ 1120.209520][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1120.216435][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00
> 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8
> <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
> [ 1120.236016][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
> [ 1120.241973][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX:
> 0000000000000000
> [ 1120.249851][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffffffc2b98940
> [ 1120.257729][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09:
> ffffed11d1fcd071
> [ 1120.265605][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12:
> ffff8888f6129000
> [ 1120.273483][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15:
> 0000000000000000
> [ 1120.281364][ T7441] ? handle_exception_nmi+0x788/0xe60 [kvm_intel]
> [ 1120.287695][ T7441] vcpu_enter_guest+0x1ef4/0x4850 [kvm]
> [ 1120.293126][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.298582][ T7441] ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
> [ 1120.305262][ T7441] ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
> [ 1120.311392][ T7441] ? rcu_read_unlock+0x40/0x40
> [ 1120.316038][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.321469][ T7441] ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 1120.326639][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.332506][ T7441] ? __local_bh_enable_ip+0xa0/0xe0
> [ 1120.337613][ T7441] ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
> [ 1120.343942][ T7441] ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1120.350096][ T7441] kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1120.356076][ T7441] kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
> [ 1120.361180][ T7441] ? kvm_vcpu_block+0xc40/0xc40 [kvm]
> [ 1120.366436][ T7441] ? find_held_lock+0x33/0x1c0
> [ 1120.371082][ T7441] ? __fget_files+0x1a4/0x2e0
> [ 1120.375639][ T7441] ? lock_downgrade+0x730/0x730
> [ 1120.380371][ T7441] ? rcu_read_lock_sched_held+0xd0/0xd0
> [ 1120.385802][ T7441] ? __fget_files+0x1c3/0x2e0
> [ 1120.390360][ T7441] __x64_sys_ioctl+0x315/0xfc0
> [ 1120.395006][ T7441] ? generic_block_fiemap+0x60/0x60
> [ 1120.400088][ T7441] ? up_read+0x1a3/0x730
> [ 1120.404209][ T7441] ? down_read_nested+0x420/0x420
> [ 1120.409116][ T7441] ? syscall_enter_from_user_mode+0x17/0x50
> [ 1120.414898][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.420329][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.425761][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.431628][ T7441] ? syscall_enter_from_user_mode+0x1c/0x50
> [ 1120.437410][ T7441] do_syscall_64+0x33/0x40
> [ 1120.441704][ T7441] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1120.447484][ T7441] RIP: 0033:0x7fc3368c687b
> [ 1120.451779][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
> [ 1120.471361][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 1120.479676][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX:
> 00007fc3368c687b
> [ 1120.487552][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000017
> [ 1120.495429][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09:
> 0000000000000000
> [ 1120.503305][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00005613156b4100
> [ 1120.511181][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15:
> 00005613177da760
> [ 1120.519104][ T7441] irq event stamp: 4321673
> [ 1120.523400][ T7441] hardirqs last enabled at (4321681):
> [<ffffffffa6c2a76f>] console_unlock+0x81f/0xa20
> [ 1120.532948][ T7441] hardirqs last disabled at (4321690):
> [<ffffffffa6c2a67b>] console_unlock+0x72b/0xa20
> [ 1120.542495][ T7441] softirqs last enabled at (4321672):
> [<ffffffffa800061b>] __do_softirq+0x61b/0x95d
> [ 1120.551864][ T7441] softirqs last disabled at (4321665):
> [<ffffffffa7e00ec2>] asm_call_irq_on_stack+0x12/0x20
> [ 1120.561852][ T7441] ---[ end trace 31c2bbb23abc5aa2 ]---
>
> > ---
> > arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > arch/x86/kvm/vmx/vmx.h | 3 +++
> > arch/x86/kvm/x86.c | 2 +-
> > 3 files changed, 10 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 819c185adf09..dc778c7b5a06 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> > module_param_named(preemption_timer, enable_preemption_timer, bool,
> > S_IRUGO);
> > #endif
> >
> > +extern bool __read_mostly allow_smaller_maxphyaddr;
> > +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> > +
> > #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> > #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> > #define KVM_VM_CR0_ALWAYS_ON \
> > @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> >
> > if (is_page_fault(intr_info)) {
> > cr2 = vmx_get_exit_qual(vcpu);
> > - if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> > + if (enable_ept && !vcpu->arch.apf.host_apf_flags
> > + && allow_smaller_maxphyaddr) {
> > /*
> > * EPT will cause page fault only if we need to
> > * detect illegal GPAs.
> > @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> > * would also use advanced VM-exit information for EPT violations to
> > * reconstruct the page fault error code.
> > */
> > - if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> > + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) &&
> > allow_smaller_maxphyaddr)
> > return kvm_emulate_instruction(vcpu, 0);
> >
> > return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
> > #endif
> > vmx_check_vmcs12_offsets();
> >
> > - /*
> > - * Intel processors don't have problems with
> > - * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> > - * it for VMX by default
> > - */
> > - allow_smaller_maxphyaddr = true;
> > -
> > return 0;
> > }
> > module_init(vmx_init);
> > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > index 26175a4759fa..b859435efa2e 100644
> > --- a/arch/x86/kvm/vmx/vmx.h
> > +++ b/arch/x86/kvm/vmx/vmx.h
> > @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
> >
> > static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> > {
> > + if (!allow_smaller_maxphyaddr)
> > + return false;
> > +
> > return !enable_ept || cpuid_maxphyaddr(vcpu) <
> > boot_cpu_data.x86_phys_bits;
> > }
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d39d6cf1d473..982f1d73a884 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
> > u64 __read_mostly host_efer;
> > EXPORT_SYMBOL_GPL(host_efer);
> >
> > -bool __read_mostly allow_smaller_maxphyaddr;
> > +bool __read_mostly allow_smaller_maxphyaddr = 0;
> > EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
> >
> > static u64 __read_mostly host_xss;

2020-09-29 12:27:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On 29/09/20 13:59, Qian Cai wrote:
>
> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
>
> I noticed the origin patch did not have this WARN_ON_ONCE(), but the mainline
> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address space
> support user-configurable") does have it for some reasons.

Because that part of the code should not be reached. The exception
bitmap is set up with

if (!vmx_need_pf_intercept(vcpu))
eb &= ~(1u << PF_VECTOR);

where

static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
{
if (!enable_ept)
return true;

return allow_smaller_maxphyaddr &&
cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
}

We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
which implies vmx_need_pf_intercept(vcpu) == false. So the warning is
genuine; I've sent a patch.

Paolo

2020-09-29 13:42:25

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
> On 29/09/20 13:59, Qian Cai wrote:
> > WARN_ON_ONCE(!allow_smaller_maxphyaddr);
> >
> > I noticed the origin patch did not have this WARN_ON_ONCE(), but the
> > mainline
> > commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
> > space
> > support user-configurable") does have it for some reasons.
>
> Because that part of the code should not be reached. The exception
> bitmap is set up with
>
> if (!vmx_need_pf_intercept(vcpu))
> eb &= ~(1u << PF_VECTOR);
>
> where
>
> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> {
> if (!enable_ept)
> return true;
>
> return allow_smaller_maxphyaddr &&
> cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
> }
>
> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
> which implies vmx_need_pf_intercept(vcpu) == false. So the warning is
> genuine; I've sent a patch.

Care to provide a link to the patch? Just curious.

2020-09-29 14:51:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On 29/09/20 15:39, Qian Cai wrote:
> On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
>> On 29/09/20 13:59, Qian Cai wrote:
>>> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
>>>
>>> I noticed the origin patch did not have this WARN_ON_ONCE(), but the
>>> mainline
>>> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
>>> space
>>> support user-configurable") does have it for some reasons.
>>
>> Because that part of the code should not be reached. The exception
>> bitmap is set up with
>>
>> if (!vmx_need_pf_intercept(vcpu))
>> eb &= ~(1u << PF_VECTOR);
>>
>> where
>>
>> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
>> {
>> if (!enable_ept)
>> return true;
>>
>> return allow_smaller_maxphyaddr &&
>> cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
>> }
>>
>> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
>> which implies vmx_need_pf_intercept(vcpu) == false. So the warning is
>> genuine; I've sent a patch.
>
> Care to provide a link to the patch? Just curious.
>

Ok, I haven't sent it yet. :) But here it is:

commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
Author: Paolo Bonzini <[email protected]>
Date: Tue Sep 29 08:31:32 2020 -0400

KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept

The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
the #PF intercept bit in the exception bitmap when they do not match.
This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
hypervisor can get a vmexit for #PF exceptions even when the
corresponding bit is clear in the exception bitmap.

This is unexpected and is promptly reported as a WARN_ON_ONCE.
To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).

Reported-by: Qian Cai <[email protected]>>
Signed-off-by: Paolo Bonzini <[email protected]>

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f0384e93548a..f4e9c310032a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -794,6 +794,18 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
*/
if (is_guest_mode(vcpu))
eb |= get_vmcs12(vcpu)->exception_bitmap;
+ else {
+ /*
+ * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
+ * between guest and host. In that case we only care about present
+ * faults. For vmcs02, however, PFEC_MASK and PFEC_MATCH are set in
+ * prepare_vmcs02_rare.
+ */
+ bool selective_pf_trap = enable_ept && (eb & (1u << PF_VECTOR));
+ int mask = selective_pf_trap ? PFERR_PRESENT_MASK : 0;
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, mask);
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, mask);
+ }

vmcs_write32(EXCEPTION_BITMAP, eb);
}
@@ -4355,16 +4367,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmx->pt_desc.guest.output_mask = 0x7F;
vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}
-
- /*
- * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
- * between guest and host. In that case we only care about present
- * faults.
- */
- if (enable_ept) {
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
- }
}

static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

2020-10-02 17:32:25

by Naresh Kamboju

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

Hi Paolo,

Thanks for the patch.

On Tue, 29 Sep 2020 at 20:17, Paolo Bonzini <[email protected]> wrote:
>
> On 29/09/20 15:39, Qian Cai wrote:
> > On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
> >> On 29/09/20 13:59, Qian Cai wrote:
> >>> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
> >>>
> >>> I noticed the origin patch did not have this WARN_ON_ONCE(), but the
> >>> mainline
> >>> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
> >>> space
> >>> support user-configurable") does have it for some reasons.
> >>
> >> Because that part of the code should not be reached. The exception
> >> bitmap is set up with
> >>
> >> if (!vmx_need_pf_intercept(vcpu))
> >> eb &= ~(1u << PF_VECTOR);
> >>
> >> where
> >>
> >> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> >> {
> >> if (!enable_ept)
> >> return true;
> >>
> >> return allow_smaller_maxphyaddr &&
> >> cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
> >> }
> >>
> >> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
> >> which implies vmx_need_pf_intercept(vcpu) == false. So the warning is
> >> genuine; I've sent a patch.
> >
> > Care to provide a link to the patch? Just curious.
> >
>
> Ok, I haven't sent it yet. :) But here it is:
>
> commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
> Author: Paolo Bonzini <[email protected]>
> Date: Tue Sep 29 08:31:32 2020 -0400
>
> KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept
>
> The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
> the #PF intercept bit in the exception bitmap when they do not match.
> This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
> hypervisor can get a vmexit for #PF exceptions even when the
> corresponding bit is clear in the exception bitmap.
>
> This is unexpected and is promptly reported as a WARN_ON_ONCE.
> To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
> is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).

I have tested this patch on an x86_64 machine and the reported issue is gone.

>
> Reported-by: Qian Cai <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>

Reported-by: Naresh Kamboju <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>

>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f0384e93548a..f4e9c310032a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -794,6 +794,18 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
> */
> if (is_guest_mode(vcpu))
> eb |= get_vmcs12(vcpu)->exception_bitmap;
> + else {
> + /*
> + * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
> + * between guest and host. In that case we only care about present
> + * faults. For vmcs02, however, PFEC_MASK and PFEC_MATCH are set in
> + * prepare_vmcs02_rare.
> + */
> + bool selective_pf_trap = enable_ept && (eb & (1u << PF_VECTOR));
> + int mask = selective_pf_trap ? PFERR_PRESENT_MASK : 0;
> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, mask);
> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, mask);
> + }
>
> vmcs_write32(EXCEPTION_BITMAP, eb);
> }
> @@ -4355,16 +4367,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> vmx->pt_desc.guest.output_mask = 0x7F;
> vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
> }
> -
> - /*
> - * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
> - * between guest and host. In that case we only care about present
> - * faults.
> - */
> - if (enable_ept) {
> - vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
> - vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
> - }
> }
>
> static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>

test log link
https://lkft.validation.linaro.org/scheduler/job/1813223

- Naresh

2020-10-02 17:32:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On 02/10/20 19:28, Naresh Kamboju wrote:
>>
>> commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
>> Author: Paolo Bonzini <[email protected]>
>> Date: Tue Sep 29 08:31:32 2020 -0400
>>
>> KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept
>>
>> The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
>> the #PF intercept bit in the exception bitmap when they do not match.
>> This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
>> hypervisor can get a vmexit for #PF exceptions even when the
>> corresponding bit is clear in the exception bitmap.
>>
>> This is unexpected and is promptly reported as a WARN_ON_ONCE.
>> To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
>> is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).
> I have tested this patch on an x86_64 machine and the reported issue is gone.
>

Thanks, the issue with my disk is gone too so it'll get to Linus in time
for rc8.

Paolo

2021-01-16 00:11:19

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <[email protected]> wrote:
>
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to enable
> it.
>
> Signed-off-by: Mohammed Gamal <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> arch/x86/kvm/vmx/vmx.h | 3 +++
> arch/x86/kvm/x86.c | 2 +-
> 3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> #endif
>
> +extern bool __read_mostly allow_smaller_maxphyaddr;

Since this variable is in the kvm module rather than the kvm_intel
module, its current setting is preserved across "rmmod kvm_intel;
modprobe kvm_intel." That is, if set to true, it doesn't revert to
false after "rmmod kvm_intel." Is that the intended behavior?

2021-01-18 10:53:03

by Mohammed Gamal

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <[email protected]>
> wrote:
> >
> > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > parameter.
> >
> > Since smaller physical address spaces are only supported on VMX,
> > the parameter
> > is only exposed in the kvm_intel module.
> > Modifications to VMX page fault and EPT violation handling will
> > depend on whether
> > that parameter is enabled.
> >
> > Also disable support by default, and let the user decide if they
> > want to enable
> > it.
> >
> > Signed-off-by: Mohammed Gamal <[email protected]>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> >  arch/x86/kvm/vmx/vmx.h |  3 +++
> >  arch/x86/kvm/x86.c     |  2 +-
> >  3 files changed, 10 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 819c185adf09..dc778c7b5a06 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -129,6 +129,9 @@ static bool __read_mostly
> > enable_preemption_timer = 1;
> >  module_param_named(preemption_timer, enable_preemption_timer,
> > bool, S_IRUGO);
> >  #endif
> >
> > +extern bool __read_mostly allow_smaller_maxphyaddr;
>
> Since this variable is in the kvm module rather than the kvm_intel
> module, its current setting is preserved across "rmmod kvm_intel;
> modprobe kvm_intel." That is, if set to true, it doesn't revert to
> false after "rmmod kvm_intel." Is that the intended behavior?
>

IIRC, this is because this setting was indeed not intended to be just
VMX-specific, but since AMD has an issue with PTE accessed-bits being
set by hardware and thus we can't yet enable this feature on it, it
might make sense to move the variable to the kvm_intel module for now.

Paolo, what do you think?


2021-06-21 18:18:57

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Mon, Jan 18, 2021 at 2:22 AM Mohammed Gamal <[email protected]> wrote:
>
> On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <[email protected]>
> > wrote:
> > >
> > > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > > parameter.
> > >
> > > Since smaller physical address spaces are only supported on VMX,
> > > the parameter
> > > is only exposed in the kvm_intel module.
> > > Modifications to VMX page fault and EPT violation handling will
> > > depend on whether
> > > that parameter is enabled.
> > >
> > > Also disable support by default, and let the user decide if they
> > > want to enable
> > > it.
> > >
> > > Signed-off-by: Mohammed Gamal <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > > arch/x86/kvm/vmx/vmx.h | 3 +++
> > > arch/x86/kvm/x86.c | 2 +-
> > > 3 files changed, 10 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 819c185adf09..dc778c7b5a06 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -129,6 +129,9 @@ static bool __read_mostly
> > > enable_preemption_timer = 1;
> > > module_param_named(preemption_timer, enable_preemption_timer,
> > > bool, S_IRUGO);
> > > #endif
> > >
> > > +extern bool __read_mostly allow_smaller_maxphyaddr;
> >
> > Since this variable is in the kvm module rather than the kvm_intel
> > module, its current setting is preserved across "rmmod kvm_intel;
> > modprobe kvm_intel." That is, if set to true, it doesn't revert to
> > false after "rmmod kvm_intel." Is that the intended behavior?
> >
>
> IIRC, this is because this setting was indeed not intended to be just
> VMX-specific, but since AMD has an issue with PTE accessed-bits being
> set by hardware and thus we can't yet enable this feature on it, it
> might make sense to move the variable to the kvm_intel module for now.

Um...

We do allow it for SVM, if NPT is not enabled. In fact, we set it
unconditionally in that case. See commit 3edd68399dc15 ("KVM: x86: Add
a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support").

Perhaps it should be a module parameter for SVM as well?

And, in any case, it would be nice if the parameter reverted to false
when the kvm_intel module is unloaded.

> Paolo, what do you think?
>
>

2021-06-22 13:01:20

by Mohammed Gamal

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

On Mon, 2021-06-21 at 11:01 -0700, Jim Mattson wrote:
> On Mon, Jan 18, 2021 at 2:22 AM Mohammed Gamal <[email protected]>
> wrote:
> >
> > On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> > > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <[email protected]>
> > > wrote:
> > > >
> > > > This patch exposes allow_smaller_maxphyaddr to the user as a
> > > > module
> > > > parameter.
> > > >
> > > > Since smaller physical address spaces are only supported on
> > > > VMX,
> > > > the parameter
> > > > is only exposed in the kvm_intel module.
> > > > Modifications to VMX page fault and EPT violation handling will
> > > > depend on whether
> > > > that parameter is enabled.
> > > >
> > > > Also disable support by default, and let the user decide if
> > > > they
> > > > want to enable
> > > > it.
> > > >
> > > > Signed-off-by: Mohammed Gamal <[email protected]>
> > > > ---
> > > >  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > > >  arch/x86/kvm/vmx/vmx.h |  3 +++
> > > >  arch/x86/kvm/x86.c     |  2 +-
> > > >  3 files changed, 10 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 819c185adf09..dc778c7b5a06 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -129,6 +129,9 @@ static bool __read_mostly
> > > > enable_preemption_timer = 1;
> > > >  module_param_named(preemption_timer, enable_preemption_timer,
> > > > bool, S_IRUGO);
> > > >  #endif
> > > >
> > > > +extern bool __read_mostly allow_smaller_maxphyaddr;
> > >
> > > Since this variable is in the kvm module rather than the
> > > kvm_intel
> > > module, its current setting is preserved across "rmmod kvm_intel;
> > > modprobe kvm_intel." That is, if set to true, it doesn't revert
> > > to
> > > false after "rmmod kvm_intel." Is that the intended behavior?
> > >
> >
> > IIRC, this is because this setting was indeed not intended to be
> > just
> > VMX-specific, but since AMD has an issue with PTE accessed-bits
> > being
> > set by hardware and thus we can't yet enable this feature on it, it
> > might make sense to move the variable to the kvm_intel module for
> > now.
>
> Um...
>
> We do allow it for SVM, if NPT is not enabled. In fact, we set it
> unconditionally in that case. See commit 3edd68399dc15 ("KVM: x86:
> Add
> a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support").
>
> Perhaps it should be a module parameter for SVM as well?

Hmmm, I think given how AMD CPUs' behavior with NPT enabled, maybe it'd
actually be a better idea to move this entirely to VMX for the time
being. And then maybe make it available again on AMD only if the
behavior with NPT is changed.

>
> And, in any case, it would be nice if the parameter reverted to false
> when the kvm_intel module is unloaded.
>
> > Paolo, what do you think?
> >
> >
>