Subject: Re: [PATCH 1/2] KVM: MMU: fix
 ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
To: Paolo Bonzini <pbonzini@redhat.com>, linux-kernel@vger.kernel.org,
        kvm@vger.kernel.org
References: <1457437467-65707-1-git-send-email-pbonzini@redhat.com>
 <1457437467-65707-2-git-send-email-pbonzini@redhat.com>
Cc: stable@vger.kernel.org, Xiao Guangrong <guangrong.xiao@redhat.com>,
        Andy Lutomirski <luto@amacapital.net>
From: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Message-ID: <56E12FF0.90202@linux.intel.com>
Date: Thu, 10 Mar 2016 16:27:28 +0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.5.1
MIME-Version: 1.0
In-Reply-To: <1457437467-65707-2-git-send-email-pbonzini@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4249
Lines: 105


On 03/08/2016 07:44 PM, Paolo Bonzini wrote:
> Yes, all of these are needed. :) This is admittedly a bit odd, but
> kvm-unit-tests access.flat tests this if you run it with "-cpu host"
> and of course ept=0.
>
> KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by
> setting U=0 and W=1 in the shadow PTE.  This will cause a user write
> to fault and a supervisor write to succeed (which is correct because
> CR0.WP=0).  A user read instead will flip U=0 to 1 and W=1 back to 0.
> This enables user reads; it also disables supervisor writes, the next
> of which will then flip the bits again.
>
> When SMEP is in effect, however, U=0 will enable kernel execution of
> this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
> with U=0.  If the guest has not enabled NX, the result is a continuous
> stream of page faults due to the NX bit being reserved.
>
> The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
> switch.

Good catch!

So it only hurts the box which has cpu_has_load_ia32_efer support otherwise
NX is inherited from kernel (kernel always sets NX if CPU supports it),
right?

>
> There is another bug in the reserved bit check, which I've split to a
> separate patch for easier application to stable kernels.
>

> Cc: stable@vger.kernel.org
> Cc: Xiao Guangrong <guangrong.xiao@redhat.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Fixes: f6577a5fa15d82217ca73c74cd2dcbc0f6c781dd
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   Documentation/virtual/kvm/mmu.txt |  3 ++-
>   arch/x86/kvm/vmx.c                | 25 +++++++++++++++----------
>   2 files changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index daf9c0f742d2..c81731096a43 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -358,7 +358,8 @@ In the first case there are two additional complications:
>   - if CR4.SMEP is enabled: since we've turned the page into a kernel page,
>     the kernel may now execute it.  We handle this by also setting spte.nx.
>     If we get a user fetch or read fault, we'll change spte.u=1 and
> -  spte.nx=gpte.nx back.
> +  spte.nx=gpte.nx back.  For this to work, KVM forces EFER.NX to 1 when
> +  shadow paging is in use.
>   - if CR4.SMAP is disabled: since the page has been changed to a kernel
>     page, it can not be reused when CR4.SMAP is enabled. We set
>     CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6e51493ff4f9..91830809d837 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1863,20 +1863,20 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
>   	guest_efer = vmx->vcpu.arch.efer;
>
>   	/*
> -	 * NX is emulated; LMA and LME handled by hardware; SCE meaningless
> -	 * outside long mode
> +	 * LMA and LME handled by hardware; SCE meaningless outside long mode.
>   	 */
> -	ignore_bits = EFER_NX | EFER_SCE;
> +	ignore_bits = EFER_SCE;
>   #ifdef CONFIG_X86_64
>   	ignore_bits |= EFER_LMA | EFER_LME;
>   	/* SCE is meaningful only in long mode on Intel */
>   	if (guest_efer & EFER_LMA)
>   		ignore_bits &= ~(u64)EFER_SCE;
>   #endif
> -	guest_efer &= ~ignore_bits;
> -	guest_efer |= host_efer & ignore_bits;
> -	vmx->guest_msrs[efer_offset].data = guest_efer;
> -	vmx->guest_msrs[efer_offset].mask = ~ignore_bits;
> +	/* NX is needed to handle CR0.WP=1, CR4.SMEP=1.  */

> +	if (!enable_ept) {
> +		guest_efer |= EFER_NX;
> +		ignore_bits |= EFER_NX;

Update ignore_bits is not necessary i think.

> +	}
>
>   	clear_atomic_switch_msr(vmx, MSR_EFER);
>
> @@ -1887,16 +1887,21 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
>   	 */
>   	if (cpu_has_load_ia32_efer ||
>   	    (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) {
> -		guest_efer = vmx->vcpu.arch.efer;
>   		if (!(guest_efer & EFER_LMA))
>   			guest_efer &= ~EFER_LME;
>   		if (guest_efer != host_efer)
>   			add_atomic_switch_msr(vmx, MSR_EFER,
>   					      guest_efer, host_efer);

So, why not set EFER_NX (if !ept) just in this branch to make the fix more simpler?