Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754515AbcCJI2F (ORCPT ); Thu, 10 Mar 2016 03:28:05 -0500 Received: from mga04.intel.com ([192.55.52.120]:17595 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750902AbcCJI2A (ORCPT ); Thu, 10 Mar 2016 03:28:00 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.24,314,1455004800"; d="scan'208";a="920779813" Subject: Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo To: Paolo Bonzini , linux-kernel@vger.kernel.org, kvm@vger.kernel.org References: <1457437467-65707-1-git-send-email-pbonzini@redhat.com> <1457437467-65707-2-git-send-email-pbonzini@redhat.com> Cc: stable@vger.kernel.org, Xiao Guangrong , Andy Lutomirski From: Xiao Guangrong Message-ID: <56E12FF0.90202@linux.intel.com> Date: Thu, 10 Mar 2016 16:27:28 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <1457437467-65707-2-git-send-email-pbonzini@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4249 Lines: 105 On 03/08/2016 07:44 PM, Paolo Bonzini wrote: > Yes, all of these are needed. :) This is admittedly a bit odd, but > kvm-unit-tests access.flat tests this if you run it with "-cpu host" > and of course ept=0. > > KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by > setting U=0 and W=1 in the shadow PTE. This will cause a user write > to fault and a supervisor write to succeed (which is correct because > CR0.WP=0). A user read instead will flip U=0 to 1 and W=1 back to 0. > This enables user reads; it also disables supervisor writes, the next > of which will then flip the bits again. > > When SMEP is in effect, however, U=0 will enable kernel execution of > this page. To avoid this, KVM also sets NX=1 in the shadow PTE together > with U=0. If the guest has not enabled NX, the result is a continuous > stream of page faults due to the NX bit being reserved. > > The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER > switch. Good catch! So it only hurts the box which has cpu_has_load_ia32_efer support otherwise NX is inherited from kernel (kernel always sets NX if CPU supports it), right? > > There is another bug in the reserved bit check, which I've split to a > separate patch for easier application to stable kernels. > > Cc: stable@vger.kernel.org > Cc: Xiao Guangrong > Cc: Andy Lutomirski > Fixes: f6577a5fa15d82217ca73c74cd2dcbc0f6c781dd > Signed-off-by: Paolo Bonzini > --- > Documentation/virtual/kvm/mmu.txt | 3 ++- > arch/x86/kvm/vmx.c | 25 +++++++++++++++---------- > 2 files changed, 17 insertions(+), 11 deletions(-) > > diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt > index daf9c0f742d2..c81731096a43 100644 > --- a/Documentation/virtual/kvm/mmu.txt > +++ b/Documentation/virtual/kvm/mmu.txt > @@ -358,7 +358,8 @@ In the first case there are two additional complications: > - if CR4.SMEP is enabled: since we've turned the page into a kernel page, > the kernel may now execute it. We handle this by also setting spte.nx. > If we get a user fetch or read fault, we'll change spte.u=1 and > - spte.nx=gpte.nx back. > + spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when > + shadow paging is in use. > - if CR4.SMAP is disabled: since the page has been changed to a kernel > page, it can not be reused when CR4.SMAP is enabled. We set > CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note, > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 6e51493ff4f9..91830809d837 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -1863,20 +1863,20 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset) > guest_efer = vmx->vcpu.arch.efer; > > /* > - * NX is emulated; LMA and LME handled by hardware; SCE meaningless > - * outside long mode > + * LMA and LME handled by hardware; SCE meaningless outside long mode. > */ > - ignore_bits = EFER_NX | EFER_SCE; > + ignore_bits = EFER_SCE; > #ifdef CONFIG_X86_64 > ignore_bits |= EFER_LMA | EFER_LME; > /* SCE is meaningful only in long mode on Intel */ > if (guest_efer & EFER_LMA) > ignore_bits &= ~(u64)EFER_SCE; > #endif > - guest_efer &= ~ignore_bits; > - guest_efer |= host_efer & ignore_bits; > - vmx->guest_msrs[efer_offset].data = guest_efer; > - vmx->guest_msrs[efer_offset].mask = ~ignore_bits; > + /* NX is needed to handle CR0.WP=1, CR4.SMEP=1. */ > + if (!enable_ept) { > + guest_efer |= EFER_NX; > + ignore_bits |= EFER_NX; Update ignore_bits is not necessary i think. > + } > > clear_atomic_switch_msr(vmx, MSR_EFER); > > @@ -1887,16 +1887,21 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset) > */ > if (cpu_has_load_ia32_efer || > (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) { > - guest_efer = vmx->vcpu.arch.efer; > if (!(guest_efer & EFER_LMA)) > guest_efer &= ~EFER_LME; > if (guest_efer != host_efer) > add_atomic_switch_msr(vmx, MSR_EFER, > guest_efer, host_efer); So, why not set EFER_NX (if !ept) just in this branch to make the fix more simpler?