DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 43DB080460
Date: Mon, 19 Jun 2017 16:51:45 +0200
From: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
To: Wanpeng Li <kernellwp@gmail.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        kvm <kvm@vger.kernel.org>, Paolo Bonzini <pbonzini@redhat.com>,
        Wanpeng Li <wanpeng.li@hotmail.com>
Subject: Re: [PATCH v2 3/4] KVM: async_pf: Force a nested vmexit if the
 injected #PF is async_pf
Message-ID: <20170619145144.GA10325@potion>
References: <1497493615-18512-1-git-send-email-wanpeng.li@hotmail.com>
 <1497493615-18512-4-git-send-email-wanpeng.li@hotmail.com>
 <20170616133702.GA6360@potion>
 <CANRm+CwtGykZHSJxRtdPrTEbDzzk5bTpU6XLsdJK3vOHqtYKLw@mail.gmail.com>
 <20170616153832.GA5980@potion>
 <CANRm+CzBFAetj7arinkWkVYzGgtPx_hc6QBgnW__tWkux0+O0w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CANRm+CzBFAetj7arinkWkVYzGgtPx_hc6QBgnW__tWkux0+O0w@mail.gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4596
Lines: 94

2017-06-17 13:52+0800, Wanpeng Li:
> 2017-06-16 23:38 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
> > 2017-06-16 22:24+0800, Wanpeng Li:
> >> 2017-06-16 21:37 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
> >> > 2017-06-14 19:26-0700, Wanpeng Li:
> >> >> From: Wanpeng Li <wanpeng.li@hotmail.com>
> >> >>
> >> >> Add an async_page_fault field to vcpu->arch.exception to identify an async
> >> >> page fault, and constructs the expected vm-exit information fields. Force
> >> >> a nested VM exit from nested_vmx_check_exception() if the injected #PF
> >> >> is async page fault.
> >> >>
> >> >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >> >> Cc: Radim Krčmář <rkrcmar@redhat.com>
> >> >> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> >> >> ---
> >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> >> @@ -452,7 +452,11 @@ EXPORT_SYMBOL_GPL(kvm_complete_insn_gp);
> >> >>  void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> >> >>  {
> >> >>       ++vcpu->stat.pf_guest;
> >> >> -     vcpu->arch.cr2 = fault->address;
> >> >> +     vcpu->arch.exception.async_page_fault = fault->async_page_fault;
> >> >
> >> > I think we need to act as if arch.exception.async_page_fault was not
> >> > pending in kvm_vcpu_ioctl_x86_get_vcpu_events().  Otherwise, if we
> >> > migrate with pending async_page_fault exception, we'd inject it as a
> >> > normal #PF, which could confuse/kill the nested guest.
> >> >
> >> > And kvm_vcpu_ioctl_x86_set_vcpu_events() should clean the flag for
> >> > sanity as well.
> >>
> >> Do you mean we should add a field like async_page_fault to
> >> kvm_vcpu_events::exception, then saves arch.exception.async_page_fault
> >> to events->exception.async_page_fault through KVM_GET_VCPU_EVENTS and
> >> restores events->exception.async_page_fault to
> >> arch.exception.async_page_fault through KVM_SET_VCPU_EVENTS?
> >
> > No, I thought we could get away with a disgusting hack of hiding the
> > exception from userspace, which would work for migration, but not if
> > local userspace did KVM_GET_VCPU_EVENTS and KVM_SET_VCPU_EVENTS ...
> >
> > Extending the userspace interface would work, but I'd do it as a last
> > resort, after all conservative solutions have failed.
> > async_pf migration is very crude, so exposing the exception is just an
> > ugly workaround for the local case.  Adding the flag would also require
> > userspace configuration of async_pf features for the guest to keep
> > compatibility.
> >
> > I see two options that might be simpler than adding the userspace flag:
> >
> >  1) do the nested VM exit sooner, at the place where we now queue #PF,
> >  2) queue the #PF later, save the async_pf in some intermediate
> >     structure and consume it at the place where you proposed the nested
> >     VM exit.
> 
> How about something like this to not get exception events if it is
> "is_guest_mode(vcpu) && vcpu->arch.exception.nr == PF_VECTOR &&
> vcpu->arch.exception.async_page_fault" since lost a reschedule
> optimization is not that importmant in L1.
> 
> @@ -3072,13 +3074,16 @@ static void
> kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>                             struct kvm_vcpu_events *events)
>  {
>      process_nmi(vcpu);
> -    events->exception.injected =
> -        vcpu->arch.exception.pending &&
> -        !kvm_exception_is_soft(vcpu->arch.exception.nr);
> -    events->exception.nr = vcpu->arch.exception.nr;
> -    events->exception.has_error_code = vcpu->arch.exception.has_error_code;
> -    events->exception.pad = 0;
> -    events->exception.error_code = vcpu->arch.exception.error_code;
> +    if (!(is_guest_mode(vcpu) && vcpu->arch.exception.nr == PF_VECTOR &&
> +        vcpu->arch.exception.async_page_fault)) {
> +        events->exception.injected =
> +            vcpu->arch.exception.pending &&
> +            !kvm_exception_is_soft(vcpu->arch.exception.nr);
> +        events->exception.nr = vcpu->arch.exception.nr;
> +        events->exception.has_error_code = vcpu->arch.exception.has_error_code;
> +        events->exception.pad = 0;
> +        events->exception.error_code = vcpu->arch.exception.error_code;
> +    }

This adds a bug when userspace does KVM_GET_VCPU_EVENTS and
KVM_SET_VCPU_EVENTS without migration -- KVM would drop the async_pf and
a L1 process gets stuck as a result.

We we'd need to add a similar condition to
kvm_vcpu_ioctl_x86_set_vcpu_events(), so userspace SET doesn't drop it,
but that is far beyond the realm of acceptable code.

I realized this bug only after the first mail, sorry for the confusing
paragraph.