Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <7e05e0befa13af05f1e5f0fd8658bc4e7bdf764f.camel@redhat.com>
Subject: Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
From:   Maxim Levitsky <mlevitsk@redhat.com>
To:     Sean Christopherson <seanjc@google.com>,
        Paolo Bonzini <pbonzini@redhat.com>
Cc:     Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Oliver Upton <oupton@google.com>,
        Peter Shier <pshier@google.com>
Date:   Wed, 29 Jun 2022 14:16:52 +0300
In-Reply-To: <20220614204730.3359543-1-seanjc@google.com>
References: <20220614204730.3359543-1-seanjc@google.com>
Content-Type: text/plain; charset="UTF-8"
User-Agent: Evolution 3.36.5 (3.36.5-2.fc32) 
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> The main goal of this series is to fix KVM's longstanding bug of not
> honoring L1's exception intercepts wants when handling an exception that
> occurs during delivery of a different exception.  E.g. if L0 and L1 are
> using shadow paging, and L2 hits a #PF, and then hits another #PF while
> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> so that the #PF is routed to L1, not injected into L2 as a #DF.
> 
> nVMX has hacked around the bug for years by overriding the #PF injector
> for shadow paging to go straight to VM-Exit, and nSVM has started doing
> the same.  The hacks mostly work, but they're incomplete, confusing, and
> lead to other hacky code, e.g. bailing from the emulator because #PF
> injection forced a VM-Exit and suddenly KVM is back in L1.
> 
> Everything leading up to that are related fixes and cleanups I encountered
> along the way; some through code inspection, some through tests.
> 
> v2:
>   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
>     overhaul.
>     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
>   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> 
> v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> 
> Sean Christopherson (21):
>   KVM: nVMX: Unconditionally purge queued/injected events on nested
>     "exit"
>   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
>   KVM: x86: Don't check for code breakpoints when emulating on exception
>   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
>   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
>   KVM: x86: Treat #DBs from the emulator as fault-like (code and
>     DR7.GD=1)
>   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
>   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
>   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
>   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
>   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
>   KVM: x86: Make kvm_queued_exception a properly named, visible struct
>   KVM: x86: Formalize blocking of nested pending exceptions
>   KVM: x86: Use kvm_queue_exception_e() to queue #DF
>   KVM: x86: Hoist nested event checks above event injection logic
>   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
>     VM-Exit
>   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
>   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
>   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
>     behavior
>   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
>   KVM: selftests: Add an x86-only test to verify nested exception
>     queueing
> 
>  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
>  arch/x86/include/asm/kvm_host.h               |  35 +-
>  arch/x86/kvm/emulate.c                        |   3 +-
>  arch/x86/kvm/svm/nested.c                     | 102 ++---
>  arch/x86/kvm/svm/svm.c                        |  18 +-
>  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
>  arch/x86/kvm/vmx/sgx.c                        |   2 +-
>  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
>  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
>  arch/x86/kvm/x86.h                            |  11 +-
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
>  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
>  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
>  15 files changed, 886 insertions(+), 418 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> 
> 
> base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb

Hi Sean and everyone!
 
 
Before I continue reviewing the patch series, I would like you to check if
I understand the monitor trap/pending debug exception/event injection
logic on VMX correctly. I was looking at the spec for several hours and I still have more
questions that answers about it.
 
So let me state what I understand:
 
1. Event injection (aka eventinj in SVM terms):
 
  (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
 
  If I understand correctly all event injections types just like on SVM just inject,
  and never create something pending, and/or drop the injection if event is not allowed
  (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
  if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
  I haven't checked this)
 
  All event injections happen right away, don't deliver any payload (like DR6), etc.
 
  Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
  type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
  to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
  with cases when these trap like events are intercepted, where the interception happens
  on the start of the instruction despite exceptions being trap-like.
 
 
2. #DB is the only trap like exception that can be pending for one more instruction
   if MOV SS shadow is on (any other cases?).
   (AMD just ignores the whole thing, rightfully)
 
   That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
   I understand that it will be written by CPU in case we have VM exit at the moment
   where #DB is already pending but not yet delivered.
 
   That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
   and this #DB will actually update DR6 and such, and might be delayed/lost.
 
 
3. Facts about MTF:
 
   * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
     instruction', and is enabled in primary execution controls.
 
   * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
     and that has no connection to the 'feature', although usually this injection will be useful
     when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
 
   * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
     which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
     overrides it.
 
   * MTF event is raised (when the primary execution controls bit is enabled) when:
 
	- after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
	  (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
	  I am not 100% sure about this but this seems to be what PRM implies:
 
	  "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
	  26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
	  VM entry."
 
	- If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
	  then once again MTF will happen on first instruction of the exception/interrupt handler
 
	  "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
	  (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
	  on the instruction boundary following delivery of the event (or any nested exception)."
 
	  That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
 
	- If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
 
	- Otherwise after an instruction (or REP iteration) retires.
 

If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!

Best regards,
	Maxim Levitsky