Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp479188iog; Wed, 29 Jun 2022 04:19:05 -0700 (PDT) X-Google-Smtp-Source: AGRyM1sRlYKwFh/wNNPLDYE2qLaguwA318pu+/h5cq/lKA6Vp/fXfesO/lrhYBEjSd8QKzY2YMhA X-Received: by 2002:a63:5155:0:b0:400:14af:a1e6 with SMTP id r21-20020a635155000000b0040014afa1e6mr2587537pgl.71.1656501545231; Wed, 29 Jun 2022 04:19:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1656501545; cv=none; d=google.com; s=arc-20160816; b=ZUWXcceArBaUJnbNj+ak+NchIYJLJqkUOmNQC6EmIDpY2HuWNyeGE37hWdCJPOjUgV wxhYPG0cmx8A7hj14OZp378OdxiLZL7AM9PGh4kFaxZpwWfl0z0IXouAsc8UFuo7D+Jr M9GwgjSCPbifLA5Uuzw6JQSk977xTX6mT8gShMWUlRiEW+TXnowGrQh0UVjZGxOvl2rD YViHfJBnfxDGv7jg7O6nNXlj7csgkTiPz+lAptJqCuInKDtqEjkhUp1hfzkrJGkWo8eX F+9c8iGrchal9zvd7swbebMt/DbhbEmZcWPKeCoDfsnn/ayZK0dF0pxPb4suer1n1qOF tF4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=FXLoEvEnMaWJAAWZz9xO07LVm1h+eXNsfUhntDr1uvU=; b=QQz5AZ870GZMXDbSgiIWrre8ZUrog9L0H4dhSRkT21jJg/m+2r6waOf0gNbYkCF3bv IMUYvMV5XEfM76+mNXNLTkDcOKL4nCVo7cszkWh7cB7K4oepBYGAJTc6bvB8tYPsBbXc G942Yns7G3eAHDxLUjLSzg5GnirxuhZuZkzp/3qKRC/bRPKqPVH0WQ92+QnUyOESxnoQ 2G5gNzuBHHctr1uQNarBf2gyp0p0dg6/bDsjyXHPVxg9j/uVq02g/w8j9JKA2nuegFh2 CbN00oJ63zm1NyOHtFbYU5hQuIIbYGVWMS0A7UCgXGJ50Lt4tcpGbUb91xuLQo+lDoah 5A7A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YK7J05fO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id pf1-20020a17090b1d8100b001ecee3e5c0esi3336838pjb.98.2022.06.29.04.18.50; Wed, 29 Jun 2022 04:19:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YK7J05fO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229570AbiF2LRE (ORCPT + 99 others); Wed, 29 Jun 2022 07:17:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51906 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229643AbiF2LRC (ORCPT ); Wed, 29 Jun 2022 07:17:02 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 1C1D43E5D8 for ; Wed, 29 Jun 2022 04:17:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656501420; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FXLoEvEnMaWJAAWZz9xO07LVm1h+eXNsfUhntDr1uvU=; b=YK7J05fO0kxhjjjbG2NRBXm1E0qcoVSdCwWgrYN1ftJt5tHVoYfBFzqSd3QPgemGogvTx4 TmrEIYRLxY5RE6lWOFjbt1KsJK5hh94yMvrbu1tsTyZoB8MM83vAisZebDhVJHCR62FxS0 QoNV6oQgR52/zF4d1NXoSOGaKuPOp38= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-604-nMBH4o-SM9mnV5zZMfqJTQ-1; Wed, 29 Jun 2022 07:16:56 -0400 X-MC-Unique: nMBH4o-SM9mnV5zZMfqJTQ-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 681BD100EDD5; Wed, 29 Jun 2022 11:16:56 +0000 (UTC) Received: from starship (unknown [10.40.194.38]) by smtp.corp.redhat.com (Postfix) with ESMTP id E1C541121314; Wed, 29 Jun 2022 11:16:53 +0000 (UTC) Message-ID: <7e05e0befa13af05f1e5f0fd8658bc4e7bdf764f.camel@redhat.com> Subject: Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups From: Maxim Levitsky To: Sean Christopherson , Paolo Bonzini Cc: Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Oliver Upton , Peter Shier Date: Wed, 29 Jun 2022 14:16:52 +0300 In-Reply-To: <20220614204730.3359543-1-seanjc@google.com> References: <20220614204730.3359543-1-seanjc@google.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.5 (3.36.5-2.fc32) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote: > The main goal of this series is to fix KVM's longstanding bug of not > honoring L1's exception intercepts wants when handling an exception that > occurs during delivery of a different exception. E.g. if L0 and L1 are > using shadow paging, and L2 hits a #PF, and then hits another #PF while > vectoring the first #PF due to _L1_ not having a shadow page for the IDT, > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF > so that the #PF is routed to L1, not injected into L2 as a #DF. > > nVMX has hacked around the bug for years by overriding the #PF injector > for shadow paging to go straight to VM-Exit, and nSVM has started doing > the same. The hacks mostly work, but they're incomplete, confusing, and > lead to other hacky code, e.g. bailing from the emulator because #PF > injection forced a VM-Exit and suddenly KVM is back in L1. > > Everything leading up to that are related fixes and cleanups I encountered > along the way; some through code inspection, some through tests. > > v2: > - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID > overhaul. > https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com > - Treat KVM_REQ_TRIPLE_FAULT as a pending exception. > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com > > Sean Christopherson (21): > KVM: nVMX: Unconditionally purge queued/injected events on nested > "exit" > KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS > KVM: x86: Don't check for code breakpoints when emulating on exception > KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like > KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag > KVM: x86: Treat #DBs from the emulator as fault-like (code and > DR7.GD=1) > KVM: x86: Use DR7_GD macro instead of open coding check in emulator > KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS > KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit > KVM: VMX: Inject #PF on ENCLS as "emulated" #PF > KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception > KVM: x86: Make kvm_queued_exception a properly named, visible struct > KVM: x86: Formalize blocking of nested pending exceptions > KVM: x86: Use kvm_queue_exception_e() to queue #DF > KVM: x86: Hoist nested event checks above event injection logic > KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential > VM-Exit > KVM: x86: Morph pending exceptions to pending VM-Exits at queue time > KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions > KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle > behavior > KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes > KVM: selftests: Add an x86-only test to verify nested exception > queueing > > arch/x86/include/asm/kvm-x86-ops.h | 2 +- > arch/x86/include/asm/kvm_host.h | 35 +- > arch/x86/kvm/emulate.c | 3 +- > arch/x86/kvm/svm/nested.c | 102 ++--- > arch/x86/kvm/svm/svm.c | 18 +- > arch/x86/kvm/vmx/nested.c | 319 +++++++++----- > arch/x86/kvm/vmx/sgx.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 53 ++- > arch/x86/kvm/x86.c | 404 +++++++++++------- > arch/x86/kvm/x86.h | 11 +- > tools/testing/selftests/kvm/.gitignore | 1 + > tools/testing/selftests/kvm/Makefile | 1 + > .../selftests/kvm/include/x86_64/svm_util.h | 7 +- > .../selftests/kvm/include/x86_64/vmx.h | 51 +-- > .../kvm/x86_64/nested_exceptions_test.c | 295 +++++++++++++ > 15 files changed, 886 insertions(+), 418 deletions(-) > create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c > > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb Hi Sean and everyone! Before I continue reviewing the patch series, I would like you to check if I understand the monitor trap/pending debug exception/event injection logic on VMX correctly. I was looking at the spec for several hours and I still have more questions that answers about it. So let me state what I understand: 1. Event injection (aka eventinj in SVM terms): (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN) If I understand correctly all event injections types just like on SVM just inject, and never create something pending, and/or drop the injection if event is not allowed (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry, if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0, I haven't checked this) All event injections happen right away, don't deliver any payload (like DR6), etc. Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that, type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent with cases when these trap like events are intercepted, where the interception happens on the start of the instruction despite exceptions being trap-like. 2. #DB is the only trap like exception that can be pending for one more instruction if MOV SS shadow is on (any other cases?). (AMD just ignores the whole thing, rightfully) That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field. I understand that it will be written by CPU in case we have VM exit at the moment where #DB is already pending but not yet delivered. That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it, and this #DB will actually update DR6 and such, and might be delayed/lost. 3. Facts about MTF: * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed instruction', and is enabled in primary execution controls. * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7, and that has no connection to the 'feature', although usually this injection will be useful when the hypervisor does some kind of re-injection, triggered by the actual MTF feature. * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF', which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit overrides it. * MTF event is raised (when the primary execution controls bit is enabled) when: - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler) I am not 100% sure about this but this seems to be what PRM implies: "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section 26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the VM entry." - If an interrupt and or #DB exception happens prior to executing first instruction of the guest, then once again MTF will happen on first instruction of the exception/interrupt handler "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending on the instruction boundary following delivery of the event (or any nested exception)." That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen. - Otherwise after an instruction (or REP iteration) retires. If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen! Best regards, Maxim Levitsky