Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp546171iob; Wed, 4 May 2022 02:46:01 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyaNFsvv+1vl6PXvnhbB1/1802tzA1YDVnXpqByPAXkWhB+Fbn8Z9h/TaO5Zzgosx/4WTpy X-Received: by 2002:a17:902:b085:b0:15d:325e:8649 with SMTP id p5-20020a170902b08500b0015d325e8649mr20651338plr.119.1651657561605; Wed, 04 May 2022 02:46:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651657561; cv=none; d=google.com; s=arc-20160816; b=iKNdtb1edYCwIs3+ujJoNf8gDUqsNKRuoKu90apklIbJcN8MHBWllEZ0imOhpOlzXk x4m2TO7SoA3q/Ccldfu5h5+LVNoiXX/R2yHMxjgAP69eWhNNLZW6wdSGovQUAN/BXd3w 8IFrBBuGZrxCDqTpYgG3yT4cp9iOcRWKXb4RtUDtVSJcdFWX/I5K/bbqgMCjDh3eJXWS bFAdsG0IwkPjttP8aNY+FaNOTvsocqZkiNYtVR0+UdropgPk2y4sdiBd8DpWgx9fICR0 VqsDax4oX+My7crdP6lmyJUG2WQspYyfxWv9Y4u2fdQ1M1Il8bi3Y19ZydCJesj5esfl SG+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=wsR3FwABhxtWZiXpEXHgrlRr8UQyeh37PG+bJ+BbATQ=; b=zZIaXS6WsME13YAvXxUACQIdc84a91R1auxdiDJ0xRSn3bHYLj8OAhdvebPDUXXGVt yPYPdMVEZ7YR8HOUuf+LsukpR5go2fDSTY47IOSuN3Lgz64+HtauPB/mQWoez3+fcaU2 JFxiFVsJ4CYGSzK1gBTjgk7vi0QBBSnk5LnWaU5BU+ic77zG3XDHJARpSMQcWTB6i8I+ oW0VQIiUeX4MjjvFIWBTaRcL86/y2ib7M4mm9M8DzgyPKoOnORaUmeT2cimdB897UZ97 K/j8cCYqI8wpRBTg6hvYfE0FYeqEhnZLROAnurY+vfKqtgMT64viBJRay/FEVW8j5X8h G5Tg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=gpE9lffW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i68-20020a639d47000000b003aa810be6aasi19207188pgd.85.2022.05.04.02.45.46; Wed, 04 May 2022 02:46:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=gpE9lffW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241565AbiECSfY (ORCPT + 99 others); Tue, 3 May 2022 14:35:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241425AbiECSel (ORCPT ); Tue, 3 May 2022 14:34:41 -0400 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7396D1EAC0 for ; Tue, 3 May 2022 11:31:01 -0700 (PDT) Received: by mail-pj1-x104a.google.com with SMTP id t24-20020a17090a449800b001d2d6e740c3so1669289pjg.9 for ; Tue, 03 May 2022 11:31:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=wsR3FwABhxtWZiXpEXHgrlRr8UQyeh37PG+bJ+BbATQ=; b=gpE9lffWd+RjZAvhIe1TTpYRZRNX4gDPOvBDs8jwoAsVO38JV+VxAqH2HCSVavh4C1 k7c2+xu+TJRn+V+faSB4IWA+v/DfdH8xGVpW+QaDnKEnjPsMZA41zw2sEJmHn+Vz0w5c LnUyxL7Y2dc3Fv/QqCJp+cLNibrnUPl8rNVS9NhEKpyO9XP+JDL/UC813NnroS+nEnQ5 Ki2geda8ckujtnUmb8ynOyoBFAllGdWt6vJqlTMT6WUcCw4tYVRDIyYfBboYCpJxCM+x UbAk321irMb7mGzhYHz3/yfHtxtfW9k89YJU9TnawUQlvT0RbDfvWzsYywlATmlsN3cr Rvig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=wsR3FwABhxtWZiXpEXHgrlRr8UQyeh37PG+bJ+BbATQ=; b=CvUjsyFMjJ3zRtFa9Qo78fLZLubWrhWtAt2sXzHyPoVaMbJWH0wGW60GvpNvsomFWl tiUWJRaP0Q7TY5F4iHMXTtjXa4uKy6jiLg/BROriEAo0x8A2Xpxb23UwYc/Rk9V/dZ+k sg/rtb/4ZAMz5c9xEVTMuvoDszMzI0goOzEtskI5emfxxoWMivz52Drn4sCKSwducfGq n6TkroNwKkaLToRbSLYzkIBliLgNzsbjFBAvZMSzCdPD1l7pGFPAmgw70PToHDhRSES+ vYCB9S2le/nTqhdHrfhFrlgbTbtwrDiYtDfswaXlyho0eRceWuR0jSV9usiwhWOM9oh2 3O3Q== X-Gm-Message-State: AOAM533mKuyR6tc7J52tyuKDfvgticKjDpkxQAdbyhxiA20cjIf4kmf9 sOwfKypgtiHLTK/Lwxt6bi3sJURwu3JvPtIpLFNgHdJhYT1VrZjLQZlaDVquTXXhORHk8Xiw/xr VPoj1FOlZk2Ht7X42AkKE8Xg0Eantnzm3eKBExSne/GHIztubGW3cgflDGBJVi1quzCNhdvDK X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a62:b60f:0:b0:508:2a61:2c8b with SMTP id j15-20020a62b60f000000b005082a612c8bmr17327945pff.2.1651602661050; Tue, 03 May 2022 11:31:01 -0700 (PDT) Date: Tue, 3 May 2022 18:30:42 +0000 In-Reply-To: <20220503183045.978509-1-bgardon@google.com> Message-Id: <20220503183045.978509-9-bgardon@google.com> Mime-Version: 1.0 References: <20220503183045.978509-1-bgardon@google.com> X-Mailer: git-send-email 2.36.0.464.gb9c8b46e94-goog Subject: [PATCH v7 08/11] KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Jim Mattson , David Dunn , Jing Zhang , Junaid Shahid , Ben Gardon Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In some cases, the NX hugepage mitigation for iTLB multihit is not needed for all guests on a host. Allow disabling the mitigation on a per-VM basis to avoid the performance hit of NX hugepages on trusted workloads. In order to disable NX hugepages on a VM, ensure that the userspace actor has permission to reboot the system. Since disabling NX hugepages would allow a guest to crash the system, it is similar to reboot permissions. Ideally, KVM would require userspace to prove it has access to KVM's nx_huge_pages module param, e.g. so that userspace can opt out without needing full reboot permissions. But getting access to the module param file info is difficult because it is buried in layers of sysfs and module glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases. Suggested-by: Jim Mattson Reviewed-by: David Matlack Reviewed-by: Peter Xu Signed-off-by: Ben Gardon --- Documentation/virt/kvm/api.rst | 17 +++++++++++++++++ arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/mmu.h | 8 ++++---- arch/x86/kvm/mmu/spte.c | 7 ++++--- arch/x86/kvm/mmu/spte.h | 3 ++- arch/x86/kvm/mmu/tdp_mmu.c | 2 +- arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 1 + 8 files changed, 61 insertions(+), 9 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 23baf7fce038..ddda1f1440d6 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -7879,6 +7879,23 @@ At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting this capability will disable PMU virtualization for that VM. Usermode should adjust CPUID leaf 0xA to reflect that the PMU is disabled. +8.36 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES +--------------------------- + +:Capability KVM_CAP_VM_DISABLE_NX_HUGE_PAGES +:Architectures: x86 +:Type: vm +:Parameters: arg[0] must be 0. +:Returns 0 on success, -EPERM if the userspace process does not + have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been + created. + +This capability disables the NX huge pages mitigation for iTLB MULTIHIT. + +The capability has no effect if the nx_huge_pages module parameter is not set. + +This capability may only be set before any vCPUs are created. + 9. Known KVM API problems ========================= diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c59fea4bdb6e..0a1bb01c3229 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1249,6 +1249,8 @@ struct kvm_arch { * the global KVM_MAX_VCPU_IDS may lead to significant memory waste. */ u32 max_vcpu_ids; + + bool disable_nx_huge_pages; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 7e258cc94152..0ce84042c059 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -217,9 +217,9 @@ struct kvm_page_fault { int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); extern int nx_huge_pages; -static inline bool is_nx_huge_page_enabled(void) +static inline bool is_nx_huge_page_enabled(struct kvm *kvm) { - return READ_ONCE(nx_huge_pages); + return READ_ONCE(nx_huge_pages) && !kvm->arch.disable_nx_huge_pages; } static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, @@ -235,8 +235,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, .user = err & PFERR_USER_MASK, .prefetch = prefetch, .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault), - .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(), - + .nx_huge_page_workaround_enabled = + is_nx_huge_page_enabled(vcpu->kvm), .max_level = KVM_MAX_HUGEPAGE_LEVEL, .req_level = PG_LEVEL_4K, .goal_level = PG_LEVEL_4K, diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 75c9e87d446a..ecea6cfc965d 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -117,7 +117,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, spte |= spte_shadow_accessed_mask(spte); if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) && - is_nx_huge_page_enabled()) { + is_nx_huge_page_enabled(vcpu->kvm)) { pte_access &= ~ACC_EXEC_MASK; } @@ -216,7 +216,8 @@ static u64 make_spte_executable(u64 spte) * This is used during huge page splitting to build the SPTEs that make up the * new page table. */ -u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index) +u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level, + int index) { u64 child_spte; int child_level; @@ -244,7 +245,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index) * When splitting to a 4K page, mark the page executable as the * NX hugepage mitigation no longer applies. */ - if (is_nx_huge_page_enabled()) + if (is_nx_huge_page_enabled(kvm)) child_spte = make_spte_executable(child_spte); } diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index fbbab180395e..19eb5ea67d95 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -412,7 +412,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch, bool can_unsync, bool host_writable, u64 *new_spte); -u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index); +u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level, + int index); u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled); u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access); u64 mark_spte_for_access_track(u64 spte); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 4fabb2cd0ba9..36f33a95a414 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1470,7 +1470,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, * not been linked in yet and thus is not reachable from any other CPU. */ for (i = 0; i < PT64_ENT_PER_PAGE; i++) - sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i); + sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, level, i); /* * Replace the huge spte with a pointer to the populated lower level diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5ba9c7c8140a..6ce2b18dcaa1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4286,6 +4286,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SYS_ATTRIBUTES: case KVM_CAP_VAPIC: case KVM_CAP_ENABLE_CAP: + case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: r = 1; break; case KVM_CAP_EXIT_HYPERCALL: @@ -6093,6 +6094,35 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->lock); break; + case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: + r = -EINVAL; + + /* + * Since the risk of disabling NX hugepages is a guest crashing + * the system, ensure the userspace process has permission to + * reboot the system. + * + * Note that unlike the reboot() syscall, the process must have + * this capability in the root namespace because exposing + * /dev/kvm into a container does not limit the scope of the + * iTLB multihit bug to that container. In other words, + * this must use capable(), not ns_capable(). + */ + if (!capable(CAP_SYS_BOOT)) { + r = -EPERM; + break; + } + + if (cap->args[0]) + break; + + mutex_lock(&kvm->lock); + if (!kvm->created_vcpus) { + kvm->arch.disable_nx_huge_pages = true; + r = 0; + } + mutex_unlock(&kvm->lock); + break; default: r = -EINVAL; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e10d131edd80..3a09528a9a81 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1153,6 +1153,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_DISABLE_QUIRKS2 213 #define KVM_CAP_VM_TSC_CONTROL 214 #define KVM_CAP_SYSTEM_EVENT_DATA 215 +#define KVM_CAP_VM_DISABLE_NX_HUGE_PAGES 216 #ifdef KVM_CAP_IRQ_ROUTING -- 2.36.0.464.gb9c8b46e94-goog