Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp891409imw; Wed, 13 Jul 2022 09:49:02 -0700 (PDT) X-Google-Smtp-Source: AGRyM1urfocmRZq4QhMIWuxhTn/rexT9+V81HH4RqQiyYEnSbMYp+zTJVfDZ48tvjv7xQhgvGHos X-Received: by 2002:a17:90a:6008:b0:1ef:abc7:a740 with SMTP id y8-20020a17090a600800b001efabc7a740mr4619043pji.179.1657730942546; Wed, 13 Jul 2022 09:49:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1657730942; cv=none; d=google.com; s=arc-20160816; b=fzhKU/PsHCihCW5JWlAsNFBh2h/m03iRH4QmM7dvWsUTmSAiBRNp8xQ3N38Mq89kjP ouBdqo5tMMQmmTodkkIwXi5IZQQGs5P5ekJqhvHsoZ3HcjGopqm4zCaBwNoVGWP7tufW 0gqwtmsRnoIZwyTLC8dEec1CkGqyDxUdcfiQGgqlMP2F/nd989GDqUdrhTISPUpMlVjy les23V3aA8WrIIJZA6mEsDeiAGmLR7dmHxJ2a0U3OdqtsIxw0fBq7HwY8s03DY9keNLE fO9MchJ74zEHwNGrEA4dSRAOAm0DnI2u29Ir1F30oIbP6o+QjEt4kNmKgKMv5570WsbS S+Jw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=USCiroLCrx1eIq2fE4NYNm6R4ZIjqYme9mNE6xCMEl8=; b=xnibKc2LXICl2mgk2k6lIkgZ7450pfb1DGvImKDl+aY3V90uM/MslXybFuhCmS3qiN vCJzdWAu65FtBz+0RgyCc/K23wQwDl2QGokihOoBnQ8KuK1Oxd6essnSD3lVj5YjO6Fq suNcn4tuZYDOn8Gh1uIsiVexI67WI8+k/pr8y5eTmUdxTTUjJrRC5ZJL/T0NxfJHAr9h GPFLR0aOyHX5Bsy0FN7hpJzJjVzliyK4tC6Uff1VtPLlXnCn1NiNNz6+W00qYRn0k3yD r6gpC4ixjQPyCWB4gbKnC+mXoPieJ4kC9QGBre8wihhD594lw3TI1bXK/Qw+OXBr3vGC RnUg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Xf6ZO3Pa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j123-20020a62c581000000b0050dcc1acf09si18840506pfg.111.2022.07.13.09.48.50; Wed, 13 Jul 2022 09:49:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Xf6ZO3Pa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236636AbiGMQUY (ORCPT + 99 others); Wed, 13 Jul 2022 12:20:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54744 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236602AbiGMQUX (ORCPT ); Wed, 13 Jul 2022 12:20:23 -0400 Received: from mail-pg1-x529.google.com (mail-pg1-x529.google.com [IPv6:2607:f8b0:4864:20::529]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B4B9CF584 for ; Wed, 13 Jul 2022 09:20:21 -0700 (PDT) Received: by mail-pg1-x529.google.com with SMTP id 23so10894196pgc.8 for ; Wed, 13 Jul 2022 09:20:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=USCiroLCrx1eIq2fE4NYNm6R4ZIjqYme9mNE6xCMEl8=; b=Xf6ZO3PaMEbsrJI7bGyWHJCyL4Xdxdh/LvmuLEmZg+LZJ81dntvVDVy3z1J1R1NXiN PhWXnLZcZcy5NUwWwLNBfUaheFglFCLfwr5rBhXCTHCS4/tEJC3XRDNouu4cOuiWfyTZ Yr+TTFMlF7k/64LmrChBbZ4LDctiuOGpvEJusA+lTqZLIkNzCDEAli8hOr8vtvnodtX3 TVvP7dap8Iic5YyG1DYOvHj8X5EQC2mDGBGl1mLRJpl4Knzu909OK9Hos13GNQDvvjWh n8El1qMVOxfswaUcb5Kd3/CzFNI+qRrQEvOLPTCHTyBGIEm4oGqkl1nYA57Ef7+CPwmB l4Ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=USCiroLCrx1eIq2fE4NYNm6R4ZIjqYme9mNE6xCMEl8=; b=1dIWmCZSf4elyDOMeSY5hx1X4TxDkyh4VO7UxYoOEIV9GTsRC3qdn2IBN8DbEaXIgb lNcfVG0TD1d7t/4MshZevCNWtSJF0Hat4oo4BwNnT1O57Rp5mbe17cFpNJfezq/0Tfdn 79fLyajYj5AduiVVIudny9v82h77NqDRD4Un4+XT8hBUALkgKrHfond4luWnYpPz7z7Y PR5ttMV8OaEEe6kg/4+PieTgOU2ZIMNk8DMoHeel8A0RXDy9PFUZkMAhF3g8QVuQvf9b mhafJpV5YRLX6pBQYM4XqFYhRI7pahFq2aufKQxKy1bN8C+vmPf/ZAo/Mfqc44JJYqwT SDiA== X-Gm-Message-State: AJIora+eV7XRO6c3tD5AmUkAP/RunMVIW+vK2Z44pvAfYD0rqrGt3DDN NeC1CSS7jrcSBRpI902IsKxIYg== X-Received: by 2002:a63:1619:0:b0:40d:37aa:9ace with SMTP id w25-20020a631619000000b0040d37aa9acemr3477474pgl.609.1657729221066; Wed, 13 Jul 2022 09:20:21 -0700 (PDT) Received: from google.com (123.65.230.35.bc.googleusercontent.com. [35.230.65.123]) by smtp.gmail.com with ESMTPSA id jj14-20020a170903048e00b0016c5b2a16ffsm3886385plb.142.2022.07.13.09.20.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Jul 2022 09:20:20 -0700 (PDT) Date: Wed, 13 Jul 2022 16:20:16 +0000 From: Sean Christopherson To: Ben Gardon Cc: David Matlack , LKML , kvm , Paolo Bonzini , Peter Xu , Jim Mattson , David Dunn , Jing Zhang , Junaid Shahid Subject: Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Message-ID: References: <20220321224358.1305530-1-bgardon@google.com> <20220321224358.1305530-10-bgardon@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 12, 2022, Sean Christopherson wrote: > On Mon, Mar 28, 2022, Ben Gardon wrote: > > On Mon, Mar 28, 2022 at 10:45 AM David Matlack wrote: > > > If iter.old_spte is not a leaf, the only loop would always continue to > > > the next SPTE. Now we try to promote it and if that fails we run through > > > the rest of the loop. This seems broken. For example, in the next line > > > we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN > > > of the TDP MMU page table?) and treat that as the PFN backing this GFN, > > > which is wrong. > > > > > > In the worst case we end up zapping an SPTE that we didn't need to, but > > > we should still fix up this code. > > My thought to remedy this is to drop the @pfn argument to kvm_mmu_max_mapping_level(). > It's used only for historical reasons, where KVM didn't walk the host page tables > to get the max mapping level and instead pulled THP information out of struct page. > I.e. KVM needed the pfn to get the page. > > That would also allow KVM to use huge pages for things that aren't backed by > struct page (I know of at least one potential use case). > > I _think_ we can do the below. It's compile tested only at this point, and I > want to split some of the changes into separate patches, e.g. the WARN on the step-up > not going out-of-bounds. I'll put this on the backburner for now, it's too late > for 5.20, and too many people are OOO :-) Heh, that was a bit of a lie. _Now_ it's going on the backburner. Thinking about the pfn coming from the old leaf SPTE made me realize all of the information we need to use __make_spte() during promotion is available in the existing leaf SPTE. If KVM first retrieves a PRESENT leaf SPTE, then pfn _can't_ be different, because that would mean KVM done messed up and didn't zap existing entries in response to a MMU notifier invalidation, and holding mmu_lock prevents new invalidations. And because the primary MMU must invalidate before splitting a huge page, having a valid leaf SPTE means that host mapping level can't become stale until mmu_lock is dropped. In other words, KVM can compute the huge pfn by using the smaller pfn and adjusting based on the target mapping level. As for the EPT memtype, that can also come from the existing leaf SPTE. KVM only forces the memtype for host MMIO pfns, and if the primary MMU maps a huge page that straddles host MMIO (UC) and regular memory (WB), then the kernel is already hosed. If the VM doesn't have non-coherent DMA, then the EPT memtype will be WB regardless of the page size. That means KVM just needs to reject promotion if the VM has non-coherent DMA and the target pfn is not host MMIO, else KVM can use the leaf's memtype as-is. Using the pfn avoids gup() (fast-only, but still), and using the memtype avoids having to split vmx_get_mt_mask() and add another kvm_x86_ops hook. And digging into all of that yielded another optimization. kvm_tdp_page_fault() needs to restrict the host mapping level if and only if it may consume the guest MTRRs. If KVM ignores the guest MTRRs, then the fact that they're inconsistent across a TDP page is irrelevant because the _guest_ MTRRs are completely virtual and are not consumed by either EPT or NPT. I doubt this meaningfully affects whether or not KVM can create huge pages for real world VMs, but it does avoid having to walk the guest variable MTRRs when faulting in a huge page. Compile tested only at this point, but I'm mostly certain my logic is sound. int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { /* * If the guest's MTRRs may be used, restrict the mapping level to * ensure KVM uses a consistent memtype across the entire mapping. */ if (kvm_may_need_guest_mtrrs(vcpu->kvm)) { for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) { int page_num = KVM_PAGES_PER_HPAGE(fault->max_level); gfn_t base = (fault->addr >> PAGE_SHIFT) & ~(page_num - 1); if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num)) break; } } return direct_page_fault(vcpu, fault); } static int try_promote_to_huge_page(struct kvm *kvm, struct rsvd_bits_validate *rsvd_bits, const struct kvm_memory_slot *slot, u64 leaf_spte, struct tdp_iter *iter) { struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep); kvm_pfn_t pfn; u64 new_spte; u8 mt_mask; if (WARN_ON_ONCE(slot->flags & KVM_MEM_READONLY)) return -EINVAL; pfn = spte_to_pfn(leaf_spte) & ~(KVM_PAGES_PER_HPAGE(iter->level) - 1); mt_mask = leaf_spte & shadow_memtype_mask; /* * Bail if KVM needs guest MTRRs to compute the memtype and will not * force the memtype (host MMIO). There is no guarantee the guest uses * a consistent MTRR memtype for the entire huge page, and MTRRs are * tracked per vCPU, not per VM. */ if (kvm_may_need_guest_mtrrs(kvm) && !kvm_is_mmio_pfn(pfn)) return -EIO; __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true, true, mt_mask, rsvd_bits, &new_spte); return tdp_mmu_set_spte_atomic(kvm, iter, new_spte); }