Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp12215620rwl; Tue, 3 Jan 2023 10:41:56 -0800 (PST) X-Google-Smtp-Source: AMrXdXstR81jXi5QMkOR9aw+tw31w+612RH5m9G4Mb/TMNukIcFrQpvKpLSXRIVQwvBuMm+G8BE2 X-Received: by 2002:a05:6402:298b:b0:479:971e:58f6 with SMTP id eq11-20020a056402298b00b00479971e58f6mr38784955edb.19.1672771316542; Tue, 03 Jan 2023 10:41:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672771316; cv=none; d=google.com; s=arc-20160816; b=gCLl4MvWrjcBbMlNDBMkgnmh/JBJtQBIpdTTfWVY9pq2zrJ3AmkCxz2NeQirxRWEyd D+arVBQDAKfIdRTRXcTYMfVIxKPNu33GKcGJ8rMHBigmxh+sus0vGvxDQ2SVotPJlAV3 ArM+pczUs9SxFUqQCjuRssQnFEAsjWRj8H+Rms4uxq6Ky1lgjmpkaIUGrXbXTy+ZhvVm mi1DWFtipMBDtOs12+FWg2XswecMnZcvxtOhuZj1DpEMLjKf4wC3z8ABzQhYuLWw/SrA lqKRsGxtCuJQeJX+4Hp/CLl+I0vXijFcA8KsHe86UJ0Y93tchSrqYYOhqDFlbqq3apeP WpjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=pq3R+dMUCBtQBXaTENIPK+pFT96E7II7s9kyI9TK3FA=; b=Gqq5jaLMAuxrWBMgZUbYvx1DwJ/Cog7vLZTbpx7arhgSL/riSQmm725uqQDdPNTrY3 nixi07mwyy9n12qsp7QH0eFxg2iQi5WhQNTqIAySV5URS4nFiAOv3eredEUzwfOjqNgj MX3Qfa/HmgaiKgae1S137ds/nccEWZGGsp/Z5RNDI4O2GN00aMbqZYJSUBjslxUqecv2 Gr2Cda0OxD8wcrfnZcDGg8UnSYHOHtRLp5EhmmmlLr5E/TzcBw68FP+Zc2suZ9FOsq3o z0DfN6Ok9SIKArD9Jb+wRo1OE77FKQ6GDxvPzsuzip9PdADzvHdHp70iuOrWjWfVq7pB NVpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=qtM6NgWa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b8-20020aa7d488000000b00473d8c56c90si24371299edr.139.2023.01.03.10.41.41; Tue, 03 Jan 2023 10:41:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=qtM6NgWa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231325AbjACS1f (ORCPT + 60 others); Tue, 3 Jan 2023 13:27:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41246 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238955AbjACS1H (ORCPT ); Tue, 3 Jan 2023 13:27:07 -0500 Received: from mail-yb1-xb31.google.com (mail-yb1-xb31.google.com [IPv6:2607:f8b0:4864:20::b31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9AA8DF023 for ; Tue, 3 Jan 2023 10:26:52 -0800 (PST) Received: by mail-yb1-xb31.google.com with SMTP id e141so34070099ybh.3 for ; Tue, 03 Jan 2023 10:26:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=pq3R+dMUCBtQBXaTENIPK+pFT96E7II7s9kyI9TK3FA=; b=qtM6NgWaiVqZCrj8mlfOBfl9chZFnGk7FZeAhK0wy71zu0Jbt2DGUrwpnG+p6HliNC wO2NaOlbCKIgyAgsIjNe/ikj5/5528+86PkkSkuC0kNwPWkn7sa5m7z0RYJ3nJKeFbCh qKMd4AqqUqZL4N4KitDhTrxCoBKchw+0ODvu4s+qk5ApqgNxN5YNl00Xc/+j+jTj5Zf3 pPUGZsWkOgqVLfWCcj3yYEHcaOdmMTAoL6Jc05sJ/Hh4r2FXvyZEmzUYIWm9n0Tt0b7y GzMQ8UT2wOaVw/bOcZ1dvM88egQWS4WawD4MiF/7QBnjCih+TPZ3Cy6Y3OPb5R3FTuf3 76FA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=pq3R+dMUCBtQBXaTENIPK+pFT96E7II7s9kyI9TK3FA=; b=BTzc3AkHxHzgJvfRBAg12XFwTfXoWUQJL9Uu19hzf38o8G5oMyQOj5WghrvqHG/i0b 4DXaQqLmJplu+nio039YaCzDKve6Rd3CGeEZ9/jWcUKmpABaieHTLPifNFywi6NA7sqj RCznVpG2dcpsJIZ9DXXHyT/WlcZi+UZ6n2WPiSTzWKeW23UiWvVvh26JqtRhVAFo9YgQ DTNrBNdW/l2l0iuDtMABPCLn1t9SkL2rhqyO//0krlnq78F5SWH0SGprFM+ITOJBs4Id 1OJi/0jFDhsxDGK6IecWytfoDlqLIfLK/JOAAYjUTN4EyvojLcqfL4lZ4zwYzqgWgilP n+lQ== X-Gm-Message-State: AFqh2ko95KTVqhmEsrFOB7x6Y2ZusQzTQWEqNAtPbxjy+mbAPH84TNGA UFwptt2QUM/7oPYIcbm4gszx4yy56b3UW1neCRvG6g== X-Received: by 2002:a25:da14:0:b0:707:34cf:b4b1 with SMTP id n20-20020a25da14000000b0070734cfb4b1mr4286913ybf.123.1672770411475; Tue, 03 Jan 2023 10:26:51 -0800 (PST) MIME-Version: 1.0 References: <20221222023457.1764-1-vipinsh@google.com> <20221222023457.1764-6-vipinsh@google.com> In-Reply-To: From: Vipin Sharma Date: Tue, 3 Jan 2023 10:26:15 -0800 Message-ID: Subject: Re: [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split To: David Matlack Cc: seanjc@google.com, pbonzini@redhat.com, bgardon@google.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 29, 2022 at 2:30 PM David Matlack wrote: > > On Wed, Dec 21, 2022 at 06:34:53PM -0800, Vipin Sharma wrote: > > When dirty log is enabled, huge pages are split. Page table's pages > > during the split are allocated based on the current thread NUMA node or > > mempolicy. This causes inefficient page table accesses if underlying > > page is on a different NUMA node > > > > Allocate page table's pages on the same NUMA node as the underlying huge > > page when dirty log is enabled and huge pages are split. > > > > The performance gain during the pre-copy phase of live migrations of a > > 416 vCPUs and 11 TiB memory VM on a 8 node host was seen in the range > > of 130% to 150%. > > Can you be more specific about this. "The performance" is vague. I know > it's an internal workload and fully explaining it would be difficult, > but you can give readers a slightly more specific idea of what improved. > e.g. > > When testing with a synthetic write-heavy workload in a 416 vCPU VM on > an 8 NUMA node host, the throughput increased by 150% from X to Y > operations per second. > > It's also necessary to characterize the improvement relative to the > performance when dirty logging is not enabled. Whithout that information > it would be hard for an unfamiliar reader to understand how useful this > change really is. > > For example, let's say the throughput of your workload is 100,000 > operations per second before dirty logging is enabled, and that drops > down to 1,000 operations per second after dirty logging is enabled. This > commit could increase that by 150% to 2,500 operations per second, but > that's actually not a very meaningful improvement since, either way, > guest performance is degraded by 95+% during dirty logging. > > On the other hand, if performance goes from 100,000 to 30,000 normally, > and this commit increases that 30,000 to 75,000 (150%), that's a much > more meaningful improvement. > Yeah, I will provide more insight in the next version. > > > > Suggested-by: David Matlack > > Signed-off-by: Vipin Sharma > > --- > > arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++---- > > include/linux/kvm_host.h | 18 ++++++++++++++++++ > > 2 files changed, 26 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 4974fa96deff..376b8dceb3f9 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, > > return spte_set; > > } > > > > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp) > > { > > struct kvm_mmu_page *sp; > > > > @@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > > if (!sp) > > return NULL; > > > > - sp->spt = (void *)__get_free_page(gfp); > > + sp->spt = kvm_mmu_get_free_page(nid, gfp); > > + > > if (!sp->spt) { > > kmem_cache_free(mmu_page_header_cache, sp); > > return NULL; > > @@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > > bool shared) > > { > > struct kvm_mmu_page *sp; > > + int nid; > > + > > + nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte)); > > > > /* > > * Since we are allocating while under the MMU lock we have to be > > @@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > > * If this allocation fails we drop the lock and retry with reclaim > > * allowed. > > */ > > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); > > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT); > > if (sp) > > return sp; > > > > @@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > > write_unlock(&kvm->mmu_lock); > > > > iter->yielded = true; > > - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT); > > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT); > > > > if (shared) > > read_lock(&kvm->mmu_lock); > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index d48064503b88..a262e15ebd19 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm); > > int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu); > > > > struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn); > > + > > +/* > > + * Tells the appropriate NUMA node location of the page table's page based on > > + * pfn it will point to. > > I know what you are trying to say but the wording is a bit awkward. e.g. > "Tells" instead of "Returns", "location" is redundant, "page table's > page", etc. Suggest this: > > /* > * Returns an appropriate NUMA node on which to allocate a page table that > * maps @pfn. > */ > > > + * > > + * Return the nid of the page if pfn is valid and backed by a refcounted page, > > + * otherwise, return the nearest memory node for the current CPU. > > I would just drop this as it's just restating the code, which is already > very readable. > Okay. > > + */ > > +static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn) > > +{ > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > + > > + if (page) > > + return page_to_nid(page); > > + else > > + return numa_mem_id(); > > +} > > + > > bool kvm_is_zone_device_page(struct page *page); > > > > struct kvm_irq_ack_notifier { > > -- > > 2.39.0.314.g84b9a713c41-goog > >