Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp5155457rwl; Tue, 28 Mar 2023 17:30:33 -0700 (PDT) X-Google-Smtp-Source: AKy350brNevyOrm+PWESjbMyvkVoeYNXk0bRg53ecprIQwezjlRwyvgzCNxJLrvk/gp15vZzePng X-Received: by 2002:a50:fb91:0:b0:502:2148:2980 with SMTP id e17-20020a50fb91000000b0050221482980mr13938239edq.30.1680049833387; Tue, 28 Mar 2023 17:30:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680049833; cv=none; d=google.com; s=arc-20160816; b=Z8CEU+UQQkYhIKBaPZF63i56YDkYzb1J4NJ6NE9OKo9EBxMxm/8IW+timFvxgYH3vV KzOLIi0VbQOWrxth4ajiNviA09QxDdGqrNkJGQpTfq5/IY2bFbNF2nRaMVrqMWIIS1U1 nERREuuZKhAWIYDVfboyQN7o/Y7bRuSK9D++QfCKGr/WDth1vKQAw34o22OCSQHREpZr us5BzKelJt9M/Pf/WVgvTCggR6/UBHrYColThhUnxbYq5KrOtQZOrkoWAln3eHS3QHXQ 3v11Bxzui4VQIYZ3rjR2lLY25uC8Pg83GWOfrS4nXsxTofVjKmHh+mRURNgTCPD44T/p pUJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=PyqHjtL+rZkr3QGhJXzTG5s8vpvz1mXKLrceNpD1m0E=; b=IT18eZdszXPDkl9UQbMnBrUVgSmN83xEF+ylIrURGBcbktPEmiGa8jdhpk/TE1O5OR /HjHSrPcEsWDmGPkSVzhGJChXG+PoBBeONT4GcCCU5slsYy2zY7F9g20peWuRz/9naCh oHeNl9fYQBIgohYoyiKwVk++Lsvy+b1TbDDgz100aquRP71eQvyzJbq+mCXod3eFKVYb qABij/SiVsQI6chbmJafj+N77OOcmrS31mEo2ZfZk7h9eBXC212edtR9s+GsT/qYwhTF HMSQNYKjfawGAxY8HUyPXFeZZyDxoUyJgQ+rP6GwYzomnw4hK4TShmEZ3zYs+HHzPguV kPOA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=M9D88JPW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v14-20020aa7d64e000000b005021f0d5766si13199056edr.685.2023.03.28.17.30.05; Tue, 28 Mar 2023 17:30:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=M9D88JPW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229579AbjC2A2j (ORCPT + 99 others); Tue, 28 Mar 2023 20:28:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60384 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229452AbjC2A2h (ORCPT ); Tue, 28 Mar 2023 20:28:37 -0400 Received: from mail-vs1-xe2a.google.com (mail-vs1-xe2a.google.com [IPv6:2607:f8b0:4864:20::e2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C140610EB for ; Tue, 28 Mar 2023 17:28:35 -0700 (PDT) Received: by mail-vs1-xe2a.google.com with SMTP id z17so5320403vsf.4 for ; Tue, 28 Mar 2023 17:28:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680049715; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PyqHjtL+rZkr3QGhJXzTG5s8vpvz1mXKLrceNpD1m0E=; b=M9D88JPWAMKk/pgNwzzYgI5z1YNJXtj2tHk9EUoYPep6DDDgRuFEYvODeR83uSatIY VUUbWLmlC1adBiKBFalomI+msFx3OAwmeW/YUIsmvSfwa5jAsmHm926CnzJ6ZrUFtooX 6vXPXVDamJUKmhkaqKk6Yj4947AW74g9+PUg1lw1O3zUzjKjObNThCXjKW8wFwEtP9wQ tPRCY+PhQKg3iEOx8vMxZEdu6Zqnwp2rdnjC+DRUXrCfaFpP1KUc8REA1Xt+ZKPzq+iP g8IquFTs/Gg+iMGbfV+lpSU0yjsU/ZOCUOWgDfXZbvH1xE4Sn0vywyYFGYkg+BRRXypG xgCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680049715; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PyqHjtL+rZkr3QGhJXzTG5s8vpvz1mXKLrceNpD1m0E=; b=c6NS6g9xBi2KOoM0oGkEWZSL9zXKwmh3nChT/sOAIArWyUhN5X2RUqzoVU8IE7IOYN PKWz3I4RVsLC6PCgVFXNnTMWwIVil77daccgcrH5h+7kTdc5Wj6m/5KQxNo3PpmwgHLN myA8MX7/KnuBq47Voo8kNYHzVYfZAsYUlv3hjmQ4QZZvlzNXoPOk7C7bDPqXDKu79bnf fIT6Fxo5yHGvcK/JmAfWNZoVD+eIqOfXSKjCOU41wbQLaE1Uo4eeWNuyTh+D9L41XTpA 1r3EC7NLW6W+kPjd1tw6xfq+3MYubDa3TwvTW60VCRVSuGf1ZamIOaxNYLDNtImd85p3 lT7A== X-Gm-Message-State: AAQBX9euQxH3uX/R3v7g7o655fO2nyA0P1wpXm60eWiMXpbXpMWsYzBB j8wrYyYK6J47DDuMvlwLZ9pBVLpJNtWcvVBihk+vNw== X-Received: by 2002:a67:d994:0:b0:423:e694:7a7d with SMTP id u20-20020a67d994000000b00423e6947a7dmr9880934vsj.4.1680049714775; Tue, 28 Mar 2023 17:28:34 -0700 (PDT) MIME-Version: 1.0 References: <20230306224127.1689967-1-vipinsh@google.com> <20230306224127.1689967-17-vipinsh@google.com> In-Reply-To: From: David Matlack Date: Tue, 28 Mar 2023 17:28:08 -0700 Message-ID: Subject: Re: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables during page fault To: Vipin Sharma Cc: seanjc@google.com, pbonzini@redhat.com, bgardon@google.com, jmattson@google.com, mizhang@google.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-15.7 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,ENV_AND_HDR_SPF_MATCH, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 28, 2023 at 5:21=E2=80=AFPM David Matlack = wrote: > > On Mon, Mar 06, 2023 at 02:41:25PM -0800, Vipin Sharma wrote: > > Allocate page tables on the preferred NUMA node via memory cache during > > page faults. If memory cache doesn't have a preferred NUMA node (node > > value is set to NUMA_NO_NODE) then fallback to the default logic where > > pages are selected based on thread's mempolicy. Also, free NUMA aware > > page caches, mmu_shadow_page_cache, when memory shrinker is invoked. > > > > Allocate root pages based on the current thread's NUMA node as there is > > no way to know which will be the ideal NUMA node in long run. > > > > This commit allocate page tables to be on the same NUMA node as the > > physical page pointed by them, even if a vCPU causing page fault is on = a > > different NUMA node. If memory is not available on the requested NUMA > > node then the other nearest NUMA node is selected by default. NUMA awar= e > > page tables can be beneficial in cases where a thread touches lot of fa= r > > memory initially and then divide work among multiple threads. VMs > > generally take advantage of NUMA architecture for faster memory access > > by moving threads to the NUMA node of the memory they are accessing. > > This change will help them in accessing pages faster. > > > > Downside of this change is that an experimental workload can be created > > where a guest threads are always accessing remote memory and not the on= e > > local to them. This will cause performance to degrade compared to VMs > > where numa aware page tables are not enabled. Ideally, these VMs when > > using non-uniform memory access machine should generally be taking > > advantage of NUMA architecture to improve their performance in the firs= t > > place. > > > > Signed-off-by: Vipin Sharma > > --- > > arch/x86/include/asm/kvm_host.h | 2 +- > > arch/x86/kvm/mmu/mmu.c | 63 ++++++++++++++++++++++++--------- > > arch/x86/kvm/mmu/mmu_internal.h | 24 ++++++++++++- > > arch/x86/kvm/mmu/paging_tmpl.h | 4 +-- > > arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++--- > > include/linux/kvm_types.h | 6 ++++ > > virt/kvm/kvm_main.c | 2 +- > > 7 files changed, 88 insertions(+), 27 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm= _host.h > > index 64de083cd6b9..77d3aa368e5e 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -787,7 +787,7 @@ struct kvm_vcpu_arch { > > struct kvm_mmu *walk_mmu; > > > > struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; > > - struct kvm_mmu_memory_cache mmu_shadow_page_cache; > > + struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES]; > > I think we need an abstraction for a NUMA-aware mmu cache, since there > is more than one by the end of this series. > > e.g. A wrapper struct (struct kvm_mmu_numa_memory_cache) or make > NUMA-awareness an optional feature within kvm_mmu_memory_cache, plus > common helper functions for operations like initializing, topping-up, > and freeing. > > I have some ideas I want to try but I ran out of time today. > > > struct kvm_mmu_memory_cache mmu_shadowed_info_cache; > > struct kvm_mmu_memory_cache mmu_page_header_cache; > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index d96afc849ee8..86f0d74d35ed 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -702,7 +702,7 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu= _memory_cache *cache) > > > > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_i= ndirect) > > { > > - int r; > > + int r, nid =3D KVM_MMU_DEFAULT_CACHE_INDEX; > > > > /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */ > > r =3D kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_ca= che, > > @@ -710,7 +710,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu= *vcpu, bool maybe_indirect) > > if (r) > > return r; > > > > - r =3D mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache= , PT64_ROOT_MAX_LEVEL); > > + if (kvm_numa_aware_page_table_enabled(vcpu->kvm)) { > > + for_each_online_node(nid) { > > Blegh. This is going to potentially waste a lot of memory. Yes the > shrinker can free it, but the next fault will re-allocate all the online > node caches. > > The reason we have to top-up all nodes is because KVM tops up caches > before faulting in the PFN, and there is concern that changing this will > increase the rate of guest page-fault retries [1]. > > I think we should revisit that concern. Can we do any testing to > validate that hypothesis? Or can we convince ourselves that re-ordering > is ok? > > [1] https://lore.kernel.org/kvm/CAHVum0cjqsdG2NEjRF3ZRtUY2t2=3DTb9H4OyOz9= wpmsrN--Sjhg@mail.gmail.com/ Ah I forgot about patch 18 reducing the default cache size. So at the end of this series, even with topping up every node, the maximum number of objects per cache will be 4 * num_online_nodes. So only hosts with more than 10 online NUMA nodes would have larger caches than today (40). That seems more reasonable to me.