Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20221222023457.1764-1-vipinsh@google.com> <20221222023457.1764-6-vipinsh@google.com>
 <CANgfPd-x+w8DnyfnLeMCWApbNyAbxOdS8e-nhoSq7txSu2mHXA@mail.gmail.com>
In-Reply-To: <CANgfPd-x+w8DnyfnLeMCWApbNyAbxOdS8e-nhoSq7txSu2mHXA@mail.gmail.com>
From:   Vipin Sharma <vipinsh@google.com>
Date:   Wed, 28 Dec 2022 14:07:54 -0800
Message-ID: <CAHVum0fy_rKO1+yg0_Y8ZFA8cpYHK1Q3qA=pZx5pjG1Et3ftkw@mail.gmail.com>
Subject: Re: [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on
 correct NUMA node on split
To:     Ben Gardon <bgardon@google.com>
Cc:     seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com,
        kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Tue, Dec 27, 2022 at 11:02 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > When dirty log is enabled, huge pages are split. Page table's pages
>
> Nit: Suggest "When huge pages are split for dirty log" since this can
> happen at various points during dirty logging.
> Same below.
>

Yeah, this should be updated.

> > during the split are allocated based on the current thread NUMA node or
> > mempolicy. This causes inefficient page table accesses if underlying
> > page is on a different NUMA node
> >
> > Allocate page table's pages on the same NUMA node as the underlying huge
> > page when dirty log is enabled and huge pages are split.
> >
> > The performance gain during the pre-copy phase of live migrations of a
> > 416 vCPUs and 11 TiB memory VM  on a 8 node host was seen in the range
> > of 130% to 150%.
> >
> > Suggested-by: David Matlack <dmatlack@google.com>
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
> >  include/linux/kvm_host.h   | 18 ++++++++++++++++++
> >  2 files changed, 26 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4974fa96deff..376b8dceb3f9 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> >         return spte_set;
> >  }
> >
> > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
> >  {
> >         struct kvm_mmu_page *sp;
> >
> > @@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> >         if (!sp)
> >                 return NULL;
> >
> > -       sp->spt = (void *)__get_free_page(gfp);
> > +       sp->spt = kvm_mmu_get_free_page(nid, gfp);
> > +
>
> Just so that kvm_mmu_get_free_page isn't dead code in the previous
> commit, I'd do this refactor there and just pass NUMA_NO_NODE here.
>

Agreed.

> >         if (!sp->spt) {
> >                 kmem_cache_free(mmu_page_header_cache, sp);
> >                 return NULL;
> > @@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >                                                        bool shared)
> >  {
> >         struct kvm_mmu_page *sp;
> > +       int nid;
> > +
> > +       nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte));
> >
> >         /*
> >          * Since we are allocating while under the MMU lock we have to be
> > @@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >          * If this allocation fails we drop the lock and retry with reclaim
> >          * allowed.
> >          */
> > -       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> > +       sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
> >         if (sp)
> >                 return sp;
> >
> > @@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >                 write_unlock(&kvm->mmu_lock);
> >
> >         iter->yielded = true;
> > -       sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
> > +       sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
> >
> >         if (shared)
> >                 read_lock(&kvm->mmu_lock);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index d48064503b88..a262e15ebd19 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm);
> >  int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
> >
> >  struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
> > +
> > +/*
> > + * Tells the appropriate NUMA node location of the page table's page based on
> > + * pfn it will point to.
> > + *
> > + * Return the nid of the page if pfn is valid and backed by a refcounted page,
> > + * otherwise, return the nearest memory node for the current CPU.
>
> Nit: Should this be "current thread"?

I will say "current thread CPU". As memory nodes are near to CPUs
whereas threads can execute on multiple CPUs throughout its lifetime.

>
> > + */
> > +static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn)
>
> This could just be kvm_pfn_nid (or even better kvm_pfn_node_id) since
> this really has nothing to do with page tables. We just want to know
> which NUMA node backs the given PFN.

Apart from NUMA node backing the given PFN, it can also return the
nearest NUMA node via numa_mem_id(). So, it is actually telling which
NUMA node is the best one for the page table's page, given a PFN.


>
> > +{
> > +       struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > +
> > +       if (page)
> > +               return page_to_nid(page);
> > +       else
> > +               return numa_mem_id();
> > +}
> > +
> >  bool kvm_is_zone_device_page(struct page *page);
> >
> >  struct kvm_irq_ack_notifier {
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >