Received: by 2002:a05:6358:e9c4:b0:b2:91dc:71ab with SMTP id hc4csp6158559rwb; Tue, 9 Aug 2022 10:03:34 -0700 (PDT) X-Google-Smtp-Source: AA6agR5UqSRSHFZv6OS5JdmOZK2NLrBxkPDMFleyvURVYVyJxhlJcvz5w52eFeOVuxcLN2yPJvqk X-Received: by 2002:a17:907:6eaa:b0:730:9fb6:41a5 with SMTP id sh42-20020a1709076eaa00b007309fb641a5mr17586740ejc.675.1660064614581; Tue, 09 Aug 2022 10:03:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660064614; cv=none; d=google.com; s=arc-20160816; b=uWttXx0Tz8ZHE8m5Vdk32XEx9nttJEbT5fULZ5yZIe2Ut3j2dMSjjqslhZz8/q4YEq 6pSwvQ7pXsv/MgKwxRf1d/JD3hG09sp/PSvsrLcJNwAR1eeqUYVwuq11CSnhBTZWQbTF 1psBI94jb6MmoZXM559Vt58DSaj2fvlRY5ac77DH20gFjButPghsGzPDyyVA36eUsVVq pCerRivmWCjSKnhiGhBgm9uzvHXfGu/5TKL3l8a2taB6/4725OM5udBQmboaAXgc6THS Z7D0fmQsS1I7theVPhQGVKtx66aOm9jcFftxNclWiT9pLnUNYnwMZwwAkGttubLZLG0/ bamw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=U/D22wA89KCZ8FoM5lxlmalnb8zbCkLG8x+Wao6IKHs=; b=r07HS2z87ZURfUuE5tUwabrymAMKA78vLQHZTiLCUn8zudMoSGWB/0MFGtWqDKXk3N v0aYlLM7+yR3RLXCdKLJiovuTnE9wQwd3dUVsHVAPzUeGV55vetWnlcl/dAccTNdNeGS sDHiXKkvwnn5GuXBz2i1FG7s0xtPcF50/MKBDZnf7EYeP3inKqO1WtjCQ4ADGeJ3qkAB SPWLjvVd6rDvXw6yhE+bZR5sjPQKuld6S0lywUwxte4D4u2FgvSyO1oP+stGC8TaWMtD ym8KJA/tNWsu6/+sPdzOBBMR+JfECFvMoDcZTXOwHyUHRVOzn115SV6YKODX7U6ExdYk oR1g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=gri5en4j; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id dp18-20020a170906c15200b0073136de4f07si2468117ejc.318.2022.08.09.10.03.05; Tue, 09 Aug 2022 10:03:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=gri5en4j; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245093AbiHIQwo (ORCPT + 99 others); Tue, 9 Aug 2022 12:52:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35380 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245070AbiHIQwj (ORCPT ); Tue, 9 Aug 2022 12:52:39 -0400 Received: from mail-oi1-x22f.google.com (mail-oi1-x22f.google.com [IPv6:2607:f8b0:4864:20::22f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DB29D2229C for ; Tue, 9 Aug 2022 09:52:37 -0700 (PDT) Received: by mail-oi1-x22f.google.com with SMTP id o184so4662549oif.13 for ; Tue, 09 Aug 2022 09:52:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc; bh=U/D22wA89KCZ8FoM5lxlmalnb8zbCkLG8x+Wao6IKHs=; b=gri5en4j1Nf3AVExb/xtD1i/SndOYTeJjQtSC4usRK/O5d48jIap8p682Zeazqp1Pe ipEbBmYcoxs2HYZIFr0omCnqtPLNcjZBirFyZdioPg2ZHJTdjjl3rybGv5A8NN8sEr2D j/46Th9HahRaqHle8jwPoK5mftTBUmii9vmBapkxypM1FxDjqyr8XLKNjh7zeXnt2/do wB/lJlW0htmLpjlsTEnwrlPtWZEKxAXiszhwYqKQSPvWhm0V6psnPZtA7vYtrguTQ8qm P0tnIJh4/g7qzJlNnNSX9XeWQuoDh5KaFvMiDWpq29FKzA3WM7kDA7XCjhYsgnVyiCoB uyiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc; bh=U/D22wA89KCZ8FoM5lxlmalnb8zbCkLG8x+Wao6IKHs=; b=szFjOxx0purwShU0nrjxjsafmJYF5KaETn0c/uE8MVCjFUEeL2G3CHXGTZRCcmKxJf ZjjEOU/9FiaR5TYSaKpY7/nGssj7WF1LwNtI0mwiXuisLxcS7lTLyqfiOUo5dAMNgYvM OwlRNhEtNY+arSieOf2JtXIiFeVM3yDIuwY1BU8gh3+P/Gg/u7lnPXeqESGolFL53eGw Mp2qiVS1sgSwEzplcO7ACrvAjau2nzZBGnuxxznnLha5bg2K1gnC2vDJaeP3ZttuY3Se t3j6WzqOilJo7GDovcwWTt3uYLyRU/aFYUxMZ6uqgUU3yaLZ4Dk95DIhR2NcsESmcNsH 2iCw== X-Gm-Message-State: ACgBeo0/CeZNwZgzJe1tJghig5W6qNEzgNFkcLEMzSbPVMrGQn+n98RY AT2MUNCLBREsd4FlduLW84EZtLKD1Nw5Og5BFI1ELg== X-Received: by 2002:a05:6808:16a3:b0:326:a585:95b8 with SMTP id bb35-20020a05680816a300b00326a58595b8mr13853758oib.281.1660063957038; Tue, 09 Aug 2022 09:52:37 -0700 (PDT) MIME-Version: 1.0 References: <20220801151928.270380-1-vipinsh@google.com> <4ccbafb5-9157-ec73-c751-ec71164f8688@redhat.com> In-Reply-To: From: David Matlack Date: Tue, 9 Aug 2022 09:52:10 -0700 Message-ID: Subject: Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware To: Vipin Sharma Cc: Paolo Bonzini , Sean Christopherson , kvm list , LKML Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 5, 2022 at 4:30 PM Vipin Sharma wrote: [...] > > Here are the two approaches, please provide feedback on which one > looks more appropriate before I start spamming your inbox with my > patches > > Approach A: > Have per numa node cache for page table pages allocation > > Instead of having only one mmu_shadow_page_cache per vcpu, we provide > multiple caches for each node > > either: > mmu_shadow_page_cache[MAX_NUMNODES] > or > mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE] I think the former approach will work better. The objects[] array is allocated dynamically during top-up now, so if a vCPU never allocates a page table to map memory on a given node, KVM will never have to allocate an objects[] array for that node. Whereas with the latter approach KVM would have to allocate the entire objects[] array up-front. > > We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value > instead of 40 to control memory consumption. I'm not sure we are getting any performance benefit from the cache size being so high. It doesn't fundamentally change the number of times a vCPU thread will have to call __get_free_page(), it just batches more of those calls together. Assuming reducing the cache size doesn't impact performance, I think it's a good idea to reduce it as part of this feature. KVM needs at most PT64_ROOT_MAX_LEVEL (5) page tables to handle a fault. So we could decrease the mmu_shadow_page_cache.objects[] capacity to PT64_ROOT_MAX_LEVEL (5) and support up to 8 NUMA nodes without increasing memory usage. If a user wants to run a VM on an even larger machine, I think it's safe to consume a few extra bytes for the vCPU shadow page caches at that point (the machine probably has 10s of TiB of RAM). > > When a fault happens, use the pfn to find which node the page should > belong to and use the corresponding cache to get page table pages. > > struct *page = kvm_pfn_to_refcounted_page(pfn); > int nid; > if(page) { > nid = page_to_nid(page); > } else { > nid = numa_node_id(); > } > > ... > tdp_mmu_alloc_sp(nid, vcpu); > ... > > static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) { > ... > sp->spt = kvm_mmu_memory_cache_alloc(nid, > &vcpu->arch.mmu_shadow_page_cache); > ... > } > > > Since we are changing cache allocation for page table pages, should we > also make similar changes for other caches like mmu_page_header_cache, > mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how > good this idea is. We don't currently have a reason to make these objects NUMA-aware, so I would only recommend it if it somehow makes the code a lot simpler. > > Approach B: > Ask page from the specific node on fault path with option to fallback > to the original cache and default task policy. > > This is what Sean's rough patch looks like. This would definitely be a simpler approach but could increase the amount of time a vCPU thread holds the MMU lock when handling a fault, since KVM would start performing GFP_NOWAIT allocations under the lock. So my preference would be to try the cache approach first and see how complex it turns out to be.