Received: by 2002:a05:6358:e9c4:b0:b2:91dc:71ab with SMTP id hc4csp2245078rwb; Fri, 5 Aug 2022 16:40:26 -0700 (PDT) X-Google-Smtp-Source: AA6agR79+pzgKT7A18ljQ82tw8x9hODhhzisxSqptSJp9UUm8rsanpa8/mqK+5AOGJpcDWLTLiNm X-Received: by 2002:a17:903:1ce:b0:16f:145c:a842 with SMTP id e14-20020a17090301ce00b0016f145ca842mr8735813plh.83.1659742825778; Fri, 05 Aug 2022 16:40:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1659742825; cv=none; d=google.com; s=arc-20160816; b=NaClCb0I9TXph0MGtTIm7DOhqD4YPWKcrqTAgWVi1D7mkzL1YDQpEJyBTyA136K8+H whnC+XrnvMSPs1PVJr8rrEQH4az5XBV4rUYh7TOFcAY2AfI4fMTmkZTDgXjGNXqInDSk xsIBuID9cPecLhU2RzY03Pr3AZPAwTdEn35UtmN3+WtMCsnzr1l5ULGZgb5M7oJo2Qv5 8KYDKTeHtYm2103h3ujzj0LF1uT7LT2ycTTEbuO4Tej0wJSTcYyMb7oeZiid4RUWvHuv EJFX4Kup4OONCDuhy/ARVhrch44uqOkVvUgPJiUHfh9V0zV5mVB4crTPLAa6N2aSfoQh bN+A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=5+e88HfQH2lIucsXxwkZxdyfP+2njFIpGKdmQ0s6vaQ=; b=oLfN+XG0n9aiabZNpmAhf1r+8nZdK/DJQv7liUzZDu92lRMM4rOG2jfWVhsNvBOtD5 Cmv1QWg9z5R1n6TRKbfmgmdA95zQ84APuKtfp1emoRYv1vizgaWgHlcwuuzv4RHSei21 OO2SqsHAxd/+t4zcIRwWUrxFVvxGH976Zfcs2BSe/s64zDOp7sXAMQuTH2w+yK2BXywk cqZ12kQYfkI0sGxDjnq1EHPySb+bAk+hjtY9seHVBJGmIcup+YrmD6mYQxE+o48TFeui aS1P0o4K/EPQZiylRSyTqO5gcwrzwUZb85cQuW2DCRTbGiDEIGsOQbAdzVBwyyoOrrWA rFvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ZK8Dljk4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lr7-20020a17090b4b8700b001f558ba0d51si6404880pjb.190.2022.08.05.16.40.12; Fri, 05 Aug 2022 16:40:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ZK8Dljk4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238141AbiHEXaw (ORCPT + 99 others); Fri, 5 Aug 2022 19:30:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46272 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241431AbiHEXav (ORCPT ); Fri, 5 Aug 2022 19:30:51 -0400 Received: from mail-lf1-x12b.google.com (mail-lf1-x12b.google.com [IPv6:2a00:1450:4864:20::12b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 08CCB1A81B for ; Fri, 5 Aug 2022 16:30:49 -0700 (PDT) Received: by mail-lf1-x12b.google.com with SMTP id d14so5301948lfl.13 for ; Fri, 05 Aug 2022 16:30:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc; bh=5+e88HfQH2lIucsXxwkZxdyfP+2njFIpGKdmQ0s6vaQ=; b=ZK8Dljk4TO46z8nvkk219ZxDRYc85V4rtZtlROQhonB3PPjOO2z0gTucP8WSJdOsKe qbGdeJOABn0xfboUvrMhnjnYC7+mQvxuFGsvcfVL2IdEgEOgYpgBzgQhuhgKcjZ6l80q 7DLaTUu2PpD0ClnNPaU1NAQvEGM+Voi6VXTVzJuNxhyP4GKlyTId4Gw1eRPCfScJhoxh MrKfzPxGa+JXf/Skg/YJcDDyECArSk5+EuGP33YpIYG99HBZuAaCX8lzPH8g/9vmeRA1 7lCOCLqML6TfLAOLeNOxWiaesSJor/7yCdZmoJaO1uBlXWzv0DZYj/gFyJ4jR5nUISGa nf7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc; bh=5+e88HfQH2lIucsXxwkZxdyfP+2njFIpGKdmQ0s6vaQ=; b=vghr2rLhiSscU72xHhV1Xi1MVMsXkkknharZ0Nbxpn9LI/KegkxASOghL6ah+os+yY icSDs/emf6GLRliA087ptPazTMbwGSdYAPJmO002dNA8Zye6+jPKYXRisJuLlD+vQ0Ve qpCSKsfUekLM+KJpLr1yKJl6MXJGD4CQ9Bm/XEW7rFzd4tqVSGVOkvKArCmLDwwhDJkK fENz/GpDmTN6e3ESSYS89NMTelK5nUxlGzbtLFEuyF8HC6hUnBwIIh/Bod5y6r2knjZT FFple8PcKIDKWJNMac3qY37ZkWu+Dj8qat0KBC411hdKB4ioNnG5zkPkg7XjRvJ6C9C1 fZew== X-Gm-Message-State: ACgBeo0prMZciaTWIpbsXMqxmXpY5SKx2gtnXN6YdVA3VgIKQD3Vsyi/ nIqqrhJlTV8fH7LPjVFM/pzIPpyGYfCx4f4oYlNvcw== X-Received: by 2002:a05:6512:b03:b0:489:e00a:b32 with SMTP id w3-20020a0565120b0300b00489e00a0b32mr2883849lfu.368.1659742248023; Fri, 05 Aug 2022 16:30:48 -0700 (PDT) MIME-Version: 1.0 References: <20220801151928.270380-1-vipinsh@google.com> <4ccbafb5-9157-ec73-c751-ec71164f8688@redhat.com> In-Reply-To: From: Vipin Sharma Date: Fri, 5 Aug 2022 16:30:11 -0700 Message-ID: Subject: Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware To: Paolo Bonzini , Sean Christopherson , David Matlack Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 3, 2022 at 8:08 AM Paolo Bonzini wrote: > > On 8/2/22 21:12, Sean Christopherson wrote: > > Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting > > the node for the existing shadow page, but it's actually getting the node for the > > underlying physical huge page. > > > > That means this patch is broken now that KVM doesn't require a "struct page" to > > create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted > > "struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page() > > may return garbage. > > > > return page_to_nid(pfn_to_page(spte_to_pfn(spte))); > > I was about to say that yesterday. However my knowledge of struct page > things has been proved to be spotty enough, that I wasn't 100% sure of > that. But yeah, with a fresh mind it's quite obvious that anything that > goes through hva_to_pfn_remap and similar paths will fail. > > > That said, I still don't like this patch, at least not as a one-off thing. I do > > like the idea of having page table allocations follow the node requested for the > > target pfn, what I don't like is having one-off behavior that silently overrides > > userspace policy. > > Yes, I totally agree with making it all or nothing. > > > I would much rather we solve the problem for all page table allocations, maybe > > with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change > > because it will probably require having a per-node cache in each of the per-vCPU > > caches. > > A module parameter is fine. If you care about performance to this > level, your userspace is probably homogeneous enough. > Thank you all for the feedback and suggestions. Regarding dirty_log_perf_test, I will send out a patch to add an option which will tag vcpus to physical cpus using sched_setaffinity() calls. This should increase accuracy for the test. It seems like we are agreeing on two things: 1. Keep the same behavior for the page table pages allocation for both during the page fault and during eager page splitting. 2. Page table pages should be allocated on the same node where the underlying guest memory page is residing. Here are the two approaches, please provide feedback on which one looks more appropriate before I start spamming your inbox with my patches Approach A: Have per numa node cache for page table pages allocation Instead of having only one mmu_shadow_page_cache per vcpu, we provide multiple caches for each node either: mmu_shadow_page_cache[MAX_NUMNODES] or mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE] We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value instead of 40 to control memory consumption. When a fault happens, use the pfn to find which node the page should belong to and use the corresponding cache to get page table pages. struct *page = kvm_pfn_to_refcounted_page(pfn); int nid; if(page) { nid = page_to_nid(page); } else { nid = numa_node_id(); } ... tdp_mmu_alloc_sp(nid, vcpu); ... static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) { ... sp->spt = kvm_mmu_memory_cache_alloc(nid, &vcpu->arch.mmu_shadow_page_cache); ... } Since we are changing cache allocation for page table pages, should we also make similar changes for other caches like mmu_page_header_cache, mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how good this idea is. Approach B: Ask page from the specific node on fault path with option to fallback to the original cache and default task policy. This is what Sean's rough patch looks like.