Received: by 2002:ac0:e350:0:0:0:0:0 with SMTP id g16csp2259644imn; Mon, 1 Aug 2022 17:52:19 -0700 (PDT) X-Google-Smtp-Source: AA6agR6ohHDwonmfvXTcLppZdoACZzCKVDCZ8+fIS1mT7Fnl80MC7penNbdYBC1tlh3/9JJlD0F/ X-Received: by 2002:a17:902:cece:b0:16e:d4c3:5894 with SMTP id d14-20020a170902cece00b0016ed4c35894mr11648959plg.154.1659401539697; Mon, 01 Aug 2022 17:52:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1659401539; cv=none; d=google.com; s=arc-20160816; b=uMNtYyd5OfnlnzBWO29qhOnZW9c1ocAfPJ/JsW6B6KO0bWhZN6+opGoa7Wk/02R5D8 63o3Nm6gUekP05+IlOxT9IHTxRaQivnLWFC8HAC74T9UbEec7LMpE/LeHLpAcksYEqya uiYLgL5CGTuAOsczpxtWihOW2h5+A7qXeY7CQFF1z/SU+vKh9N8NYt4wLniFycN56gZ6 qbOwV/ZAIjQ/CBugb7RNittrQgWPX3/xKm315FnLKjuhYcQob0t8rj7IEqhpi2fzXq5+ rE+BJ4d2sKK9SQX7uy4udsFB6RC7IjWkIzO7W2Y5CE2UILVbOtgDjkL2fYNKH2zMfF/Z mvLQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=V6WHXhqn/X7TOSFwMnLNQi+h+Hy5x1whbaIhx6t4ylA=; b=Wv/gNm6cZ/RwsdtRrxeYhlKVWH9ql9y4xFV+w6MIL7pHeU6D10MzWdQI6sjCK3hbjL yqmrEHxb/6Zeq3WtEQabu7En5B3hMepV9ZQZP+13qtrEEfFwxkbz2rBgdPWuIaX2n0zL wjL5i4kZ01EdIzlOy9Ph04FXvmAcpNl798KZdz4l/UYE9A3UZmlBuL/1/OqGHqKHrXHp C8pZPT6c8HpF5rqtIfnhdzKQj64NTMavT6CsVbPnKD1u2XFpg3ApdVnQbdKIG4bUQhme NalF64wVq6GI4AhL5RzMH25XYyRYrEC5iz9BoYbchWLqci9nkciVBbBTkSZTq7k62J/R czag== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=BgBIzg4J; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w20-20020a63fb54000000b0041a8d6b47a8si13763996pgj.699.2022.08.01.17.52.04; Mon, 01 Aug 2022 17:52:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=BgBIzg4J; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233736AbiHAX41 (ORCPT + 99 others); Mon, 1 Aug 2022 19:56:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229644AbiHAX40 (ORCPT ); Mon, 1 Aug 2022 19:56:26 -0400 Received: from mail-pg1-x52c.google.com (mail-pg1-x52c.google.com [IPv6:2607:f8b0:4864:20::52c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DBEB41EADD for ; Mon, 1 Aug 2022 16:56:25 -0700 (PDT) Received: by mail-pg1-x52c.google.com with SMTP id h132so10967380pgc.10 for ; Mon, 01 Aug 2022 16:56:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc; bh=V6WHXhqn/X7TOSFwMnLNQi+h+Hy5x1whbaIhx6t4ylA=; b=BgBIzg4JP8/ok61pyrEft6oWj93sbOVdFLgVXVuw9oQ2q5syTHiDA24VvqNVrdjjHH fl30hyWq06a/rXTCwa7yxPzthmUW2VPtMmrdnRyQQv99LbwYLRSe8yX3Fvf/RL68M6L8 o+lN/ZBSYU2f8mI++MiP4lLkoi6icgltVNCBLr8rl/ziqIQlYTAg/1ous5K0sA5ezQVH HklEyuMDyjPZLhT4375l00tVblp3mpic9XlPHbQhfXiWuzDgX/uEpDrX4FhqlDxVW74Q bWrC/lB4NpUk4jw9GFHHqN4XxtcHMMGOWpa2jn3t1nFteuZ3vRllmf1H1npFd2O72M1z CWcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc; bh=V6WHXhqn/X7TOSFwMnLNQi+h+Hy5x1whbaIhx6t4ylA=; b=V9LiSJZEtKCuDxPJ/4VoriWgA+YEKHwJMlYtmDGoMpxhnF64MqWFBVaMCv60uuKGpY VBWZSL0F7Ti4jW8MVCJc9aix6vEWghqz7osqiq2YaCBWlWEXfULqDmpsVfqsv19XmeIb HPxtHwJxfpEyKvwPyrF8qiqqRwqd8mOeSyd1vDAjPOLbAG6VTfbnYxxH42h4sESaGWBS p6Df7ypsuXXFs/AyCGBOyPAtlkloGp7XOFTeL3AL+ozBgJBpa9JRWmbuyTlc3P0HZuLp +JbteZzu5A3EDS8yXzkDhgCkZAKrpBdeQILYC7f6WDcdgne6+3bnvU7kaO7r24GLdtG2 P5BQ== X-Gm-Message-State: ACgBeo3hDc779E9SVYwfqfTW+NS+ZpSkjeHhlAbqa3N4SioMZLwxz0cZ 7moLCqNociWJ65x197ghee1YIw== X-Received: by 2002:a63:6d1:0:b0:41c:45d:7d49 with SMTP id 200-20020a6306d1000000b0041c045d7d49mr6424986pgg.437.1659398185291; Mon, 01 Aug 2022 16:56:25 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id 18-20020a621812000000b005251fff13dfsm9230601pfy.155.2022.08.01.16.56.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Aug 2022 16:56:24 -0700 (PDT) Date: Mon, 1 Aug 2022 23:56:21 +0000 From: Sean Christopherson To: David Matlack Cc: Vipin Sharma , pbonzini@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware Message-ID: References: <20220801151928.270380-1-vipinsh@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 01, 2022, David Matlack wrote: > On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote: > That being said, KVM currently has a gap where a guest doing a lot of > remote memory accesses when touching memory for the first time will > cause KVM to allocate the TDP page tables on the arguably wrong node. Userspace can solve this by setting the NUMA policy on a VMA or shared-object basis. E.g. create dedicated memslots for each NUMA node, then bind each of the backing stores to the appropriate host node. If there is a gap, e.g. a backing store we want to use doesn't properly support mempolicy for shared mappings, then we should enhance the backing store. > > We can improve TDP MMU eager page splitting by making > > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a > > huge page, allocate the new lower level page tables on the same node as the > > huge page. > > > > __get_free_page() is replaced by alloc_page_nodes(). This introduces two > > functional changes. > > > > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to > > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is > > not passed in tdp_mmu_alloc_sp_for_split() anyway. > > > > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for > > the NUMA node allocation. From this commit, thread's mempolicy will not > > be used and first preference will be to allocate on the node where huge > > page was present. > > It would be worth noting that userspace could change the mempolicy of > the thread doing eager splitting to prefer allocating from the target > NUMA node, as an alternative approach. > > I don't prefer the alternative though since it bleeds details from KVM > into userspace, such as the fact that enabling dirty logging does eager > page splitting, which allocates page tables. As above, if userspace is cares about vNUMA, then it already needs to be aware of some of KVM/kernel details. Separate memslots aren't strictly necessary, e.g. userspace could stitch together contiguous VMAs to create a single mega-memslot, but that seems like it'd be more work than just creating separate memslots. And because eager page splitting for dirty logging runs with mmu_lock held for, userspace might also benefit from per-node memslots as it can do the splitting on multiple tasks/CPUs. Regardless of what we do, the behavior needs to be document, i.e. KVM details will bleed into userspace. E.g. if KVM is overriding the per-task NUMA policy, then that should be documented. > It's also unnecessary since KVM can infer an appropriate NUMA placement > without the help of userspace, and I can't think of a reason for userspace to > prefer a different policy. I can't think of a reason why userspace would want to have a different policy for the task that's enabling dirty logging, but I also can't think of a reason why KVM should go out of its way to ignore that policy. IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to document how to effectively configure vNUMA-aware memslots.