Received: by 2002:a05:6602:2086:0:0:0:0 with SMTP id a6csp4824929ioa; Wed, 27 Apr 2022 11:56:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyrZKGF3zu0HnnjO3JV8tHg8YVhQs533CsDsBHRjXp3J5YQ/kXxCJYCwlMYOgAnBvDlwOVZ X-Received: by 2002:a17:902:dac1:b0:15d:356:887c with SMTP id q1-20020a170902dac100b0015d0356887cmr19355516plx.78.1651085809170; Wed, 27 Apr 2022 11:56:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651085809; cv=none; d=google.com; s=arc-20160816; b=IjndOxlexf2P/kWEhp4L5aKVQeDBRZKtrXMgiZ03j84Hs1z8If19ou7nK+XFKK91jj aU+NaOJBS1ijeszwP8bzTcy25ZUYP6OsmBAE6RWLromyLeA/FwIrLrw92GYPivhA2+HT mcrP9I5XVnJ0OSPRWV7dizmAR0qXxuOQSfBnoCgsjCM377fhvpe7+sVMQ4lJMOSqI6ZG kxEXZNsOMn1FDPwtfK0K1/oPmm1pYRf5rBEIowOWvKW8Mj+RwSp03L4+U7PpdJyEXTyB 29hN0L+dCh1gIAuBwGHnwHQhPPPuwwMb84z2TEcNA6UdiZMkqASK3ONAIkuKHBdm7qaL Lsig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=sELzMdbj5+HZ4sGPHJZnRGAhIQ7BCXymuFnX9zGs7BE=; b=gsbzkoIitlC6RJumL1qfT1QWUlirPJL9choZHzDsNajf6DAKn1IRxqBj1RutAJVfWp FUWdGqZCE6aD30woToNTYQDR7/RdJGrh6iaA7lt1PM2t5mTOvnt5RNyi4a4m6WVjNxB/ TtM89BlosQKOdERmHqptE3o5AZxfS6NL6uqIPcMypArD88TDXU21YLusvJjsRRDtKfzn DKvYpWSw50oBPXU7S3CW87B4XQy6TVjIN8C6+au1eo0ez7qPFDHw21v07F1AQo3gRk9h QR26XtEchKZuX7IaSW8auoIY6cDGyHf0mQYGx3AGvQb2lo6AM+yOP35CpsfLbvyCWtNP fXeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=aKHzEIuo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ng7-20020a17090b1a8700b001c69ee187d0si6508418pjb.171.2022.04.27.11.56.32; Wed, 27 Apr 2022 11:56:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=aKHzEIuo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230198AbiD0Sq5 (ORCPT + 99 others); Wed, 27 Apr 2022 14:46:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49306 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229952AbiD0SqG (ORCPT ); Wed, 27 Apr 2022 14:46:06 -0400 Received: from mail-vs1-xe36.google.com (mail-vs1-xe36.google.com [IPv6:2607:f8b0:4864:20::e36]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E1F69EA348 for ; Wed, 27 Apr 2022 11:27:26 -0700 (PDT) Received: by mail-vs1-xe36.google.com with SMTP id q2so2513437vsr.5 for ; Wed, 27 Apr 2022 11:27:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sELzMdbj5+HZ4sGPHJZnRGAhIQ7BCXymuFnX9zGs7BE=; b=aKHzEIuoGKVkf9zlW8zxPtiBPQ/ymiQxLmKRNGnk7xORraxDYRsElCIWngY5ooRyKp BEEVgfDboWTK5Yxqc66S6yaldjAO/UTqXcjqUK+5zGIPcI6i2aLPQUMuQ++aUUPicFoG 1SE6YGgtyypuRBiNpPX6WPwZ+xGkO8tqFY3aquNvWu+cZvbyRJ99jWnFLf1G758dK/Jb gTZ6mit2qD+ZZGJvDwcgsJtWk7jwO7v8zmFRy9HJPiJ5bB94OA+uTGX/3o83Aw9zhYJK TFJbf++ohKYiCl07NsotAoYyIsdg5shKj+hJ/eKvZKuOYQS7rYv44kIKvCJVByKAvLo/ ESoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sELzMdbj5+HZ4sGPHJZnRGAhIQ7BCXymuFnX9zGs7BE=; b=V4DNvfps4+NTkoONNmT8p/3zQP+MmUglgeoFNijDnFfojTW345mBf9iR2gN3mLG+aK Zx0ssJUyC6M+Duat7AXCJl8CZxRgwxli/2zhAQY7ilrYOAZohpTnd3BaKzrcny8+Ni1M /l7XNRDdUeUCguKIMnzfAdsz6pb06rxmKgTbUwkV2KH8div6NJWCZVIspLlhfqv3Uy/N 8AjkBoja/eN93uTnu1lCzDg3Ht36hUVhysd+SDS8mSHtX4d4LB5giePN/o3iden5jJxH nn7tPOc/wjX0lhK2LVW/K2b5Skyv2vF8ZP+hP7MRDhXHXeFeO0aY12ZQgaOqqYABu6Bt SYqg== X-Gm-Message-State: AOAM531tYR3n71HKltgAoNlUsV3pllX9nQeHhkZc+DLwjx5DMn4uYB/l eubPWq8ve5UwQph4oatxQgRA4bQcSUkZxpzqC14+Kw== X-Received: by 2002:a67:fd0b:0:b0:31b:e36d:31b1 with SMTP id f11-20020a67fd0b000000b0031be36d31b1mr9727091vsr.44.1651084045851; Wed, 27 Apr 2022 11:27:25 -0700 (PDT) MIME-Version: 1.0 References: <610ccaad03f168440ce765ae5570634f3b77555e.camel@intel.com> <8e31c744a7712bb05dbf7ceb2accf1a35e60306a.camel@intel.com> <78b5f4cfd86efda14c61d515e4db9424e811c5be.camel@intel.com> <200e95cf36c1642512d99431014db8943fed715d.camel@intel.com> In-Reply-To: From: Wei Xu Date: Wed, 27 Apr 2022 11:27:14 -0700 Message-ID: Subject: Re: [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS To: Aneesh Kumar K V Cc: "ying.huang@intel.com" , Jagdish Gediya , Yang Shi , Dave Hansen , Dan Williams , Davidlohr Bueso , Linux MM , Linux Kernel Mailing List , Andrew Morton , Baolin Wang , Greg Thelen , MichalHocko , Brice Goglin Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V wrote: > > On 4/25/22 10:26 PM, Wei Xu wrote: > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > wrote: > >> > > .... > > >> 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > >> > >> Node 0 & 2 are cpu + dram nodes and node 1 are slow > >> memory node near node 0, > >> > >> available: 3 nodes (0-2) > >> node 0 cpus: 0 1 > >> node 0 size: n MB > >> node 0 free: n MB > >> node 1 cpus: > >> node 1 size: n MB > >> node 1 free: n MB > >> node 2 cpus: 2 3 > >> node 2 size: n MB > >> node 2 free: n MB > >> node distances: > >> node 0 1 2 > >> 0: 10 40 20 > >> 1: 40 10 80 > >> 2: 20 80 10 > >> > >> We have 2 choices, > >> > >> a) > >> node demotion targets > >> 0 1 > >> 2 1 > >> > >> b) > >> node demotion targets > >> 0 1 > >> 2 X > >> > >> a) is good to take advantage of PMEM. b) is good to reduce cross-socket > >> traffic. Both are OK as defualt configuration. But some users may > >> prefer the other one. So we need a user space ABI to override the > >> default configuration. > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > In general, we can view the demotion order in a way similar to > > allocation fallback order (after all, if we don't demote or demotion > > lags behind, the allocations will go to these demotion target nodes > > according to the allocation fallback order anyway). If we initialize > > the demotion order in that way (i.e. every node can demote to any node > > in the next tier, and the priority of the target nodes is sorted for > > each source node), we don't need per-node demotion order override from > > the userspace. What we need is to specify what nodes should be in > > each tier and support NUMA mempolicy in demotion. > > > > I have been wondering how we would handle this. For ex: If an > application has specified an MPOL_BIND policy and restricted the > allocation to be from Node0 and Node1, should we demote pages allocated > by that application > to Node10? The other alternative for that demotion is swapping. So from > the page point of view, we either demote to a slow memory or pageout to > swap. But then if we demote we are also breaking the MPOL_BIND rule. IMHO, the MPOL_BIND policy should be respected and demotion should be skipped in such cases. Such MPOL_BIND policies can be an important tool for applications to override and control their memory placement when transparent memory tiering is enabled. If the application doesn't want swapping, there are other ways to achieve that (e.g. mlock, disabling swap globally, setting memcg parameters, etc). > The above says we would need some kind of mem policy interaction, but > what I am not sure about is how to find the memory policy in the > demotion path. This is indeed an important and challenging problem. One possible approach is to retrieve the allowed demotion nodemask from page_referenced() similar to vm_flags. > > > Cross-socket demotion should not be too big a problem in practice > > because we can optimize the code to do the demotion from the local CPU > > node (i.e. local writes to the target node and remote read from the > > source node). The bigger issue is cross-socket memory access onto the > > demoted pages from the applications, which is why NUMA mempolicy is > > important here. > > > > > -aneesh