Received: by 2002:a05:6602:18e:0:0:0:0 with SMTP id m14csp3483350ioo; Mon, 30 May 2022 03:08:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxdnoC/5zTK2PkWQ7DOgby30e9FrQb4QpzLQzO1jNC7QT0ivyBt/qWy0FqADpI34OIiWHqw X-Received: by 2002:a17:907:9485:b0:6ff:1012:1b94 with SMTP id dm5-20020a170907948500b006ff10121b94mr21819898ejc.39.1653905329975; Mon, 30 May 2022 03:08:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1653905329; cv=none; d=google.com; s=arc-20160816; b=j7PTtAVrtagUUhxjNhCbiYpiZmMJCZQBEzMyOV5GXNhQO3ODcE8xJxhvdWj8aoMtKp ZjoJcHF2EoaU7V2qTNPxNszoLxIARqgwg1iubRPMUhRcjTdWOXfiye0yDAmqRvV6LHvH 4c86u6qr2vIM860siqdUeeyygKzOK+SsOV+dguxdKzCpJy96BgqaconbM63wGd35hjcn qaUmb3/jCqvk3TEllBuVftPwISYl5InR1aa0ipnvo6PDkIXvtWf5OM44YVBfXliTvt/v zl/Bqu2mOUEoFFlrM9NHSc8qnEHxpTy0Ixrl0cA/oMRTjdfjbRkVKrXG4moZM8WAzQFY i+mQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=Y6LQoFUB+FpmSxC89bOFtvF9pCrpFJH8X50A+wnyAhg=; b=XGQdqzTr2XUK5dozxWm/C8rblgOhRdLiCe7AXhm5pFYzv12mIyMJjA9MxIQZpZ6cPT RSip+PnS2NGil5oJxwF74q2uW0TsNO3mtMYMUblmAVCvStYjqAWf22XnaNDHCDekJWxN p9Ki30cNAtbrV9lgBqcrH73zLJxW9e8chrkOyITcVOcW+ozYy0P76QsmAInagjLobl6L ia71SkRXElIbm1u7hVkFAx6Z0ibxXJODCflXKBUIp0uy1AgaHMQv54qSxD3lnkZq74W3 SgS2jCTJNcvd1g3V1n++OrCRjrcg3/B0FiTz4fsy6hV4MaBvkktyVm4JfAtRoYRD3fZn 34Ow== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=L3VlwJeV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n11-20020a1709065dab00b006e89ac43dcfsi8340541ejv.899.2022.05.30.03.08.23; Mon, 30 May 2022 03:08:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=L3VlwJeV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233084AbiE3Gyp (ORCPT + 99 others); Mon, 30 May 2022 02:54:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32792 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232692AbiE3Gyn (ORCPT ); Mon, 30 May 2022 02:54:43 -0400 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7375D6C545 for ; Sun, 29 May 2022 23:54:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1653893679; x=1685429679; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=UxsNmmMphwmfKu+ZoGtxGP7hmVXRA56NSDO/xP8BywA=; b=L3VlwJeVnV4LzcdJ62ob4bAnwpnFhKbWi/oUXass6JZFp0y7uIInLx35 777T+P5yxn5eKxvspfzKrHx1KzpecApxydha19lnlKHqqbDNpLVG8/EPf JC5v8DLDP/EJDNF7Sir210bpRPHWGMJi8IJCrtrxCbUrk8pbq1Q0zZ47Q W15zEa3y3rSmY1Wy246pkd6XGLgovnG0flxoOTyQeHXYipl9ZZMchVzbB 81U9lE1L/0vnlbulK/hMnbugLhAhgYd8c2fDCd15PYgS4CebyrMb4NDTY DkJ6ECHitVNwkwRMa9qBldwV8JJDgJcYmISIjJHZolpx6AwsFfjt8NnKs g==; X-IronPort-AV: E=McAfee;i="6400,9594,10362"; a="262532893" X-IronPort-AV: E=Sophos;i="5.91,262,1647327600"; d="scan'208";a="262532893" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 May 2022 23:54:39 -0700 X-IronPort-AV: E=Sophos;i="5.91,262,1647327600"; d="scan'208";a="551214840" Received: from yingyin4-mobl.ccr.corp.intel.com ([10.254.215.161]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 May 2022 23:54:34 -0700 Message-ID: <4ef0c618cdb61745047ca09f66fdc3a1746952f8.camel@intel.com> Subject: Re: RFC: Memory Tiering Kernel Interfaces (v2) From: Ying Huang To: Jonathan Cameron , Wei Xu Cc: Andrew Morton , Greg Thelen , "Aneesh Kumar K.V" , Yang Shi , Linux Kernel Mailing List , Jagdish Gediya , Michal Hocko , Tim C Chen , Dave Hansen , Alistair Popple , Baolin Wang , Feng Tang , Davidlohr Bueso , Dan Williams , David Rientjes , Linux MM , Brice Goglin , Hesham Almatary Date: Mon, 30 May 2022 14:54:31 +0800 In-Reply-To: <20220527101036.0000584a@Huawei.com> References: <20220512160010.00005bc4@Huawei.com> <6b7c472b50049592cde912f04ca47c696caa2227.camel@intel.com> <6ce724e5c67d4f7530457897fa08d0a8ba5dd6d0.camel@intel.com> <4c712c5aa69efc103091077f1d3579efa56015a7.camel@intel.com> <41229097bf02364a0dbd9e85c9488160db8a3c89.camel@intel.com> <20220527101036.0000584a@Huawei.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2022-05-27 at 10:10 +0100, Jonathan Cameron wrote: > On Thu, 26 May 2022 13:55:39 -0700 > Wei Xu wrote: > > > On Thu, May 26, 2022 at 12:39 AM Ying Huang wrote: > > > > > > On Thu, 2022-05-26 at 00:08 -0700, Wei Xu wrote: > > > > On Wed, May 25, 2022 at 11:55 PM Ying Huang wrote: > > > > > > > > > > On Wed, 2022-05-25 at 20:53 -0700, Wei Xu wrote: > > > > > > On Wed, May 25, 2022 at 6:10 PM Ying Huang wrote: > > > > > > > > > > > > > > On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote: > > > > > > > > On Wed, May 25, 2022 at 2:03 AM Ying Huang wrote: > > > > > > > > > > > > > > > > > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote: > > > > > > > > > > On Tue, May 24, 2022 at 1:24 AM Ying Huang wrote: > > > > > > > > > > > > > > > > > > > > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote: > > > > > > > > > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote: > > > > > > > > > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 11 May 2022 23:22:11 -0700 > > > > > > > > > > > > > > > Wei Xu wrote: > > > > > > > > > > > > > > > > The current kernel has the basic memory tiering support: Inactive > > > > > > > > > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower > > > > > > > > > > > > > > > > tier NUMA node to make room for new allocations on the higher tier > > > > > > > > > > > > > > > > NUMA node. Frequently accessed pages on a lower tier NUMA node can be > > > > > > > > > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the > > > > > > > > > > > > > > > > performance. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In the current kernel, memory tiers are defined implicitly via a > > > > > > > > > > > > > > > > demotion path relationship between NUMA nodes, which is created during > > > > > > > > > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or > > > > > > > > > > > > > > > > hot-removed. The current implementation puts all nodes with CPU into > > > > > > > > > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing > > > > > > > > > > > > > > > > the per-node demotion targets based on the distances between nodes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This current memory tier kernel interface needs to be improved for > > > > > > > > > > > > > > > > several important use cases: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * The current tier initialization code always initializes > > > > > > > > > > > > > > > >   each memory-only NUMA node into a lower tier. But a memory-only > > > > > > > > > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM > > > > > > > > > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on > > > > > > > > > > > > > > > >   a virtual machine) and should be put into a higher tier. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top > > > > > > > > > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these > > > > > > > > > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > > > > > > > > > > > > > > > >   with CPUs are better to be placed into the next lower tier. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes > > > > > > > > > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and > > > > > > > > > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice > > > > > > > > > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no > > > > > > > > > > > > > > > >   memory node is added or removed. This can make the tier > > > > > > > > > > > > > > > >   hierarchy unstable and make it difficult to support tier-based > > > > > > > > > > > > > > > >   memory accounting. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * A higher tier node can only be demoted to selected nodes on the > > > > > > > > > > > > > > > >   next lower tier as defined by the demotion path, not any other > > > > > > > > > > > > > > > >   node from any lower tier. This strict, hard-coded demotion order > > > > > > > > > > > > > > > >   does not work in all use cases (e.g. some use cases may want to > > > > > > > > > > > > > > > >   allow cross-socket demotion to another node in the same demotion > > > > > > > > > > > > > > > >   tier as a fallback when the preferred demotion node is out of > > > > > > > > > > > > > > > >   space), and has resulted in the feature request for an interface to > > > > > > > > > > > > > > > >   override the system-wide, per-node demotion order from the > > > > > > > > > > > > > > > >   userspace. This demotion order is also inconsistent with the page > > > > > > > > > > > > > > > >   allocation fallback order when all the nodes in a higher tier are > > > > > > > > > > > > > > > >   out of space: The page allocation can fall back to any node from > > > > > > > > > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * There are no interfaces for the userspace to learn about the memory > > > > > > > > > > > > > > > >   tier hierarchy in order to optimize its memory allocations. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on > > > > > > > > > > > > > > > > the discussions in the threads: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/ > > > > > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/ > > > > > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > High-level Design Ideas > > > > > > > > > > > > > > > > ======================= > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * Define memory tiers explicitly, not implicitly. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory > > > > > > > > > > > > > > > >   nodes, not their relative node distances between each other. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * The tier assignment of each node is independent from each other. > > > > > > > > > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier > > > > > > > > > > > > > > > >   assignment of any other node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * The node-tier association is stable. A node can be reassigned to a > > > > > > > > > > > > > > > >   different tier only under the specific conditions that don't block > > > > > > > > > > > > > > > >   future tier-based memory cgroup accounting. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The > > > > > > > > > > > > > > > >   demotion target node selection follows the allocation fallback order > > > > > > > > > > > > > > > >   of the source node, which is built based on node distances. The > > > > > > > > > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers > > > > > > > > > > > > > > > >   lower than the source node. We no longer need to maintain a separate > > > > > > > > > > > > > > > >   per-node demotion order (node_demotion[]). > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Wei, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This proposal looks good to me, though we'll be having fun > > > > > > > > > > > > > > > white boarding topologies from our roadmaps for the next few days :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's good to hear. > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > A few comments inline. It also seems likely to me that there is little > > > > > > > > > > > > > > > benefit in starting with 3 tiers as the maximum. Seems unlikely the > > > > > > > > > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5. > > > > > > > > > > > > > > > I've drawn out one simple case that needs 4 to do sensible things. > > > > > > > > > > > > > > > > > > > > > > > > > > > > We can make the number of tiers a config option. 3 tiers are just what > > > > > > > > > > > > > > the kernel can reasonably initialize when there isn't enough hardware > > > > > > > > > > > > > > performance information from the firmware. > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sysfs Interfaces > > > > > > > > > > > > > > > > ================ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   Format: node_list > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   Read-only. When read, list the memory nodes in the specified tier. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   The absolute value of a tier id number has no specific meaning. > > > > > > > > > > > > > > > >   What matters is the relative order of the tier id numbers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier > > > > > > > > > > > > > > > >   sysfs files. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * /sys/devices/system/node/nodeN/memtier > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   where N = 0, 1, ... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   Format: int or empty > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   When read, list the memory tier that the node belongs to. Its value > > > > > > > > > > > > > > > >   is empty for a CPU-only NUMA node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   When written, the kernel moves the node into the specified memory > > > > > > > > > > > > > > > >   tier if the move is allowed. The tier assignment of all other nodes > > > > > > > > > > > > > > > >   are not affected. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   Initially, we can make this interface read-only. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Kernel Representation > > > > > > > > > > > > > > > > ===================== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * #define MAX_MEMORY_TIERS 3 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   Support 3 memory tiers for now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * #define MEMORY_DEFAULT_TIER 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   The default tier that a memory node is assigned to. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   Store memory nodes by tiers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * int node_tier_map[MAX_NUMNODES] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   Map a node to its tier. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Memory Tier Initialization > > > > > > > > > > > > > > > > ========================== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > By default, all memory nodes are assigned to the default tier > > > > > > > > > > > > > > > > (MEMORY_DEFAULT_TIER). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is tighter than it needs to be. In many cases we can easily > > > > > > > > > > > > > > > establish if there is any possibility of CPU being hotplugged into > > > > > > > > > > > > > > > a memory node. If it's CXL attached no way CPUs are going to be > > > > > > > > > > > > > > > turning up their later :) If CPU HP into a given node can't happen > > > > > > > > > > > > > > > we can be more flexible and I think that often results in better decisions. > > > > > > > > > > > > > > > See example below, though obviously I could just use the userspace > > > > > > > > > > > > > > > interface to fix that up anyway or have a CXL driver move it around > > > > > > > > > > > > > > > if that's relevant. In some other cases I'm fairly sure we know in > > > > > > > > > > > > > > > advance where CPUs can be added but I'd need to check all the > > > > > > > > > > > > > > > relevant specs to be sure there aren't any corner cases. I 'think' > > > > > > > > > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged > > > > > > > > > > > > > > > (constraint coming from the interrupt controller + the fact that only > > > > > > > > > > > > > > > virtual CPU HP is defined). > > > > > > > > > > > > > > > > > > > > > > > > > > > > We may not always want to put a CXL-attached memory device into a > > > > > > > > > > > > > > slower tier because even though CXL does add some additional latency, > > > > > > > > > > > > > > both the memory device and CXL can still be very capable in > > > > > > > > > > > > > > performance and may not be much slower (if any) than the on-board DRAM > > > > > > > > > > > > > > (e.g. DRAM from a remote CPU socket). > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, the default tier here is just the initial tier assignment of > > > > > > > > > > > > > > each node, which behaves as if there were no tiering. A tiering > > > > > > > > > > > > > > kernel init function can certainly reassign the tier for each node if > > > > > > > > > > > > > > it knows enough about the hardware performance for these nodes from > > > > > > > > > > > > > > the firmware. > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > A device driver can move up or down its memory nodes from the default > > > > > > > > > > > > > > > > tier. For example, PMEM can move down its memory nodes below the > > > > > > > > > > > > > > > > default tier, whereas GPU can move up its memory nodes above the > > > > > > > > > > > > > > > > default tier. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The kernel initialization code makes the decision on which exact tier > > > > > > > > > > > > > > > > a memory node should be assigned to based on the requests from the > > > > > > > > > > > > > > > > device drivers as well as the memory device hardware information > > > > > > > > > > > > > > > > provided by the firmware. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Memory Tier Reassignment > > > > > > > > > > > > > > > > ======================== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a > > > > > > > > > > > > > > > > different memory tier. This is useful for supporting dynamically > > > > > > > > > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different > > > > > > > > > > > > > > > > memory devices across hot-plug events. Such tier changes should > > > > > > > > > > > > > > > > be compatible with tier-based memory accounting. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The userspace may also reassign an existing online memory node to a > > > > > > > > > > > > > > > > different tier. However, this should only be allowed when no pages > > > > > > > > > > > > > > > > are allocated from the memory node or when there are no non-root > > > > > > > > > > > > > > > > memory cgroups (e.g. during the system boot). This restriction is > > > > > > > > > > > > > > > > important for keeping memory tier hierarchy stable enough for > > > > > > > > > > > > > > > > tier-based memory cgroup accounting. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Memory Allocation for Demotion > > > > > > > > > > > > > > > > ============================== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To allocate a new page as the demotion target for a page, the kernel > > > > > > > > > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the > > > > > > > > > > > > > > > > source page node as the preferred node and the union of all lower > > > > > > > > > > > > > > > > tier nodes as the allowed nodemask. The actual target node selection > > > > > > > > > > > > > > > > then follows the allocation fallback order that the kernel has > > > > > > > > > > > > > > > > already defined. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The pseudo code looks like: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >     targets = NODE_MASK_NONE; > > > > > > > > > > > > > > > >     src_nid = page_to_nid(page); > > > > > > > > > > > > > > > >     src_tier = node_tier_map[src_nid]; > > > > > > > > > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++) > > > > > > > > > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]); > > > > > > > > > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can > > > > > > > > > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent > > > > > > > > > > > > > > > > demotion or select a particular allowed node as the demotion target. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Memory Allocation for Promotion > > > > > > > > > > > > > > > > =============================== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The page allocation for promotion is similar to demotion, except that (1) > > > > > > > > > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can > > > > > > > > > > > > > > > > be the accessing CPU node, not the source page node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Examples > > > > > > > > > > > > > > > > ======== > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * Example 3: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node2 is drawn as pmem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Typo. Good catch. > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All nodes are in the same tier. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >                   20 > > > > > > > > > > > > > > > >   Node 0 (DRAM) ---- Node 1 (DRAM) > > > > > > > > > > > > > > > >          \ / > > > > > > > > > > > > > > > >           \ 30 / 30 > > > > > > > > > > > > > > > >            \ / > > > > > > > > > > > > > > > >              Node 2 (PMEM) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > > > > >    0 10 20 30 > > > > > > > > > > > > > > > >    1 20 10 30 > > > > > > > > > > > > > > > >    2 30 30 10 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 0-2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion fallback order: > > > > > > > > > > > > > > > > node 0: empty > > > > > > > > > > > > > > > > node 1: empty > > > > > > > > > > > > > > > > node 2: empty > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * Example 4: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 is a DRAM node with CPU. > > > > > > > > > > > > > > > > Node 1 is a PMEM node. > > > > > > > > > > > > > > > > Node 2 is a GPU node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >                   50 > > > > > > > > > > > > > > > >   Node 0 (DRAM) ---- Node 2 (GPU) > > > > > > > > > > > > > > > >          \ / > > > > > > > > > > > > > > > >           \ 30 / 60 > > > > > > > > > > > > > > > >            \ / > > > > > > > > > > > > > > > >              Node 1 (PMEM) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > > > > >    0 10 30 50 > > > > > > > > > > > > > > > >    1 30 10 60 > > > > > > > > > > > > > > > >    2 50 60 10 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist > > > > > > > > > > > > > > > > 2 > > > > > > > > > > > > > > > > 0 > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > 2 > > > > > > > > > > > > > > > > 0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion fallback order: > > > > > > > > > > > > > > > > node 0: 1 > > > > > > > > > > > > > > > > node 1: empty > > > > > > > > > > > > > > > > node 2: 0, 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * Example 5: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 is a DRAM node with CPU. > > > > > > > > > > > > > > > > Node 1 is a GPU node. > > > > > > > > > > > > > > > > Node 2 is a PMEM node. > > > > > > > > > > > > > > > > Node 3 is a large, slow DRAM node without CPU. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >      Node 2 (PMEM) ---- > > > > > > > > > > > > > > > >    / | \ > > > > > > > > > > > > > > > >   / | 30 \ 120 > > > > > > > > > > > > > > > >  | | 100 \ > > > > > > > > > > > > > > > >  | Node 0 (DRAM) ---- Node 1 (GPU) > > > > > > > > > > > > > > > >   \ \ / > > > > > > > > > > > > > > > >     \ \ 40 / 110 > > > > > > > > > > > > > > > >   80 \ \ / > > > > > > > > > > > > > > > >         --- Node 3 (Slow DRAM) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is close but not quite what was intended for Hesham's > > > > > > > > > > > > > > > example... (note we just checked that Hesham's original node0-1 > > > > > > > > > > > > > > > timing didn't make any sense.). > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This was inspired by Hesham's example. But I should have also included > > > > > > > > > > > > > > the version that illustrates the need to skip a tier when demoting > > > > > > > > > > > > > > from certain nodes. > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > > > > node 0 1 2 3 > > > > > > > > > > > > > > > >    0 10 100 30 40 > > > > > > > > > > > > > > > >    1 100 10 120 110 > > > > > > > > > > > > > > > >    2 30 120 10 80 > > > > > > > > > > > > > > > >    3 40 110 80 10 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > 0,3 > > > > > > > > > > > > > > > > 2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > 0 > > > > > > > > > > > > > > > > 2 > > > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion fallback order: > > > > > > > > > > > > > > > > node 0: 2 > > > > > > > > > > > > > > > > node 1: 0, 3, 2 > > > > > > > > > > > > > > > > node 2: empty > > > > > > > > > > > > > > > > node 3: 2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is close but not quite the same as the example > > > > > > > > > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table > > > > > > > > > > > > > > > with that example didn't make sense). I added another > > > > > > > > > > > > > > > level of switching to make the numbers more obviously > > > > > > > > > > > > > > > different and show how critical it might be. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * Example 6: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 is a DRAM node with CPU. > > > > > > > > > > > > > > > Node 1 is a GPU node. > > > > > > > > > > > > > > > Node 2 is a PMEM node. > > > > > > > > > > > > > > > Node 3 is an extremely large, DRAM node without CPU. > > > > > > > > > > > > > > >   (Key point here being that it probably never makes sense > > > > > > > > > > > > > > >    to demote to anywhere else from this memory). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I've redone the timings wrt to example 5. > > > > > > > > > > > > > > > Basis for this is 0 and 2 are directly connected > > > > > > > > > > > > > > > via controllers in an SoC. 1 and 3 are connected > > > > > > > > > > > > > > > via a a common switch one switch down switch > > > > > > > > > > > > > > > (each hop via this is 100) > > > > > > > > > > > > > > > All drams cost 10 once you've reached correct node > > > > > > > > > > > > > > > and pmem costs 30 from SoC. > > > > > > > > > > > > > > > Numbers get too large as a result but meh, I'm making > > > > > > > > > > > > > > > a point not providing real numbers :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >          PMEM Node 2 > > > > > > > > > > > > > > >             |(30) > > > > > > > > > > > > > > >         CPU + DRAM Node0 > > > > > > > > > > > > > > >             |(100) > > > > > > > > > > > > > > >          Switch 1 > > > > > > > > > > > > > > >             |(100) > > > > > > > > > > > > > > >           Switch 2 > > > > > > > > > > > > > > >     (100) | |(100) > > > > > > > > > > > > > > > Node 1 GPU Node3 Large memory. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > With one level of s > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >      Node 2 (PMEM) ---- > > > > > > > > > > > > > > >     / | \ > > > > > > > > > > > > > > >    / | 30 \ 330 > > > > > > > > > > > > > > >   | | 310 \ > > > > > > > > > > > > > > >   | Node 0 (DRAM) ---- Node 1 (GPU) > > > > > > > > > > > > > > >    \ \ / > > > > > > > > > > > > > > >      \ \ 310 / 210 > > > > > > > > > > > > > > >    330 \ \ / > > > > > > > > > > > > > > >          --- Node 3 (Extremely large DRAM) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To my mind, we should potentially also take into account > > > > > > > > > > > > > > > the fact that Node3 can be known to never contain CPUs > > > > > > > > > > > > > > > (in at least some architectures we know where the CPUs > > > > > > > > > > > > > > >  might be added later, they can't just magically turn up > > > > > > > > > > > > > > >  anywhere in the topology). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > > > node 0 1 2 3 > > > > > > > > > > > > > > >     0 10 310 30 310 > > > > > > > > > > > > > > >     1 310 10 330 210 > > > > > > > > > > > > > > >     2 30 330 10 330 > > > > > > > > > > > > > > >     3 310 210 330 10 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So, my ideal would treat node 3 different from other dram nodes > > > > > > > > > > > > > > > as we know it can't have CPUs. Trying to come up with an > > > > > > > > > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain > > > > > > > > > > > > > > > extent depends on capacity. If node 2 was big enough to take > > > > > > > > > > > > > > > any demotion from node 0 and still have lots of room then demoting > > > > > > > > > > > > > > > there form node 3 would make sense and visa versa. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist > > > > > > > > > > > > > > >  1 > > > > > > > > > > > > > > >  0 > > > > > > > > > > > > > > >  2 > > > > > > > > > > > > > > >  3 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >  $ cat /sys/devices/system/node/node*/memtier > > > > > > > > > > > > > > >   1 > > > > > > > > > > > > > > >   0 > > > > > > > > > > > > > > >   2 > > > > > > > > > > > > > > >   3 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >  Demotion fallback order: > > > > > > > > > > > > > > >  node 0: 2, 3 > > > > > > > > > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3) > > > > > > > > > > > > > > >  node 2: 3 > > > > > > > > > > > > > > >  node 3: empty > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > or as Hesham just pointed out this can be done with 3 tiers > > > > > > > > > > > > > > > because we can put the GPU and CPU in the same tier because > > > > > > > > > > > > > > > their is little reason to demote from one to the other. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for the example. It makes sense to me to have node 3 on its > > > > > > > > > > > > > > own tier. We can have either 3 tiers or 4 tiers in total (assuming > > > > > > > > > > > > > > that the max number of tiers is a config option). > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We are also a bit worried about ABI backwards compatibility because > > > > > > > > > > > > > > > of potential need to make more space in tiers lower in number than > > > > > > > > > > > > > > > CPU attached DDR. I rather liked the negative proposal with > > > > > > > > > > > > > > > default as 0 that Huang, Ying made. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It is hard to have negative values as the device IDs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The current proposal equals the tier device ID to the tier hierarchy > > > > > > > > > > > > > > level, which makes the interface simpler, but less flexible. How > > > > > > > > > > > > > > about the following proposal (which decouples the tier device ID from > > > > > > > > > > > > > > the tier level)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist > > > > > > > > > > > > > > /sys/devices/system/memtier/memtierN/rank > > > > > > > > > > > > > > > > > > > > > > > > > > > > Each memory tier N has two sysfs files: > > > > > > > > > > > > > > - nodelist: the nodes that are in this tier > > > > > > > > > > > > > > - rank: an opaque value that helps decide the level at which this tier > > > > > > > > > > > > > > is in the tier hierarchy (smaller value means faster tier) > > > > > > > > > > > > > > > > > > > > > > > > > > > > The tier hierarchy is determined by "rank", not by the device id > > > > > > > > > > > > > > number N from "memtierN". > > > > > > > > > > > > > > > > > > > > > > > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry > > > > > > > > > > > > > > any meaning. Its value relative to other memtiers decides the level of > > > > > > > > > > > > > > this memtier in the tier hierarchy. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID), > > > > > > > > > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3 > > > > > > > > > > > > > > in a 5-tier system. > > > > > > > > > > > > > > > > > > > > > > > > > > > > For the above example (example 6), we can have: > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ ls /sys/devices/system/memtier > > > > > > > > > > > > > > memtier0 > > > > > > > > > > > > > > memtier1 > > > > > > > > > > > > > > memtier2 > > > > > > > > > > > > > > memtier128 > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank > > > > > > > > > > > > > > 50 > > > > > > > > > > > > > > 60 > > > > > > > > > > > > > > 70 > > > > > > > > > > > > > > 10 > > > > > > > > > > > > > > > > > > > > > > > > > > I understand that the device ID cannot be negtive. So we have to use > > > > > > > > > > > > > rank. Can we make it possible to allow "rank" to be negtive? > > > > > > > > > > > > > > > > > > > > > > > > It is possible to allow "rank" to be negative, though I think all > > > > > > > > > > > > positive values should work equally well. > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Another choice is to do some trick on device ID. For example, the CPU- > > > > > > > > > > > > > attached DRAM node are always memtier100 (the device ID). Then we can > > > > > > > > > > > > > have memtier99, memtier100, memtier101, memteri102, .... That's not > > > > > > > > > > > > > perfect too. > > > > > > > > > > > > > > > > > > > > > > > > If we go with the device ID tricks, one approach is to use sub-device IDs: > > > > > > > > > > > > > > > > > > > > > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and > > > > > > > > > > > > tier2 (e.g. PMEM). > > > > > > > > > > > > > > > > > > > > > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0, > > > > > > > > > > > > tier1.1, tier2.0, tier2.1. > > > > > > > > > > > > > > > > > > > > > > > > The earlier 4-tier example can be represented as: > > > > > > > > > > > > > > > > > > > > > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1 > > > > > > > > > > > > > > > > > > > > > > > > We can also omit .0 so that the tiers are: > > > > > > > > > > > > > > > > > > > > > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1 > > > > > > > > > > > > > > > > > > > > > > > > This should be flexible enough to support multiple tiers while keeping > > > > > > > > > > > > the tier IDs relatively stable. > > > > > > > > > > > > > > > > > > > > > > > > It is not as flexible as the rank approach. For example, to insert a > > > > > > > > > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign > > > > > > > > > > > > existing nodes to these 3 tiers. Using "rank", we can insert a new > > > > > > > > > > > > tier and only move desired nodes into the new tier. > > > > > > > > > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > > > > > > > The rank approach looks better for. And if we stick with the device ID > > > > > > > > > > > rule as follows, > > > > > > > > > > > > > > > > > > > > > > ... > > > > > > > > > > > 255 GPU > > > > > > > > > > > 0 DRAM > > > > > > > > > > > 1 PMEM > > > > > > > > > > > 2 > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > > > 255 is -1 for "s8". > > > > > > > > > > > > > > > > > > > > > > The device ID should do most tricks at least now. The rank can provide > > > > > > > > > > > more flexibility in the future. We can even go without rank in the > > > > > > > > > > > first version, and introduce it when it's necessary. > > > > > > > > > > > > > > > > > > > > Given that the "rank" approach is generally favored, let's go with > > > > > > > > > > that to avoid compatibility issues that may come from the switch of > > > > > > > > > > device ID tricks to ranks. > > > > > > > > > > > > > > > > > > OK. Just to confirm. Does this mean that we will have fixed device ID, > > > > > > > > > for example, > > > > > > > > > > > > > > > > > > GPU memtier255 > > > > > > > > > DRAM (with CPU) memtier0 > > > > > > > > > PMEM memtier1 > > > > > > > > > > > > > > > > > > When we add a new memtier, it can be memtier254, or memter2? The rank > > > > > > > > > value will determine the real demotion order. > > > > > > > > > > > > > > > > With the rank approach, the device ID numbering should be flexible and > > > > > > > > not mandated by the proposal. > > > > > > > > > > > > > > If so, the rank number will be fixed? For example, > > > > > > > > > > > > > > GPU 100 > > > > > > > DRAM (with CPU) 200 > > > > > > > PMEM 300 > > > > > > > > > > > > > > When we add a new memtier, its rank can be 50, 150, 250, or 400? > > > > > > > > > > > > > > If so, this makes me think why we don't just make this kind of rank the > > > > > > > device ID? Or I missed something? > > > > > > > > > > > > > > Or, both device IDs and rank values are not fixed? Why do we need that > > > > > > > kind of flexibility? Sorry, I may not undersand all requirements. > > > > > > > > > > > > Even though the proposal doesn't mandate a particular device ID > > > > > > numbering, I expect that the device IDs will be relatively stable once > > > > > > a kernel implementation is chosen. For example, it is likely that DRAM > > > > > > nodes with CPUs will always be on memtier1, no matter how many tiers > > > > > > are higher or lower than these nodes. > > > > > > > > > > > > We don't need to mandate a particular way to assign the rank values, > > > > > > either. What matters is the relative order and some reasonable gap > > > > > > between these values. > > > > > > > > > > > > The rank approach allows us to keep memtier device IDs relatively > > > > > > stable even though we may change the tier ordering among them. Its > > > > > > flexibility can have many other uses as well. For example, we can > > > > > > insert a new memtier into the tier hierarchy for a new set of nodes > > > > > > without affecting the node assignment of any existing memtier, > > > > > > provided that there is enough gap in the rank values for the new > > > > > > memtier. > > > > > > > > > > > > Using the rank value directly as the device ID has some disadvantages: > > > > > > - It is kind of unconventional to number devices in this way. > > > > > > - We cannot assign DRAM nodes with CPUs with a specific memtier device > > > > > > ID (even though this is not mandated by the "rank" proposal, I expect > > > > > > the device will likely always be memtier1 in practice). > > > > > > - It is possible that we may eventually allow the rank value to be > > > > > > modified as a way to adjust the tier ordering. We cannot do that > > > > > > easily for device IDs. > > > > > > > > > > OK. I can understand that sometimes it's more natural to change the > > > > > order of a set of nodes with same memory types (and data plane path) > > > > > together instead of change that one by one for each node. > > > > > > > > > > It appears that the memtierX device becomes kind of memory types (with > > > > > data plane path considered for latency/throughput too). We can assign a > > > > > memory type for a node, and change the order between memory types. If > > > > > so, we need to allow multiple memtiers have same rank value. > > > > > > > > Jonathan mentioned this feature that multiple memtiers share the same > > > > rank as well. It can be a convenient feature to have. For > > > > simplicity, it should be fine to leave out this feature initially. > > > > > > OK. What do you think about the concept of memory types? You have > > > mentioned that in memtierX directory, we can put latency/throughput, > > > etc. IMHO, these only make sense for one type of memory. And it's > > > natural for all memory nodes onlined by a driver to be same memory type. > > > > I think this is not always true. For example, a dax kmem driver can > > online both pmem and non-pmem dax devices as system memory. > > CXL Type 3 memory driver is also responsible for a memories of different > types with very different characteristics. Would need to assign memory > into at least a few different tiers - potentially many different ones. OK. My original words aren't correct. So I should have said that "the memory types should be determined by the drivers to online them". > > > > > That is, drivers (including firmware drivers) will register memory types > > > and put nodes into it. Base on memory types, "rank" (related to for > > > example latency) determined the real memory tiers. > > > > > > If you think it's a good idea, we can rename memtierX to memory_typeX. > > > But memory type may be not a good name, DRAM in local memory controler > > > and DRAM in remote CXL may have quite different performance metric. Or > > > memory_class to avoid the possible confusion? > > > > Memory types (e.g. GPU, DRAM, PMEM, etc) can be useful information to > > help initialize the memory tiers of NUMA nodes. But I think memory > > type is not a substitute for memory tier. We still need to define > > memory tiers on top of NUMA node groups based on memory types (for > > example, some may want to group GPU and DRAM into the same tier, > > others may want separate tiers for GPU/DRAM). It is simpler to keep > > the sysfs interface to just memory tiers and implement memory types as > > internal device attributes if needed. > > > > To avoid confusion, we can require that the rank value is unique for > > each memtier device. This should make it clear that each memtier > > device represents a distinct memory tier. > > I don't mind that for a first implementation, but can see advantage > in flexibility of being able to have multiple tiers fuse by > giving them the same rank value if we ever make rank writeable after > creation. Given no userspace is going to rely on 'failure' to create > ranks with the same value, the flexibility to make this change later > without ABI compatibility problems is there. IMHO, I don't think it's a good idea to have 2 memory tiers have same rank value. That makes the concept of tier confusing. Best Regards, Huang, Ying > > We can still put > > latency/throughput values into each memtierN directory. Such values > > need to be specified as a range to better accommodate possibly varied > > performance of the devices within the same memory tier. > > I'd postpone adding this sort of information to the tiers > until we need it. Most of the info can be established by userspace anyway > so why complicate this interface? If there are strong usecases for the info > we can add it later. > > Thanks, > > Jonathan > > > > > > Best Regards, > > > Huang, Ying > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > Huang, Ying > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think you may need to send v3 to make sure everyone is at the same > > > > > > > > > page. > > > > > > > > > > > > > > > > Will do it shortly. > > > > > > > > > > > > > > Good! Thanks! > > > > > > > > > > > > > > Best Regards, > > > > > > > Huang, Ying > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > Huang, Ying > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > > Huang, Ying > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist > > > > > > > > > > > > > > 0 > > > > > > > > > > > > > > 2 > > > > > > > > > > > > > > 3 > > > > > > > > > > > > > > 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ ls -l /sys/devices/system/node/node*/memtier > > > > > > > > > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0 > > > > > > > > > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128 > > > > > > > > > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1 > > > > > > > > > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > To override the memory tier of a node, we can use a new, write-only, > > > > > > > > > > > > > > per-node interface file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > /sys/devices/system/node/nodeN/set_memtier > > > > > > > > > > > > > > > > > > > > > > > > > > > > e.g. > > > > > > > > > > > > > > > > > > > > > > > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier > > > > > > > > > > > > > > > > > > > > > > > > > > I prefer the original proposal to make nodeX/memtier a normal file to > > > > > > > > > > > > > hold memtier devicde ID instead of a link. > > > > > > > > > > > > > > > > > > > > > > > > OK. We don't have to use a symlink. > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > > > > Huang, Ying > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Any comments? > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >   > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >