Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp3707019iob; Mon, 2 May 2022 03:42:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzTfOSzahdxI7J4tTUGsj9mP9cnd4AhroEgeUv7QnkKcuOTz0GJfmqyp39o+9/VKkwNo7SD X-Received: by 2002:a17:902:a407:b0:15d:29cc:5f56 with SMTP id p7-20020a170902a40700b0015d29cc5f56mr11372105plq.132.1651488141779; Mon, 02 May 2022 03:42:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651488141; cv=none; d=google.com; s=arc-20160816; b=nnjGexj0yypMKtithUdEHuP0UqbXYc90ch4Znz/0iW7X0y2lowRYrWP6nF0tbcPYXx 0piT8LaDnBTgP2bxZ656zd4op4ltUHKT99wuclQcw0jIFXPV9l8nFmmod5jc874rnTQt aT/OhmUxycJyFG6Q4Su/qGB4MqCh9U9iUPmV5gfyj+NSJjM+w/59/SoVF8Nc8lMCeCHN BhcLiiVO2L3rL/PjBYsLVbuvan8Q+Xx52OIpT3wDf8CfDtrkwnd7zxalpZCatEJhjyD8 VWh2IHK7WxTh8Jt4tkPPIRqOh3EOQjBhqRp49a9LCm8LUA1IUFg6wE15uZrFLIlQ1tba l8OA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:to:subject:message-id:date:from:mime-version :dkim-signature; bh=niB8HkuoqiT56uikZtBpshfk5BjQ9RlAZ15YvCwaKPU=; b=UaZOCIUMVR7z4T1FWpA+qLonzBsBwh/578ihKGQngNpLx2q5LPFXB8e9ef7tszF2Kh bWuWnxJt3YXkcCxZICp9l/OWrRLrTAjfA8E3bm3ydxU0vi2v+gJaC/DDko9bYaJFjnsT HQL01lLm2r7zKvjYWtN2gFwD6qz8DlL6LSV2Iz9eRs/c6EjI8XwdGne98VTrBpr4o5ia WtVMZ+0+lw/k4BIwFfT+TlYgTSG6jqVtnx+RXRkUYMaBFAWR4nhjSZHYmWd0cdNU9NHd k6zpqs02yu2RNEW0z+00bA4BkmkKSBSHQE3l8NeB/h5ahUNK9IlkwUf+8kJ+X9DEYsqX bG4Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="MxXPnY8/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t8-20020a17090a4e4800b001cd49fa9153si12091983pjl.18.2022.05.02.03.42.08; Mon, 02 May 2022 03:42:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="MxXPnY8/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347819AbiD3COT (ORCPT + 99 others); Fri, 29 Apr 2022 22:14:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60918 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240220AbiD3COR (ORCPT ); Fri, 29 Apr 2022 22:14:17 -0400 Received: from mail-vk1-xa35.google.com (mail-vk1-xa35.google.com [IPv6:2607:f8b0:4864:20::a35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55987D17F7 for ; Fri, 29 Apr 2022 19:10:57 -0700 (PDT) Received: by mail-vk1-xa35.google.com with SMTP id q136so4502267vke.10 for ; Fri, 29 Apr 2022 19:10:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:from:date:message-id:subject:to; bh=niB8HkuoqiT56uikZtBpshfk5BjQ9RlAZ15YvCwaKPU=; b=MxXPnY8/L9n7R3r8Ww76pyWJm18WqMoEm+Nyg69g4bGkJRx6F9qbW/Ii6NVvSRzr0f rqosnjYbBLBSKItoxiIF8L7lHxfCMycbD5bPhMEZwEtJfdA/CKtx74BR9pYClU/dMISD wT05plVTvUS74cqejezPuRJpI2M3nV0N9CgridMNdnqa3kufekE9T1bhZ08jby0X9uQl Ucjkhr5FabUAi8RW/++wgCFquiZQEYAKuvnnFjssnK5LlZqCFPwQM69vYQryAuV2xoDa AupkynMRKDijjKHuQ5M1hqln9HVukGAFL2tKowQw5H+eEsbPfe9f2dQcZyTw7d8Qp/F6 H6SA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=niB8HkuoqiT56uikZtBpshfk5BjQ9RlAZ15YvCwaKPU=; b=0o3rzkoKy36Kuzic1A7vlKFNLduNqmvG2QF/4x3u56oq73lPj9N0HXR/CNGe9m1i+C myex3QtpzET+x5LMa7yf+uvbvgOwkbXokJ3ZQH8ahAse1t9rVfuHrAF99kBan94WSrH5 F6SbpmOJ0RfPF6jKnLU3YZ9W/W3ivyB2b0iks2w0bk/6Cb9mgXfonqzMLocfyd0sWVfM sAk9mAKRkfe8TUu/04ipSk+bFIc8btJq6xiazN5UXH6f6VmiezUUy/kWjs48DXL2caER wbiIiNR3EssM90r6d5UUj6TvZFO3RmxnXrds1nHYovbWZiod58wd7M/JcEr1+yW5tPi2 V3fA== X-Gm-Message-State: AOAM531sFLCux1PrvWdrvIvHZB/d53YrkspL9JBTwANHF6BsZO/ZNsJu +mKfEvH8SxlapxU8KlJoQj23GpkrbxS1Sx34V7i7VQ== X-Received: by 2002:ac5:c3d0:0:b0:344:44f4:25c3 with SMTP id t16-20020ac5c3d0000000b0034444f425c3mr608834vkk.23.1651284656107; Fri, 29 Apr 2022 19:10:56 -0700 (PDT) MIME-Version: 1.0 From: Wei Xu Date: Fri, 29 Apr 2022 19:10:45 -0700 Message-ID: Subject: RFC: Memory Tiering Kernel Interfaces To: Andrew Morton , Dave Hansen , Huang Ying , Dan Williams , Yang Shi , Linux MM , Greg Thelen , "Aneesh Kumar K.V" , Jagdish Gediya , Linux Kernel Mailing List , Alistair Popple , Davidlohr Bueso , Michal Hocko , Baolin Wang , Brice Goglin , Feng Tang , Jonathan.Cameron@huawei.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The current kernel has the basic memory tiering support: Inactive pages on a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to make room for new allocations on the higher tier NUMA node. Frequently accessed pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA node to improve the performance. A tiering relationship between NUMA nodes in the form of demotion path is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the top tier, and then builds the tiering hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. The current memory tiering interface needs to be improved to address several important use cases: * The current tiering initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM device attached via CXL.mem or a DRAM-backed memory-only node on a virtual machine) and should be put into the top tier. * The current tiering hierarchy always puts CPU nodes into the top tier. But on a system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. * Also because the current tiering hierarchy always puts CPU nodes into the top tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from CPU-less into a CPU node (or vice versa), the memory tiering hierarchy gets changed, even though no memory node is added or removed. This can make the tiering hierarchy much less stable. * A higher tier node can only be demoted to selected nodes on the next lower tier, not any other node from the next lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), and has resulted in the feature request for an interface to override the system-wide, per-node demotion order from the userspace. * There are no interfaces for the userspace to learn about the memory tiering hierarchy in order to optimize its memory allocations. I'd like to propose revised memory tiering kernel interfaces based on the discussions in the threads: - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/ - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/ Sysfs Interfaces ================ * /sys/devices/system/node/memory_tiers Format: node list (one tier per line, in the tier order) When read, list memory nodes by tiers. When written (one tier per line), take the user-provided node-tier assignment as the new tiering hierarchy and rebuild the per-node demotion order. It is allowed to only override the top tiers, in which cases, the kernel will establish the lower tiers automatically. Kernel Representation ===================== * nodemask_t node_states[N_TOPTIER_MEMORY] Store all top-tier memory nodes. * nodemask_t memory_tiers[MAX_TIERS] Store memory nodes by tiers. * struct demotion_nodes node_demotion[] where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; } For a node N: node_demotion[N].preferred lists all preferred demotion targets; node_demotion[N].allowed lists all allowed demotion targets (initialized to be all the nodes in the same demotion tier). Tiering Hierarchy Initialization ================================ By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY). A device driver can remove its memory nodes from the top tier, e.g. a dax driver can remove PMEM nodes from the top tier. The kernel builds the memory tiering hierarchy and per-node demotion order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the best distance nodes in the next lower tier are assigned to node_demotion[N].preferred and all the nodes in the next lower tier are assigned to node_demotion[N].allowed. node_demotion[N].preferred can be empty if no preferred demotion node is available for node N. If the userspace overrides the tiers via the memory_tiers sysfs interface, the kernel then only rebuilds the per-node demotion order accordingly. Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU node. Memory Allocation for Demotion ============================== When allocating a new demotion target page, both a preferred node and the allowed nodemask are provided to the allocation function. The default kernel allocation fallback order is used to allocate the page from the specified node and nodemask. The memopolicy of cpuset, vma and owner task of the source page can be set to refine the demotion nodemask, e.g. to prevent demotion or select a particular allowed node as the demotion target. Examples ======== * Example 1: Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. Node 0 has node 2 as the preferred demotion target and can also fallback demotion to node 3. Node 1 has node 3 as the preferred demotion target and can also fallback demotion to node 2. Set mempolicy to prevent cross-socket demotion and memory access, e.g. cpuset.mems=0,2 node distances: node 0 1 2 3 0 10 20 30 40 1 20 10 40 30 2 30 40 10 40 3 40 30 40 10 /sys/devices/system/node/memory_tiers 0-1 2-3 N_TOPTIER_MEMORY: 0-1 node_demotion[]: 0: [2], [2-3] 1: [3], [2-3] 2: [], [] 3: [], [] * Example 2: Node 0 & 1 are DRAM nodes. Node 2 is a PMEM node and closer to node 0. Node 0 has node 2 as the preferred and only demotion target. Node 1 has no preferred demotion target, but can still demote to node 2. Set mempolicy to prevent cross-socket demotion and memory access, e.g. cpuset.mems=0,2 node distances: node 0 1 2 0 10 20 30 1 20 10 40 2 30 40 10 /sys/devices/system/node/memory_tiers 0-1 2 N_TOPTIER_MEMORY: 0-1 node_demotion[]: 0: [2], [2] 1: [], [2] 2: [], [] * Example 3: Node 0 & 1 are DRAM nodes. Node 2 is a PMEM node and has the same distance to node 0 & 1. Node 0 has node 2 as the preferred and only demotion target. Node 1 has node 2 as the preferred and only demotion target. node distances: node 0 1 2 0 10 20 30 1 20 10 30 2 30 30 10 /sys/devices/system/node/memory_tiers 0-1 2 N_TOPTIER_MEMORY: 0-1 node_demotion[]: 0: [2], [2] 1: [2], [2] 2: [], [] * Example 4: Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node. All nodes are top-tier. node distances: node 0 1 2 0 10 20 30 1 20 10 30 2 30 30 10 /sys/devices/system/node/memory_tiers 0-2 N_TOPTIER_MEMORY: 0-2 node_demotion[]: 0: [], [] 1: [], [] 2: [], [] * Example 5: Node 0 is a DRAM node with CPU. Node 1 is a HBM node. Node 2 is a PMEM node. With userspace override, node 1 is the top tier and has node 0 as the preferred and only demotion target. Node 0 is in the second tier, tier 1, and has node 2 as the preferred and only demotion target. Node 2 is in the lowest tier, tier 2, and has no demotion targets. node distances: node 0 1 2 0 10 21 30 1 21 10 40 2 30 40 10 /sys/devices/system/node/memory_tiers (userspace override) 1 0 2 N_TOPTIER_MEMORY: 1 node_demotion[]: 0: [2], [2] 1: [0], [0] 2: [], [] -- Wei