Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp402587imw; Tue, 12 Jul 2022 23:54:01 -0700 (PDT) X-Google-Smtp-Source: AGRyM1t5/D7R5iz/p3t063J1ylKdVqZLCE0tV7hHxDc7bc6KYvH+bWy3nfbLXZe+It9HT44IqSAa X-Received: by 2002:a05:6402:438d:b0:43a:ae23:b77e with SMTP id o13-20020a056402438d00b0043aae23b77emr2762731edc.233.1657695241152; Tue, 12 Jul 2022 23:54:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1657695241; cv=none; d=google.com; s=arc-20160816; b=Z04UjCOZTQqLF5akA3oB0d6+uPVG1/nMxQrBMx4zxbLDRj634PHcleyqyTLqQ32wlu dSTl1dd5omCSFG1GF9TgVnM7ytulOILLEY545JGcToi5oA5Raw43MyrTBefdbHH3umGE /0ZsPO/SP9siT6+UMi1IBTM2v4OcyTdqXrfUXnEdwTlQzUYn56cbRnBiMeffgUPbRgKf A98jwir8IhnaATCJoBVeHVc99HGtuy0vSbRbW3xFHXbHxz2GvecbCXqBMr2Vx9fdZPTo KuF4h2aMVw3cbKdm4sbVXJQp/fqzjGJFAnAD4PSj7JVpIpFZHXwzrmcDXA8rMOteVkLi DjZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=H0SWPCgf/Jpq2p/PI6IvYiR4l2hD+7hoqcGNI7rQjCY=; b=VJuvbTZRhMP6EPFk6Ioyu5cIppjeoXaTx3ZTOEL2R+v2fD6w20rvvmvVqD4fk6n1sE DCDqkbEuFk/2IeFIJO9H41yVeOWJqetkqqv6/LhjzyshUzhujMG8jDzA/ZxTvkLf18gp 1yie7y3ZRzVDtM4CINwDEQaI/Z09GJxrUmx9dUfBP9jzmXKl+vhl0zWmCeA2jCmcwyIa L0jRuBMQ4fjjRsGEiCBkQTuO1BhyoR7qenN7UQht2lKu6QoHqrW0BBV9fUDHIM449agR gpd/9etTWtm5g+6RhVXF/T0jAXDRGO+LsK8+/ceMQQxVTTzqvnz/T/PzOJ7C15vVAXHs d/vQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=M7Fe3+k9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id js3-20020a17090797c300b0072b77925e20si250902ejc.636.2022.07.12.23.53.36; Tue, 12 Jul 2022 23:54:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=M7Fe3+k9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234139AbiGMGrI (ORCPT + 99 others); Wed, 13 Jul 2022 02:47:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44732 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230179AbiGMGrG (ORCPT ); Wed, 13 Jul 2022 02:47:06 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4E1B6DF610 for ; Tue, 12 Jul 2022 23:47:05 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id s21so10702153pjq.4 for ; Tue, 12 Jul 2022 23:47:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=H0SWPCgf/Jpq2p/PI6IvYiR4l2hD+7hoqcGNI7rQjCY=; b=M7Fe3+k9DnTzPkguIzx8HZAovDx68XaTvrr413ySsdIT5qoe3tGBjSbYnHUV8h0KO/ UEMKMO2TuJuZD5rq6wdKoSTXuPNYukcZ0hY8YXF4aOeT5lrrNXncH5ZiI30fEDAhBlI7 K0+bBjCcqW6nULAxuanYo759NBZqF3RwSO7wY4tqixLg8bTcUntPUtZ/TlUV1A8AEydr jwvAunOHePmPKBr+FxA/EW+5q0dK1x+3obMdVVQsdz3Pl7Xuov0gE9nXh3MNnehSdLvA 5mdYIEZqXlgQncD4cShN2HaZZAw0BFkNRLaRQg4mtY5dCepJrsYYjA6IqG7coxxWYTRO sTEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=H0SWPCgf/Jpq2p/PI6IvYiR4l2hD+7hoqcGNI7rQjCY=; b=Dk5VXETagKDO0lg6/mAbiFPN320R5NDTA5NTvvwdq4v6OQfabuZAUU4uO/kNs+8aT2 GlTVdnZP4cCuKcZCxfc+3VH6Key0zbOYwJJ2KGGc4l1YaSM6MRFlkhbA2Mb/CgDtfiHL Q8QJaoCrKlir7Szm5eLwvPCl5KMchLQaaE4Nin5DG3PmplqnlEUMFw5Ese6+yRzb4Vji OeNpCv9+4umUK5dGc+xwqqv5NW7lcwPm4hmrzTCH3tPy8riVcLa1Y+2Mhm6k1Mah9USt MB1V6EgSYw5ZZbeFdiAmzL/TSdsklOUfWBpfpn4x05PRIwqE7NL2jRUE3w6VU1OXBglG R/+A== X-Gm-Message-State: AJIora97NhdXLM0RPTDodU4uNPqpI8BrpfSMve0/WDJV7eCKwNpJKwTZ hPpE0m6npm668XxrARJkqu9OPxiYqHs2DZHEkaniuQ== X-Received: by 2002:a17:90b:896:b0:1ef:935c:f326 with SMTP id bj22-20020a17090b089600b001ef935cf326mr2158146pjb.193.1657694824580; Tue, 12 Jul 2022 23:47:04 -0700 (PDT) MIME-Version: 1.0 References: <20220704070612.299585-1-aneesh.kumar@linux.ibm.com> <87r130b2rh.fsf@yhuang6-desk2.ccr.corp.intel.com> <60e97fa2-0b89-cf42-5307-5a57c956f741@linux.ibm.com> <87r12r5dwu.fsf@yhuang6-desk2.ccr.corp.intel.com> <0a55e48a-b4b7-4477-a72f-73644b5fc4cb@linux.ibm.com> <87mtde6cla.fsf@yhuang6-desk2.ccr.corp.intel.com> <87ilo267jl.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edyp67m1.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87edyp67m1.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Wei Xu Date: Tue, 12 Jul 2022 23:46:53 -0700 Message-ID: Subject: Re: [PATCH v8 00/12] mm/demotion: Memory tiers and demotion To: "Huang, Ying" Cc: Aneesh Kumar K V , Johannes Weiner , Linux MM , Andrew Morton , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , jvgediya.oss@gmail.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 12, 2022 at 8:03 PM Huang, Ying wrote: > > Aneesh Kumar K V writes: > > > On 7/12/22 2:18 PM, Huang, Ying wrote: > >> Aneesh Kumar K V writes: > >> > >>> On 7/12/22 12:29 PM, Huang, Ying wrote: > >>>> Aneesh Kumar K V writes: > >>>> > >>>>> On 7/12/22 6:46 AM, Huang, Ying wrote: > >>>>>> Aneesh Kumar K V writes: > >>>>>> > >>>>>>> On 7/5/22 9:59 AM, Huang, Ying wrote: > >>>>>>>> Hi, Aneesh, > >>>>>>>> > >>>>>>>> "Aneesh Kumar K.V" writes: > >>>>>>>> > >>>>>>>>> The current kernel has the basic memory tiering support: Inactive > >>>>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower > >>>>>>>>> tier NUMA node to make room for new allocations on the higher tier > >>>>>>>>> NUMA node. Frequently accessed pages on a lower tier NUMA node can be > >>>>>>>>> migrated (promoted) to a higher tier NUMA node to improve the > >>>>>>>>> performance. > >>>>>>>>> > >>>>>>>>> In the current kernel, memory tiers are defined implicitly via a > >>>>>>>>> demotion path relationship between NUMA nodes, which is created during > >>>>>>>>> the kernel initialization and updated when a NUMA node is hot-added or > >>>>>>>>> hot-removed. The current implementation puts all nodes with CPU into > >>>>>>>>> the top tier, and builds the tier hierarchy tier-by-tier by establishing > >>>>>>>>> the per-node demotion targets based on the distances between nodes. > >>>>>>>>> > >>>>>>>>> This current memory tier kernel interface needs to be improved for > >>>>>>>>> several important use cases: > >>>>>>>>> > >>>>>>>>> * The current tier initialization code always initializes > >>>>>>>>> each memory-only NUMA node into a lower tier. But a memory-only > >>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM > >>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on > >>>>>>>>> a virtual machine) and should be put into a higher tier. > >>>>>>>>> > >>>>>>>>> * The current tier hierarchy always puts CPU nodes into the top > >>>>>>>>> tier. But on a system with HBM (e.g. GPU memory) devices, these > >>>>>>>>> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > >>>>>>>>> with CPUs are better to be placed into the next lower tier. > >>>>>>>>> > >>>>>>>>> * Also because the current tier hierarchy always puts CPU nodes > >>>>>>>>> into the top tier, when a CPU is hot-added (or hot-removed) and > >>>>>>>>> triggers a memory node from CPU-less into a CPU node (or vice > >>>>>>>>> versa), the memory tier hierarchy gets changed, even though no > >>>>>>>>> memory node is added or removed. This can make the tier > >>>>>>>>> hierarchy unstable and make it difficult to support tier-based > >>>>>>>>> memory accounting. > >>>>>>>>> > >>>>>>>>> * A higher tier node can only be demoted to selected nodes on the > >>>>>>>>> next lower tier as defined by the demotion path, not any other > >>>>>>>>> node from any lower tier. This strict, hard-coded demotion order > >>>>>>>>> does not work in all use cases (e.g. some use cases may want to > >>>>>>>>> allow cross-socket demotion to another node in the same demotion > >>>>>>>>> tier as a fallback when the preferred demotion node is out of > >>>>>>>>> space), and has resulted in the feature request for an interface to > >>>>>>>>> override the system-wide, per-node demotion order from the > >>>>>>>>> userspace. This demotion order is also inconsistent with the page > >>>>>>>>> allocation fallback order when all the nodes in a higher tier are > >>>>>>>>> out of space: The page allocation can fall back to any node from > >>>>>>>>> any lower tier, whereas the demotion order doesn't allow that. > >>>>>>>>> > >>>>>>>>> * There are no interfaces for the userspace to learn about the memory > >>>>>>>>> tier hierarchy in order to optimize its memory allocations. > >>>>>>>>> > >>>>>>>>> This patch series make the creation of memory tiers explicit under > >>>>>>>>> the control of userspace or device driver. > >>>>>>>>> > >>>>>>>>> Memory Tier Initialization > >>>>>>>>> ========================== > >>>>>>>>> > >>>>>>>>> By default, all memory nodes are assigned to the default tier with > >>>>>>>>> tier ID value 200. > >>>>>>>>> > >>>>>>>>> A device driver can move up or down its memory nodes from the default > >>>>>>>>> tier. For example, PMEM can move down its memory nodes below the > >>>>>>>>> default tier, whereas GPU can move up its memory nodes above the > >>>>>>>>> default tier. > >>>>>>>>> > >>>>>>>>> The kernel initialization code makes the decision on which exact tier > >>>>>>>>> a memory node should be assigned to based on the requests from the > >>>>>>>>> device drivers as well as the memory device hardware information > >>>>>>>>> provided by the firmware. > >>>>>>>>> > >>>>>>>>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > >>>>>>>>> > >>>>>>>>> Memory Allocation for Demotion > >>>>>>>>> ============================== > >>>>>>>>> This patch series keep the demotion target page allocation logic same. > >>>>>>>>> The demotion page allocation pick the closest NUMA node in the > >>>>>>>>> next lower tier to the current NUMA node allocating pages from. > >>>>>>>>> > >>>>>>>>> This will be later improved to use the same page allocation strategy > >>>>>>>>> using fallback list. > >>>>>>>>> > >>>>>>>>> Sysfs Interface: > >>>>>>>>> ------------- > >>>>>>>>> Listing current list of memory tiers details: > >>>>>>>>> > >>>>>>>>> :/sys/devices/system/memtier$ ls > >>>>>>>>> default_tier max_tier memtier1 power uevent > >>>>>>>>> :/sys/devices/system/memtier$ cat default_tier > >>>>>>>>> memtier200 > >>>>>>>>> :/sys/devices/system/memtier$ cat max_tier > >>>>>>>>> 400 > >>>>>>>>> :/sys/devices/system/memtier$ > >>>>>>>>> > >>>>>>>>> Per node memory tier details: > >>>>>>>>> > >>>>>>>>> For a cpu only NUMA node: > >>>>>>>>> > >>>>>>>>> :/sys/devices/system/node# cat node0/memtier > >>>>>>>>> :/sys/devices/system/node# echo 1 > node0/memtier > >>>>>>>>> :/sys/devices/system/node# cat node0/memtier > >>>>>>>>> :/sys/devices/system/node# > >>>>>>>>> > >>>>>>>>> For a NUMA node with memory: > >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >>>>>>>>> 1 > >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ > >>>>>>>>> default_tier max_tier memtier1 power uevent > >>>>>>>>> :/sys/devices/system/node# echo 2 > node1/memtier > >>>>>>>>> :/sys/devices/system/node# > >>>>>>>>> :/sys/devices/system/node# ls ../memtier/ > >>>>>>>>> default_tier max_tier memtier1 memtier2 power uevent > >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >>>>>>>>> 2 > >>>>>>>>> :/sys/devices/system/node# > >>>>>>>>> > >>>>>>>>> Removing a memory tier > >>>>>>>>> :/sys/devices/system/node# cat node1/memtier > >>>>>>>>> 2 > >>>>>>>>> :/sys/devices/system/node# echo 1 > node1/memtier > >>>>>>>> > >>>>>>>> Thanks a lot for your patchset. > >>>>>>>> > >>>>>>>> Per my understanding, we haven't reach consensus on > >>>>>>>> > >>>>>>>> - how to create the default memory tiers in kernel (via abstract > >>>>>>>> distance provided by drivers? Or use SLIT as the first step?) > >>>>>>>> > >>>>>>>> - how to override the default memory tiers from user space > >>>>>>>> > >>>>>>>> As in the following thread and email, > >>>>>>>> > >>>>>>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >>>>>>>> > >>>>>>>> I think that we need to finalized on that firstly? > >>>>>>> > >>>>>>> I did list the proposal here > >>>>>>> > >>>>>>> https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com > >>>>>>> > >>>>>>> So both the kernel default and driver-specific default tiers now become kernel parameters that can be updated > >>>>>>> if the user wants a different tier topology. > >>>>>>> > >>>>>>> All memory that is not managed by a driver gets added to default_memory_tier which got a default value of 200 > >>>>>>> > >>>>>>> For now, the only driver that is updated is dax kmem, which adds the memory it manages to memory tier 100. > >>>>>>> Later as we learn more about the device attributes (HMAT or something similar) that we might want to use > >>>>>>> to control the tier assignment this can be a range of memory tiers. > >>>>>>> > >>>>>>> Based on the above, I guess we can merge what is posted in this series and later fine-tune/update > >>>>>>> the memory tier assignment based on device attributes. > >>>>>> > >>>>>> Sorry for late reply. > >>>>>> > >>>>>> As the first step, it may be better to skip the parts that we haven't > >>>>>> reached consensus yet, for example, the user space interface to override > >>>>>> the default memory tiers. And we can use 0, 1, 2 as the default memory > >>>>>> tier IDs. We can refine/revise the in-kernel implementation, but we > >>>>>> cannot change the user space ABI. > >>>>>> > >>>>> > >>>>> Can you help list the use case that will be broken by using tierID as outlined in this series? > >>>>> One of the details that were mentioned earlier was the need to track top-tier memory usage in a > >>>>> memcg and IIUC the patchset posted https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com > >>>>> can work with tier IDs too. Let me know if you think otherwise. So at this point > >>>>> I am not sure which area we are still debating w.r.t the userspace interface. > >>>> > >>>> In > >>>> > >>>> https://lore.kernel.org/lkml/YqjZyP11O0yCMmiO@cmpxchg.org/ > >>>> > >>>> per my understanding, Johannes suggested to override the kernel default > >>>> memory tiers with "abstract distance" via drivers implementing memory > >>>> devices. As you said in another email, that is related to [7/12] of the > >>>> series. And we can table it for future. > >>>> > >>>> And per my understanding, he also suggested to make memory tier IDs > >>>> dynamic. For example, after the "abstract distance" of a driver is > >>>> overridden by users, the total number of memory tiers may be changed, > >>>> and the memory tier ID of some nodes may be changed too. This will make > >>>> memory tier ID easier to be understood, but more unstable. For example, > >>>> this will make it harder to specify the per-memory-tier memory partition > >>>> for a cgroup. > >>>> > >>> > >>> With all the approaches we discussed so far, a memory tier of a numa node can be changed. > >>> ie, pgdat->memtier can change anytime. The per memcg top tier mem usage tracking patches > >>> posted here > >>> https://lore.kernel.org/linux-mm/cefeb63173fa0fac7543315a2abbd4b5a1b25af8.1655242024.git.tim.c.chen@linux.intel.com/ > >>> doesn't consider the node movement from one memory tier to another. If we need > >>> a stable pgdat->memtier we will have to prevent a node memory tier reassignment > >>> while we have pages from the memory tier charged to a cgroup. This patchset should not > >>> prevent such a restriction. > >> > >> Absolute stableness doesn't exist even in "rank" based solution. But > >> "rank" can improve the stableness at some degree. For example, if we > >> move the tier of HBM nodes (from below DRAM to above DRAM), the DRAM > >> nodes can keep its memory tier ID stable. This may be not a real issue > >> finally. But we need to discuss that. > >> > > > > I agree that using ranks gives us the flexibility to change demotion order > > without being blocked by cgroup usage. But how frequently do we expect the > > tier assignment to change? My expectation was these reassignments are going > > to be rare and won't happen frequently after a system is up and running? > > Hence using tierID for demotion order won't prevent a node reassignment > > much because we don't expect to change the node tierID during runtime. In > > the rare case we do, we will have to make sure there is no cgroup usage from > > the specific memory tier. > > > > Even if we use ranks, we will have to avoid a rank update, if such > > an update can change the meaning of top tier? ie, if a rank update > > can result in a node being moved from top tier to non top tier. > > > >> Tim has suggested to use top-tier(s) memory partition among cgroups. > >> But I don't think that has been finalized. We may use per-memory-tier > >> memory partition among cgroups. I don't know whether Wei will use that > >> (may be implemented in the user space). > >> > >> And, if we thought stableness between nodes and memory tier ID isn't > >> important. Why should we use sparse memory device IDs (that is, 100, > >> 200, 300)? Why not just 0, 1, 2, ...? That looks more natural. > >> > > > > > > The range allows us to use memtier ID for demotion order. ie, as we start initializing > > devices with different attributes via dax kmem, there will be a desire to > > assign them to different tierIDs. Having default memtier ID (DRAM) at 200 enables > > us to put these devices in the range [0 - 200) without updating the node to memtier > > mapping of existing NUMA nodes (ie, without updating default memtier). > > I believe that sparse memory tier IDs can make memory tier more stable > in some cases. But this is different from the system suggested by > Johannes. Per my understanding, with Johannes' system, we will > > - one driver may online different memory types (such as kmem_dax may > online HBM, PMEM, etc.) > > - one memory type manages several memory nodes (NUMA nodes) > > - one "abstract distance" for each memory type > > - the "abstract distance" can be offset by user space override knob > > - memory tiers generated dynamic from different memory types according > "abstract distance" and overridden "offset" > > - the granularity to group several memory types into one memory tier can > be overridden via user space knob > > In this way, the memory tiers may be changed totally after user space > overridden. It may be hard to link memory tiers before/after the > overridden. So we may need to reset all per-memory-tier configuration, > such as cgroup paritation limit or interleave weight, etc. > > Personally, I think the system above makes sense. But I think we need > to make sure whether it satisfies the requirements. > > Best Regards, > Huang, Ying > Th "memory type" and "abstract distance" concepts sound to me similar to the memory tier "rank" idea. We can have some well-defined type/distance/rank values, e.g. HBM, DRAM, CXL_DRAM, PMEM, CXL_PMEM, which a device can register with. The memory tiers will build from these values. It can be configurable to whether/how to collapse several values into a single tier. Wei