Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp216459iob; Mon, 2 May 2022 17:29:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxd43H9GQYR8G3TjMno+BGBwWX9J0H2trWwpT591CP439ik511qBUd9Zm10qip4QxBN0G13 X-Received: by 2002:a17:902:7fc9:b0:15b:d408:5b19 with SMTP id t9-20020a1709027fc900b0015bd4085b19mr14394077plb.12.1651537790441; Mon, 02 May 2022 17:29:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651537790; cv=none; d=google.com; s=arc-20160816; b=MAQ2xf7BJ76zcvRnw38LgGXJ/qUhkdFHi6VPrruNzzdb06bqrLOjHxDe0ZebfJoVZI s1PU1+8L6Pa/1DC6i4sDKRNLcFQyle/ezmnxDnenUP67mIdI3M/kJBTqhCY9jSaMcKR/ X6E+H14MbDd1ZZQGBr/mrZepjr8w4gtjDIGU04cRqpbhfPGCwmGAzY7OPezjCFuFhUy0 LRAIjVXVbdbUUDGVFzVN7ovcj6YSJCEl+9Y9IzTiN49hI6D7MYoEDy+9xojIa29Rk8Pm 9axAefWPZHluBnIokR2OuJ+anyV3OfQDRyGCtp0+jJP0rZgrbsZWazrRqTUbFnaSD1DD qcCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=gEKLzHDmrjf1i9KSUaOozCHBh8PDFjk1Tl8GhCKHgdo=; b=sySy+ZZxFHVgcFn1Q+8w1h2dkDdAyuB7CLR/+yiLR3bOG8n1T/BWSbiGFZ2A5TyIZ1 /yLZBlt2REby3cGfZKBmW7f7rDO+VVoT9fvgOkdnofnjSa1TdVwCu1fxkviwj9dszdx7 3TefFfWnZ8No70UyyrPsR8zrby3995vpHc6XWqOBmvYK97M99QQ+XTYMX+yz5Sr8JalA hGsiSqBkplcPgTs6Klzwu2/l8BO93u1HIjqMhsGcfs+5NO7pNcfPi4Tt3e1pcleKTV8D MsouZNGKedQg4cRicAVvzu0kyQceSux5I6NdWdUYcF/30pLjX06JSY/F3Ap5KtFnwdBs FnEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=GxjehPUL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id mm10-20020a17090b358a00b001caa70213c0si662418pjb.132.2022.05.02.17.29.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 May 2022 17:29:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=GxjehPUL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id CC0473FBE9; Mon, 2 May 2022 17:24:11 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1352214AbiEASjA (ORCPT + 99 others); Sun, 1 May 2022 14:39:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351554AbiEASik (ORCPT ); Sun, 1 May 2022 14:38:40 -0400 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 28C52617A for ; Sun, 1 May 2022 11:35:13 -0700 (PDT) Received: by mail-pj1-x1032.google.com with SMTP id fv2so11085894pjb.4 for ; Sun, 01 May 2022 11:35:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gEKLzHDmrjf1i9KSUaOozCHBh8PDFjk1Tl8GhCKHgdo=; b=GxjehPULVcq/fzKtlONcVsDXlSml+007HBNIVAwS5tLyal/+nGIbMJGqjftZQyrLM6 mew7Gtz7zhmlquBaV/NkgJdIGMHy5wyStoFxN7mP7wPcGWm0MtIlqYpFF0dfoX9Mu8KX XravQI9uPa5r5MKHo8eRYgVrvYeB1d5TbQOvU/7G+UYwEBn8EW6qXUTjBqk43FOQeAY/ S38y1+uf4Wby0yDtmPgP5aulnqnGMJfT5cDnJQMqEJTRuPEtzw5i+2bHj+Kq2ANPcig5 2hfPFka0LVMcH5y5cE9zac5UYwHiA2HWn6OdAtYIrD6WeyzopQubvMyeVKa4iSncTY8R 1VQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gEKLzHDmrjf1i9KSUaOozCHBh8PDFjk1Tl8GhCKHgdo=; b=dy3lym1tJerGJSyHjbIhSNHEvPuVr43SCC3TGogJHQBa9oSpuFlgTkD9anFw9xWUaL 16fi96xm31Hl4FTIuoI7U/HUGSFkU2uzCx2vHz8AYaq5GqyqYFK+wMSFHwWWfp4XaU1F Q87LqojV18u9XmqdsMzi8SdVqugLgzHGTsqeWS1C8FnZQm0PFcXWDri6iMsyUmL2TwJj 6juNOoNbY5FIcCxPIdH2IXS2S2EUzvteB83jKG3DdT0Z1NaPAEbhyLcHIPr7kWyClmZC Zh19E5CuI6t3xg+Tr4h3A21cpOuDLS8naixOSSQsXKXiob8+DOK/cRbrWl6xC6xBfSi7 O43g== X-Gm-Message-State: AOAM533ad8Swd516F8nFhlC2CJJQEYKmhogLqAPBB1ErwPopCoET/5ca RBczqSItK3UIOZlfjNGL6NZ1qgYsaToi5r0I+DrgZg== X-Received: by 2002:a17:90b:4b01:b0:1d2:abf5:c83f with SMTP id lx1-20020a17090b4b0100b001d2abf5c83fmr9201434pjb.93.1651430112699; Sun, 01 May 2022 11:35:12 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dan Williams Date: Sun, 1 May 2022 11:35:01 -0700 Message-ID: Subject: Re: RFC: Memory Tiering Kernel Interfaces To: Yang Shi Cc: Wei Xu , Andrew Morton , Dave Hansen , Huang Ying , Linux MM , Greg Thelen , "Aneesh Kumar K.V" , Jagdish Gediya , Linux Kernel Mailing List , Alistair Popple , Davidlohr Bueso , Michal Hocko , Baolin Wang , Brice Goglin , Feng Tang , Jonathan Cameron Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 29, 2022 at 8:59 PM Yang Shi wrote: > > Hi Wei, > > Thanks for the nice writing. Please see the below inline comments. > > On Fri, Apr 29, 2022 at 7:10 PM Wei Xu wrote: > > > > The current kernel has the basic memory tiering support: Inactive > > pages on a higher tier NUMA node can be migrated (demoted) to a lower > > tier NUMA node to make room for new allocations on the higher tier > > NUMA node. Frequently accessed pages on a lower tier NUMA node can be > > migrated (promoted) to a higher tier NUMA node to improve the > > performance. > > > > A tiering relationship between NUMA nodes in the form of demotion path > > is created during the kernel initialization and updated when a NUMA > > node is hot-added or hot-removed. The current implementation puts all > > nodes with CPU into the top tier, and then builds the tiering hierarchy > > tier-by-tier by establishing the per-node demotion targets based on > > the distances between nodes. > > > > The current memory tiering interface needs to be improved to address > > several important use cases: > > > > * The current tiering initialization code always initializes > > each memory-only NUMA node into a lower tier. But a memory-only > > NUMA node may have a high performance memory device (e.g. a DRAM > > device attached via CXL.mem or a DRAM-backed memory-only node on > > a virtual machine) and should be put into the top tier. > > > > * The current tiering hierarchy always puts CPU nodes into the top > > tier. But on a system with HBM (e.g. GPU memory) devices, these > > memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > > with CPUs are better to be placed into the next lower tier. > > > > * Also because the current tiering hierarchy always puts CPU nodes > > into the top tier, when a CPU is hot-added (or hot-removed) and > > triggers a memory node from CPU-less into a CPU node (or vice > > versa), the memory tiering hierarchy gets changed, even though no > > memory node is added or removed. This can make the tiering > > hierarchy much less stable. > > I'd prefer the firmware builds up tiers topology then passes it to > kernel so that kernel knows what nodes are in what tiers. No matter > what nodes are hot-removed/hot-added they always stay in their tiers > defined by the firmware. I think this is important information like > numa distances. NUMA distance alone can't satisfy all the usecases > IMHO. Just want to note here that the platform firmware can only describe the tiers of static memory present at boot. CXL hotplug breaks this model and the kernel is left to dynamically determine the device's performance characteristics and the performance of the topology to reach that device. Now, the platform firmware does set expectations for the perfomance class of different memory ranges, but there is no way to know in advance the performance of devices that will be asked to be physically or logically added to the memory configuration. That said, it's probably still too early to define ABI for those exceptional cases where the kernel needs to make a policy decision about a device that does not fit into the firmware's performance expectations, but just note that there are limits to the description that platform firmware can provide. I agree that NUMA distance alone is inadequate and the kernel needs to make better use of data like ACPI HMAT to determine the default tiering order.