Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp4737116iob; Sun, 8 May 2022 23:46:15 -0700 (PDT) X-Google-Smtp-Source: ABdhPJznBLICxhGlUBKc73QrxgW1sTbDyPwhFgKXHkSmwGRpyulKNpiWThokmon85w0IrMsFY4K2 X-Received: by 2002:a17:90a:e510:b0:1d8:39b3:280b with SMTP id t16-20020a17090ae51000b001d839b3280bmr25321054pjy.142.1652078775610; Sun, 08 May 2022 23:46:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652078775; cv=none; d=google.com; s=arc-20160816; b=D/nZYU0adsIyd2oaf9PwIE4skuVJecwgpx+L/caFWiBpTgXCCzoCKxq9lZelhG/S7Q I6+NxgYpUcvqn9hQ8x7PIzppZCcx12akrVa2EaFkkkax7+GVYgiD7ncdOG5v+bQBEnH9 /xIl8ygQaR6c0PpMkmU3jSCB1myjBy/ZkJCccPjyPs3ujcYr1mwvwnkIK7rOgbbQB/sd /AeRO/C/+/NZ+Pc/uRNhtNYOX+ZZ38IkvoqQ727H9lU3bXa0q9rcvfKB/JhhtAvNrr+3 xPkAOp9idqW2M/DKWBZ8BQ/L+HC8AytRapU1lBm0/Womm+j0qzekXu6KVfNVyklwB4d9 Cdnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=5XZb8SGRgCj7GpZsL/Cd8XmWyZYQ/CPyXIbu5Am7YqI=; b=rrhRZjmDUuDPu6qa46endxLR6Z/amU++0UGRfob4otG0+CdcMghimeiThtAP75MLED JPuHW/XvEZ4P24xseuzeoW6UGUwkUuxDMxrEbzSlED9yKkNEcG/+LVih71Tp0eXcOYUD PPPeyJ779bLkwTx4POCWJtn6pjzpwtbzWEHsMEgclwQ+BD1UlsnecA5u9msHHtQUMmO9 ygYXxBRYaNj20WvPLQ//DpB+ucm8yOulsTgHsDWAjAH95MvUTelbz6tSjDYRFB03JSFn u1pk550DUxysEn4ldyXMMMItZ+CsZZYf7b1ml4wuhaabIPMphQIN1Y1i3NVJHxQkG3JI 3ngw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=LQTx8XtY; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id g37-20020a635225000000b003995a4f48c4si13875795pgb.377.2022.05.08.23.46.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 08 May 2022 23:46:15 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=LQTx8XtY; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D3A9848E44; Sun, 8 May 2022 23:40:54 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1446032AbiEGIAG (ORCPT + 99 others); Sat, 7 May 2022 04:00:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56112 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232420AbiEGIAF (ORCPT ); Sat, 7 May 2022 04:00:05 -0400 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 740E918353 for ; Sat, 7 May 2022 00:56:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1651910179; x=1683446179; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=QjP7gFNkYoRcvp5aKs/kvkBClgI5+qwDadM8XMqpay0=; b=LQTx8XtY8CtFPxEkaTPtZIhP4L6mvOU9prXvZWMFmA9KRZRhuP5Rq9uM y9PRH/kotP+/YtAEJeOf44LKV5WO6CfxVTCyTs2IDJBqZXi0DAdBVc1PQ qu+Lu+kSYfTnovnrglWYKHRSiBwvlDbUrsrdIwCWMi30rMohq1cMixYnV rHKxhxk0HwSV+CCQ+MJFlB0c6Ifs+gY2y+IZfwSr9ovtByG08h5EjW9kC qm5orWWpkriJilL8vFsU+quLraDBoyED13w45gdiY7WQb3d90qh7VWqKl A9yRkyklWmYPKtmkdHPPMmaZAk68JWAmQr7GFVb2achsKWPXarnEsB/Q/ Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10339"; a="355108371" X-IronPort-AV: E=Sophos;i="5.91,206,1647327600"; d="scan'208";a="355108371" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 May 2022 00:56:19 -0700 X-IronPort-AV: E=Sophos;i="5.91,206,1647327600"; d="scan'208";a="564227035" Received: from sjin6-mobl1.ccr.corp.intel.com ([10.254.214.15]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 May 2022 00:56:14 -0700 Message-ID: Subject: Re: RFC: Memory Tiering Kernel Interfaces From: "ying.huang@intel.com" To: Dan Williams , Yang Shi Cc: Wei Xu , Andrew Morton , Dave Hansen , Linux MM , Greg Thelen , "Aneesh Kumar K.V" , Jagdish Gediya , Linux Kernel Mailing List , Alistair Popple , Davidlohr Bueso , Michal Hocko , Baolin Wang , Brice Goglin , Feng Tang , Jonathan Cameron Date: Sat, 07 May 2022 15:56:12 +0800 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Dan, On Sun, 2022-05-01 at 11:35 -0700, Dan Williams wrote: > On Fri, Apr 29, 2022 at 8:59 PM Yang Shi wrote: > > > > Hi Wei, > > > > Thanks for the nice writing. Please see the below inline comments. > > > > On Fri, Apr 29, 2022 at 7:10 PM Wei Xu wrote: > > > > > > The current kernel has the basic memory tiering support: Inactive > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower > > > tier NUMA node to make room for new allocations on the higher tier > > > NUMA node. Frequently accessed pages on a lower tier NUMA node can be > > > migrated (promoted) to a higher tier NUMA node to improve the > > > performance. > > > > > > A tiering relationship between NUMA nodes in the form of demotion path > > > is created during the kernel initialization and updated when a NUMA > > > node is hot-added or hot-removed. The current implementation puts all > > > nodes with CPU into the top tier, and then builds the tiering hierarchy > > > tier-by-tier by establishing the per-node demotion targets based on > > > the distances between nodes. > > > > > > The current memory tiering interface needs to be improved to address > > > several important use cases: > > > > > > * The current tiering initialization code always initializes > > >   each memory-only NUMA node into a lower tier. But a memory-only > > >   NUMA node may have a high performance memory device (e.g. a DRAM > > >   device attached via CXL.mem or a DRAM-backed memory-only node on > > >   a virtual machine) and should be put into the top tier. > > > > > > * The current tiering hierarchy always puts CPU nodes into the top > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes > > >   with CPUs are better to be placed into the next lower tier. > > > > > > * Also because the current tiering hierarchy always puts CPU nodes > > >   into the top tier, when a CPU is hot-added (or hot-removed) and > > >   triggers a memory node from CPU-less into a CPU node (or vice > > >   versa), the memory tiering hierarchy gets changed, even though no > > >   memory node is added or removed. This can make the tiering > > >   hierarchy much less stable. > > > > I'd prefer the firmware builds up tiers topology then passes it to > > kernel so that kernel knows what nodes are in what tiers. No matter > > what nodes are hot-removed/hot-added they always stay in their tiers > > defined by the firmware. I think this is important information like > > numa distances. NUMA distance alone can't satisfy all the usecases > > IMHO. > > Just want to note here that the platform firmware can only describe > the tiers of static memory present at boot. CXL hotplug breaks this > model and the kernel is left to dynamically determine the device's > performance characteristics and the performance of the topology to > reach that device. Now, the platform firmware does set expectations > for the perfomance class of different memory ranges, but there is no > way to know in advance the performance of devices that will be asked > to be physically or logically added to the memory configuration. That > said, it's probably still too early to define ABI for those > exceptional cases where the kernel needs to make a policy decision > about a device that does not fit into the firmware's performance > expectations, but just note that there are limits to the description > that platform firmware can provide. > Does this mean we will need some kind of in-kernel memory latency measurement mechanism to determine the tier of the memory device finally? Best Regards, Huang, Ying > I agree that NUMA distance alone is inadequate and the kernel needs to > make better use of data like ACPI HMAT to determine the default > tiering order.