Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp981355iob; Fri, 13 May 2022 18:29:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwKOhoy4ULsHTMe57JBoP3MKJ0Kk17WN/dV37tc0uQfLIGj9mqvURSHdyfTdxRsISX+UddO X-Received: by 2002:a05:600c:1547:b0:394:882a:3b5 with SMTP id f7-20020a05600c154700b00394882a03b5mr6882353wmg.97.1652491753336; Fri, 13 May 2022 18:29:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652491753; cv=none; d=google.com; s=arc-20160816; b=p8Jj07/IVpeGNXl6oWos4oHK5/9yNBf14IV8Iu7OVY96O65/1qIs2TcE0sBV7kOA0O 2JDVhOtgtqlAb03KjzbK+2b/U8sapY5BEEeyk4ohbOzX1FIgYHurrnR9413uTRjs/iD/ q0U6KPqeh6rDE7JYjOB4Tqpxo1U9xMUi9L/zuz8kl6nQs3i1+e4IR1/8lh95lOaqgixk yTCyB/7Lylh8GejjHVXHkSapI/GUnBkJF2boMToTElQzotEwilzVl6bO1TqN7vyuKlnx zoH9tPyNYw/IuV88lKkty3mQ53h5Z2AF9hoV+pton8+vgJZKue5jKktfgAL/erJJ34IJ 66Xw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=A713aPDOgY9rqFPGxiFCriolL23ylvMLtUURnbZHGNc=; b=I8XMj+V+re2HJWP7BDeQzXal/PDdPWnEPFJRnZP8rSFDWvaYHO7fXEznxw/+wV0PGE 6doFam+Lbr0pAgsLqa52eItOngGD1nHwEqr1twaIYoG4CbNb+GbaAEKtGGl8LqTVOYul Ne+7hXOCcUYe8L+ZO2Qt/TQhQO/cYNsdSzC7F0ZKlNu7iZQ+JU9ugmcRIejXvXEOlxb6 X4d6rwA2k3/kNfT2arQgFDLWR8zuEF6681rO19LtovF04Sv28DJEZ1xJJU669OIOCTcx fugluSva+7VwFjNH+x9wixVeRx3/CkmVTnudkE0sPIZb2iVjycBgJJ637QR3o7yIh881 FMoQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=PkhY5GCY; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id a4-20020a5d5704000000b00207aa28179esi3274522wrv.411.2022.05.13.18.29.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 May 2022 18:29:13 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=PkhY5GCY; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id DEB383F31B1; Fri, 13 May 2022 16:57:51 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233498AbiELBmf (ORCPT + 99 others); Wed, 11 May 2022 21:42:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50006 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229781AbiELBme (ORCPT ); Wed, 11 May 2022 21:42:34 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CBC281C5FA0 for ; Wed, 11 May 2022 18:42:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1652319752; x=1683855752; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=nqS2Yo8wWxSVKw+tz2+wy6FYBMYYW/yKi14tiidRDOI=; b=PkhY5GCYW1X7+fhuYGRGQ8D7fyq91z+kzChmiRch23UO1svt4eG+Pv8D itkoPsbtBVXsKh3mE0Bh6rD/XRvBGQR86gAJtlvK4MeTaadGyUatuS7Gu qd3wF1xxQ2hUF8RpKkhGxcsgoLJv5fvl3VimlKLuCEd3ZmRkXByAAb438 m7NWfS+VTMrpvxV/zowKy8DyLAE0O27ojkGb8kEjfbtbCE538/1Wba0DY PRTumRVDdxpQu7jSjRaM4T+5WWJ9UHir7p89bwfz86H8LkaN9rq+gdMkB M0U8vLQL+YC9ddluTZeIGlE6eLhcLJ8nrp+7Pz4ZAcehMRC5CGXhaM6BE g==; X-IronPort-AV: E=McAfee;i="6400,9594,10344"; a="250387093" X-IronPort-AV: E=Sophos;i="5.91,218,1647327600"; d="scan'208";a="250387093" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 May 2022 18:42:32 -0700 X-IronPort-AV: E=Sophos;i="5.91,218,1647327600"; d="scan'208";a="697821769" Received: from ruonanwa-mobl.ccr.corp.intel.com ([10.254.212.157]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 May 2022 18:42:27 -0700 Message-ID: <0a92d0040edb3b74ac259062d241b8cd28924edf.camel@intel.com> Subject: Re: RFC: Memory Tiering Kernel Interfaces From: "ying.huang@intel.com" To: Wei Xu Cc: "Aneesh Kumar K.V" , Alistair Popple , Yang Shi , Andrew Morton , Dave Hansen , Dan Williams , Linux MM , Greg Thelen , Jagdish Gediya , Linux Kernel Mailing List , Davidlohr Bueso , Michal Hocko , Baolin Wang , Brice Goglin , Feng Tang , Jonathan Cameron , Tim Chen Date: Thu, 12 May 2022 09:42:24 +0800 In-Reply-To: References: <87tua3h5r1.fsf@nvdebian.thelocal> <875ymerl81.fsf@nvdebian.thelocal> <87fslhhb2l.fsf@linux.ibm.com> <68333b21a58604f3fd0e660f1a39921ae22849d8.camel@intel.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2022-05-11 at 10:07 -0700, Wei Xu wrote: > On Wed, May 11, 2022 at 12:49 AM ying.huang@intel.com > wrote: > > > > On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote: > > > On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V > > > wrote: > > > > > > > > Alistair Popple writes: > > > > > > > > > Wei Xu writes: > > > > > > > > > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple wrote: > > > > > > > > > > > > > > Wei Xu writes: > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Tiering Hierarchy Initialization > > > > > > > > > > `==============================' > > > > > > > > > > > > > > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY). > > > > > > > > > > > > > > > > > > > > A device driver can remove its memory nodes from the top tier, e.g. > > > > > > > > > > a dax driver can remove PMEM nodes from the top tier. > > > > > > > > > > > > > > > > > > With the topology built by firmware we should not need this. > > > > > > > > > > > > > > I agree that in an ideal world the hierarchy should be built by firmware based > > > > > > > on something like the HMAT. But I also think being able to override this will be > > > > > > > useful in getting there. Therefore a way of overriding the generated hierarchy > > > > > > > would be good, either via sysfs or kernel boot parameter if we don't want to > > > > > > > commit to a particular user interface now. > > > > > > > > > > > > > > However I'm less sure letting device-drivers override this is a good idea. How > > > > > > > for example would a GPU driver make sure it's node is in the top tier? By moving > > > > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That > > > > > > > could get messy if say there were two drivers both of which wanted their node to > > > > > > > be in the top tier. > > > > > > > > > > > > The suggestion is to allow a device driver to opt out its memory > > > > > > devices from the top-tier, not the other way around. > > > > > > > > > > So how would demotion work in the case of accelerators then? In that > > > > > case we would want GPU memory to demote to DRAM, but that won't happen > > > > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the > > > > > only override available with this proposal would move GPU memory into a > > > > > lower tier, which is the opposite of what's needed there. > > > > > > > > How about we do 3 tiers now. dax kmem devices can be registered to > > > > tier 3. By default all numa nodes can be registered at tier 2 and HBM or > > > > GPU can be enabled to register at tier 1. ? > > > > > > This makes sense. I will send an updated RFC based on the discussions so far. > > > > Are these tier number fixed? If so, it appears strange that the > > smallest tier number is 0 on some machines, but 1 on some other > > machines. > > When the kernel is configured to allow 3 tiers, we can always show all > the 3 tiers. It is just that some tiers (e.g. tier 0) may be empty on > some machines. I still think that it's better to have no empty tiers for auto-generated memory tiers by kernel. Yes, the tier number will be not absolutely stable, but that only happens during system bootup in practice, so it's not a big issue IMHO. And, I still think it's better to make only N-1 tiers writable for totally N tiers (or even readable). Considering "tier0" is written, how to deal with nodes in "tier0" before but not after writing? One possible way is to put them into "tierN". And during a user customize the tiers, the union of "N tiers" may be not complete. > BTW, the userspace should not assume a specific meaning of a > particular tier id because it can change depending on the number of > tiers that the kernel is configured with. For example, the userspace > should not assume that tier-2 always means PMEM nodes. In a system > with 4 tiers, PMEM nodes may be in tier-3, not tier-2. Yes. This sounds good. Best Regards, Huang, Ying