Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp1014776iob; Fri, 13 May 2022 19:41:29 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwYKL04i02Ncy/9WzsZJy5w+3xz3Tp88HtBetAHV6TzUJSCWlYr1M8Y9R4dgRRO12S6EE63 X-Received: by 2002:adf:ec8c:0:b0:20a:dfe5:749e with SMTP id z12-20020adfec8c000000b0020adfe5749emr6031797wrn.615.1652496089305; Fri, 13 May 2022 19:41:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652496089; cv=none; d=google.com; s=arc-20160816; b=BtwG+nT+J3kntnVLD0P7Xjrx6AF0ufo+TSnwSwuSYgRTK/zQpHmX9WI4x5JFfveS6u okDNcjCL4SujZIbgYc7SEPYgKaPSQ7DbeKiEseyRWsKWmW8bTKQqdzPV/Yn64aQcROI/ xog6ZVXltmqr86bN8R+6WgZr+BLYwB/gJrYaMKwziHSbRz8PYORuXbVdghWKbUI/ok/f zi3rRpFXxkeyGnsIF6UKAVADnuCUukBXArGwFKvKjZebVvJwapEz2L2yUE1ySCeJMpu2 UFGiPyJ3yJjZT3K4epDoqDxc/hh8K8ctBOmBcbQcF5j9LP92sRO2zMJ9nvSkCNJ3bj2g bgJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Yi/Z75PNHk1ys10Xd95MwCUa0CwRgya3lUvptT8pN84=; b=SgjrB+AgDFdyk9+takeM9N+ReNCLONex84LxCDdZYPR8APdq4ZbBCjusrTKNct69cq JpRM6I/Zgyuu18x74clv/+fv3yQzrdq2JnpvwacZRL31Sp2y+ipjJM/oLTZZdJOZEhog QbHyofLGCed09aFw6OirtjaA83xI+6Tv8hIXXFAbwEksOiRq51V0SwcdrmRm/x2T03Sr O6sfIAQcqMJMb4qOSk/PuXZGS+RI8yTPr18N1kgigMi07KAMV0EPot0JmiD9uy2yW8zF 4nddJzmZgkRperBBCoL7TM/aRPO216Wh5yW+OWGeVS3mSBaLzM3G+zb7J2nm7mKVVbi8 uuLQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=EBzA3KZD; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id x23-20020a7bc777000000b0039451a56cf4si3730720wmk.13.2022.05.13.19.41.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 May 2022 19:41:29 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=EBzA3KZD; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8173B502826; Fri, 13 May 2022 17:51:35 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344558AbiELDhd (ORCPT + 99 others); Wed, 11 May 2022 23:37:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32836 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229539AbiELDh2 (ORCPT ); Wed, 11 May 2022 23:37:28 -0400 Received: from mail-vs1-xe2d.google.com (mail-vs1-xe2d.google.com [IPv6:2607:f8b0:4864:20::e2d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5F614252AA for ; Wed, 11 May 2022 20:37:26 -0700 (PDT) Received: by mail-vs1-xe2d.google.com with SMTP id a127so3929809vsa.3 for ; Wed, 11 May 2022 20:37:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Yi/Z75PNHk1ys10Xd95MwCUa0CwRgya3lUvptT8pN84=; b=EBzA3KZDwe8GWuCM37qaNGOlYoPBwCtZ6W3PJIIJva2D0X6crlO7eEtvvMJs3EZCrB IqF7LSwqR3pxIqmepA0+SLCqqcEIzuwqWJqqqzwk5NCXaH5DTHLvwMC1tDH9W1oGhBin VzpckuHvHJSAA6sYwiKem2g5KWFl8qwfSKO0s/jn8On6YMlxQ5ni/ln1abdvb/Ic+vwm oxXoy0YXfpqutKwd7kJ/B5I5abiMZdOD9yiS2QBJ3A7TjO2mtZQPHo2YHpgfee6p6066 peycm4azgAW9nnie6GYiUesOQTkYytrH9sA4GnDWEPHAdk6k9GbqICEaD3aQZylpDXQ7 pKcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Yi/Z75PNHk1ys10Xd95MwCUa0CwRgya3lUvptT8pN84=; b=Er4AxEOjlf5BWLwyxJ34pN+Fe0oVTlNr9h3Wdo9XH0vY9UnoQBv8ApaHrnzxBN8oNT X/7Rp0QObcd+2faiQCrPfc7EqsZNtBRqohOr5YuFeYTO8SU8tdcFvNiTio5aXYghjxfx hhI4D6YHcQW0QYRUxBCvK55lAKaecOjeLMRlS6YOUx9LJgK/HRf4CJCXT4WYtV/sftGy YvHt+Jky5jUnQPZ2YefVFKYtn+ndBbBac1HqrYXzAmmcjricvBpeWklVkZthQT8NT7t8 vPcKLuWOpwwWpnD4OUpKqSlMp4dhpuPW3M8qsSviZWbAn5dRoFirEiDgV/zE/6SD//u+ vhBA== X-Gm-Message-State: AOAM533po1/RqD/SL+iqpg7DF8/ItPpi+f3KvqeYqNL7WOjraM8KGiHs Te2nRHjck9mZsx9tzLFms+FNV/eMabbGTDM+9a+fIw== X-Received: by 2002:a67:f343:0:b0:32c:c4b7:e238 with SMTP id p3-20020a67f343000000b0032cc4b7e238mr14295223vsm.77.1652326645358; Wed, 11 May 2022 20:37:25 -0700 (PDT) MIME-Version: 1.0 References: <87tua3h5r1.fsf@nvdebian.thelocal> <875ymerl81.fsf@nvdebian.thelocal> <87fslhhb2l.fsf@linux.ibm.com> <68333b21a58604f3fd0e660f1a39921ae22849d8.camel@intel.com> <0a92d0040edb3b74ac259062d241b8cd28924edf.camel@intel.com> In-Reply-To: From: Wei Xu Date: Wed, 11 May 2022 20:37:14 -0700 Message-ID: Subject: Re: RFC: Memory Tiering Kernel Interfaces To: "ying.huang@intel.com" Cc: "Aneesh Kumar K.V" , Alistair Popple , Yang Shi , Andrew Morton , Dave Hansen , Dan Williams , Linux MM , Greg Thelen , Jagdish Gediya , Linux Kernel Mailing List , Davidlohr Bueso , Michal Hocko , Baolin Wang , Brice Goglin , Feng Tang , Jonathan Cameron , Tim Chen Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 11, 2022 at 8:14 PM ying.huang@intel.com wrote: > > On Wed, 2022-05-11 at 19:39 -0700, Wei Xu wrote: > > On Wed, May 11, 2022 at 6:42 PM ying.huang@intel.com > > wrote: > > > > > > On Wed, 2022-05-11 at 10:07 -0700, Wei Xu wrote: > > > > On Wed, May 11, 2022 at 12:49 AM ying.huang@intel.com > > > > wrote: > > > > > > > > > > On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote: > > > > > > On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V > > > > > > wrote: > > > > > > > > > > > > > > Alistair Popple writes: > > > > > > > > > > > > > > > Wei Xu writes: > > > > > > > > > > > > > > > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple wrote: > > > > > > > > > > > > > > > > > > > > Wei Xu writes: > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Tiering Hierarchy Initialization > > > > > > > > > > > > > `==============================' > > > > > > > > > > > > > > > > > > > > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY). > > > > > > > > > > > > > > > > > > > > > > > > > > A device driver can remove its memory nodes from the top tier, e.g. > > > > > > > > > > > > > a dax driver can remove PMEM nodes from the top tier. > > > > > > > > > > > > > > > > > > > > > > > > With the topology built by firmware we should not need this. > > > > > > > > > > > > > > > > > > > > I agree that in an ideal world the hierarchy should be built by firmware based > > > > > > > > > > on something like the HMAT. But I also think being able to override this will be > > > > > > > > > > useful in getting there. Therefore a way of overriding the generated hierarchy > > > > > > > > > > would be good, either via sysfs or kernel boot parameter if we don't want to > > > > > > > > > > commit to a particular user interface now. > > > > > > > > > > > > > > > > > > > > However I'm less sure letting device-drivers override this is a good idea. How > > > > > > > > > > for example would a GPU driver make sure it's node is in the top tier? By moving > > > > > > > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That > > > > > > > > > > could get messy if say there were two drivers both of which wanted their node to > > > > > > > > > > be in the top tier. > > > > > > > > > > > > > > > > > > The suggestion is to allow a device driver to opt out its memory > > > > > > > > > devices from the top-tier, not the other way around. > > > > > > > > > > > > > > > > So how would demotion work in the case of accelerators then? In that > > > > > > > > case we would want GPU memory to demote to DRAM, but that won't happen > > > > > > > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the > > > > > > > > only override available with this proposal would move GPU memory into a > > > > > > > > lower tier, which is the opposite of what's needed there. > > > > > > > > > > > > > > How about we do 3 tiers now. dax kmem devices can be registered to > > > > > > > tier 3. By default all numa nodes can be registered at tier 2 and HBM or > > > > > > > GPU can be enabled to register at tier 1. ? > > > > > > > > > > > > This makes sense. I will send an updated RFC based on the discussions so far. > > > > > > > > > > Are these tier number fixed? If so, it appears strange that the > > > > > smallest tier number is 0 on some machines, but 1 on some other > > > > > machines. > > > > > > > > When the kernel is configured to allow 3 tiers, we can always show all > > > > the 3 tiers. It is just that some tiers (e.g. tier 0) may be empty on > > > > some machines. > > > > > > I still think that it's better to have no empty tiers for auto-generated > > > memory tiers by kernel. Yes, the tier number will be not absolutely > > > stable, but that only happens during system bootup in practice, so it's > > > not a big issue IMHO. > > > > It should not be hard to hide empty tiers (e.g. tier-0) if we prefer. > > But even if tier-0 is empty, we should still keep this tier in the > > kernel and not move DRAM nodes into this tier. One reason is that a > > HBM node might be hot-added into tier-0 at a later time. > > > > Yes. The in-kernel representation and the user space interface could be > different. > > I have thought something like below. We always make the main memory > (DRAM here, CPU local) as tier 0. Then the slower memory will be > positive, tier 1, 2, 3, ..., and the faster memory will be negative, > tier -1, -2, -3, .... Then, GPU driver can regesiter its memory as tier > -1. And the tier number could be more stable. But I'm not sure whether > users will be happy with negtive tier number. Given that we have agreed that the tier id itself should not carry any specific meaning to the userspace and what matters is the relative tier order, I think it is better to avoid negative tier numbers. > > > And, I still think it's better to make only N-1 tiers writable for > > > totally N tiers (or even readable). Considering "tier0" is written, how > > > to deal with nodes in "tier0" before but not after writing? One > > > possible way is to put them into "tierN". And during a user customize > > > the tiers, the union of "N tiers" may be not complete. > > > > The sysfs interfaces that I have in mind now are: > > > > * /sys/devices/system/memtier/memtierN/nodelist (N=0, 1, 2) > > > > This is read-only to list the memory nodes for a specific tier. > > > > * /sys/devices/system/node/nodeN/memtier. (N=0, 1, ...,) > > > > This is a read-write interface. When written, the kernel moves the > > node into the user-specified tier. No other nodes are affected. > > > > This interface should be able to avoid the above issue. > > Yes. This works too. > > Best Regards, > Huang, Ying > > > > > BTW, the userspace should not assume a specific meaning of a > > > > particular tier id because it can change depending on the number of > > > > tiers that the kernel is configured with. For example, the userspace > > > > should not assume that tier-2 always means PMEM nodes. In a system > > > > with 4 tiers, PMEM nodes may be in tier-3, not tier-2. > > > > > > Yes. This sounds good. > > > > > > Best Regards, > > > Huang, Ying > > > > >