2022-05-02 10:42:22

by Wei Xu

[permalink] [raw]
Subject: RFC: Memory Tiering Kernel Interfaces

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node. Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

A tiering relationship between NUMA nodes in the form of demotion path
is created during the kernel initialization and updated when a NUMA
node is hot-added or hot-removed. The current implementation puts all
nodes with CPU into the top tier, and then builds the tiering hierarchy
tier-by-tier by establishing the per-node demotion targets based on
the distances between nodes.

The current memory tiering interface needs to be improved to address
several important use cases:

* The current tiering initialization code always initializes
each memory-only NUMA node into a lower tier. But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into the top tier.

* The current tiering hierarchy always puts CPU nodes into the top
tier. But on a system with HBM (e.g. GPU memory) devices, these
memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
with CPUs are better to be placed into the next lower tier.

* Also because the current tiering hierarchy always puts CPU nodes
into the top tier, when a CPU is hot-added (or hot-removed) and
triggers a memory node from CPU-less into a CPU node (or vice
versa), the memory tiering hierarchy gets changed, even though no
memory node is added or removed. This can make the tiering
hierarchy much less stable.

* A higher tier node can only be demoted to selected nodes on the
next lower tier, not any other node from the next lower tier. This
strict, hard-coded demotion order does not work in all use cases
(e.g. some use cases may want to allow cross-socket demotion to
another node in the same demotion tier as a fallback when the
preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide,
per-node demotion order from the userspace.

* There are no interfaces for the userspace to learn about the memory
tiering hierarchy in order to optimize its memory allocations.

I'd like to propose revised memory tiering kernel interfaces based on
the discussions in the threads:

- https://lore.kernel.org/lkml/[email protected]/T/
- https://lore.kernel.org/linux-mm/[email protected]/t/


Sysfs Interfaces
================

* /sys/devices/system/node/memory_tiers

Format: node list (one tier per line, in the tier order)

When read, list memory nodes by tiers.

When written (one tier per line), take the user-provided node-tier
assignment as the new tiering hierarchy and rebuild the per-node
demotion order. It is allowed to only override the top tiers, in
which cases, the kernel will establish the lower tiers automatically.


Kernel Representation
=====================

* nodemask_t node_states[N_TOPTIER_MEMORY]

Store all top-tier memory nodes.

* nodemask_t memory_tiers[MAX_TIERS]

Store memory nodes by tiers.

* struct demotion_nodes node_demotion[]

where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }

For a node N:

node_demotion[N].preferred lists all preferred demotion targets;

node_demotion[N].allowed lists all allowed demotion targets
(initialized to be all the nodes in the same demotion tier).


Tiering Hierarchy Initialization
================================

By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).

A device driver can remove its memory nodes from the top tier, e.g.
a dax driver can remove PMEM nodes from the top tier.

The kernel builds the memory tiering hierarchy and per-node demotion
order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
best distance nodes in the next lower tier are assigned to
node_demotion[N].preferred and all the nodes in the next lower tier
are assigned to node_demotion[N].allowed.

node_demotion[N].preferred can be empty if no preferred demotion node
is available for node N.

If the userspace overrides the tiers via the memory_tiers sysfs
interface, the kernel then only rebuilds the per-node demotion order
accordingly.

Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
node.


Memory Allocation for Demotion
==============================

When allocating a new demotion target page, both a preferred node
and the allowed nodemask are provided to the allocation function.
The default kernel allocation fallback order is used to allocate the
page from the specified node and nodemask.

The memopolicy of cpuset, vma and owner task of the source page can
be set to refine the demotion nodemask, e.g. to prevent demotion or
select a particular allowed node as the demotion target.


Examples
========

* Example 1:
Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.

Node 0 has node 2 as the preferred demotion target and can also
fallback demotion to node 3.

Node 1 has node 3 as the preferred demotion target and can also
fallback demotion to node 2.

Set mempolicy to prevent cross-socket demotion and memory access,
e.g. cpuset.mems=0,2

node distances:
node 0 1 2 3
0 10 20 30 40
1 20 10 40 30
2 30 40 10 40
3 40 30 40 10

/sys/devices/system/node/memory_tiers
0-1
2-3

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
0: [2], [2-3]
1: [3], [2-3]
2: [], []
3: [], []

* Example 2:
Node 0 & 1 are DRAM nodes.
Node 2 is a PMEM node and closer to node 0.

Node 0 has node 2 as the preferred and only demotion target.

Node 1 has no preferred demotion target, but can still demote
to node 2.

Set mempolicy to prevent cross-socket demotion and memory access,
e.g. cpuset.mems=0,2

node distances:
node 0 1 2
0 10 20 30
1 20 10 40
2 30 40 10

/sys/devices/system/node/memory_tiers
0-1
2

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
0: [2], [2]
1: [], [2]
2: [], []


* Example 3:
Node 0 & 1 are DRAM nodes.
Node 2 is a PMEM node and has the same distance to node 0 & 1.

Node 0 has node 2 as the preferred and only demotion target.

Node 1 has node 2 as the preferred and only demotion target.

node distances:
node 0 1 2
0 10 20 30
1 20 10 30
2 30 30 10

/sys/devices/system/node/memory_tiers
0-1
2

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
0: [2], [2]
1: [2], [2]
2: [], []


* Example 4:
Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.

All nodes are top-tier.

node distances:
node 0 1 2
0 10 20 30
1 20 10 30
2 30 30 10

/sys/devices/system/node/memory_tiers
0-2

N_TOPTIER_MEMORY: 0-2

node_demotion[]:
0: [], []
1: [], []
2: [], []


* Example 5:
Node 0 is a DRAM node with CPU.
Node 1 is a HBM node.
Node 2 is a PMEM node.

With userspace override, node 1 is the top tier and has node 0 as
the preferred and only demotion target.

Node 0 is in the second tier, tier 1, and has node 2 as the
preferred and only demotion target.

Node 2 is in the lowest tier, tier 2, and has no demotion targets.

node distances:
node 0 1 2
0 10 21 30
1 21 10 40
2 30 40 10

/sys/devices/system/node/memory_tiers (userspace override)
1
0
2

N_TOPTIER_MEMORY: 1

node_demotion[]:
0: [2], [2]
1: [0], [0]
2: [], []

-- Wei


2022-05-02 23:35:49

by David Rientjes

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

On Sun, 1 May 2022, Davidlohr Bueso wrote:

> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node. Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.
>

Hi Davidlohr,

I tend to agree with this and we've been discussing potential hardware
assistance for page heatmaps as well, but not as an extension of sampling
techniques that rely on the page table Accessed bit.

Have you thought about what hardware could give us here that would allow
us to identify the set of hottest (or coldest) pages over a range so that
we don't need to iterate through it?

Adding Yuanchu Xie <[email protected]> who has been looking into this
recently.

> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed. The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> >
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> >
> > * The current tiering initialization code always initializes
> > each memory-only NUMA node into a lower tier. But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into the top tier.
>
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
>
> > Tiering Hierarchy Initialization
> > ================================
> >
> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >
> > A device driver can remove its memory nodes from the top tier, e.g.
> > a dax driver can remove PMEM nodes from the top tier.
> >
> > The kernel builds the memory tiering hierarchy and per-node demotion
> > order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
> > best distance nodes in the next lower tier are assigned to
> > node_demotion[N].preferred and all the nodes in the next lower tier
> > are assigned to node_demotion[N].allowed.
> >
> > node_demotion[N].preferred can be empty if no preferred demotion node
> > is available for node N.
>
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.
>
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
>
> I think this makes sense.
>
> Thanks,
> Davidlohr
>
>

2022-05-02 23:52:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

Davidlohr Bueso <[email protected]> writes:

> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.

Will there be an online option this time? If so, i would like to
participate in this discussion. I have not closely followed LSF/MM
details this year. So not sure how to get the online attend request out.

>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
>>The current kernel has the basic memory tiering support: Inactive
>>pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>tier NUMA node to make room for new allocations on the higher tier
>>NUMA node. Frequently accessed pages on a lower tier NUMA node can be
>>migrated (promoted) to a higher tier NUMA node to improve the
>>performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.


Power10 hardware can do this. Right now we are looking at integrating
this to MultiGen LRU. We haven't got it to work. One of the challenges is
how to estimate the relative hotness of the page compared to the rest of the
pages in the system. I am looking at the random sampling of the oldest
generation pages (the page list in the shrink page list) and using the hot
and cold page in that random sample to determine the hotness of a
specific page and whether to reclaim it or not.

-aneesh

2022-05-02 23:52:55

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

Wei Xu <[email protected]> writes:

....

>
> Tiering Hierarchy Initialization
> ================================
>
> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>
> A device driver can remove its memory nodes from the top tier, e.g.
> a dax driver can remove PMEM nodes from the top tier.

Should we look at the tier in which to place the memory an option that
device drivers like dax driver can select? Or dax driver just selects
the desire to mark a specific memory only numa node as demotion target
and won't explicity specify the tier in which it should be placed. I
would like to go for the later and choose the tier details based on the
current memory tiers and the NUMA distance value (even HMAT at some
point in the future). The challenge with NUMA distance though is which
distance value we will pick. For example, in your example1.

node 0 1 2 3
0 10 20 30 40
1 20 10 40 30
2 30 40 10 40
3 40 30 40 10

When Node3 is registered, how do we decide to create a Tier2 or add it
to Tier1? . We could say devices that wish to be placed in the same tier
will have same distance as the existing tier device ie, for the above
case,

node_distance[2][2] == node_distance[2][3] ? Can we expect the firmware
to have distance value like that?

>
> The kernel builds the memory tiering hierarchy and per-node demotion
> order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
> best distance nodes in the next lower tier are assigned to
> node_demotion[N].preferred and all the nodes in the next lower tier
> are assigned to node_demotion[N].allowed.
>
> node_demotion[N].preferred can be empty if no preferred demotion node
> is available for node N.
>
> If the userspace overrides the tiers via the memory_tiers sysfs
> interface, the kernel then only rebuilds the per-node demotion order
> accordingly.
>
> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.
>
>
> Memory Allocation for Demotion
> ==============================
>
> When allocating a new demotion target page, both a preferred node
> and the allowed nodemask are provided to the allocation function.
> The default kernel allocation fallback order is used to allocate the
> page from the specified node and nodemask.
>
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion nodemask, e.g. to prevent demotion or
> select a particular allowed node as the demotion target.
>
>
> Examples
> ========
>
> * Example 1:
> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>
> Node 0 has node 2 as the preferred demotion target and can also
> fallback demotion to node 3.
>
> Node 1 has node 3 as the preferred demotion target and can also
> fallback demotion to node 2.
>
> Set mempolicy to prevent cross-socket demotion and memory access,
> e.g. cpuset.mems=0,2
>
> node distances:
> node 0 1 2 3
> 0 10 20 30 40
> 1 20 10 40 30
> 2 30 40 10 40
> 3 40 30 40 10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2-3

How can I make Node3 the demotion target for Node2 in this case? Can
we have one file for each tier? ie, we start with
/sys/devices/system/node/memory_tier0. Removing a node with memory from
the above file/list results in the creation of new tiers.

/sys/devices/system/node/memory_tier0
0-1
/sys/devices/system/node/memory_tier1
2-3

echo 2 > /sys/devices/system/node/memory_tier1
/sys/devices/system/node/memory_tier1
2
/sys/devices/system/node/memory_tier2
3

>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
> 0: [2], [2-3]
> 1: [3], [2-3]
> 2: [], []
> 3: [], []
>
> * Example 2:
> Node 0 & 1 are DRAM nodes.
> Node 2 is a PMEM node and closer to node 0.
>
> Node 0 has node 2 as the preferred and only demotion target.
>
> Node 1 has no preferred demotion target, but can still demote
> to node 2.
>
> Set mempolicy to prevent cross-socket demotion and memory access,
> e.g. cpuset.mems=0,2
>
> node distances:
> node 0 1 2
> 0 10 20 30
> 1 20 10 40
> 2 30 40 10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
> 0: [2], [2]
> 1: [], [2]
> 2: [], []
>
>
> * Example 3:
> Node 0 & 1 are DRAM nodes.
> Node 2 is a PMEM node and has the same distance to node 0 & 1.
>
> Node 0 has node 2 as the preferred and only demotion target.
>
> Node 1 has node 2 as the preferred and only demotion target.
>
> node distances:
> node 0 1 2
> 0 10 20 30
> 1 20 10 30
> 2 30 30 10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
> 0: [2], [2]
> 1: [2], [2]
> 2: [], []
>
>
> * Example 4:
> Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>
> All nodes are top-tier.
>
> node distances:
> node 0 1 2
> 0 10 20 30
> 1 20 10 30
> 2 30 30 10
>
> /sys/devices/system/node/memory_tiers
> 0-2
>
> N_TOPTIER_MEMORY: 0-2
>
> node_demotion[]:
> 0: [], []
> 1: [], []
> 2: [], []
>
>
> * Example 5:
> Node 0 is a DRAM node with CPU.
> Node 1 is a HBM node.
> Node 2 is a PMEM node.
>
> With userspace override, node 1 is the top tier and has node 0 as
> the preferred and only demotion target.
>
> Node 0 is in the second tier, tier 1, and has node 2 as the
> preferred and only demotion target.
>
> Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>
> node distances:
> node 0 1 2
> 0 10 21 30
> 1 21 10 40
> 2 30 40 10
>
> /sys/devices/system/node/memory_tiers (userspace override)
> 1
> 0
> 2
>
> N_TOPTIER_MEMORY: 1
>
> node_demotion[]:
> 0: [2], [2]
> 1: [0], [0]
> 2: [], []
>
> -- Wei

2022-05-03 00:29:50

by Dan Williams

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <[email protected]> wrote:
>
> Hi Wei,
>
> Thanks for the nice writing. Please see the below inline comments.
>
> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <[email protected]> wrote:
> >
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node. Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> >
> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed. The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> >
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> >
> > * The current tiering initialization code always initializes
> > each memory-only NUMA node into a lower tier. But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into the top tier.
> >
> > * The current tiering hierarchy always puts CPU nodes into the top
> > tier. But on a system with HBM (e.g. GPU memory) devices, these
> > memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > with CPUs are better to be placed into the next lower tier.
> >
> > * Also because the current tiering hierarchy always puts CPU nodes
> > into the top tier, when a CPU is hot-added (or hot-removed) and
> > triggers a memory node from CPU-less into a CPU node (or vice
> > versa), the memory tiering hierarchy gets changed, even though no
> > memory node is added or removed. This can make the tiering
> > hierarchy much less stable.
>
> I'd prefer the firmware builds up tiers topology then passes it to
> kernel so that kernel knows what nodes are in what tiers. No matter
> what nodes are hot-removed/hot-added they always stay in their tiers
> defined by the firmware. I think this is important information like
> numa distances. NUMA distance alone can't satisfy all the usecases
> IMHO.

Just want to note here that the platform firmware can only describe
the tiers of static memory present at boot. CXL hotplug breaks this
model and the kernel is left to dynamically determine the device's
performance characteristics and the performance of the topology to
reach that device. Now, the platform firmware does set expectations
for the perfomance class of different memory ranges, but there is no
way to know in advance the performance of devices that will be asked
to be physically or logically added to the memory configuration. That
said, it's probably still too early to define ABI for those
exceptional cases where the kernel needs to make a policy decision
about a device that does not fit into the firmware's performance
expectations, but just note that there are limits to the description
that platform firmware can provide.

I agree that NUMA distance alone is inadequate and the kernel needs to
make better use of data like ACPI HMAT to determine the default
tiering order.

2022-05-03 00:37:04

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

> The current memory tiering interface needs to be improved to address
> several important use cases:

FWIW, I totally agree. We knew when that code went in that the default
ordering was feeble. There were patches to export the demotion order
and allow it to be modified from userspace, but they were jettisoned at
some point.

> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.

Yeah, this would be a welcome improvement if we can get there.

> * /sys/devices/system/node/memory_tiers
>
> Format: node list (one tier per line, in the tier order)
>
> When read, list memory nodes by tiers.

Nit: this would seems to violate the one-value-per-file sysfs guideline.
It can be fixed by making tiers actual objects, which would have some
other nice benefits too.

2022-05-03 00:56:17

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

Nice summary, thanks. I don't know who of the interested parties will be
at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
at 14:00 and 15:00.

On Fri, 29 Apr 2022, Wei Xu wrote:

>The current kernel has the basic memory tiering support: Inactive
>pages on a higher tier NUMA node can be migrated (demoted) to a lower
>tier NUMA node to make room for new allocations on the higher tier
>NUMA node. Frequently accessed pages on a lower tier NUMA node can be
>migrated (promoted) to a higher tier NUMA node to improve the
>performance.

Regardless of the promotion algorithm, at some point I see the NUMA hinting
fault mechanism being in the way of performance. It would be nice if hardware
began giving us page "heatmaps" instead of having to rely on faulting or
sampling based ways to identify hot memory.

>A tiering relationship between NUMA nodes in the form of demotion path
>is created during the kernel initialization and updated when a NUMA
>node is hot-added or hot-removed. The current implementation puts all
>nodes with CPU into the top tier, and then builds the tiering hierarchy
>tier-by-tier by establishing the per-node demotion targets based on
>the distances between nodes.
>
>The current memory tiering interface needs to be improved to address
>several important use cases:
>
>* The current tiering initialization code always initializes
> each memory-only NUMA node into a lower tier. But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into the top tier.

At least the CXL memory (volatile or not) will still be slower than
regular DRAM, so I think that we'd not want this to be top-tier. But
in general, yes I agree that defining top tier as whether or not the
node has a CPU a bit limiting, as you've detailed here.

>Tiering Hierarchy Initialization
>================================
>
>By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>
>A device driver can remove its memory nodes from the top tier, e.g.
>a dax driver can remove PMEM nodes from the top tier.
>
>The kernel builds the memory tiering hierarchy and per-node demotion
>order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
>best distance nodes in the next lower tier are assigned to
>node_demotion[N].preferred and all the nodes in the next lower tier
>are assigned to node_demotion[N].allowed.
>
>node_demotion[N].preferred can be empty if no preferred demotion node
>is available for node N.

Upon cases where there more than one possible demotion node (with equal
cost), I'm wondering if we want to do something better than choosing
randomly, like we do now - perhaps round robin? Of course anything
like this will require actual performance data, something I have seen
very little of.

>Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>node.

I think this makes sense.

Thanks,
Davidlohr

2022-05-03 01:08:24

by Yang Shi

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

Hi Wei,

Thanks for the nice writing. Please see the below inline comments.

On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <[email protected]> wrote:
>
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node. Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
>
> A tiering relationship between NUMA nodes in the form of demotion path
> is created during the kernel initialization and updated when a NUMA
> node is hot-added or hot-removed. The current implementation puts all
> nodes with CPU into the top tier, and then builds the tiering hierarchy
> tier-by-tier by establishing the per-node demotion targets based on
> the distances between nodes.
>
> The current memory tiering interface needs to be improved to address
> several important use cases:
>
> * The current tiering initialization code always initializes
> each memory-only NUMA node into a lower tier. But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into the top tier.
>
> * The current tiering hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM (e.g. GPU memory) devices, these
> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> with CPUs are better to be placed into the next lower tier.
>
> * Also because the current tiering hierarchy always puts CPU nodes
> into the top tier, when a CPU is hot-added (or hot-removed) and
> triggers a memory node from CPU-less into a CPU node (or vice
> versa), the memory tiering hierarchy gets changed, even though no
> memory node is added or removed. This can make the tiering
> hierarchy much less stable.

I'd prefer the firmware builds up tiers topology then passes it to
kernel so that kernel knows what nodes are in what tiers. No matter
what nodes are hot-removed/hot-added they always stay in their tiers
defined by the firmware. I think this is important information like
numa distances. NUMA distance alone can't satisfy all the usecases
IMHO.

>
> * A higher tier node can only be demoted to selected nodes on the
> next lower tier, not any other node from the next lower tier. This
> strict, hard-coded demotion order does not work in all use cases
> (e.g. some use cases may want to allow cross-socket demotion to
> another node in the same demotion tier as a fallback when the
> preferred demotion node is out of space), and has resulted in the
> feature request for an interface to override the system-wide,
> per-node demotion order from the userspace.
>
> * There are no interfaces for the userspace to learn about the memory
> tiering hierarchy in order to optimize its memory allocations.
>
> I'd like to propose revised memory tiering kernel interfaces based on
> the discussions in the threads:
>
> - https://lore.kernel.org/lkml/[email protected]/T/
> - https://lore.kernel.org/linux-mm/[email protected]/t/
>
>
> Sysfs Interfaces
> ================
>
> * /sys/devices/system/node/memory_tiers
>
> Format: node list (one tier per line, in the tier order)
>
> When read, list memory nodes by tiers.
>
> When written (one tier per line), take the user-provided node-tier
> assignment as the new tiering hierarchy and rebuild the per-node
> demotion order. It is allowed to only override the top tiers, in
> which cases, the kernel will establish the lower tiers automatically.

TBH I still think it is too soon to define proper user visible
interfaces for now, particularly for override.

>
>
> Kernel Representation
> =====================
>
> * nodemask_t node_states[N_TOPTIER_MEMORY]
>
> Store all top-tier memory nodes.
>
> * nodemask_t memory_tiers[MAX_TIERS]
>
> Store memory nodes by tiers.

I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
top tier. The kernel could build this with the topology built by
firmware.

>
> * struct demotion_nodes node_demotion[]
>
> where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
>
> For a node N:
>
> node_demotion[N].preferred lists all preferred demotion targets;
>
> node_demotion[N].allowed lists all allowed demotion targets
> (initialized to be all the nodes in the same demotion tier).

It seems unnecessary to define preferred and allowed IMHO. Why not
just use something like the allocation fallback list? The first node
in the list is the preferred one. When allocating memory for demotion,
convert the list to a nodemask, then call __alloc_pages(gfp, order,
first_node, nodemask). So the allocation could fallback to the allowed
nodes automatically.

>
>
> Tiering Hierarchy Initialization
> ================================
>
> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>
> A device driver can remove its memory nodes from the top tier, e.g.
> a dax driver can remove PMEM nodes from the top tier.

With the topology built by firmware we should not need this.

>
> The kernel builds the memory tiering hierarchy and per-node demotion
> order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
> best distance nodes in the next lower tier are assigned to
> node_demotion[N].preferred and all the nodes in the next lower tier
> are assigned to node_demotion[N].allowed.

I'm not sure whether it should be allowed to demote to multiple lower
tiers. But it is totally fine to *NOT* allow it at the moment. Once we
figure out a good way to define demotion targets, it could be extended
to support this easily.

>
> node_demotion[N].preferred can be empty if no preferred demotion node
> is available for node N.
>
> If the userspace overrides the tiers via the memory_tiers sysfs
> interface, the kernel then only rebuilds the per-node demotion order
> accordingly.
>
> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.
>
>
> Memory Allocation for Demotion
> ==============================
>
> When allocating a new demotion target page, both a preferred node
> and the allowed nodemask are provided to the allocation function.
> The default kernel allocation fallback order is used to allocate the
> page from the specified node and nodemask.
>
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion nodemask, e.g. to prevent demotion or
> select a particular allowed node as the demotion target.
>
>
> Examples
> ========
>
> * Example 1:
> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>
> Node 0 has node 2 as the preferred demotion target and can also
> fallback demotion to node 3.
>
> Node 1 has node 3 as the preferred demotion target and can also
> fallback demotion to node 2.
>
> Set mempolicy to prevent cross-socket demotion and memory access,
> e.g. cpuset.mems=0,2
>
> node distances:
> node 0 1 2 3
> 0 10 20 30 40
> 1 20 10 40 30
> 2 30 40 10 40
> 3 40 30 40 10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2-3
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
> 0: [2], [2-3]
> 1: [3], [2-3]
> 2: [], []
> 3: [], []
>
> * Example 2:
> Node 0 & 1 are DRAM nodes.
> Node 2 is a PMEM node and closer to node 0.
>
> Node 0 has node 2 as the preferred and only demotion target.
>
> Node 1 has no preferred demotion target, but can still demote
> to node 2.
>
> Set mempolicy to prevent cross-socket demotion and memory access,
> e.g. cpuset.mems=0,2
>
> node distances:
> node 0 1 2
> 0 10 20 30
> 1 20 10 40
> 2 30 40 10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
> 0: [2], [2]
> 1: [], [2]
> 2: [], []
>
>
> * Example 3:
> Node 0 & 1 are DRAM nodes.
> Node 2 is a PMEM node and has the same distance to node 0 & 1.
>
> Node 0 has node 2 as the preferred and only demotion target.
>
> Node 1 has node 2 as the preferred and only demotion target.
>
> node distances:
> node 0 1 2
> 0 10 20 30
> 1 20 10 30
> 2 30 30 10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
> 0: [2], [2]
> 1: [2], [2]
> 2: [], []
>
>
> * Example 4:
> Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>
> All nodes are top-tier.
>
> node distances:
> node 0 1 2
> 0 10 20 30
> 1 20 10 30
> 2 30 30 10
>
> /sys/devices/system/node/memory_tiers
> 0-2
>
> N_TOPTIER_MEMORY: 0-2
>
> node_demotion[]:
> 0: [], []
> 1: [], []
> 2: [], []
>
>
> * Example 5:
> Node 0 is a DRAM node with CPU.
> Node 1 is a HBM node.
> Node 2 is a PMEM node.
>
> With userspace override, node 1 is the top tier and has node 0 as
> the preferred and only demotion target.
>
> Node 0 is in the second tier, tier 1, and has node 2 as the
> preferred and only demotion target.
>
> Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>
> node distances:
> node 0 1 2
> 0 10 21 30
> 1 21 10 40
> 2 30 40 10
>
> /sys/devices/system/node/memory_tiers (userspace override)
> 1
> 0
> 2
>
> N_TOPTIER_MEMORY: 1
>
> node_demotion[]:
> 0: [2], [2]
> 1: [0], [0]
> 2: [], []
>
> -- Wei

2022-05-03 02:06:42

by Baolin Wang

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces



On 5/2/2022 1:58 AM, Davidlohr Bueso wrote:
> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
>> The current kernel has the basic memory tiering support: Inactive
>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>> tier NUMA node to make room for new allocations on the higher tier
>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>> migrated (promoted) to a higher tier NUMA node to improve the
>> performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if
> hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.
>
>> A tiering relationship between NUMA nodes in the form of demotion path
>> is created during the kernel initialization and updated when a NUMA
>> node is hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and then builds the tiering hierarchy
>> tier-by-tier by establishing the per-node demotion targets based on
>> the distances between nodes.
>>
>> The current memory tiering interface needs to be improved to address
>> several important use cases:
>>
>> * The current tiering initialization code always initializes
>>  each memory-only NUMA node into a lower tier.  But a memory-only
>>  NUMA node may have a high performance memory device (e.g. a DRAM
>>  device attached via CXL.mem or a DRAM-backed memory-only node on
>>  a virtual machine) and should be put into the top tier.
>
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
>
>> Tiering Hierarchy Initialization
>> ================================
>>
>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>
>> A device driver can remove its memory nodes from the top tier, e.g.
>> a dax driver can remove PMEM nodes from the top tier.
>>
>> The kernel builds the memory tiering hierarchy and per-node demotion
>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>> best distance nodes in the next lower tier are assigned to
>> node_demotion[N].preferred and all the nodes in the next lower tier
>> are assigned to node_demotion[N].allowed.
>>
>> node_demotion[N].preferred can be empty if no preferred demotion node
>> is available for node N.
>
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.

I've tried to use round robin[1] to select a target demotion node if
there are multiple demotion nodes, however I did not see any obvious
performance gain with mysql testing. Maybe use other test suits?

https://lore.kernel.org/all/c02b[email protected]linux.alibaba.com/

2022-05-03 06:08:30

by Wei Xu

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

On Sun, May 1, 2022 at 11:09 AM Davidlohr Bueso <[email protected]> wrote:
>
> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
> >The current kernel has the basic memory tiering support: Inactive
> >pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >tier NUMA node to make room for new allocations on the higher tier
> >NUMA node. Frequently accessed pages on a lower tier NUMA node can be
> >migrated (promoted) to a higher tier NUMA node to improve the
> >performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.

I agree with your comments on both NUMA hinting faults and
hardware-assisted "heatmaps".


> >A tiering relationship between NUMA nodes in the form of demotion path
> >is created during the kernel initialization and updated when a NUMA
> >node is hot-added or hot-removed. The current implementation puts all
> >nodes with CPU into the top tier, and then builds the tiering hierarchy
> >tier-by-tier by establishing the per-node demotion targets based on
> >the distances between nodes.
> >
> >The current memory tiering interface needs to be improved to address
> >several important use cases:
> >
> >* The current tiering initialization code always initializes
> > each memory-only NUMA node into a lower tier. But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into the top tier.
>
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
>
> >Tiering Hierarchy Initialization
> >================================
> >
> >By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >
> >A device driver can remove its memory nodes from the top tier, e.g.
> >a dax driver can remove PMEM nodes from the top tier.
> >
> >The kernel builds the memory tiering hierarchy and per-node demotion
> >order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
> >best distance nodes in the next lower tier are assigned to
> >node_demotion[N].preferred and all the nodes in the next lower tier
> >are assigned to node_demotion[N].allowed.
> >
> >node_demotion[N].preferred can be empty if no preferred demotion node
> >is available for node N.
>
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.

I'd prefer that the demotion node selection follows the way how the
kernel selects the node/zone for normal allocations. If we want to
group several demotion nodes with equal cost together (e.g. to better
utilize the bandwidth from these nodes), we'd better to improve such
an optimization in __alloc_pages_nodemask() to benefit normal
allocations as well.

> >Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> >memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> >node.
>
> I think this makes sense.
>
> Thanks,
> Davidlohr

2022-05-03 06:47:48

by Wei Xu

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

On Sun, May 1, 2022 at 11:35 AM Dan Williams <[email protected]> wrote:
>
> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <[email protected]> wrote:
> >
> > Hi Wei,
> >
> > Thanks for the nice writing. Please see the below inline comments.
> >
> > On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <[email protected]> wrote:
> > >
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > tier NUMA node to make room for new allocations on the higher tier
> > > NUMA node. Frequently accessed pages on a lower tier NUMA node can be
> > > migrated (promoted) to a higher tier NUMA node to improve the
> > > performance.
> > >
> > > A tiering relationship between NUMA nodes in the form of demotion path
> > > is created during the kernel initialization and updated when a NUMA
> > > node is hot-added or hot-removed. The current implementation puts all
> > > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based on
> > > the distances between nodes.
> > >
> > > The current memory tiering interface needs to be improved to address
> > > several important use cases:
> > >
> > > * The current tiering initialization code always initializes
> > > each memory-only NUMA node into a lower tier. But a memory-only
> > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > a virtual machine) and should be put into the top tier.
> > >
> > > * The current tiering hierarchy always puts CPU nodes into the top
> > > tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > with CPUs are better to be placed into the next lower tier.
> > >
> > > * Also because the current tiering hierarchy always puts CPU nodes
> > > into the top tier, when a CPU is hot-added (or hot-removed) and
> > > triggers a memory node from CPU-less into a CPU node (or vice
> > > versa), the memory tiering hierarchy gets changed, even though no
> > > memory node is added or removed. This can make the tiering
> > > hierarchy much less stable.
> >
> > I'd prefer the firmware builds up tiers topology then passes it to
> > kernel so that kernel knows what nodes are in what tiers. No matter
> > what nodes are hot-removed/hot-added they always stay in their tiers
> > defined by the firmware. I think this is important information like
> > numa distances. NUMA distance alone can't satisfy all the usecases
> > IMHO.
>
> Just want to note here that the platform firmware can only describe
> the tiers of static memory present at boot. CXL hotplug breaks this
> model and the kernel is left to dynamically determine the device's
> performance characteristics and the performance of the topology to
> reach that device. Now, the platform firmware does set expectations
> for the perfomance class of different memory ranges, but there is no
> way to know in advance the performance of devices that will be asked
> to be physically or logically added to the memory configuration. That
> said, it's probably still too early to define ABI for those
> exceptional cases where the kernel needs to make a policy decision
> about a device that does not fit into the firmware's performance
> expectations, but just note that there are limits to the description
> that platform firmware can provide.
>
> I agree that NUMA distance alone is inadequate and the kernel needs to
> make better use of data like ACPI HMAT to determine the default
> tiering order.

Very useful clarification. It should be fine for the kernel to
dynamically determine the memory tier of each node. I expect that it
can also be fine even if a node gets attached to a different memory
device and needs to be assigned into a different tier after another
round of hot-remove/hot-add.

What can be problematic is that a hot-added node not only changes its
own iter, but also causes other existing nodes to change their tiers.
This can mess up any tier-based memory accounting.

One approach to address this is to:

- have tiers being well-defined and stable, e.g. HBM is always in
tier-0, direct-attached DRAM and high-performance CXL.mem devices are
always in tier-1, slower CXL.mem devices are always in tier-2, and
PMEM is always in tier-3. The tier definition is based on the device
performance, something similar to the class rating of storage devices
(e.g. SD cards).

- allow tiers being absent in the system, e.g. a machine may have only
tier-1 and tier-3, but have neither tier-0 nor tier-2.

- allow demotion to not only the immediate next lower tier, but all
lower tiers. The actual selection of demotion order follows the
allocation fallback order. This allows tier-1 to directly demote to
tier-3 without requiring the presence of tier-2.

This approach can ensure that the tiers of existing nodes are stable
and permit that the tier of a hot-plugged node is determined
dynamically.

2022-05-03 09:21:23

by Wei Xu

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

On Mon, May 2, 2022 at 8:20 AM Dave Hansen <[email protected]> wrote:
>
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
>
> FWIW, I totally agree. We knew when that code went in that the default
> ordering was feeble. There were patches to export the demotion order
> and allow it to be modified from userspace, but they were jettisoned at
> some point.
>
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
>
> Yeah, this would be a welcome improvement if we can get there.
>
> > * /sys/devices/system/node/memory_tiers
> >
> > Format: node list (one tier per line, in the tier order)
> >
> > When read, list memory nodes by tiers.
>
> Nit: this would seems to violate the one-value-per-file sysfs guideline.
> It can be fixed by making tiers actual objects, which would have some
> other nice benefits too.
>

Good point. One tier per file should work as well. It can be even
better to have a separate tier sub-tree.

2022-05-03 09:47:15

by Wei Xu

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

On Sun, May 1, 2022 at 11:25 PM Aneesh Kumar K.V
<[email protected]> wrote:
>
> Wei Xu <[email protected]> writes:
>
> ....
>
> >
> > Tiering Hierarchy Initialization
> > ================================
> >
> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >
> > A device driver can remove its memory nodes from the top tier, e.g.
> > a dax driver can remove PMEM nodes from the top tier.
>
> Should we look at the tier in which to place the memory an option that
> device drivers like dax driver can select? Or dax driver just selects
> the desire to mark a specific memory only numa node as demotion target
> and won't explicity specify the tier in which it should be placed. I
> would like to go for the later and choose the tier details based on the
> current memory tiers and the NUMA distance value (even HMAT at some
> point in the future).

This is what has been proposed here. The driver doesn't determine
which particular tier the node should be placed in. It just removes
the node from the top-tier (i.e. making the node a demotion target).
The actual tier of the node is determined based on all the nodes and
their NUMA distance values.

> The challenge with NUMA distance though is which
> distance value we will pick. For example, in your example1.
>
> node 0 1 2 3
> 0 10 20 30 40
> 1 20 10 40 30
> 2 30 40 10 40
> 3 40 30 40 10
>
> When Node3 is registered, how do we decide to create a Tier2 or add it
> to Tier1? .

This proposal assumes a breadth-first search in tier construction,
which is also how the current implementation works. In this example,
the top-tier nodes are [0,1]. We then find a best demotion node for
each of [0,1] and get [0->2, 1->3]. Now we have two tiers: [0,1],
[2,3], and the search terminates.

But this algorithm doesn't work if there is no node 1 and we still
want node 2 & 3 in the same tier. Without the additional hardware
information such as HMAT, we will need a way to override the default
tier definition.

> We could say devices that wish to be placed in the same tier
> will have same distance as the existing tier device ie, for the above
> case,
>
> node_distance[2][2] == node_distance[2][3] ? Can we expect the firmware
> to have distance value like that?

node_distance[2][2] is local, which should be smaller than
node_distance[2][3]. I expect that this should be the case in normal
firmwares.

> >
> > The kernel builds the memory tiering hierarchy and per-node demotion
> > order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
> > best distance nodes in the next lower tier are assigned to
> > node_demotion[N].preferred and all the nodes in the next lower tier
> > are assigned to node_demotion[N].allowed.
> >
> > node_demotion[N].preferred can be empty if no preferred demotion node
> > is available for node N.
> >
> > If the userspace overrides the tiers via the memory_tiers sysfs
> > interface, the kernel then only rebuilds the per-node demotion order
> > accordingly.
> >
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
> >
> >
> > Memory Allocation for Demotion
> > ==============================
> >
> > When allocating a new demotion target page, both a preferred node
> > and the allowed nodemask are provided to the allocation function.
> > The default kernel allocation fallback order is used to allocate the
> > page from the specified node and nodemask.
> >
> > The memopolicy of cpuset, vma and owner task of the source page can
> > be set to refine the demotion nodemask, e.g. to prevent demotion or
> > select a particular allowed node as the demotion target.
> >
> >
> > Examples
> > ========
> >
> > * Example 1:
> > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> >
> > Node 0 has node 2 as the preferred demotion target and can also
> > fallback demotion to node 3.
> >
> > Node 1 has node 3 as the preferred demotion target and can also
> > fallback demotion to node 2.
> >
> > Set mempolicy to prevent cross-socket demotion and memory access,
> > e.g. cpuset.mems=0,2
> >
> > node distances:
> > node 0 1 2 3
> > 0 10 20 30 40
> > 1 20 10 40 30
> > 2 30 40 10 40
> > 3 40 30 40 10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2-3
>
> How can I make Node3 the demotion target for Node2 in this case? Can
> we have one file for each tier? ie, we start with
> /sys/devices/system/node/memory_tier0. Removing a node with memory from
> the above file/list results in the creation of new tiers.
>
> /sys/devices/system/node/memory_tier0
> 0-1
> /sys/devices/system/node/memory_tier1
> 2-3
>
> echo 2 > /sys/devices/system/node/memory_tier1
> /sys/devices/system/node/memory_tier1
> 2
> /sys/devices/system/node/memory_tier2
> 3

The proposal does something similar, except using a single file: memory_tiers.

Another idea is to pass the tier override from a kernel boot argument,
though it is challenging to deal with hot-plugged nodes.

> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> > 0: [2], [2-3]
> > 1: [3], [2-3]
> > 2: [], []
> > 3: [], []
> >
> > * Example 2:
> > Node 0 & 1 are DRAM nodes.
> > Node 2 is a PMEM node and closer to node 0.
> >
> > Node 0 has node 2 as the preferred and only demotion target.
> >
> > Node 1 has no preferred demotion target, but can still demote
> > to node 2.
> >
> > Set mempolicy to prevent cross-socket demotion and memory access,
> > e.g. cpuset.mems=0,2
> >
> > node distances:
> > node 0 1 2
> > 0 10 20 30
> > 1 20 10 40
> > 2 30 40 10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> > 0: [2], [2]
> > 1: [], [2]
> > 2: [], []
> >
> >
> > * Example 3:
> > Node 0 & 1 are DRAM nodes.
> > Node 2 is a PMEM node and has the same distance to node 0 & 1.
> >
> > Node 0 has node 2 as the preferred and only demotion target.
> >
> > Node 1 has node 2 as the preferred and only demotion target.
> >
> > node distances:
> > node 0 1 2
> > 0 10 20 30
> > 1 20 10 30
> > 2 30 30 10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> > 0: [2], [2]
> > 1: [2], [2]
> > 2: [], []
> >
> >
> > * Example 4:
> > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> >
> > All nodes are top-tier.
> >
> > node distances:
> > node 0 1 2
> > 0 10 20 30
> > 1 20 10 30
> > 2 30 30 10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-2
> >
> > N_TOPTIER_MEMORY: 0-2
> >
> > node_demotion[]:
> > 0: [], []
> > 1: [], []
> > 2: [], []
> >
> >
> > * Example 5:
> > Node 0 is a DRAM node with CPU.
> > Node 1 is a HBM node.
> > Node 2 is a PMEM node.
> >
> > With userspace override, node 1 is the top tier and has node 0 as
> > the preferred and only demotion target.
> >
> > Node 0 is in the second tier, tier 1, and has node 2 as the
> > preferred and only demotion target.
> >
> > Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> >
> > node distances:
> > node 0 1 2
> > 0 10 21 30
> > 1 21 10 40
> > 2 30 40 10
> >
> > /sys/devices/system/node/memory_tiers (userspace override)
> > 1
> > 0
> > 2
> >
> > N_TOPTIER_MEMORY: 1
> >
> > node_demotion[]:
> > 0: [2], [2]
> > 1: [0], [0]
> > 2: [], []
> >
> > -- Wei

2022-05-03 22:51:52

by Tim Chen

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

On Fri, 2022-04-29 at 19:10 -0700, Wei Xu wrote:
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node. Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
>
> A tiering relationship between NUMA nodes in the form of demotion path
> is created during the kernel initialization and updated when a NUMA
> node is hot-added or hot-removed. The current implementation puts all
> nodes with CPU into the top tier, and then builds the tiering hierarchy
> tier-by-tier by establishing the per-node demotion targets based on
> the distances between nodes.

Thanks for making this proposal. It has many of the elements needed
for the tiering support.

>
> The current memory tiering interface needs to be improved to address
> several important use cases:
>
> * The current tiering initialization code always initializes
> each memory-only NUMA node into a lower tier. But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into the top tier.
>
> * The current tiering hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM (e.g. GPU memory) devices, these
> memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> with CPUs are better to be placed into the next lower tier.
>
> * Also because the current tiering hierarchy always puts CPU nodes
> into the top tier, when a CPU is hot-added (or hot-removed) and
> triggers a memory node from CPU-less into a CPU node (or vice
> versa), the memory tiering hierarchy gets changed, even though no
> memory node is added or removed. This can make the tiering
> hierarchy much less stable.
>
> * A higher tier node can only be demoted to selected nodes on the
> next lower tier, not any other node from the next lower tier. This
> strict, hard-coded demotion order does not work in all use cases
> (e.g. some use cases may want to allow cross-socket demotion to
> another node in the same demotion tier as a fallback when the
> preferred demotion node is out of space), and has resulted in the
> feature request for an interface to override the system-wide,
> per-node demotion order from the userspace.
>
> * There are no interfaces for the userspace to learn about the memory
> tiering hierarchy in order to optimize its memory allocations.
>
> I'd like to propose revised memory tiering kernel interfaces based on
> the discussions in the threads:
>
> - https://lore.kernel.org/lkml/[email protected]/T/
> - https://lore.kernel.org/linux-mm/[email protected]/t/
>
>
> Sysfs Interfaces
> ================
>
> * /sys/devices/system/node/memory_tiers
>
> Format: node list (one tier per line, in the tier order)
>
> When read, list memory nodes by tiers.
>
> When written (one tier per line), take the user-provided node-tier
> assignment as the new tiering hierarchy and rebuild the per-node
> demotion order. It is allowed to only override the top tiers, in
> which cases, the kernel will establish the lower tiers automatically.
>
>
> Kernel Representation
> =====================
>
> * nodemask_t node_states[N_TOPTIER_MEMORY]
>
> Store all top-tier memory nodes.
>
> * nodemask_t memory_tiers[MAX_TIERS]
>
> Store memory nodes by tiers.
>
> * struct demotion_nodes node_demotion[]
>
> where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
>
> For a node N:
>
> node_demotion[N].preferred lists all preferred demotion targets;
>
> node_demotion[N].allowed lists all allowed demotion targets
> (initialized to be all the nodes in the same demotion tier).
>

I assume that the preferred list is auto-configured/initialized based on
NUMA distances. Not sure why "allowed" list is only to the same demotion
tier? For example, I think the default should be tier 0 should
is allowed to demote to tier 1 and tier 2, not just to tier 1. So if we
fail to demote to tier 1, we can demote to tier 2.

Do you also expose the demotion preferred node and allowed
list via /sys/devices/system/node/memory_tiers, as you have done in the examples?

> Examples
> ========
>
> * Example 2:
> Node 0 & 1 are DRAM nodes.
> Node 2 is a PMEM node and closer to node 0.
>
> Node 0 has node 2 as the preferred and only demotion target.
>
> Node 1 has no preferred demotion target, but can still demote
> to node 2.
>
> Set mempolicy to prevent cross-socket demotion and memory access,
> e.g. cpuset.mems=0,2

Do we expect to later allow configuration of the demotion list explicitly?
Something like:

echo "demotion 0 1 1-3" > /sys/devices/system/node/memory_tiers

to set demotion list for node 0, where preferred demote node is 1,
allowed demote node list is 1-3.

Thanks.

Tim


2022-05-04 00:24:26

by Dave Hansen

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

On 5/3/22 15:35, Alistair Popple wrote:
> Not entirely true. The GPUs on POWER9 have performance counters capable of
> collecting this kind of information for memory accessed from the GPU. I will
> admit though that sadly most people probably don't have a P9 sitting under their
> desk :)

Well, x86 CPUs have performance monitoring hardware that can
theoretically collect physical access information too. But, this
performance monitoring hardware wasn't designed for this specific use
case in mind. So, in practice, these events (PEBS) weren't very useful
for driving memory tiering.

Are you saying that the GPUs on POWER9 have performance counters that
can drive memory tiering in practice? I'd be curious if there's working
code to show how they get used. Maybe the hardware is better than the
x86 PMU or the software consuming it is more clever than what we did.
But, I'd love to see it either way.

2022-05-04 05:45:25

by Alistair Popple

[permalink] [raw]
Subject: Re: RFC: Memory Tiering Kernel Interfaces

Davidlohr Bueso <[email protected]> writes:

> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
>>The current kernel has the basic memory tiering support: Inactive
>>pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>tier NUMA node to make room for new allocations on the higher tier
>>NUMA node. Frequently accessed pages on a lower tier NUMA node can be
>>migrated (promoted) to a higher tier NUMA node to improve the
>>performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.

Agreed. The existing NUMA faulting mechanism is already in the way of
performance on something like POWER9+Coherent GPUs. In that case enabling the
NUMA faulting mechanism results in a multiple orders of magnitude decrease in
performance, to the point that the only reasonable configuration for that system
was to disable NUMA balancing for anything using the GPU.

I would certainly be interested in figuring out how HW could provide some sort
of heatmap to identify which pages are hot and which processing unit is using
them. Currently for these systems users have to manually assign memory policy to
get any reasonable performance, both to disable NUMA balancing and make sure
memory is allocated on the right node.

- Alistair

>>A tiering relationship between NUMA nodes in the form of demotion path
>>is created during the kernel initialization and updated when a NUMA
>>node is hot-added or hot-removed. The current implementation puts all
>>nodes with CPU into the top tier, and then builds the tiering hierarchy
>>tier-by-tier by establishing the per-node demotion targets based on
>>the distances between nodes.
>>
>>The current memory tiering interface needs to be improved to address
>>several important use cases:
>>
>>* The current tiering initialization code always initializes
>> each memory-only NUMA node into a lower tier. But a memory-only
>> NUMA node may have a high performance memory device (e.g. a DRAM
>> device attached via CXL.mem or a DRAM-backed memory-only node on
>> a virtual machine) and should be put into the top tier.
>
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
>
>>Tiering Hierarchy Initialization
>>================================
>>
>>By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>
>>A device driver can remove its memory nodes from the top tier, e.g.
>>a dax driver can remove PMEM nodes from the top tier.
>>
>>The kernel builds the memory tiering hierarchy and per-node demotion
>>order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
>>best distance nodes in the next lower tier are assigned to
>>node_demotion[N].preferred and all the nodes in the next lower tier
>>are assigned to node_demotion[N].allowed.
>>
>>node_demotion[N].preferred can be empty if no preferred demotion node
>>is available for node N.
>
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.
>
>>Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>>memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>>node.
>
> I think this makes sense.
>
> Thanks,
> Davidlohr


Attachments:
(No filename) (3.93 kB)