2023-12-13 17:54:16

by Srinivasulu Opensrc

[permalink] [raw]
Subject: [RFC PATCH v2 0/2] Node migration between memory tiers

From: Srinivasulu Thanneeru <[email protected]>

The memory tiers feature allows nodes with similar memory types
or performance characteristics to be grouped together in a
memory tier. However, there is currently no provision for
moving a node from one tier to another on demand.

This patch series aims to support node migration between tiers
on demand by sysadmin/root user using the provided sysfs for
node migration.

To migrate a node to a tier, the corresponding node’s sysfs
memtier_override is written with target tier id.

Example: Move node2 to memory tier2 from its default tier(i.e 4)

1. To check current memtier of node2
$cat /sys/devices/system/node/node2/memtier_override
memory_tier4

2. To migrate node2 to memory_tier2
$echo 2 > /sys/devices/system/node/node2/memtier_override
$cat /sys/devices/system/node/node2/memtier_override
memory_tier2

Usecases:

1. Useful to move cxl nodes to the right tiers from userspace, when
the hardware fails to assign the tiers correctly based on
memorytypes.

On some platforms we have observed cxl memory being assigned to
the same tier as DDR memory. This is arguably a system firmware
bug, but it is true that tiers represent *ranges* of performance
and we believe it's important for the system operator to have
the ability to override bad firmware or OS decisions about tier
assignment as a fail-safe against potential bad outcomes.

2. Useful if we want interleave weights to be applied on memory tiers
instead of nodes.
In a previous thread, Huang Ying <[email protected]> thought
this feature might be useful to overcome limitations of systems
where nodes with different bandwidth characteristics are grouped
in a single tier.
https://lore.kernel.org/lkml/[email protected]/

=============
Version Notes:

V2 : Changed interface to memtier_override from adistance_offset.
memtier_override was recommended by
1. John Groves <[email protected]>
2. Ravi Shankar <[email protected]>
3. Brice Goglin <[email protected]>

V1 : Introduced adistance_offset sysfs.

=============

Srinivasulu Thanneeru (2):
base/node: Add sysfs for memtier_override
memory tier: Support node migration between tiers

Documentation/ABI/stable/sysfs-devices-node | 7 ++
drivers/base/node.c | 47 ++++++++++++
include/linux/memory-tiers.h | 11 +++
include/linux/node.h | 11 +++
mm/memory-tiers.c | 85 ++++++++++++---------
5 files changed, 125 insertions(+), 36 deletions(-)

--
2.25.1


2023-12-13 17:54:25

by Srinivasulu Opensrc

[permalink] [raw]
Subject: [PATCH 1/2] base/node: Add sysfs for memtier_override

From: Srinivasulu Thanneeru <[email protected]>

This patch introduces a new memtier_override sysfs.

memtier_override is the current memory tier of the node.
To migrate, replace it with the id of the desired memory tier.

adistance_offset is the required offset from memtype to move
the node to the target memory tier(i.e, memtier_override).

Signed-off-by: Srinivasulu Thanneeru <[email protected]>
Signed-off-by: Ravi Jonnalagadda <[email protected]>
---
Documentation/ABI/stable/sysfs-devices-node | 7 ++++
drivers/base/node.c | 41 +++++++++++++++++++++
include/linux/memory-tiers.h | 6 +++
include/linux/node.h | 6 +++
mm/memory-tiers.c | 19 +++++++++-
5 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 402af4b2b905..447a599cc536 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -70,6 +70,13 @@ Description:
Distance between the node and all the other nodes
in the system.

+What: /sys/devices/system/node/nodeX/memtier_overwrite
+Date: December 2023
+Contact: Srinivasulu Thanneeru <[email protected]>
+Description:
+ The current memory tier of the node.
+ To migrate, replace it with the id of the desired memory tier.
+
What: /sys/devices/system/node/nodeX/vmstat
Date: October 2002
Contact: Linux Memory Management list <[email protected]>
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 493d533f8375..788176b3585a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/memory.h>
+#include <linux/memory-tiers.h>
#include <linux/vmstat.h>
#include <linux/notifier.h>
#include <linux/node.h>
@@ -569,11 +570,49 @@ static ssize_t node_read_distance(struct device *dev,
}
static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);

+static ssize_t memtier_override_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int nid = dev->id;
+ int len = 0;
+
+ len += sysfs_emit(buf, "memory_tier%d\n", node_devices[nid]->memtier);
+ return len;
+}
+
+static ssize_t memtier_override_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t size)
+{
+ int nid = dev->id;
+ int ret, memtier;
+
+ ret = kstrtoint(buf, 0, &memtier);
+
+ if (ret)
+ return ret;
+ if (memtier < 0 || memtier > MAX_MEMTIERID)
+ return -EINVAL;
+ if (node_devices[nid]->memtier == memtier)
+ return size;
+ ret = get_memtier_adistance_offset(nid, memtier);
+ node_devices[nid]->adistance_offset = ret;
+
+ return size;
+}
+static DEVICE_ATTR_RW(memtier_override);
+
+void set_node_memtierid(int node, int memtierid)
+{
+ node_devices[node]->memtier = memtierid;
+}
+
static struct attribute *node_dev_attrs[] = {
&dev_attr_meminfo.attr,
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
+ &dev_attr_memtier_override.attr,
NULL
};

@@ -883,6 +922,8 @@ int __register_one_node(int nid)

INIT_LIST_HEAD(&node_devices[nid]->access_list);
node_init_caches(nid);
+ node_devices[nid]->memtier = 0;
+ node_devices[nid]->adistance_offset = 0;

return error;
}
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 1e39d27bee41..0dba8027e785 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -20,6 +20,11 @@
*/
#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))

+/*
+ * Memory tier id is derived from abstract distance(signed 32bits)
+ */
+#define MAX_MEMTIERID (0xFFFFFFFF >> (MEMTIER_CHUNK_BITS + 1))
+
struct memory_tier;
struct memory_dev_type {
/* list of memory types that are part of same tier as this type */
@@ -48,6 +53,7 @@ int mt_calc_adistance(int node, int *adist);
int mt_set_default_dram_perf(int nid, struct node_hmem_attrs *perf,
const char *source);
int mt_perf_to_adistance(struct node_hmem_attrs *perf, int *adist);
+int get_memtier_adistance_offset(int node, int memtier);
#ifdef CONFIG_MIGRATION
int next_demotion_node(int node);
void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
diff --git a/include/linux/node.h b/include/linux/node.h
index 427a5975cf40..1c4f4be39db4 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -83,6 +83,8 @@ static inline void node_set_perf_attrs(unsigned int nid,
struct node {
struct device dev;
struct list_head access_list;
+ int memtier;
+ int adistance_offset;
#ifdef CONFIG_HMEM_REPORTING
struct list_head cache_attrs;
struct device *cache_dev;
@@ -138,6 +140,7 @@ extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk);
extern int register_memory_node_under_compute_node(unsigned int mem_nid,
unsigned int cpu_nid,
unsigned access);
+extern void set_node_memtierid(int node, int memtierid);
#else
static inline void node_dev_init(void)
{
@@ -165,6 +168,9 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
static inline void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
{
}
+static inline void set_node_memtierid(int node, int memtierid)
+{
+}
#endif

#define to_node(device) container_of(device, struct node, dev)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 8d5291add2bc..31ed3c577836 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -167,6 +167,21 @@ static const struct attribute_group *memtier_dev_groups[] = {
NULL
};

+int get_memtier_adistance_offset(int node, int memtier)
+{
+ struct memory_dev_type *memtype;
+ int adistance_offset;
+
+ memtype = node_memory_types[node].memtype;
+ /*
+ * Calculate the adistance offset required from memtype
+ * to move node to target memory tier.
+ */
+ adistance_offset = (memtier << MEMTIER_CHUNK_BITS) -
+ memtype->adistance;
+ return adistance_offset;
+}
+
static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
{
int ret;
@@ -497,8 +512,10 @@ static struct memory_tier *set_node_memory_tier(int node)
memtype = node_memory_types[node].memtype;
node_set(node, memtype->nodes);
memtier = find_create_memory_tier(memtype);
- if (!IS_ERR(memtier))
+ if (!IS_ERR(memtier)) {
rcu_assign_pointer(pgdat->memtier, memtier);
+ set_node_memtierid(node, memtier->dev.id);
+ }
return memtier;
}

--
2.25.1

2023-12-13 17:54:49

by Srinivasulu Opensrc

[permalink] [raw]
Subject: [PATCH 2/2] memory tier: Support node migration between tiers

From: Srinivasulu Thanneeru <[email protected]>

Node migration enables the grouping or migration of nodes
between tiers based on nodes' latencies and bandwidth characteristics.
Since nodes of the same memory-type can exist in different tiers and
can migrate from one tier to another, it is necessary to maintain
nodes per tier instead of maintaining a list of nodes grouped using
memory type(siblings) within the tier.

Signed-off-by: Srinivasulu Thanneeru <[email protected]>
---
drivers/base/node.c | 6 ++++
include/linux/memory-tiers.h | 5 +++
include/linux/node.h | 5 +++
mm/memory-tiers.c | 66 +++++++++++++++++-------------------
4 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 788176b3585a..179d9004e4f3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -597,6 +597,7 @@ static ssize_t memtier_override_store(struct device *dev,
return size;
ret = get_memtier_adistance_offset(nid, memtier);
node_devices[nid]->adistance_offset = ret;
+ node_memtier_change(nid);

return size;
}
@@ -607,6 +608,11 @@ void set_node_memtierid(int node, int memtierid)
node_devices[node]->memtier = memtierid;
}

+int get_node_adistance_offset(int node)
+{
+ return node_devices[node]->adistance_offset;
+}
+
static struct attribute *node_dev_attrs[] = {
&dev_attr_meminfo.attr,
&dev_attr_numastat.attr,
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0dba8027e785..b323c2e2e417 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -54,6 +54,7 @@ int mt_set_default_dram_perf(int nid, struct node_hmem_attrs *perf,
const char *source);
int mt_perf_to_adistance(struct node_hmem_attrs *perf, int *adist);
int get_memtier_adistance_offset(int node, int memtier);
+void node_memtier_change(int node);
#ifdef CONFIG_MIGRATION
int next_demotion_node(int node);
void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -142,5 +143,9 @@ static inline int mt_perf_to_adistance(struct node_hmem_attrs *perf, int *adist)
{
return -EIO;
}
+
+static inline void node_memtier_change(int node)
+{
+}
#endif /* CONFIG_NUMA */
#endif /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/node.h b/include/linux/node.h
index 1c4f4be39db4..da679577a271 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -141,6 +141,7 @@ extern int register_memory_node_under_compute_node(unsigned int mem_nid,
unsigned int cpu_nid,
unsigned access);
extern void set_node_memtierid(int node, int memtierid);
+extern int get_node_adistance_offset(int nid);
#else
static inline void node_dev_init(void)
{
@@ -171,6 +172,10 @@ static inline void unregister_memory_block_under_nodes(struct memory_block *mem_
static inline void set_node_memtierid(int node, int memtierid)
{
}
+static inline int get_node_adistance_offset(int nid)
+{
+ return 0;
+}
#endif

#define to_node(device) container_of(device, struct node, dev)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 31ed3c577836..66e1eae97e47 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -23,6 +23,8 @@ struct memory_tier {
struct device dev;
/* All the nodes that are part of all the lower memory tiers. */
nodemask_t lower_tier_mask;
+ /* Nodes linked to this tier*/
+ nodemask_t nodes;
};

struct demotion_nodes {
@@ -120,13 +122,7 @@ static inline struct memory_tier *to_memory_tier(struct device *device)

static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier *memtier)
{
- nodemask_t nodes = NODE_MASK_NONE;
- struct memory_dev_type *memtype;
-
- list_for_each_entry(memtype, &memtier->memory_types, tier_sibling)
- nodes_or(nodes, nodes, memtype->nodes);
-
- return nodes;
+ return memtier->nodes;
}

static void memory_tier_device_release(struct device *dev)
@@ -182,33 +178,22 @@ int get_memtier_adistance_offset(int node, int memtier)
return adistance_offset;
}

-static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
+static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype,
+ int tier_adistance)
{
int ret;
bool found_slot = false;
struct memory_tier *memtier, *new_memtier;
- int adistance = memtype->adistance;
+ int adistance;
unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;

lockdep_assert_held_once(&memory_tier_lock);

- adistance = round_down(adistance, memtier_adistance_chunk_size);
- /*
- * If the memtype is already part of a memory tier,
- * just return that.
- */
- if (!list_empty(&memtype->tier_sibling)) {
- list_for_each_entry(memtier, &memory_tiers, list) {
- if (adistance == memtier->adistance_start)
- return memtier;
- }
- WARN_ON(1);
- return ERR_PTR(-EINVAL);
- }
+ adistance = round_down(tier_adistance, memtier_adistance_chunk_size);

list_for_each_entry(memtier, &memory_tiers, list) {
if (adistance == memtier->adistance_start) {
- goto link_memtype;
+ return memtier;
} else if (adistance < memtier->adistance_start) {
found_slot = true;
break;
@@ -238,11 +223,8 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
put_device(&new_memtier->dev);
return ERR_PTR(ret);
}
- memtier = new_memtier;

-link_memtype:
- list_add(&memtype->tier_sibling, &memtier->memory_types);
- return memtier;
+ return new_memtier;
}

static struct memory_tier *__node_get_memory_tier(int node)
@@ -500,7 +482,7 @@ static struct memory_tier *set_node_memory_tier(int node)
struct memory_tier *memtier;
struct memory_dev_type *memtype;
pg_data_t *pgdat = NODE_DATA(node);
-
+ int tier_adistance;

lockdep_assert_held_once(&memory_tier_lock);

@@ -511,11 +493,15 @@ static struct memory_tier *set_node_memory_tier(int node)

memtype = node_memory_types[node].memtype;
node_set(node, memtype->nodes);
- memtier = find_create_memory_tier(memtype);
+ tier_adistance = get_node_adistance_offset(node);
+ tier_adistance = memtype->adistance + tier_adistance;
+
+ memtier = find_create_memory_tier(memtype, tier_adistance);
if (!IS_ERR(memtier)) {
rcu_assign_pointer(pgdat->memtier, memtier);
set_node_memtierid(node, memtier->dev.id);
}
+ node_set(node, memtier->nodes);
return memtier;
}

@@ -551,11 +537,9 @@ static bool clear_node_memory_tier(int node)
synchronize_rcu();
memtype = node_memory_types[node].memtype;
node_clear(node, memtype->nodes);
- if (nodes_empty(memtype->nodes)) {
- list_del_init(&memtype->tier_sibling);
- if (list_empty(&memtier->memory_types))
- destroy_memory_tier(memtier);
- }
+ node_clear(node, memtier->nodes);
+ if (nodes_empty(memtier->nodes))
+ destroy_memory_tier(memtier);
cleared = true;
}
return cleared;
@@ -578,7 +562,6 @@ struct memory_dev_type *alloc_memory_type(int adistance)
return ERR_PTR(-ENOMEM);

memtype->adistance = adistance;
- INIT_LIST_HEAD(&memtype->tier_sibling);
memtype->nodes = NODE_MASK_NONE;
kref_init(&memtype->kref);
return memtype;
@@ -618,6 +601,19 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype)
}
EXPORT_SYMBOL_GPL(clear_node_memory_type);

+void node_memtier_change(int node)
+{
+ struct memory_tier *memtier;
+
+ mutex_lock(&memory_tier_lock);
+ if (clear_node_memory_tier(node))
+ establish_demotion_targets();
+ memtier = set_node_memory_tier(node);
+ if (!IS_ERR(memtier))
+ establish_demotion_targets();
+ mutex_unlock(&memory_tier_lock);
+}
+
static void dump_hmem_attrs(struct node_hmem_attrs *attrs, const char *prefix)
{
pr_info(
--
2.25.1

2023-12-15 05:05:26

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] Node migration between memory tiers

<[email protected]> writes:

> From: Srinivasulu Thanneeru <[email protected]>
>
> The memory tiers feature allows nodes with similar memory types
> or performance characteristics to be grouped together in a
> memory tier. However, there is currently no provision for
> moving a node from one tier to another on demand.
>
> This patch series aims to support node migration between tiers
> on demand by sysadmin/root user using the provided sysfs for
> node migration.
>
> To migrate a node to a tier, the corresponding node’s sysfs
> memtier_override is written with target tier id.
>
> Example: Move node2 to memory tier2 from its default tier(i.e 4)
>
> 1. To check current memtier of node2
> $cat /sys/devices/system/node/node2/memtier_override
> memory_tier4
>
> 2. To migrate node2 to memory_tier2
> $echo 2 > /sys/devices/system/node/node2/memtier_override
> $cat /sys/devices/system/node/node2/memtier_override
> memory_tier2
>
> Usecases:
>
> 1. Useful to move cxl nodes to the right tiers from userspace, when
> the hardware fails to assign the tiers correctly based on
> memorytypes.
>
> On some platforms we have observed cxl memory being assigned to
> the same tier as DDR memory. This is arguably a system firmware
> bug, but it is true that tiers represent *ranges* of performance
> and we believe it's important for the system operator to have
> the ability to override bad firmware or OS decisions about tier
> assignment as a fail-safe against potential bad outcomes.
>
> 2. Useful if we want interleave weights to be applied on memory tiers
> instead of nodes.
> In a previous thread, Huang Ying <[email protected]> thought
> this feature might be useful to overcome limitations of systems
> where nodes with different bandwidth characteristics are grouped
> in a single tier.
> https://lore.kernel.org/lkml/[email protected]/
>
> =============
> Version Notes:
>
> V2 : Changed interface to memtier_override from adistance_offset.
> memtier_override was recommended by
> 1. John Groves <[email protected]>
> 2. Ravi Shankar <[email protected]>
> 3. Brice Goglin <[email protected]>

It appears that you ignored my comments for V1 as follows ...

https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

--
Best Regards,
Huang, Ying

> V1 : Introduced adistance_offset sysfs.
>
> =============
>
> Srinivasulu Thanneeru (2):
> base/node: Add sysfs for memtier_override
> memory tier: Support node migration between tiers
>
> Documentation/ABI/stable/sysfs-devices-node | 7 ++
> drivers/base/node.c | 47 ++++++++++++
> include/linux/memory-tiers.h | 11 +++
> include/linux/node.h | 11 +++
> mm/memory-tiers.c | 85 ++++++++++++---------
> 5 files changed, 125 insertions(+), 36 deletions(-)

2023-12-15 17:49:41

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
> <[email protected]> writes:
>
> > =============
> > Version Notes:
> >
> > V2 : Changed interface to memtier_override from adistance_offset.
> > memtier_override was recommended by
> > 1. John Groves <[email protected]>
> > 2. Ravi Shankar <[email protected]>
> > 3. Brice Goglin <[email protected]>
>
> It appears that you ignored my comments for V1 as follows ...
>
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
>

Not speaking for the group, just chiming in because i'd discussed it
with them.

"Memory Type" is a bit nebulous. Is a Micron Type-3 with performance X
and an SK Hynix Type-3 with performance Y a "Different type", or are
they the "Same Type" given that they're both Type 3 backed by some form
of DDR? Is socket placement of those devices relevant for determining
"Type"? Is whether they are behind a switch relevant for determining
"Type"? "Type" is frustrating when everything we're talking about
managing is "Type-3" with difference performance.

A concrete example:
To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
exactly the same as an standard SLD. I may want to have some
combination of local memory expansion devices on the majority of my
expansion slots, but reserve 1 slot on each socket for a connection to
the MH-SLD. As of right now: There is no good way to differentiate the
devices in terms of "Type" - and even if you had that, the tiering
system would still lump them together.

Similarly, an initial run of switches may or may not allow enumeration
of devices behind it (depends on the configuration), so you may end up
with a static numa node that "looks like" another SLD - despite it being
some definition of "GFAM". Do number of hops matter in determining
"Type"?

So I really don't think "Type" is useful for determining tier placement.

As of right now, the system lumps DRAM nodes as one tier, and pretty
much everything else as "the other tier". To me, this patch set is an
initial pass meant to allow user-control over tier composition while
the internal mechanism is sussed out and the environment develops.

In general, a release valve that lets you redefine tiers is very welcome
for testing and validation of different setups while the industry evolves.

Just my two cents.

~Gregory

> --
> Best Regards,
> Huang, Ying
>

2023-12-18 05:57:51

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Gregory Price <[email protected]> writes:

> On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> <[email protected]> writes:
>>
>> > =============
>> > Version Notes:
>> >
>> > V2 : Changed interface to memtier_override from adistance_offset.
>> > memtier_override was recommended by
>> > 1. John Groves <[email protected]>
>> > 2. Ravi Shankar <[email protected]>
>> > 3. Brice Goglin <[email protected]>
>>
>> It appears that you ignored my comments for V1 as follows ...
>>
>> https://lore.kernel.org/lkml/[email protected]/
>> https://lore.kernel.org/lkml/[email protected]/
>> https://lore.kernel.org/lkml/[email protected]/
>>
>
> Not speaking for the group, just chiming in because i'd discussed it
> with them.
>
> "Memory Type" is a bit nebulous. Is a Micron Type-3 with performance X
> and an SK Hynix Type-3 with performance Y a "Different type", or are
> they the "Same Type" given that they're both Type 3 backed by some form
> of DDR? Is socket placement of those devices relevant for determining
> "Type"? Is whether they are behind a switch relevant for determining
> "Type"? "Type" is frustrating when everything we're talking about
> managing is "Type-3" with difference performance.
>
> A concrete example:
> To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
> exactly the same as an standard SLD. I may want to have some
> combination of local memory expansion devices on the majority of my
> expansion slots, but reserve 1 slot on each socket for a connection to
> the MH-SLD. As of right now: There is no good way to differentiate the
> devices in terms of "Type" - and even if you had that, the tiering
> system would still lump them together.
>
> Similarly, an initial run of switches may or may not allow enumeration
> of devices behind it (depends on the configuration), so you may end up
> with a static numa node that "looks like" another SLD - despite it being
> some definition of "GFAM". Do number of hops matter in determining
> "Type"?

In the original design, the memory devices of same memory type are
managed by the same device driver, linked with system in same way
(including switches), built with same media. So, the performance is
same too. And, same as memory tiers, memory types are orthogonal to
sockets. Do you think the definition itself is clear enough?

I admit "memory type" is a confusing name. Do you have some better
suggestion?

> So I really don't think "Type" is useful for determining tier placement.
>
> As of right now, the system lumps DRAM nodes as one tier, and pretty
> much everything else as "the other tier". To me, this patch set is an
> initial pass meant to allow user-control over tier composition while
> the internal mechanism is sussed out and the environment develops.

The patchset to identify the performance of memory devices and put them
in proper "memory types" and memory tiers via HMAT has been merged by
v6.7-rc1.

07a8bdd4120c (memory tiering: add abstract distance calculation algorithms management, 2023-09-26)
d0376aac59a1 (acpi, hmat: refactor hmat_register_target_initiators(), 2023-09-26)
3718c02dbd4c (acpi, hmat: calculate abstract distance with HMAT, 2023-09-26)
6bc2cfdf82d5 (dax, kmem: calculate abstract distance with general interface, 2023-09-26)

> In general, a release valve that lets you redefine tiers is very welcome
> for testing and validation of different setups while the industry evolves.
>
> Just my two cents.

--
Best Regards,
Huang, Ying

2023-12-18 08:56:18

by Srinivasulu Thanneeru

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers


Micron Confidential



Micron Confidential
+AF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8-
From: Huang, Ying +ADw-ying.huang+AEA-intel.com+AD4-
Sent: Friday, December 15, 2023 10:32 AM
To: Srinivasulu Opensrc
Cc: linux-cxl+AEA-vger.kernel.org+ADs- linux-mm+AEA-kvack.org+ADs- Srinivasulu Thanneeru+ADs- aneesh.kumar+AEA-linux.ibm.com+ADs- dan.j.williams+AEA-intel.com+ADs- gregory.price+ADs- mhocko+AEA-suse.com+ADs- tj+AEA-kernel.org+ADs- john+AEA-jagalactic.com+ADs- Eishan Mirakhur+ADs- Vinicius Tavares Petrucci+ADs- Ravis OpenSrc+ADs- Jonathan.Cameron+AEA-huawei.com+ADs- linux-kernel+AEA-vger.kernel.org
Subject: +AFs-EXT+AF0- Re: +AFs-RFC PATCH v2 0/2+AF0- Node migration between memory tiers

CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you recognize the sender and were expecting this message.


+ADw-sthanneeru.opensrc+AEA-micron.com+AD4- writes:

+AD4- From: Srinivasulu Thanneeru +ADw-sthanneeru.opensrc+AEA-micron.com+AD4-
+AD4-
+AD4- The memory tiers feature allows nodes with similar memory types
+AD4- or performance characteristics to be grouped together in a
+AD4- memory tier. However, there is currently no provision for
+AD4- moving a node from one tier to another on demand.
+AD4-
+AD4- This patch series aims to support node migration between tiers
+AD4- on demand by sysadmin/root user using the provided sysfs for
+AD4- node migration.
+AD4-
+AD4- To migrate a node to a tier, the corresponding node+IBk-s sysfs
+AD4- memtier+AF8-override is written with target tier id.
+AD4-
+AD4- Example: Move node2 to memory tier2 from its default tier(i.e 4)
+AD4-
+AD4- 1. To check current memtier of node2
+AD4- +ACQ-cat /sys/devices/system/node/node2/memtier+AF8-override
+AD4- memory+AF8-tier4
+AD4-
+AD4- 2. To migrate node2 to memory+AF8-tier2
+AD4- +ACQ-echo 2 +AD4- /sys/devices/system/node/node2/memtier+AF8-override
+AD4- +ACQ-cat /sys/devices/system/node/node2/memtier+AF8-override
+AD4- memory+AF8-tier2
+AD4-
+AD4- Usecases:
+AD4-
+AD4- 1. Useful to move cxl nodes to the right tiers from userspace, when
+AD4- the hardware fails to assign the tiers correctly based on
+AD4- memorytypes.
+AD4-
+AD4- On some platforms we have observed cxl memory being assigned to
+AD4- the same tier as DDR memory. This is arguably a system firmware
+AD4- bug, but it is true that tiers represent +ACo-ranges+ACo- of performance
+AD4- and we believe it's important for the system operator to have
+AD4- the ability to override bad firmware or OS decisions about tier
+AD4- assignment as a fail-safe against potential bad outcomes.
+AD4-
+AD4- 2. Useful if we want interleave weights to be applied on memory tiers
+AD4- instead of nodes.
+AD4- In a previous thread, Huang Ying +ADw-ying.huang+AEA-intel.com+AD4- thought
+AD4- this feature might be useful to overcome limitations of systems
+AD4- where nodes with different bandwidth characteristics are grouped
+AD4- in a single tier.
+AD4- https://lore.kernel.org/lkml/87a5rw1wu8.fsf+AEA-yhuang6-desk2.ccr.corp.intel.com/
+AD4-
+AD4- +AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0-
+AD4- Version Notes:
+AD4-
+AD4- V2 : Changed interface to memtier+AF8-override from adistance+AF8-offset.
+AD4- memtier+AF8-override was recommended by
+AD4- 1. John Groves +ADw-john+AEA-jagalactic.com+AD4-
+AD4- 2. Ravi Shankar +ADw-ravis.opensrc+AEA-micron.com+AD4-
+AD4- 3. Brice Goglin +ADw-Brice.Goglin+AEA-inria.fr+AD4-

It appears that you ignored my comments for V1 as follows ...

https://lore.kernel.org/lkml/87o7f62vur.fsf+AEA-yhuang6-desk2.ccr.corp.intel.com/

Thank you Huang, Ying for pointing to this.

https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live+ACU-20In+ACU-20a+ACU-20World+ACU-20With+ACU-20Multiple+ACU-20Memory+ACU-20Types.pdf

In the presentation above, the adistance+AF8-offsets are per memtype.
We believe that adistance+AF8-offset per node is more suitable and flexible
since we can change it per node. If we keep adistance+AF8-offset per memtype,
then we cannot change it for a specific node of a given memtype.


https://lore.kernel.org/lkml/87jzpt2ft5.fsf+AEA-yhuang6-desk2.ccr.corp.intel.com/

I guess that you need to move all NUMA nodes with same performance
metrics together? If so, That is why we previously proposed to place
the knob in +ACI-memory+AF8-type+ACI-? (From: Huang, Ying )

Yes, memory+AF8-type would be group the related memories togather as single tier.
We should also have a flexibility to move nodes between tiers, to address the issues described in usecases above.

https://lore.kernel.org/lkml/87a5qp2et0.fsf+AEA-yhuang6-desk2.ccr.corp.intel.com/

This patch provides a way to move a node to the correct tier.
We observed in test setups where DRAM and CXL are put under the same
tier (memory+AF8-tier4).
By using this patch, we can move the CXL node away from the DRAM-linked
tier4 and put it in the desired tier.

Regards,
Srini

--
Best Regards,
Huang, Ying

+AD4- V1 : Introduced adistance+AF8-offset sysfs.
+AD4-
+AD4- +AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0-
+AD4-
+AD4- Srinivasulu Thanneeru (2):
+AD4- base/node: Add sysfs for memtier+AF8-override
+AD4- memory tier: Support node migration between tiers
+AD4-
+AD4- Documentation/ABI/stable/sysfs-devices-node +AHw- 7 +-+-
+AD4- drivers/base/node.c +AHw- 47 +-+-+-+-+-+-+-+-+-+-+-+-
+AD4- include/linux/memory-tiers.h +AHw- 11 +-+-+-
+AD4- include/linux/node.h +AHw- 11 +-+-+-
+AD4- mm/memory-tiers.c +AHw- 85 +-+-+-+-+-+-+-+-+-+-+-+----------
+AD4- 5 files changed, 125 insertions(+-), 36 deletions(-)

2023-12-19 03:59:44

by Huang, Ying

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Hi, Srinivasulu,

Please use a email client that works for kernel patch review. Your
email is hard to read. It's hard to identify which part is your text
and which part is my text. Please refer to,

https://www.kernel.org/doc/html/latest/process/email-clients.html

Or something similar, for example,

https://elinux.org/Mail_client_tips

Srinivasulu Thanneeru <[email protected]> writes:

> Micron Confidential
>
>
>
> Micron Confidential
> ________________________________________
> From: Huang, Ying <[email protected]>
> Sent: Friday, December 15, 2023 10:32 AM
> To: Srinivasulu Opensrc
> Cc: [email protected]; [email protected]; Srinivasulu
> Thanneeru; [email protected]; [email protected];
> gregory.price; [email protected]; [email protected]; [email protected];
> Eishan Mirakhur; Vinicius Tavares Petrucci; Ravis OpenSrc;
> [email protected]; [email protected]
> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you recognize the sender and were expecting this message.
>
>
> <[email protected]> writes:
>
>> From: Srinivasulu Thanneeru <[email protected]>
>>
>> The memory tiers feature allows nodes with similar memory types
>> or performance characteristics to be grouped together in a
>> memory tier. However, there is currently no provision for
>> moving a node from one tier to another on demand.
>>
>> This patch series aims to support node migration between tiers
>> on demand by sysadmin/root user using the provided sysfs for
>> node migration.
>>
>> To migrate a node to a tier, the corresponding node’s sysfs
>> memtier_override is written with target tier id.
>>
>> Example: Move node2 to memory tier2 from its default tier(i.e 4)
>>
>> 1. To check current memtier of node2
>> $cat /sys/devices/system/node/node2/memtier_override
>> memory_tier4
>>
>> 2. To migrate node2 to memory_tier2
>> $echo 2 > /sys/devices/system/node/node2/memtier_override
>> $cat /sys/devices/system/node/node2/memtier_override
>> memory_tier2
>>
>> Usecases:
>>
>> 1. Useful to move cxl nodes to the right tiers from userspace, when
>> the hardware fails to assign the tiers correctly based on
>> memorytypes.
>>
>> On some platforms we have observed cxl memory being assigned to
>> the same tier as DDR memory. This is arguably a system firmware
>> bug, but it is true that tiers represent *ranges* of performance
>> and we believe it's important for the system operator to have
>> the ability to override bad firmware or OS decisions about tier
>> assignment as a fail-safe against potential bad outcomes.
>>
>> 2. Useful if we want interleave weights to be applied on memory tiers
>> instead of nodes.
>> In a previous thread, Huang Ying <[email protected]> thought
>> this feature might be useful to overcome limitations of systems
>> where nodes with different bandwidth characteristics are grouped
>> in a single tier.
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> =============
>> Version Notes:
>>
>> V2 : Changed interface to memtier_override from adistance_offset.
>> memtier_override was recommended by
>> 1. John Groves <[email protected]>
>> 2. Ravi Shankar <[email protected]>
>> 3. Brice Goglin <[email protected]>
>
> It appears that you ignored my comments for V1 as follows ...
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Thank you Huang, Ying for pointing to this.
>
> https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>
> In the presentation above, the adistance_offsets are per memtype.
> We believe that adistance_offset per node is more suitable and flexible
> since we can change it per node. If we keep adistance_offset per memtype,
> then we cannot change it for a specific node of a given memtype.

Why do you need to change it for a specific node? Why do you needn't to
chagne it for all nodes of a given memtype?

> https://lore.kernel.org/lkml/[email protected]/
>
> I guess that you need to move all NUMA nodes with same performance
> metrics together? If so, That is why we previously proposed to place
> the knob in "memory_type"? (From: Huang, Ying )
>
> Yes, memory_type would be group the related memories togather as single tier.
> We should also have a flexibility to move nodes between tiers, to address the issues described in usecases above.
>
> https://lore.kernel.org/lkml/[email protected]/
>
> This patch provides a way to move a node to the correct tier.
> We observed in test setups where DRAM and CXL are put under the same
> tier (memory_tier4).
> By using this patch, we can move the CXL node away from the DRAM-linked
> tier4 and put it in the desired tier.

Good! Can you give more details? So I can resend the patch with your
supporting data.

--
Best Regards,
Huang, Ying

> Regards,
> Srini
>
> --
> Best Regards,
> Huang, Ying
>
>> V1 : Introduced adistance_offset sysfs.
>>
>> =============
>>
>> Srinivasulu Thanneeru (2):
>> base/node: Add sysfs for memtier_override
>> memory tier: Support node migration between tiers
>>
>> Documentation/ABI/stable/sysfs-devices-node | 7 ++
>> drivers/base/node.c | 47 ++++++++++++
>> include/linux/memory-tiers.h | 11 +++
>> include/linux/node.h | 11 +++
>> mm/memory-tiers.c | 85 ++++++++++++---------
>> 5 files changed, 125 insertions(+), 36 deletions(-)

2024-01-03 05:26:52

by Srinivasulu Thanneeru

[permalink] [raw]
Subject: RE: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Micron Confidential

Hi Huang, Ying,

My apologies for wrong mail reply format, my mail client settings got changed on my PC.
Please find comments bellow inline.

Regards,
Srini


Micron Confidential
+AD4- -----Original Message-----
+AD4- From: Huang, Ying +ADw-ying.huang+AEA-intel.com+AD4-
+AD4- Sent: Monday, December 18, 2023 11:26 AM
+AD4- To: gregory.price +ADw-gregory.price+AEA-memverge.com+AD4-
+AD4- Cc: Srinivasulu Opensrc +ADw-sthanneeru.opensrc+AEA-micron.com+AD4AOw- linux-
+AD4- cxl+AEA-vger.kernel.org+ADs- linux-mm+AEA-kvack.org+ADs- Srinivasulu Thanneeru
+AD4- +ADw-sthanneeru+AEA-micron.com+AD4AOw- aneesh.kumar+AEA-linux.ibm.com+ADs-
+AD4- dan.j.williams+AEA-intel.com+ADs- mhocko+AEA-suse.com+ADs- tj+AEA-kernel.org+ADs-
+AD4- john+AEA-jagalactic.com+ADs- Eishan Mirakhur +ADw-emirakhur+AEA-micron.com+AD4AOw- Vinicius
+AD4- Tavares Petrucci +ADw-vtavarespetr+AEA-micron.com+AD4AOw- Ravis OpenSrc
+AD4- +ADw-Ravis.OpenSrc+AEA-micron.com+AD4AOw- Jonathan.Cameron+AEA-huawei.com+ADs- linux-
+AD4- kernel+AEA-vger.kernel.org+ADs- Johannes Weiner +ADw-hannes+AEA-cmpxchg.org+AD4AOw- Wei Xu
+AD4- +ADw-weixugc+AEA-google.com+AD4-
+AD4- Subject: +AFs-EXT+AF0- Re: +AFs-RFC PATCH v2 0/2+AF0- Node migration between memory tiers
+AD4-
+AD4- CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
+AD4- you recognize the sender and were expecting this message.
+AD4-
+AD4-
+AD4- Gregory Price +ADw-gregory.price+AEA-memverge.com+AD4- writes:
+AD4-
+AD4- +AD4- On Fri, Dec 15, 2023 at 01:02:59PM +-0800, Huang, Ying wrote:
+AD4- +AD4APg- +ADw-sthanneeru.opensrc+AEA-micron.com+AD4- writes:
+AD4- +AD4APg-
+AD4- +AD4APg- +AD4- +AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0-
+AD4- +AD4APg- +AD4- Version Notes:
+AD4- +AD4APg- +AD4-
+AD4- +AD4APg- +AD4- V2 : Changed interface to memtier+AF8-override from adistance+AF8-offset.
+AD4- +AD4APg- +AD4- memtier+AF8-override was recommended by
+AD4- +AD4APg- +AD4- 1. John Groves +ADw-john+AEA-jagalactic.com+AD4-
+AD4- +AD4APg- +AD4- 2. Ravi Shankar +ADw-ravis.opensrc+AEA-micron.com+AD4-
+AD4- +AD4APg- +AD4- 3. Brice Goglin +ADw-Brice.Goglin+AEA-inria.fr+AD4-
+AD4- +AD4APg-
+AD4- +AD4APg- It appears that you ignored my comments for V1 as follows ...
+AD4- +AD4APg-
+AD4- +AD4APg-
+AD4- https://lore.k/
+AD4- ernel.org+ACU-2Flkml+ACU-2F87o7f62vur.fsf+ACU-40yhuang6-
+AD4- desk2.ccr.corp.intel.com+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com
+AD4- +ACU-7C5e614e5f028342b6b59c08dbff8e3e37+ACU-7Cf38a5ecd28134862b11bac1d56
+AD4- 3c806f+ACU-7C0+ACU-7C0+ACU-7C638384758666895965+ACU-7CUnknown+ACU-7CTWFpbGZsb3d
+AD4- 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3
+AD4- D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-OpMkYCar+ACU-2Fv8uHb7AvXbmaNltnXeTvcNUTi
+AD4- bLhwV12Fg+ACU-3D+ACY-reserved+AD0-0

Thank you, Huang, Ying for pointing to this.
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live+ACU-20In+ACU-20a+ACU-20World+ACU-20With+ACU-20Multiple+ACU-20Memory+ACU-20Types.pdf

In the presentation above, the adistance+AF8-offsets are per memtype.
We believe that adistance+AF8-offset per node is more suitable and flexible.
since we can change it per node. If we keep adistance+AF8-offset per memtype,
then we cannot change it for a specific node of a given memtype.

+AD4- +AD4APg-
+AD4- https://lore.k/
+AD4- ernel.org+ACU-2Flkml+ACU-2F87jzpt2ft5.fsf+ACU-40yhuang6-
+AD4- desk2.ccr.corp.intel.com+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com
+AD4- +ACU-7C5e614e5f028342b6b59c08dbff8e3e37+ACU-7Cf38a5ecd28134862b11bac1d56
+AD4- 3c806f+ACU-7C0+ACU-7C0+ACU-7C638384758666895965+ACU-7CUnknown+ACU-7CTWFpbGZsb3d
+AD4- 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3
+AD4- D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-O0+ACU-2B6T+ACU-2FgU0TicCEYBac+ACU-2FAyjOLwAeouh
+AD4- D+ACU-2BcMI+ACU-2BflOsI1M+ACU-3D+ACY-reserved+AD0-0

Yes, memory+AF8-type would be grouping the related memories together as single tier.
We should also have a flexibility to move nodes between tiers, to address the issues.
described in use cases above.

+AD4- +AD4APg-
+AD4- https://lore.k/
+AD4- ernel.org+ACU-2Flkml+ACU-2F87a5qp2et0.fsf+ACU-40yhuang6-
+AD4- desk2.ccr.corp.intel.com+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com
+AD4- +ACU-7C5e614e5f028342b6b59c08dbff8e3e37+ACU-7Cf38a5ecd28134862b11bac1d56
+AD4- 3c806f+ACU-7C0+ACU-7C0+ACU-7C638384758666895965+ACU-7CUnknown+ACU-7CTWFpbGZsb3d
+AD4- 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3
+AD4- D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-W+ACU-2FWcAD4b9od+ACU-2BS0zIak+ACU-2Bv5hkjFG1Xcf
+AD4- 6p8q3xwmspUiI+ACU-3D+ACY-reserved+AD0-0

This patch provides a way to move a node to the correct tier.
We observed in test setups where DRAM and CXL are put under the same.
tier (memory+AF8-tier4).
By using this patch, we can move the CXL node away from the DRAM-linked (memory+AF8-tier4)
and put it in the desired tier.

+AD4- +AD4APg-
+AD4- +AD4-
+AD4- +AD4- Not speaking for the group, just chiming in because i'd discussed it
+AD4- +AD4- with them.
+AD4- +AD4-
+AD4- +AD4- +ACI-Memory Type+ACI- is a bit nebulous. Is a Micron Type-3 with performance X
+AD4- +AD4- and an SK Hynix Type-3 with performance Y a +ACI-Different type+ACI-, or are
+AD4- +AD4- they the +ACI-Same Type+ACI- given that they're both Type 3 backed by some form
+AD4- +AD4- of DDR? Is socket placement of those devices relevant for determining
+AD4- +AD4- +ACI-Type+ACI-? Is whether they are behind a switch relevant for determining
+AD4- +AD4- +ACI-Type+ACI-? +ACI-Type+ACI- is frustrating when everything we're talking about
+AD4- +AD4- managing is +ACI-Type-3+ACI- with difference performance.
+AD4- +AD4-
+AD4- +AD4- A concrete example:
+AD4- +AD4- To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
+AD4- +AD4- exactly the same as an standard SLD. I may want to have some
+AD4- +AD4- combination of local memory expansion devices on the majority of my
+AD4- +AD4- expansion slots, but reserve 1 slot on each socket for a connection to
+AD4- +AD4- the MH-SLD. As of right now: There is no good way to differentiate the
+AD4- +AD4- devices in terms of +ACI-Type+ACI- - and even if you had that, the tiering
+AD4- +AD4- system would still lump them together.
+AD4- +AD4-
+AD4- +AD4- Similarly, an initial run of switches may or may not allow enumeration
+AD4- +AD4- of devices behind it (depends on the configuration), so you may end up
+AD4- +AD4- with a static numa node that +ACI-looks like+ACI- another SLD - despite it being
+AD4- +AD4- some definition of +ACI-GFAM+ACI-. Do number of hops matter in determining
+AD4- +AD4- +ACI-Type+ACI-?
+AD4-
+AD4- In the original design, the memory devices of same memory type are
+AD4- managed by the same device driver, linked with system in same way
+AD4- (including switches), built with same media. So, the performance is
+AD4- same too. And, same as memory tiers, memory types are orthogonal to
+AD4- sockets. Do you think the definition itself is clear enough?
+AD4-
+AD4- I admit +ACI-memory type+ACI- is a confusing name. Do you have some better
+AD4- suggestion?
+AD4-
+AD4- +AD4- So I really don't think +ACI-Type+ACI- is useful for determining tier placement.
+AD4- +AD4-
+AD4- +AD4- As of right now, the system lumps DRAM nodes as one tier, and pretty
+AD4- +AD4- much everything else as +ACI-the other tier+ACI-. To me, this patch set is an
+AD4- +AD4- initial pass meant to allow user-control over tier composition while
+AD4- +AD4- the internal mechanism is sussed out and the environment develops.
+AD4-
+AD4- The patchset to identify the performance of memory devices and put them
+AD4- in proper +ACI-memory types+ACI- and memory tiers via HMAT has been merged by
+AD4- v6.7-rc1.
+AD4-
+AD4- 07a8bdd4120c (memory tiering: add abstract distance calculation
+AD4- algorithms management, 2023-09-26)
+AD4- d0376aac59a1 (acpi, hmat: refactor hmat+AF8-register+AF8-target+AF8-initiators(),
+AD4- 2023-09-26)
+AD4- 3718c02dbd4c (acpi, hmat: calculate abstract distance with HMAT, 2023-09-
+AD4- 26)
+AD4- 6bc2cfdf82d5 (dax, kmem: calculate abstract distance with general
+AD4- interface, 2023-09-26)
+AD4-
+AD4- +AD4- In general, a release valve that lets you redefine tiers is very welcome
+AD4- +AD4- for testing and validation of different setups while the industry evolves.
+AD4- +AD4-
+AD4- +AD4- Just my two cents.
+AD4-
+AD4- --
+AD4- Best Regards,
+AD4- Huang, Ying

2024-01-03 06:10:07

by Huang, Ying

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Srinivasulu Thanneeru <[email protected]> writes:

> Micron Confidential
>
> Hi Huang, Ying,
>
> My apologies for wrong mail reply format, my mail client settings got changed on my PC.
> Please find comments bellow inline.
>
> Regards,
> Srini
>
>
> Micron Confidential
>> -----Original Message-----
>> From: Huang, Ying <[email protected]>
>> Sent: Monday, December 18, 2023 11:26 AM
>> To: gregory.price <[email protected]>
>> Cc: Srinivasulu Opensrc <[email protected]>; linux-
>> [email protected]; [email protected]; Srinivasulu Thanneeru
>> <[email protected]>; [email protected];
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Eishan Mirakhur <[email protected]>; Vinicius
>> Tavares Petrucci <[email protected]>; Ravis OpenSrc
>> <[email protected]>; [email protected]; linux-
>> [email protected]; Johannes Weiner <[email protected]>; Wei Xu
>> <[email protected]>
>> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers
>>
>> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> you recognize the sender and were expecting this message.
>>
>>
>> Gregory Price <[email protected]> writes:
>>
>> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> >> <[email protected]> writes:
>> >>
>> >> > =============
>> >> > Version Notes:
>> >> >
>> >> > V2 : Changed interface to memtier_override from adistance_offset.
>> >> > memtier_override was recommended by
>> >> > 1. John Groves <[email protected]>
>> >> > 2. Ravi Shankar <[email protected]>
>> >> > 3. Brice Goglin <[email protected]>
>> >>
>> >> It appears that you ignored my comments for V1 as follows ...
>> >>
>> >>
>> https://lore.k/
>> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
>> bLhwV12Fg%3D&reserved=0
>
> Thank you, Huang, Ying for pointing to this.
> https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>
> In the presentation above, the adistance_offsets are per memtype.
> We believe that adistance_offset per node is more suitable and flexible.
> since we can change it per node. If we keep adistance_offset per memtype,
> then we cannot change it for a specific node of a given memtype.
>
>> >>
>> https://lore.k/
>> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
>> D%2BcMI%2BflOsI1M%3D&reserved=0
>
> Yes, memory_type would be grouping the related memories together as single tier.
> We should also have a flexibility to move nodes between tiers, to address the issues.
> described in use cases above.

We don't pursue absolute flexibility. We add necessary flexibility
only. Why do you need this kind of flexibility? Can you provide some
use cases where memory_type based "adistance_offset" doesn't work?

--
Best Regards,
Huang, Ying

2024-01-03 07:56:58

by Srinivasulu Thanneeru

[permalink] [raw]
Subject: RE: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers


Micron Confidential



Micron Confidential
+AD4- -----Original Message-----
+AD4- From: Huang, Ying +ADw-ying.huang+AEA-intel.com+AD4-
+AD4- Sent: Wednesday, January 3, 2024 11:38 AM
+AD4- To: Srinivasulu Thanneeru +ADw-sthanneeru+AEA-micron.com+AD4-
+AD4- Cc: gregory.price +ADw-gregory.price+AEA-memverge.com+AD4AOw- Srinivasulu Opensrc
+AD4- +ADw-sthanneeru.opensrc+AEA-micron.com+AD4AOw- linux-cxl+AEA-vger.kernel.org+ADs- linux-
+AD4- mm+AEA-kvack.org+ADs- aneesh.kumar+AEA-linux.ibm.com+ADs- dan.j.williams+AEA-intel.com+ADs-
+AD4- mhocko+AEA-suse.com+ADs- tj+AEA-kernel.org+ADs- john+AEA-jagalactic.com+ADs- Eishan Mirakhur
+AD4- +ADw-emirakhur+AEA-micron.com+AD4AOw- Vinicius Tavares Petrucci
+AD4- +ADw-vtavarespetr+AEA-micron.com+AD4AOw- Ravis OpenSrc +ADw-Ravis.OpenSrc+AEA-micron.com+AD4AOw-
+AD4- Jonathan.Cameron+AEA-huawei.com+ADs- linux-kernel+AEA-vger.kernel.org+ADs- Johannes
+AD4- Weiner +ADw-hannes+AEA-cmpxchg.org+AD4AOw- Wei Xu +ADw-weixugc+AEA-google.com+AD4-
+AD4- Subject: Re: +AFs-EXT+AF0- Re: +AFs-RFC PATCH v2 0/2+AF0- Node migration between memory
+AD4- tiers
+AD4-
+AD4- CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
+AD4- you recognize the sender and were expecting this message.
+AD4-
+AD4-
+AD4- Srinivasulu Thanneeru +ADw-sthanneeru+AEA-micron.com+AD4- writes:
+AD4-
+AD4- +AD4- Micron Confidential
+AD4- +AD4-
+AD4- +AD4- Hi Huang, Ying,
+AD4- +AD4-
+AD4- +AD4- My apologies for wrong mail reply format, my mail client settings got
+AD4- changed on my PC.
+AD4- +AD4- Please find comments bellow inline.
+AD4- +AD4-
+AD4- +AD4- Regards,
+AD4- +AD4- Srini
+AD4- +AD4-
+AD4- +AD4-
+AD4- +AD4- Micron Confidential
+AD4- +AD4APg- -----Original Message-----
+AD4- +AD4APg- From: Huang, Ying +ADw-ying.huang+AEA-intel.com+AD4-
+AD4- +AD4APg- Sent: Monday, December 18, 2023 11:26 AM
+AD4- +AD4APg- To: gregory.price +ADw-gregory.price+AEA-memverge.com+AD4-
+AD4- +AD4APg- Cc: Srinivasulu Opensrc +ADw-sthanneeru.opensrc+AEA-micron.com+AD4AOw- linux-
+AD4- +AD4APg- cxl+AEA-vger.kernel.org+ADs- linux-mm+AEA-kvack.org+ADs- Srinivasulu Thanneeru
+AD4- +AD4APg- +ADw-sthanneeru+AEA-micron.com+AD4AOw- aneesh.kumar+AEA-linux.ibm.com+ADs-
+AD4- +AD4APg- dan.j.williams+AEA-intel.com+ADs- mhocko+AEA-suse.com+ADs- tj+AEA-kernel.org+ADs-
+AD4- +AD4APg- john+AEA-jagalactic.com+ADs- Eishan Mirakhur +ADw-emirakhur+AEA-micron.com+AD4AOw- Vinicius
+AD4- +AD4APg- Tavares Petrucci +ADw-vtavarespetr+AEA-micron.com+AD4AOw- Ravis OpenSrc
+AD4- +AD4APg- +ADw-Ravis.OpenSrc+AEA-micron.com+AD4AOw- Jonathan.Cameron+AEA-huawei.com+ADs- linux-
+AD4- +AD4APg- kernel+AEA-vger.kernel.org+ADs- Johannes Weiner +ADw-hannes+AEA-cmpxchg.org+AD4AOw- Wei Xu
+AD4- +AD4APg- +ADw-weixugc+AEA-google.com+AD4-
+AD4- +AD4APg- Subject: +AFs-EXT+AF0- Re: +AFs-RFC PATCH v2 0/2+AF0- Node migration between memory
+AD4- tiers
+AD4- +AD4APg-
+AD4- +AD4APg- CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
+AD4- +AD4APg- you recognize the sender and were expecting this message.
+AD4- +AD4APg-
+AD4- +AD4APg-
+AD4- +AD4APg- Gregory Price +ADw-gregory.price+AEA-memverge.com+AD4- writes:
+AD4- +AD4APg-
+AD4- +AD4APg- +AD4- On Fri, Dec 15, 2023 at 01:02:59PM +-0800, Huang, Ying wrote:
+AD4- +AD4APg- +AD4APg- +ADw-sthanneeru.opensrc+AEA-micron.com+AD4- writes:
+AD4- +AD4APg- +AD4APg-
+AD4- +AD4APg- +AD4APg- +AD4- +AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0-
+AD4- +AD4APg- +AD4APg- +AD4- Version Notes:
+AD4- +AD4APg- +AD4APg- +AD4-
+AD4- +AD4APg- +AD4APg- +AD4- V2 : Changed interface to memtier+AF8-override from adistance+AF8-offset.
+AD4- +AD4APg- +AD4APg- +AD4- memtier+AF8-override was recommended by
+AD4- +AD4APg- +AD4APg- +AD4- 1. John Groves +ADw-john+AEA-jagalactic.com+AD4-
+AD4- +AD4APg- +AD4APg- +AD4- 2. Ravi Shankar +ADw-ravis.opensrc+AEA-micron.com+AD4-
+AD4- +AD4APg- +AD4APg- +AD4- 3. Brice Goglin +ADw-Brice.Goglin+AEA-inria.fr+AD4-
+AD4- +AD4APg- +AD4APg-
+AD4- +AD4APg- +AD4APg- It appears that you ignored my comments for V1 as follows ...
+AD4- +AD4APg- +AD4APg-
+AD4- +AD4APg- +AD4APg-
+AD4- +AD4APg-
+AD4- https://lore.k/
+AD4- +ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com+ACU-7C3e5d38eb47be463c2
+AD4- 95c08dc0c229d22+ACU-7Cf38a5ecd28134862b11bac1d563c806f+ACU-7C0+ACU-7C0+ACU-7C63
+AD4- 8398590664228240+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
+AD4- AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C
+AD4- +ACY-sdata+AD0-7fPxb1YYR2tZ0v2FB1vlXnMJFcI+ACU-2Fr9HT2+ACU-2BUD1MNUd+ACU-2FI+ACU-3D+ACY-re
+AD4- served+AD0-0
+AD4- +AD4APg- ernel.org+ACU-2Flkml+ACU-2F87o7f62vur.fsf+ACU-40yhuang6-
+AD4- +AD4APg-
+AD4- desk2.ccr.corp.intel.com+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com
+AD4- +AD4APg-
+AD4- +ACU-7C5e614e5f028342b6b59c08dbff8e3e37+ACU-7Cf38a5ecd28134862b11bac1d56
+AD4- +AD4APg-
+AD4- 3c806f+ACU-7C0+ACU-7C0+ACU-7C638384758666895965+ACU-7CUnknown+ACU-7CTWFpbGZsb3d
+AD4- +AD4APg-
+AD4- 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3
+AD4- +AD4APg-
+AD4- D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-OpMkYCar+ACU-2Fv8uHb7AvXbmaNltnXeTvcNUTi
+AD4- +AD4APg- bLhwV12Fg+ACU-3D+ACY-reserved+AD0-0
+AD4- +AD4-
+AD4- +AD4- Thank you, Huang, Ying for pointing to this.
+AD4- +AD4-
+AD4- https://lpc.ev/
+AD4- ents+ACU-2Fevent+ACU-2F16+ACU-2Fcontributions+ACU-2F1209+ACU-2Fattachments+ACU-2F1042+ACU-2F1
+AD4- 995+ACU-2FLive+ACU-2520In+ACU-2520a+ACU-2520World+ACU-2520With+ACU-2520Multiple+ACU-2520Me
+AD4- mory+ACU-2520Types.pdf+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com+ACU-7C3e
+AD4- 5d38eb47be463c295c08dc0c229d22+ACU-7Cf38a5ecd28134862b11bac1d563c806
+AD4- f+ACU-7C0+ACU-7C0+ACU-7C638398590664228240+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJW
+AD4- IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3D+ACU-7C3
+AD4- 000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-1fGraxff7+ACU-2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7
+AD4- n8+ACU-3D+ACY-reserved+AD0-0
+AD4- +AD4-
+AD4- +AD4- In the presentation above, the adistance+AF8-offsets are per memtype.
+AD4- +AD4- We believe that adistance+AF8-offset per node is more suitable and flexible.
+AD4- +AD4- since we can change it per node. If we keep adistance+AF8-offset per memtype,
+AD4- +AD4- then we cannot change it for a specific node of a given memtype.
+AD4- +AD4-
+AD4- +AD4APg- +AD4APg-
+AD4- +AD4APg-
+AD4- https://lore.k/
+AD4- +ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com+ACU-7C3e5d38eb47be463c2
+AD4- 95c08dc0c229d22+ACU-7Cf38a5ecd28134862b11bac1d563c806f+ACU-7C0+ACU-7C0+ACU-7C63
+AD4- 8398590664228240+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
+AD4- AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C
+AD4- +ACY-sdata+AD0-7fPxb1YYR2tZ0v2FB1vlXnMJFcI+ACU-2Fr9HT2+ACU-2BUD1MNUd+ACU-2FI+ACU-3D+ACY-re
+AD4- served+AD0-0
+AD4- +AD4APg- ernel.org+ACU-2Flkml+ACU-2F87jzpt2ft5.fsf+ACU-40yhuang6-
+AD4- +AD4APg-
+AD4- desk2.ccr.corp.intel.com+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com
+AD4- +AD4APg-
+AD4- +ACU-7C5e614e5f028342b6b59c08dbff8e3e37+ACU-7Cf38a5ecd28134862b11bac1d56
+AD4- +AD4APg-
+AD4- 3c806f+ACU-7C0+ACU-7C0+ACU-7C638384758666895965+ACU-7CUnknown+ACU-7CTWFpbGZsb3d
+AD4- +AD4APg-
+AD4- 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3
+AD4- +AD4APg-
+AD4- D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-O0+ACU-2B6T+ACU-2FgU0TicCEYBac+ACU-2FAyjOLwAeouh
+AD4- +AD4APg- D+ACU-2BcMI+ACU-2BflOsI1M+ACU-3D+ACY-reserved+AD0-0
+AD4- +AD4-
+AD4- +AD4- Yes, memory+AF8-type would be grouping the related memories together as
+AD4- single tier.
+AD4- +AD4- We should also have a flexibility to move nodes between tiers, to address
+AD4- the issues.
+AD4- +AD4- described in use cases above.
+AD4-
+AD4- We don't pursue absolute flexibility. We add necessary flexibility
+AD4- only. Why do you need this kind of flexibility? Can you provide some
+AD4- use cases where memory+AF8-type based +ACI-adistance+AF8-offset+ACI- doesn't work?

- /sys/devices/virtual/memory+AF8-type/memory+AF8-type/ adistance+AF8-offset
memory+AF8-type based +ACI-adistance+AF8-offset will provide a way to move all nodes of same memory+AF8-type (e.g. all cxl nodes)
to different tier.

Whereas /sys/devices/system/node/node2/memtier+AF8-override provide a way migrate a node from one tier to another.
Considering a case where we would like to move two cxl nodes into two different tiers in future.
So, I thought it would be good to have flexibility at node level instead of at memory+AF8-type.

+AD4-
+AD4- --
+AD4- Best Regards,
+AD4- Huang, Ying

2024-01-03 08:32:00

by Huang, Ying

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Srinivasulu Thanneeru <[email protected]> writes:

> Micron Confidential
>
>
>
> Micron Confidential
>> -----Original Message-----
>> From: Huang, Ying <[email protected]>
>> Sent: Wednesday, January 3, 2024 11:38 AM
>> To: Srinivasulu Thanneeru <[email protected]>
>> Cc: gregory.price <[email protected]>; Srinivasulu Opensrc
>> <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; [email protected]; [email protected]; Eishan Mirakhur
>> <[email protected]>; Vinicius Tavares Petrucci
>> <[email protected]>; Ravis OpenSrc <[email protected]>;
>> [email protected]; [email protected]; Johannes
>> Weiner <[email protected]>; Wei Xu <[email protected]>
>> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
>> tiers
>>
>> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> you recognize the sender and were expecting this message.
>>
>>
>> Srinivasulu Thanneeru <[email protected]> writes:
>>
>> > Micron Confidential
>> >
>> > Hi Huang, Ying,
>> >
>> > My apologies for wrong mail reply format, my mail client settings got
>> changed on my PC.
>> > Please find comments bellow inline.
>> >
>> > Regards,
>> > Srini
>> >
>> >
>> > Micron Confidential
>> >> -----Original Message-----
>> >> From: Huang, Ying <[email protected]>
>> >> Sent: Monday, December 18, 2023 11:26 AM
>> >> To: gregory.price <[email protected]>
>> >> Cc: Srinivasulu Opensrc <[email protected]>; linux-
>> >> [email protected]; [email protected]; Srinivasulu Thanneeru
>> >> <[email protected]>; [email protected];
>> >> [email protected]; [email protected]; [email protected];
>> >> [email protected]; Eishan Mirakhur <[email protected]>; Vinicius
>> >> Tavares Petrucci <[email protected]>; Ravis OpenSrc
>> >> <[email protected]>; [email protected]; linux-
>> >> [email protected]; Johannes Weiner <[email protected]>; Wei Xu
>> >> <[email protected]>
>> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
>> tiers
>> >>
>> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> >> you recognize the sender and were expecting this message.
>> >>
>> >>
>> >> Gregory Price <[email protected]> writes:
>> >>
>> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> >> >> <[email protected]> writes:
>> >> >>
>> >> >> > =============
>> >> >> > Version Notes:
>> >> >> >
>> >> >> > V2 : Changed interface to memtier_override from adistance_offset.
>> >> >> > memtier_override was recommended by
>> >> >> > 1. John Groves <[email protected]>
>> >> >> > 2. Ravi Shankar <[email protected]>
>> >> >> > 3. Brice Goglin <[email protected]>
>> >> >>
>> >> >> It appears that you ignored my comments for V1 as follows ...
>> >> >>
>> >> >>
>> >>
>> https://lore.k/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
>> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
>> served=0
>> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
>> >>
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> >>
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> >>
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> >>
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> >>
>> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
>> >> bLhwV12Fg%3D&reserved=0
>> >
>> > Thank you, Huang, Ying for pointing to this.
>> >
>> https://lpc.ev/
>> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1
>> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me
>> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e
>> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806
>> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW
>> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
>> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7
>> n8%3D&reserved=0
>> >
>> > In the presentation above, the adistance_offsets are per memtype.
>> > We believe that adistance_offset per node is more suitable and flexible.
>> > since we can change it per node. If we keep adistance_offset per memtype,
>> > then we cannot change it for a specific node of a given memtype.
>> >
>> >> >>
>> >>
>> https://lore.k/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
>> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
>> served=0
>> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
>> >>
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> >>
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> >>
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> >>
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> >>
>> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
>> >> D%2BcMI%2BflOsI1M%3D&reserved=0
>> >
>> > Yes, memory_type would be grouping the related memories together as
>> single tier.
>> > We should also have a flexibility to move nodes between tiers, to address
>> the issues.
>> > described in use cases above.
>>
>> We don't pursue absolute flexibility. We add necessary flexibility
>> only. Why do you need this kind of flexibility? Can you provide some
>> use cases where memory_type based "adistance_offset" doesn't work?
>
> - /sys/devices/virtual/memory_type/memory_type/ adistance_offset
> memory_type based "adistance_offset will provide a way to move all nodes of same memory_type (e.g. all cxl nodes)
> to different tier.

We will not put the CXL nodes with different performance metrics in one
memory_type. If so, do you still need to move one of them?

> Whereas /sys/devices/system/node/node2/memtier_override provide a way migrate a node from one tier to another.
> Considering a case where we would like to move two cxl nodes into two different tiers in future.
> So, I thought it would be good to have flexibility at node level instead of at memory_type.

--
Best Regards,
Huang, Ying

2024-01-03 08:48:16

by Srinivasulu Thanneeru

[permalink] [raw]
Subject: RE: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers



> -----Original Message-----
> From: Huang, Ying <[email protected]>
> Sent: Wednesday, January 3, 2024 2:00 PM
> To: Srinivasulu Thanneeru <[email protected]>
> Cc: gregory.price <[email protected]>; Srinivasulu Opensrc
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Eishan Mirakhur
> <[email protected]>; Vinicius Tavares Petrucci
> <[email protected]>; Ravis OpenSrc <[email protected]>;
> [email protected]; [email protected]; Johannes
> Weiner <[email protected]>; Wei Xu <[email protected]>
> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
> tiers
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
> you recognize the sender and were expecting this message.
>
>
> Srinivasulu Thanneeru <[email protected]> writes:
>
> > Micron Confidential
> >
> >
> >
> > Micron Confidential
> >> -----Original Message-----
> >> From: Huang, Ying <[email protected]>
> >> Sent: Wednesday, January 3, 2024 11:38 AM
> >> To: Srinivasulu Thanneeru <[email protected]>
> >> Cc: gregory.price <[email protected]>; Srinivasulu Opensrc
> >> <[email protected]>; [email protected]; linux-
> >> [email protected]; [email protected];
> [email protected];
> >> [email protected]; [email protected]; [email protected]; Eishan Mirakhur
> >> <[email protected]>; Vinicius Tavares Petrucci
> >> <[email protected]>; Ravis OpenSrc
> <[email protected]>;
> >> [email protected]; [email protected]; Johannes
> >> Weiner <[email protected]>; Wei Xu <[email protected]>
> >> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between
> memory
> >> tiers
> >>
> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
> >> you recognize the sender and were expecting this message.
> >>
> >>
> >> Srinivasulu Thanneeru <[email protected]> writes:
> >>
> >> > Micron Confidential
> >> >
> >> > Hi Huang, Ying,
> >> >
> >> > My apologies for wrong mail reply format, my mail client settings got
> >> changed on my PC.
> >> > Please find comments bellow inline.
> >> >
> >> > Regards,
> >> > Srini
> >> >
> >> >
> >> > Micron Confidential
> >> >> -----Original Message-----
> >> >> From: Huang, Ying <[email protected]>
> >> >> Sent: Monday, December 18, 2023 11:26 AM
> >> >> To: gregory.price <[email protected]>
> >> >> Cc: Srinivasulu Opensrc <[email protected]>; linux-
> >> >> [email protected]; [email protected]; Srinivasulu Thanneeru
> >> >> <[email protected]>; [email protected];
> >> >> [email protected]; [email protected]; [email protected];
> >> >> [email protected]; Eishan Mirakhur <[email protected]>;
> Vinicius
> >> >> Tavares Petrucci <[email protected]>; Ravis OpenSrc
> >> >> <[email protected]>; [email protected];
> linux-
> >> >> [email protected]; Johannes Weiner <[email protected]>; Wei
> Xu
> >> >> <[email protected]>
> >> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
> >> tiers
> >> >>
> >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments
> unless
> >> >> you recognize the sender and were expecting this message.
> >> >>
> >> >>
> >> >> Gregory Price <[email protected]> writes:
> >> >>
> >> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
> >> >> >> <[email protected]> writes:
> >> >> >>
> >> >> >> > =============
> >> >> >> > Version Notes:
> >> >> >> >
> >> >> >> > V2 : Changed interface to memtier_override from adistance_offset.
> >> >> >> > memtier_override was recommended by
> >> >> >> > 1. John Groves <[email protected]>
> >> >> >> > 2. Ravi Shankar <[email protected]>
> >> >> >> > 3. Brice Goglin <[email protected]>
> >> >> >>
> >> >> >> It appears that you ignored my comments for V1 as follows ...
> >> >> >>
> >> >> >>
> >> >>
> >>
> https://lore.k/
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0
> >>
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
> >>
> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> >>
> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> >>
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> >>
> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
> >> served=0
> >> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
> >> >>
> >>
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> >> >>
> >>
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> >> >>
> >>
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> >> >>
> >>
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> >> >>
> >>
> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
> >> >> bLhwV12Fg%3D&reserved=0
> >> >
> >> > Thank you, Huang, Ying for pointing to this.
> >> >
> >>
> https://lpc.ev/
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=%2F0AW8RYpTIa7%2FiScnkzmmTeAE9TYqjsuWWjTuxBPptk%3D&rese
> rved=0
> >>
> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1
> >>
> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me
> >>
> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e
> >>
> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806
> >>
> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW
> >>
> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
> >>
> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7
> >> n8%3D&reserved=0
> >> >
> >> > In the presentation above, the adistance_offsets are per memtype.
> >> > We believe that adistance_offset per node is more suitable and flexible.
> >> > since we can change it per node. If we keep adistance_offset per
> memtype,
> >> > then we cannot change it for a specific node of a given memtype.
> >> >
> >> >> >>
> >> >>
> >>
> https://lore.k/
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0
> >>
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
> >>
> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> >>
> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> >>
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> >>
> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
> >> served=0
> >> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
> >> >>
> >>
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> >> >>
> >>
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> >> >>
> >>
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> >> >>
> >>
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> >> >>
> >>
> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
> >> >> D%2BcMI%2BflOsI1M%3D&reserved=0
> >> >
> >> > Yes, memory_type would be grouping the related memories together as
> >> single tier.
> >> > We should also have a flexibility to move nodes between tiers, to
> address
> >> the issues.
> >> > described in use cases above.
> >>
> >> We don't pursue absolute flexibility. We add necessary flexibility
> >> only. Why do you need this kind of flexibility? Can you provide some
> >> use cases where memory_type based "adistance_offset" doesn't work?
> >
> > - /sys/devices/virtual/memory_type/memory_type/ adistance_offset
> > memory_type based "adistance_offset will provide a way to move all nodes
> of same memory_type (e.g. all cxl nodes)
> > to different tier.
>
> We will not put the CXL nodes with different performance metrics in one
> memory_type. If so, do you still need to move one of them?

From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
abstract_distance_offset: override by users to deal with firmware issue.

say firmware can configure the cxl node into wrong tiers, similar to that it may also configure all cxl nodes into single memtype, hence all these nodes can fall into a single wrong tier.
In this case, per node adistance_offset would be good to have ?

--
Srini
> > Whereas /sys/devices/system/node/node2/memtier_override provide a
> way migrate a node from one tier to another.
> > Considering a case where we would like to move two cxl nodes into two
> different tiers in future.
> > So, I thought it would be good to have flexibility at node level instead of at
> memory_type.
>
> --
> Best Regards,
> Huang, Ying

2024-01-04 06:07:50

by Huang, Ying

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Srinivasulu Thanneeru <[email protected]> writes:

>> -----Original Message-----
>> From: Huang, Ying <[email protected]>
>> Sent: Wednesday, January 3, 2024 2:00 PM
>> To: Srinivasulu Thanneeru <[email protected]>
>> Cc: gregory.price <[email protected]>; Srinivasulu Opensrc
>> <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; [email protected]; [email protected]; Eishan Mirakhur
>> <[email protected]>; Vinicius Tavares Petrucci
>> <[email protected]>; Ravis OpenSrc <[email protected]>;
>> [email protected]; [email protected]; Johannes
>> Weiner <[email protected]>; Wei Xu <[email protected]>
>> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
>> tiers
>>
>> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> you recognize the sender and were expecting this message.
>>
>>
>> Srinivasulu Thanneeru <[email protected]> writes:
>>
>> > Micron Confidential
>> >
>> >
>> >
>> > Micron Confidential
>> >> -----Original Message-----
>> >> From: Huang, Ying <[email protected]>
>> >> Sent: Wednesday, January 3, 2024 11:38 AM
>> >> To: Srinivasulu Thanneeru <[email protected]>
>> >> Cc: gregory.price <[email protected]>; Srinivasulu Opensrc
>> >> <[email protected]>; [email protected]; linux-
>> >> [email protected]; [email protected];
>> [email protected];
>> >> [email protected]; [email protected]; [email protected]; Eishan Mirakhur
>> >> <[email protected]>; Vinicius Tavares Petrucci
>> >> <[email protected]>; Ravis OpenSrc
>> <[email protected]>;
>> >> [email protected]; [email protected]; Johannes
>> >> Weiner <[email protected]>; Wei Xu <[email protected]>
>> >> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between
>> memory
>> >> tiers
>> >>
>> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> >> you recognize the sender and were expecting this message.
>> >>
>> >>
>> >> Srinivasulu Thanneeru <[email protected]> writes:
>> >>
>> >> > Micron Confidential
>> >> >
>> >> > Hi Huang, Ying,
>> >> >
>> >> > My apologies for wrong mail reply format, my mail client settings got
>> >> changed on my PC.
>> >> > Please find comments bellow inline.
>> >> >
>> >> > Regards,
>> >> > Srini
>> >> >
>> >> >
>> >> > Micron Confidential
>> >> >> -----Original Message-----
>> >> >> From: Huang, Ying <[email protected]>
>> >> >> Sent: Monday, December 18, 2023 11:26 AM
>> >> >> To: gregory.price <[email protected]>
>> >> >> Cc: Srinivasulu Opensrc <[email protected]>; linux-
>> >> >> [email protected]; [email protected]; Srinivasulu Thanneeru
>> >> >> <[email protected]>; [email protected];
>> >> >> [email protected]; [email protected]; [email protected];
>> >> >> [email protected]; Eishan Mirakhur <[email protected]>;
>> Vinicius
>> >> >> Tavares Petrucci <[email protected]>; Ravis OpenSrc
>> >> >> <[email protected]>; [email protected];
>> linux-
>> >> >> [email protected]; Johannes Weiner <[email protected]>; Wei
>> Xu
>> >> >> <[email protected]>
>> >> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
>> >> tiers
>> >> >>
>> >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments
>> unless
>> >> >> you recognize the sender and were expecting this message.
>> >> >>
>> >> >>
>> >> >> Gregory Price <[email protected]> writes:
>> >> >>
>> >> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> >> >> >> <[email protected]> writes:
>> >> >> >>
>> >> >> >> > =============
>> >> >> >> > Version Notes:
>> >> >> >> >
>> >> >> >> > V2 : Changed interface to memtier_override from adistance_offset.
>> >> >> >> > memtier_override was recommended by
>> >> >> >> > 1. John Groves <[email protected]>
>> >> >> >> > 2. Ravi Shankar <[email protected]>
>> >> >> >> > 3. Brice Goglin <[email protected]>
>> >> >> >>
>> >> >> >> It appears that you ignored my comments for V1 as follows ...
>> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> https://lore.k/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
>> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0
>> >>
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
>> >>
>> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> >>
>> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> >>
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> >>
>> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
>> >> served=0
>> >> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
>> >> >>
>> >>
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> >> >>
>> >>
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> >> >>
>> >>
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> >> >>
>> >>
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> >> >>
>> >>
>> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
>> >> >> bLhwV12Fg%3D&reserved=0
>> >> >
>> >> > Thank you, Huang, Ying for pointing to this.
>> >> >
>> >>
>> https://lpc.ev/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
>> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=%2F0AW8RYpTIa7%2FiScnkzmmTeAE9TYqjsuWWjTuxBPptk%3D&rese
>> rved=0
>> >>
>> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1
>> >>
>> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me
>> >>
>> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e
>> >>
>> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806
>> >>
>> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW
>> >>
>> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
>> >>
>> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7
>> >> n8%3D&reserved=0
>> >> >
>> >> > In the presentation above, the adistance_offsets are per memtype.
>> >> > We believe that adistance_offset per node is more suitable and flexible.
>> >> > since we can change it per node. If we keep adistance_offset per
>> memtype,
>> >> > then we cannot change it for a specific node of a given memtype.
>> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://lore.k/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
>> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0
>> >>
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
>> >>
>> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> >>
>> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> >>
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> >>
>> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
>> >> served=0
>> >> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
>> >> >>
>> >>
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> >> >>
>> >>
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> >> >>
>> >>
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> >> >>
>> >>
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> >> >>
>> >>
>> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
>> >> >> D%2BcMI%2BflOsI1M%3D&reserved=0
>> >> >
>> >> > Yes, memory_type would be grouping the related memories together as
>> >> single tier.
>> >> > We should also have a flexibility to move nodes between tiers, to
>> address
>> >> the issues.
>> >> > described in use cases above.
>> >>
>> >> We don't pursue absolute flexibility. We add necessary flexibility
>> >> only. Why do you need this kind of flexibility? Can you provide some
>> >> use cases where memory_type based "adistance_offset" doesn't work?
>> >
>> > - /sys/devices/virtual/memory_type/memory_type/ adistance_offset
>> > memory_type based "adistance_offset will provide a way to move all nodes
>> of same memory_type (e.g. all cxl nodes)
>> > to different tier.
>>
>> We will not put the CXL nodes with different performance metrics in one
>> memory_type. If so, do you still need to move one of them?
>
> From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> abstract_distance_offset: override by users to deal with firmware issue.
>
> say firmware can configure the cxl node into wrong tiers, similar to
> that it may also configure all cxl nodes into single memtype, hence
> all these nodes can fall into a single wrong tier.
> In this case, per node adistance_offset would be good to have ?

I think that it's better to fix the error firmware if possible. And
these are only theoretical, not practical issues. Do you have some
practical issues?

I understand that users may want to move nodes between memory tiers for
different policy choices. For that, memory_type based adistance_offset
should be good.

> --
> Srini
>> > Whereas /sys/devices/system/node/node2/memtier_override provide a
>> way migrate a node from one tier to another.
>> > Considering a case where we would like to move two cxl nodes into two
>> different tiers in future.
>> > So, I thought it would be good to have flexibility at node level instead of at
>> memory_type.

--
Best Regards,
Huang, Ying

2024-01-08 17:04:59

by Gregory Price

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >
> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> > abstract_distance_offset: override by users to deal with firmware issue.
> >
> > say firmware can configure the cxl node into wrong tiers, similar to
> > that it may also configure all cxl nodes into single memtype, hence
> > all these nodes can fall into a single wrong tier.
> > In this case, per node adistance_offset would be good to have ?
>
> I think that it's better to fix the error firmware if possible. And
> these are only theoretical, not practical issues. Do you have some
> practical issues?
>
> I understand that users may want to move nodes between memory tiers for
> different policy choices. For that, memory_type based adistance_offset
> should be good.
>

There's actually an affirmative case to change memory tiering to allow
either movement of nodes between tiers, or at least base placement on
HMAT information. Preferably, membership would be changable to allow
hotplug/DCD to be managed (there's no guarantee that the memory passed
through will always be what HMAT says on initial boot).

https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/

This group wants to enable passing CXL memory through to KVM/QEMU
(i.e. host CXL expander memory passed through to the guest), and
allow the guest to apply memory tiering.

There are multiple issues with this, presently:

1. The QEMU CXL virtual device is not and probably never will be
performant enough to be a commodity class virtualization. The
reason is that the virtual CXL device is built off the I/O
virtualization stack, which treats memory accesses as I/O accesses.

KVM also seems incompatible with the design of the CXL memory device
in general, but this problem may or may not be a blocker.

As a result, access to virtual CXL memory device leads to QEMU
crawling to a halt - and this is unlikely to change.

There is presently no good way forward to create a performant virtual
CXL device in QEMU. This means the memory tiering component in the
kernel is functionally useless for virtual CXL memory, because...

2. When passing memory through as an explicit NUMA node, but not as
part of a CXL memory device, the nodes are lumped together in the
DRAM tier.

None of this has to do with firmware.

Memory-type is an awful way of denoting membership of a tier, but we
have HMAT information that can be passed through via QEMU:

-object memory-backend-ram,size=4G,id=ram-node0 \
-object memory-backend-ram,size=4G,id=ram-node1 \
-numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
-numa node,initiator=0,nodeid=1,memdev=ram-node1 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880

Not only would it be nice if we could change tier membership based on
this data, it's realistically the only way to allow guests to accomplish
memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.

~Gregory

2024-01-09 03:43:35

by Huang, Ying

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Gregory Price <[email protected]> writes:

> On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> >
>> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>> > abstract_distance_offset: override by users to deal with firmware issue.
>> >
>> > say firmware can configure the cxl node into wrong tiers, similar to
>> > that it may also configure all cxl nodes into single memtype, hence
>> > all these nodes can fall into a single wrong tier.
>> > In this case, per node adistance_offset would be good to have ?
>>
>> I think that it's better to fix the error firmware if possible. And
>> these are only theoretical, not practical issues. Do you have some
>> practical issues?
>>
>> I understand that users may want to move nodes between memory tiers for
>> different policy choices. For that, memory_type based adistance_offset
>> should be good.
>>
>
> There's actually an affirmative case to change memory tiering to allow
> either movement of nodes between tiers, or at least base placement on
> HMAT information. Preferably, membership would be changable to allow
> hotplug/DCD to be managed (there's no guarantee that the memory passed
> through will always be what HMAT says on initial boot).

IIUC, from Jonathan Cameron as below, the performance of memory
shouldn't change even for DCD devices.

https://lore.kernel.org/linux-mm/[email protected]/

It's possible to change the performance of a NUMA node changed, if we
hot-remove a memory device, then hot-add another different memory
device. It's hoped that the CDAT changes too.

So, all in all, HMAT + CDAT can help us to put the memory device in
appropriate memory tiers. Now, we have HMAT support in upstream. We
will working on CDAT support.

--
Best Regards,
Huang, Ying

> https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>
> This group wants to enable passing CXL memory through to KVM/QEMU
> (i.e. host CXL expander memory passed through to the guest), and
> allow the guest to apply memory tiering.
>
> There are multiple issues with this, presently:
>
> 1. The QEMU CXL virtual device is not and probably never will be
> performant enough to be a commodity class virtualization. The
> reason is that the virtual CXL device is built off the I/O
> virtualization stack, which treats memory accesses as I/O accesses.
>
> KVM also seems incompatible with the design of the CXL memory device
> in general, but this problem may or may not be a blocker.
>
> As a result, access to virtual CXL memory device leads to QEMU
> crawling to a halt - and this is unlikely to change.
>
> There is presently no good way forward to create a performant virtual
> CXL device in QEMU. This means the memory tiering component in the
> kernel is functionally useless for virtual CXL memory, because...
>
> 2. When passing memory through as an explicit NUMA node, but not as
> part of a CXL memory device, the nodes are lumped together in the
> DRAM tier.
>
> None of this has to do with firmware.
>
> Memory-type is an awful way of denoting membership of a tier, but we
> have HMAT information that can be passed through via QEMU:
>
> -object memory-backend-ram,size=4G,id=ram-node0 \
> -object memory-backend-ram,size=4G,id=ram-node1 \
> -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>
> Not only would it be nice if we could change tier membership based on
> this data, it's realistically the only way to allow guests to accomplish
> memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>
> ~Gregory

2024-01-09 15:53:36

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Tue, 09 Jan 2024 11:41:11 +0800
"Huang, Ying" <[email protected]> wrote:

> Gregory Price <[email protected]> writes:
>
> > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >> >
> >> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> >> > abstract_distance_offset: override by users to deal with firmware issue.
> >> >
> >> > say firmware can configure the cxl node into wrong tiers, similar to
> >> > that it may also configure all cxl nodes into single memtype, hence
> >> > all these nodes can fall into a single wrong tier.
> >> > In this case, per node adistance_offset would be good to have ?
> >>
> >> I think that it's better to fix the error firmware if possible. And
> >> these are only theoretical, not practical issues. Do you have some
> >> practical issues?
> >>
> >> I understand that users may want to move nodes between memory tiers for
> >> different policy choices. For that, memory_type based adistance_offset
> >> should be good.
> >>
> >
> > There's actually an affirmative case to change memory tiering to allow
> > either movement of nodes between tiers, or at least base placement on
> > HMAT information. Preferably, membership would be changable to allow
> > hotplug/DCD to be managed (there's no guarantee that the memory passed
> > through will always be what HMAT says on initial boot).
>
> IIUC, from Jonathan Cameron as below, the performance of memory
> shouldn't change even for DCD devices.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> It's possible to change the performance of a NUMA node changed, if we
> hot-remove a memory device, then hot-add another different memory
> device. It's hoped that the CDAT changes too.

Not supported, but ACPI has _HMA methods to in theory allow changing
HMAT values based on firmware notifications... So we 'could' make
it work for HMAT based description.

Ultimately my current thinking is we'll end up emulating CXL type3
devices (hiding topology complexity) and you can update CDAT but
IIRC that is only meant to be for degraded situations - so if you
want multiple performance regions, CDAT should describe them form the start.

>
> So, all in all, HMAT + CDAT can help us to put the memory device in
> appropriate memory tiers. Now, we have HMAT support in upstream. We
> will working on CDAT support.
>
> --
> Best Regards,
> Huang, Ying
>
> > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> >
> > This group wants to enable passing CXL memory through to KVM/QEMU
> > (i.e. host CXL expander memory passed through to the guest), and
> > allow the guest to apply memory tiering.
> >
> > There are multiple issues with this, presently:
> >
> > 1. The QEMU CXL virtual device is not and probably never will be
> > performant enough to be a commodity class virtualization.

I'd flex that a bit - we will end up with a solution for virtualization but
it isn't the emulation that is there today because it's not possible to
emulate some of the topology in a peformant manner (interleaving with sub
page granularity / interleaving at all (to a lesser degree)). There are
ways to do better than we are today, but they start to look like
software dissagregated memory setups (think lots of page faults in the host).

> > The
> > reason is that the virtual CXL device is built off the I/O
> > virtualization stack, which treats memory accesses as I/O accesses.

That will remain true for complex emulation, but it needn't always be
the case.
I'm not 100% sure we can make it work but my current thinking is:

When decoders are set up: Check if there is any interleaving going on.
interleaving happening: Current functionally correct path.
no interleaving: More conventional memory access path.

> >
> > KVM also seems incompatible with the design of the CXL memory device
> > in general, but this problem may or may not be a blocker.

That's true if we are doing fine grained routing but as above we can
probably avoid that.

> >
> > As a result, access to virtual CXL memory device leads to QEMU
> > crawling to a halt - and this is unlikely to change.

In general yes, but hopefully not for carefully configured cases (the
simple one of direct connect single device, no host interleaving for example).

> >
> > There is presently no good way forward to create a performant virtual
> > CXL device in QEMU. This means the memory tiering component in the
> > kernel is functionally useless for virtual CXL memory, because...

Agreed - nothing there yet and I don't think the question of CXL virtualization
in general is anywhere near solved... Maybe emulating a CXL device doesn't
make sense, maybe we end up extending virtio-mem instead.
Needs some PoC work to flesh this out. (it's about number 3 on my list of
stuff to look at this year)

> >
> > 2. When passing memory through as an explicit NUMA node, but not as
> > part of a CXL memory device, the nodes are lumped together in the
> > DRAM tier.
> >
> > None of this has to do with firmware.
> >
> > Memory-type is an awful way of denoting membership of a tier, but we
> > have HMAT information that can be passed through via QEMU:
> >
> > -object memory-backend-ram,size=4G,id=ram-node0 \
> > -object memory-backend-ram,size=4G,id=ram-node1 \
> > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> >
> > Not only would it be nice if we could change tier membership based on
> > this data, it's realistically the only way to allow guests to accomplish
> > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.

This I fully agree with. There will be systems with a bunch of normal DDR with different
access characteristics irrespective of CXL. + likely HMAT solutions will be used
before we get anything more complex in place for CXL.

Jonathan

p.s. I'd love to see _HMA handling implemented in the kernel.. Would trail blaze what
we will probably need to do for fiddly CXL cases where performance degrades on old devices
etc.

> >
> > ~Gregory
>


2024-01-09 17:34:44

by Gregory Price

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Tue, Jan 09, 2024 at 11:41:11AM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >> >
> >> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> >> > abstract_distance_offset: override by users to deal with firmware issue.
> >> >
> >> > say firmware can configure the cxl node into wrong tiers, similar to
> >> > that it may also configure all cxl nodes into single memtype, hence
> >> > all these nodes can fall into a single wrong tier.
> >> > In this case, per node adistance_offset would be good to have ?
> >>
> >> I think that it's better to fix the error firmware if possible. And
> >> these are only theoretical, not practical issues. Do you have some
> >> practical issues?
> >>
> >> I understand that users may want to move nodes between memory tiers for
> >> different policy choices. For that, memory_type based adistance_offset
> >> should be good.
> >>
> >
> > There's actually an affirmative case to change memory tiering to allow
> > either movement of nodes between tiers, or at least base placement on
> > HMAT information. Preferably, membership would be changable to allow
> > hotplug/DCD to be managed (there's no guarantee that the memory passed
> > through will always be what HMAT says on initial boot).
>
> IIUC, from Jonathan Cameron as below, the performance of memory
> shouldn't change even for DCD devices.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> It's possible to change the performance of a NUMA node changed, if we
> hot-remove a memory device, then hot-add another different memory
> device. It's hoped that the CDAT changes too.
>
> So, all in all, HMAT + CDAT can help us to put the memory device in
> appropriate memory tiers. Now, we have HMAT support in upstream. We
> will working on CDAT support.

That should be sufficient assuming the `-numa hmat-lb` setting in QEMU
does the right thing. I suppose we also need to figure out a way to set
CDAT information for a memory device that isn't related to CXL (from the
perspective of the guest). I'll take a look if I get cycles.

~Gregory

2024-01-09 18:01:28

by Gregory Price

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> On Tue, 09 Jan 2024 11:41:11 +0800
> "Huang, Ying" <[email protected]> wrote:
> > Gregory Price <[email protected]> writes:
> > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> > It's possible to change the performance of a NUMA node changed, if we
> > hot-remove a memory device, then hot-add another different memory
> > device. It's hoped that the CDAT changes too.
>
> Not supported, but ACPI has _HMA methods to in theory allow changing
> HMAT values based on firmware notifications... So we 'could' make
> it work for HMAT based description.
>
> Ultimately my current thinking is we'll end up emulating CXL type3
> devices (hiding topology complexity) and you can update CDAT but
> IIRC that is only meant to be for degraded situations - so if you
> want multiple performance regions, CDAT should describe them form the start.
>

That was my thought. I don't think it's particularly *realistic* for
HMAT/CDAT values to change at runtime, but I can imagine a case where
it could be valuable.

> > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > >
> > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > (i.e. host CXL expander memory passed through to the guest), and
> > > allow the guest to apply memory tiering.
> > >
> > > There are multiple issues with this, presently:
> > >
> > > 1. The QEMU CXL virtual device is not and probably never will be
> > > performant enough to be a commodity class virtualization.
>
> I'd flex that a bit - we will end up with a solution for virtualization but
> it isn't the emulation that is there today because it's not possible to
> emulate some of the topology in a peformant manner (interleaving with sub
> page granularity / interleaving at all (to a lesser degree)). There are
> ways to do better than we are today, but they start to look like
> software dissagregated memory setups (think lots of page faults in the host).
>

Agreed, the emulated device as-is can't be the virtualization device,
but it doesn't mean it can't be the basis for it.

My thought is, if you want to pass host CXL *memory* through to the
guest, you don't actually care to pass CXL *control* through to the
guest. That control lies pretty squarely with the host/hypervisor.

So, at least in theory, you can just cut the type3 device out of the
QEMU configuration entirely and just pass it through as a distinct numa
node with specific hmat qualities.

Barring that, if we must go through the type3 device, the question is
how difficult would it be to just make a stripped down type3 device
to provide the informational components, but hack off anything
topology/interleave related? Then you just do direct passthrough as you
described below.

qemu/kvm would report errors if you tried to touch the naughty bits.

The second question is... is that device "compliant" or does it need
super special handling from the kernel driver :D? If what i described
is not "compliant", then it's probably a bad idea, and KVM/QEMU should
just hide the CXL device entirely from the guest (for this use case)
and just pass the memory through as a numa node.

Which gets us back to: The memory-tiering component needs a way to
place nodes in different tiers based on HMAT/CDAT/User Whim. All three
of those seem like totally valid ways to go about it.

> > >
> > > 2. When passing memory through as an explicit NUMA node, but not as
> > > part of a CXL memory device, the nodes are lumped together in the
> > > DRAM tier.
> > >
> > > None of this has to do with firmware.
> > >
> > > Memory-type is an awful way of denoting membership of a tier, but we
> > > have HMAT information that can be passed through via QEMU:
> > >
> > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > >
> > > Not only would it be nice if we could change tier membership based on
> > > this data, it's realistically the only way to allow guests to accomplish
> > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>
> This I fully agree with. There will be systems with a bunch of normal DDR with different
> access characteristics irrespective of CXL. + likely HMAT solutions will be used
> before we get anything more complex in place for CXL.
>

Had not even considered this, but that's completely accurate as well.

And more discretely: What of devices that don't provide HMAT/CDAT? That
isn't necessarily a violation of any standard. There probably could be
a release valve for us to still make those devices useful.

The concern I have with not implementing a movement mechanism *at all*
is that a one-size-fits-all initial-placement heuristic feels gross
when we're, at least ideologically, moving toward "software defined memory".

Personally I think the movement mechanism is a good idea that gets folks
where they're going sooner, and it doesn't hurt anything by existing. We
can change the initial placement mechanism too.

</2cents>

~Gregory

2024-01-10 00:29:04

by Hao Xiang

[permalink] [raw]
Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <[email protected]> wrote:
>
> On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> > On Tue, 09 Jan 2024 11:41:11 +0800
> > "Huang, Ying" <[email protected]> wrote:
> > > Gregory Price <[email protected]> writes:
> > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> > > It's possible to change the performance of a NUMA node changed, if we
> > > hot-remove a memory device, then hot-add another different memory
> > > device. It's hoped that the CDAT changes too.
> >
> > Not supported, but ACPI has _HMA methods to in theory allow changing
> > HMAT values based on firmware notifications... So we 'could' make
> > it work for HMAT based description.
> >
> > Ultimately my current thinking is we'll end up emulating CXL type3
> > devices (hiding topology complexity) and you can update CDAT but
> > IIRC that is only meant to be for degraded situations - so if you
> > want multiple performance regions, CDAT should describe them form the start.
> >
>
> That was my thought. I don't think it's particularly *realistic* for
> HMAT/CDAT values to change at runtime, but I can imagine a case where
> it could be valuable.
>
> > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > > >
> > > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > > (i.e. host CXL expander memory passed through to the guest), and
> > > > allow the guest to apply memory tiering.
> > > >
> > > > There are multiple issues with this, presently:
> > > >
> > > > 1. The QEMU CXL virtual device is not and probably never will be
> > > > performant enough to be a commodity class virtualization.
> >
> > I'd flex that a bit - we will end up with a solution for virtualization but
> > it isn't the emulation that is there today because it's not possible to
> > emulate some of the topology in a peformant manner (interleaving with sub
> > page granularity / interleaving at all (to a lesser degree)). There are
> > ways to do better than we are today, but they start to look like
> > software dissagregated memory setups (think lots of page faults in the host).
> >
>
> Agreed, the emulated device as-is can't be the virtualization device,
> but it doesn't mean it can't be the basis for it.
>
> My thought is, if you want to pass host CXL *memory* through to the
> guest, you don't actually care to pass CXL *control* through to the
> guest. That control lies pretty squarely with the host/hypervisor.
>
> So, at least in theory, you can just cut the type3 device out of the
> QEMU configuration entirely and just pass it through as a distinct numa
> node with specific hmat qualities.
>
> Barring that, if we must go through the type3 device, the question is
> how difficult would it be to just make a stripped down type3 device
> to provide the informational components, but hack off anything
> topology/interleave related? Then you just do direct passthrough as you
> described below.
>
> qemu/kvm would report errors if you tried to touch the naughty bits.
>
> The second question is... is that device "compliant" or does it need
> super special handling from the kernel driver :D? If what i described
> is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> just hide the CXL device entirely from the guest (for this use case)
> and just pass the memory through as a numa node.
>
> Which gets us back to: The memory-tiering component needs a way to
> place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> of those seem like totally valid ways to go about it.
>
> > > >
> > > > 2. When passing memory through as an explicit NUMA node, but not as
> > > > part of a CXL memory device, the nodes are lumped together in the
> > > > DRAM tier.
> > > >
> > > > None of this has to do with firmware.
> > > >
> > > > Memory-type is an awful way of denoting membership of a tier, but we
> > > > have HMAT information that can be passed through via QEMU:
> > > >
> > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > > >
> > > > Not only would it be nice if we could change tier membership based on
> > > > this data, it's realistically the only way to allow guests to accomplish
> > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> >
> > This I fully agree with. There will be systems with a bunch of normal DDR with different
> > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> > before we get anything more complex in place for CXL.
> >
>
> Had not even considered this, but that's completely accurate as well.
>
> And more discretely: What of devices that don't provide HMAT/CDAT? That
> isn't necessarily a violation of any standard. There probably could be
> a release valve for us to still make those devices useful.
>
> The concern I have with not implementing a movement mechanism *at all*
> is that a one-size-fits-all initial-placement heuristic feels gross
> when we're, at least ideologically, moving toward "software defined memory".
>
> Personally I think the movement mechanism is a good idea that gets folks
> where they're going sooner, and it doesn't hurt anything by existing. We
> can change the initial placement mechanism too.

I think providing users a way to "FIX" the memory tiering is a backup
option. Given that DDRs with different access characteristics provide
the relevant CDAT/HMAT information, the kernel should be able to
correctly establish memory tiering on boot.
Current memory tiering code has
1) memory_tier_init() to iterate through all boot onlined memory
nodes. All nodes are assumed to be fast tier (adistance
MEMTIER_ADISTANCE_DRAM is used).
2) dev_dax_kmem_probe to iterate through all devdax controlled memory
nodes. This is the place the kernel reads the memory attributes from
HMAT and recognizes the memory nodes into the correct tier (devdax
controlled CXL, pmem, etc).
If we want DDRs with different memory characteristics to be put into
the correct tier (as in the guest VM memory tiering case), we probably
need a third path to iterate the boot onlined memory nodes and also be
able to read their memory attributes. I don't think we can do that in
1) because the ACPI subsystem is not yet initialized.

>
> </2cents>
>
> ~Gregory

2024-01-10 05:50:46

by Huang, Ying

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Gregory Price <[email protected]> writes:

> On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
>> On Tue, 09 Jan 2024 11:41:11 +0800
>> "Huang, Ying" <[email protected]> wrote:
>> > Gregory Price <[email protected]> writes:
>> > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> > It's possible to change the performance of a NUMA node changed, if we
>> > hot-remove a memory device, then hot-add another different memory
>> > device. It's hoped that the CDAT changes too.
>>
>> Not supported, but ACPI has _HMA methods to in theory allow changing
>> HMAT values based on firmware notifications... So we 'could' make
>> it work for HMAT based description.
>>
>> Ultimately my current thinking is we'll end up emulating CXL type3
>> devices (hiding topology complexity) and you can update CDAT but
>> IIRC that is only meant to be for degraded situations - so if you
>> want multiple performance regions, CDAT should describe them form the start.
>>
>
> That was my thought. I don't think it's particularly *realistic* for
> HMAT/CDAT values to change at runtime, but I can imagine a case where
> it could be valuable.
>
>> > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>> > >
>> > > This group wants to enable passing CXL memory through to KVM/QEMU
>> > > (i.e. host CXL expander memory passed through to the guest), and
>> > > allow the guest to apply memory tiering.
>> > >
>> > > There are multiple issues with this, presently:
>> > >
>> > > 1. The QEMU CXL virtual device is not and probably never will be
>> > > performant enough to be a commodity class virtualization.
>>
>> I'd flex that a bit - we will end up with a solution for virtualization but
>> it isn't the emulation that is there today because it's not possible to
>> emulate some of the topology in a peformant manner (interleaving with sub
>> page granularity / interleaving at all (to a lesser degree)). There are
>> ways to do better than we are today, but they start to look like
>> software dissagregated memory setups (think lots of page faults in the host).
>>
>
> Agreed, the emulated device as-is can't be the virtualization device,
> but it doesn't mean it can't be the basis for it.
>
> My thought is, if you want to pass host CXL *memory* through to the
> guest, you don't actually care to pass CXL *control* through to the
> guest. That control lies pretty squarely with the host/hypervisor.
>
> So, at least in theory, you can just cut the type3 device out of the
> QEMU configuration entirely and just pass it through as a distinct numa
> node with specific hmat qualities.
>
> Barring that, if we must go through the type3 device, the question is
> how difficult would it be to just make a stripped down type3 device
> to provide the informational components, but hack off anything
> topology/interleave related? Then you just do direct passthrough as you
> described below.
>
> qemu/kvm would report errors if you tried to touch the naughty bits.
>
> The second question is... is that device "compliant" or does it need
> super special handling from the kernel driver :D? If what i described
> is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> just hide the CXL device entirely from the guest (for this use case)
> and just pass the memory through as a numa node.
>
> Which gets us back to: The memory-tiering component needs a way to
> place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> of those seem like totally valid ways to go about it.
>
>> > >
>> > > 2. When passing memory through as an explicit NUMA node, but not as
>> > > part of a CXL memory device, the nodes are lumped together in the
>> > > DRAM tier.
>> > >
>> > > None of this has to do with firmware.
>> > >
>> > > Memory-type is an awful way of denoting membership of a tier, but we
>> > > have HMAT information that can be passed through via QEMU:
>> > >
>> > > -object memory-backend-ram,size=4G,id=ram-node0 \
>> > > -object memory-backend-ram,size=4G,id=ram-node1 \
>> > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
>> > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
>> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
>> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
>> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
>> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>> > >
>> > > Not only would it be nice if we could change tier membership based on
>> > > this data, it's realistically the only way to allow guests to accomplish
>> > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>>
>> This I fully agree with. There will be systems with a bunch of normal DDR with different
>> access characteristics irrespective of CXL. + likely HMAT solutions will be used
>> before we get anything more complex in place for CXL.
>>
>
> Had not even considered this, but that's completely accurate as well.
>
> And more discretely: What of devices that don't provide HMAT/CDAT? That
> isn't necessarily a violation of any standard. There probably could be
> a release valve for us to still make those devices useful.
>
> The concern I have with not implementing a movement mechanism *at all*
> is that a one-size-fits-all initial-placement heuristic feels gross
> when we're, at least ideologically, moving toward "software defined memory".
>
> Personally I think the movement mechanism is a good idea that gets folks
> where they're going sooner, and it doesn't hurt anything by existing. We
> can change the initial placement mechanism too.
>
> </2cents>

It's the last resort to provide hardware information from user space.
We should try to avoid that if possible.

Per my understanding, per-memory-type abstract distance overriding is to
apply specific policy. While, per-memory-node abstract distance
overriding is to provide missing hardware information.

--
Best Regards,
Huang, Ying

2024-01-10 06:09:02

by Huang, Ying

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Jonathan Cameron <[email protected]> writes:

> On Tue, 09 Jan 2024 11:41:11 +0800
> "Huang, Ying" <[email protected]> wrote:
>
>> Gregory Price <[email protected]> writes:
>>
>> > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> >> >
>> >> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>> >> > abstract_distance_offset: override by users to deal with firmware issue.
>> >> >
>> >> > say firmware can configure the cxl node into wrong tiers, similar to
>> >> > that it may also configure all cxl nodes into single memtype, hence
>> >> > all these nodes can fall into a single wrong tier.
>> >> > In this case, per node adistance_offset would be good to have ?
>> >>
>> >> I think that it's better to fix the error firmware if possible. And
>> >> these are only theoretical, not practical issues. Do you have some
>> >> practical issues?
>> >>
>> >> I understand that users may want to move nodes between memory tiers for
>> >> different policy choices. For that, memory_type based adistance_offset
>> >> should be good.
>> >>
>> >
>> > There's actually an affirmative case to change memory tiering to allow
>> > either movement of nodes between tiers, or at least base placement on
>> > HMAT information. Preferably, membership would be changable to allow
>> > hotplug/DCD to be managed (there's no guarantee that the memory passed
>> > through will always be what HMAT says on initial boot).
>>
>> IIUC, from Jonathan Cameron as below, the performance of memory
>> shouldn't change even for DCD devices.
>>
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> It's possible to change the performance of a NUMA node changed, if we
>> hot-remove a memory device, then hot-add another different memory
>> device. It's hoped that the CDAT changes too.
>
> Not supported, but ACPI has _HMA methods to in theory allow changing
> HMAT values based on firmware notifications... So we 'could' make
> it work for HMAT based description.
>
> Ultimately my current thinking is we'll end up emulating CXL type3
> devices (hiding topology complexity) and you can update CDAT but
> IIRC that is only meant to be for degraded situations - so if you
> want multiple performance regions, CDAT should describe them form the start.

Thank you very much for input! So, to support degraded performance, we
will need to move a NUMA node between memory tiers. And, per my
understanding, we should do that in kernel.

>>
>> So, all in all, HMAT + CDAT can help us to put the memory device in
>> appropriate memory tiers. Now, we have HMAT support in upstream. We
>> will working on CDAT support.
>>

--
Best Regards,
Huang, Ying

2024-01-10 14:14:51

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Tue, 9 Jan 2024 12:59:19 -0500
Gregory Price <[email protected]> wrote:

> On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> > On Tue, 09 Jan 2024 11:41:11 +0800
> > "Huang, Ying" <[email protected]> wrote:
> > > Gregory Price <[email protected]> writes:
> > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> > > It's possible to change the performance of a NUMA node changed, if we
> > > hot-remove a memory device, then hot-add another different memory
> > > device. It's hoped that the CDAT changes too.
> >
> > Not supported, but ACPI has _HMA methods to in theory allow changing
> > HMAT values based on firmware notifications... So we 'could' make
> > it work for HMAT based description.
> >
> > Ultimately my current thinking is we'll end up emulating CXL type3
> > devices (hiding topology complexity) and you can update CDAT but
> > IIRC that is only meant to be for degraded situations - so if you
> > want multiple performance regions, CDAT should describe them form the start.
> >
>
> That was my thought. I don't think it's particularly *realistic* for
> HMAT/CDAT values to change at runtime, but I can imagine a case where
> it could be valuable.

For now I'm thinking we might spit that CDAT info via a tracepoint if
it happens, but given it's degraded perf only maybe we don't care.

HMAT is more interesting because it may be used by a firmware first
model to paper over some weird hardware being hotplugged, or for giggles
a hypervisor moving memory around under the hood (think powering down
whole DRAM controllers etc).

Anyhow, that's highly speculative and whoever cares about it can
make it work! :)

>
> > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > > >
> > > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > > (i.e. host CXL expander memory passed through to the guest), and
> > > > allow the guest to apply memory tiering.
> > > >
> > > > There are multiple issues with this, presently:
> > > >
> > > > 1. The QEMU CXL virtual device is not and probably never will be
> > > > performant enough to be a commodity class virtualization.
> >
> > I'd flex that a bit - we will end up with a solution for virtualization but
> > it isn't the emulation that is there today because it's not possible to
> > emulate some of the topology in a peformant manner (interleaving with sub
> > page granularity / interleaving at all (to a lesser degree)). There are
> > ways to do better than we are today, but they start to look like
> > software dissagregated memory setups (think lots of page faults in the host).
> >
>
> Agreed, the emulated device as-is can't be the virtualization device,
> but it doesn't mean it can't be the basis for it.
>
> My thought is, if you want to pass host CXL *memory* through to the
> guest, you don't actually care to pass CXL *control* through to the
> guest. That control lies pretty squarely with the host/hypervisor.
>
> So, at least in theory, you can just cut the type3 device out of the
> QEMU configuration entirely and just pass it through as a distinct numa
> node with specific hmat qualities.
>
> Barring that, if we must go through the type3 device, the question is
> how difficult would it be to just make a stripped down type3 device
> to provide the informational components, but hack off anything
> topology/interleave related? Then you just do direct passthrough as you
> described below.

Not stripped down as such, just lock the decoders as if a firmware had
configured it (in reality the config will be really really simple).
The kernel stack handles that fine today. The only dynamic bit
would be the DC related part. Not sure our lockdown support in the
emulated device is complete (some of it is there but might have missed
some registers).

>
> qemu/kvm would report errors if you tried to touch the naughty bits.

Might do that a temporary step along way to enabling thing but given
CXL assumes that the host firmware 'might' have configured everything and
locked it (kernel may be booting out of CXL memory for instance) it should
'just work' without needing this.

> The second question is... is that device "compliant" or does it need
> super special handling from the kernel driver :D? If what i described
> is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> just hide the CXL device entirely from the guest (for this use case)
> and just pass the memory through as a numa node.
Would need to be compliant or very nearly so - I can see we might advertise
no interleave support even though not setting any of the interleave address
bits is technically a spec violation. However, don't think we need to
do that because of decoder locking. We advertise interleave options but
don't allow current setting to be changed.

If someone manually resets the bus they are on their own though :(
(that will clear the lock registers as it's the same as removing power).

>
> Which gets us back to: The memory-tiering component needs a way to
> place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> of those seem like totally valid ways to go about it.
>
> > > >
> > > > 2. When passing memory through as an explicit NUMA node, but not as
> > > > part of a CXL memory device, the nodes are lumped together in the
> > > > DRAM tier.
> > > >
> > > > None of this has to do with firmware.
> > > >
> > > > Memory-type is an awful way of denoting membership of a tier, but we
> > > > have HMAT information that can be passed through via QEMU:
> > > >
> > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > > >
> > > > Not only would it be nice if we could change tier membership based on
> > > > this data, it's realistically the only way to allow guests to accomplish
> > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> >
> > This I fully agree with. There will be systems with a bunch of normal DDR with different
> > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> > before we get anything more complex in place for CXL.
> >
>
> Had not even considered this, but that's completely accurate as well.
>
> And more discretely: What of devices that don't provide HMAT/CDAT? That
> isn't necessarily a violation of any standard. There probably could be
> a release valve for us to still make those devices useful.

I'd argue any such device needs some driver support. Release valve is they
provide the info from that driver, just like the CDAT solution is doing.

If they don't then meh, their system is borked so they'll will add it
fairly quickly!

>
> The concern I have with not implementing a movement mechanism *at all*
> is that a one-size-fits-all initial-placement heuristic feels gross
> when we're, at least ideologically, moving toward "software defined memory".
>
> Personally I think the movement mechanism is a good idea that gets folks
> where they're going sooner, and it doesn't hurt anything by existing. We
> can change the initial placement mechanism too.

I've no problem with a movement mechanism. Hopefully in the long run it
never gets used though! Maybe in short term it's out of tree code.

Jonathan

>
> </2cents>
>
> ~Gregory


2024-01-10 14:21:12

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Tue, 9 Jan 2024 16:28:15 -0800
Hao Xiang <[email protected]> wrote:

> On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <[email protected]> wrote:
> >
> > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> > > On Tue, 09 Jan 2024 11:41:11 +0800
> > > "Huang, Ying" <[email protected]> wrote:
> > > > Gregory Price <[email protected]> writes:
> > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> > > > It's possible to change the performance of a NUMA node changed, if we
> > > > hot-remove a memory device, then hot-add another different memory
> > > > device. It's hoped that the CDAT changes too.
> > >
> > > Not supported, but ACPI has _HMA methods to in theory allow changing
> > > HMAT values based on firmware notifications... So we 'could' make
> > > it work for HMAT based description.
> > >
> > > Ultimately my current thinking is we'll end up emulating CXL type3
> > > devices (hiding topology complexity) and you can update CDAT but
> > > IIRC that is only meant to be for degraded situations - so if you
> > > want multiple performance regions, CDAT should describe them form the start.
> > >
> >
> > That was my thought. I don't think it's particularly *realistic* for
> > HMAT/CDAT values to change at runtime, but I can imagine a case where
> > it could be valuable.
> >
> > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > > > >
> > > > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > > > (i.e. host CXL expander memory passed through to the guest), and
> > > > > allow the guest to apply memory tiering.
> > > > >
> > > > > There are multiple issues with this, presently:
> > > > >
> > > > > 1. The QEMU CXL virtual device is not and probably never will be
> > > > > performant enough to be a commodity class virtualization.
> > >
> > > I'd flex that a bit - we will end up with a solution for virtualization but
> > > it isn't the emulation that is there today because it's not possible to
> > > emulate some of the topology in a peformant manner (interleaving with sub
> > > page granularity / interleaving at all (to a lesser degree)). There are
> > > ways to do better than we are today, but they start to look like
> > > software dissagregated memory setups (think lots of page faults in the host).
> > >
> >
> > Agreed, the emulated device as-is can't be the virtualization device,
> > but it doesn't mean it can't be the basis for it.
> >
> > My thought is, if you want to pass host CXL *memory* through to the
> > guest, you don't actually care to pass CXL *control* through to the
> > guest. That control lies pretty squarely with the host/hypervisor.
> >
> > So, at least in theory, you can just cut the type3 device out of the
> > QEMU configuration entirely and just pass it through as a distinct numa
> > node with specific hmat qualities.
> >
> > Barring that, if we must go through the type3 device, the question is
> > how difficult would it be to just make a stripped down type3 device
> > to provide the informational components, but hack off anything
> > topology/interleave related? Then you just do direct passthrough as you
> > described below.
> >
> > qemu/kvm would report errors if you tried to touch the naughty bits.
> >
> > The second question is... is that device "compliant" or does it need
> > super special handling from the kernel driver :D? If what i described
> > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> > just hide the CXL device entirely from the guest (for this use case)
> > and just pass the memory through as a numa node.
> >
> > Which gets us back to: The memory-tiering component needs a way to
> > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> > of those seem like totally valid ways to go about it.
> >
> > > > >
> > > > > 2. When passing memory through as an explicit NUMA node, but not as
> > > > > part of a CXL memory device, the nodes are lumped together in the
> > > > > DRAM tier.
> > > > >
> > > > > None of this has to do with firmware.
> > > > >
> > > > > Memory-type is an awful way of denoting membership of a tier, but we
> > > > > have HMAT information that can be passed through via QEMU:
> > > > >
> > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > > > >
> > > > > Not only would it be nice if we could change tier membership based on
> > > > > this data, it's realistically the only way to allow guests to accomplish
> > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> > >
> > > This I fully agree with. There will be systems with a bunch of normal DDR with different
> > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> > > before we get anything more complex in place for CXL.
> > >
> >
> > Had not even considered this, but that's completely accurate as well.
> >
> > And more discretely: What of devices that don't provide HMAT/CDAT? That
> > isn't necessarily a violation of any standard. There probably could be
> > a release valve for us to still make those devices useful.
> >
> > The concern I have with not implementing a movement mechanism *at all*
> > is that a one-size-fits-all initial-placement heuristic feels gross
> > when we're, at least ideologically, moving toward "software defined memory".
> >
> > Personally I think the movement mechanism is a good idea that gets folks
> > where they're going sooner, and it doesn't hurt anything by existing. We
> > can change the initial placement mechanism too.
>
> I think providing users a way to "FIX" the memory tiering is a backup
> option. Given that DDRs with different access characteristics provide
> the relevant CDAT/HMAT information, the kernel should be able to
> correctly establish memory tiering on boot.

Include hotplug and I'll be happier! I know that's messy though.

> Current memory tiering code has
> 1) memory_tier_init() to iterate through all boot onlined memory
> nodes. All nodes are assumed to be fast tier (adistance
> MEMTIER_ADISTANCE_DRAM is used).
> 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
> nodes. This is the place the kernel reads the memory attributes from
> HMAT and recognizes the memory nodes into the correct tier (devdax
> controlled CXL, pmem, etc).
> If we want DDRs with different memory characteristics to be put into
> the correct tier (as in the guest VM memory tiering case), we probably
> need a third path to iterate the boot onlined memory nodes and also be
> able to read their memory attributes. I don't think we can do that in
> 1) because the ACPI subsystem is not yet initialized.

Can we move it later in general? Or drag HMAT parsing earlier?
ACPI table availability is pretty early, it's just that we don't bother
with HMAT because nothing early uses it.
IIRC SRAT parsing occurs way before memory_tier_init() will be called.

Jonathan



>
> >
> > </2cents>
> >
> > ~Gregory


2024-01-10 19:29:52

by Hao Xiang

[permalink] [raw]
Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
<[email protected]> wrote:
>
> On Tue, 9 Jan 2024 16:28:15 -0800
> Hao Xiang <[email protected]> wrote:
>
> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <[email protected]> wrote:
> > >
> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> > > > On Tue, 09 Jan 2024 11:41:11 +0800
> > > > "Huang, Ying" <[email protected]> wrote:
> > > > > Gregory Price <[email protected]> writes:
> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> > > > > It's possible to change the performance of a NUMA node changed, if we
> > > > > hot-remove a memory device, then hot-add another different memory
> > > > > device. It's hoped that the CDAT changes too.
> > > >
> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
> > > > HMAT values based on firmware notifications... So we 'could' make
> > > > it work for HMAT based description.
> > > >
> > > > Ultimately my current thinking is we'll end up emulating CXL type3
> > > > devices (hiding topology complexity) and you can update CDAT but
> > > > IIRC that is only meant to be for degraded situations - so if you
> > > > want multiple performance regions, CDAT should describe them form the start.
> > > >
> > >
> > > That was my thought. I don't think it's particularly *realistic* for
> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
> > > it could be valuable.
> > >
> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > > > > >
> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > > > > (i.e. host CXL expander memory passed through to the guest), and
> > > > > > allow the guest to apply memory tiering.
> > > > > >
> > > > > > There are multiple issues with this, presently:
> > > > > >
> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
> > > > > > performant enough to be a commodity class virtualization.
> > > >
> > > > I'd flex that a bit - we will end up with a solution for virtualization but
> > > > it isn't the emulation that is there today because it's not possible to
> > > > emulate some of the topology in a peformant manner (interleaving with sub
> > > > page granularity / interleaving at all (to a lesser degree)). There are
> > > > ways to do better than we are today, but they start to look like
> > > > software dissagregated memory setups (think lots of page faults in the host).
> > > >
> > >
> > > Agreed, the emulated device as-is can't be the virtualization device,
> > > but it doesn't mean it can't be the basis for it.
> > >
> > > My thought is, if you want to pass host CXL *memory* through to the
> > > guest, you don't actually care to pass CXL *control* through to the
> > > guest. That control lies pretty squarely with the host/hypervisor.
> > >
> > > So, at least in theory, you can just cut the type3 device out of the
> > > QEMU configuration entirely and just pass it through as a distinct numa
> > > node with specific hmat qualities.
> > >
> > > Barring that, if we must go through the type3 device, the question is
> > > how difficult would it be to just make a stripped down type3 device
> > > to provide the informational components, but hack off anything
> > > topology/interleave related? Then you just do direct passthrough as you
> > > described below.
> > >
> > > qemu/kvm would report errors if you tried to touch the naughty bits.
> > >
> > > The second question is... is that device "compliant" or does it need
> > > super special handling from the kernel driver :D? If what i described
> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> > > just hide the CXL device entirely from the guest (for this use case)
> > > and just pass the memory through as a numa node.
> > >
> > > Which gets us back to: The memory-tiering component needs a way to
> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> > > of those seem like totally valid ways to go about it.
> > >
> > > > > >
> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
> > > > > > part of a CXL memory device, the nodes are lumped together in the
> > > > > > DRAM tier.
> > > > > >
> > > > > > None of this has to do with firmware.
> > > > > >
> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
> > > > > > have HMAT information that can be passed through via QEMU:
> > > > > >
> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > > > > >
> > > > > > Not only would it be nice if we could change tier membership based on
> > > > > > this data, it's realistically the only way to allow guests to accomplish
> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> > > >
> > > > This I fully agree with. There will be systems with a bunch of normal DDR with different
> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> > > > before we get anything more complex in place for CXL.
> > > >
> > >
> > > Had not even considered this, but that's completely accurate as well.
> > >
> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
> > > isn't necessarily a violation of any standard. There probably could be
> > > a release valve for us to still make those devices useful.
> > >
> > > The concern I have with not implementing a movement mechanism *at all*
> > > is that a one-size-fits-all initial-placement heuristic feels gross
> > > when we're, at least ideologically, moving toward "software defined memory".
> > >
> > > Personally I think the movement mechanism is a good idea that gets folks
> > > where they're going sooner, and it doesn't hurt anything by existing. We
> > > can change the initial placement mechanism too.
> >
> > I think providing users a way to "FIX" the memory tiering is a backup
> > option. Given that DDRs with different access characteristics provide
> > the relevant CDAT/HMAT information, the kernel should be able to
> > correctly establish memory tiering on boot.
>
> Include hotplug and I'll be happier! I know that's messy though.
>
> > Current memory tiering code has
> > 1) memory_tier_init() to iterate through all boot onlined memory
> > nodes. All nodes are assumed to be fast tier (adistance
> > MEMTIER_ADISTANCE_DRAM is used).
> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
> > nodes. This is the place the kernel reads the memory attributes from
> > HMAT and recognizes the memory nodes into the correct tier (devdax
> > controlled CXL, pmem, etc).
> > If we want DDRs with different memory characteristics to be put into
> > the correct tier (as in the guest VM memory tiering case), we probably
> > need a third path to iterate the boot onlined memory nodes and also be
> > able to read their memory attributes. I don't think we can do that in
> > 1) because the ACPI subsystem is not yet initialized.
>
> Can we move it later in general? Or drag HMAT parsing earlier?
> ACPI table availability is pretty early, it's just that we don't bother
> with HMAT because nothing early uses it.
> IIRC SRAT parsing occurs way before memory_tier_init() will be called.

I tested the call sequence under a debugger earlier. hmat_init() is
called after memory_tier_init(). Let me poke around and see what our
options are.

>
> Jonathan
>
>
>
> >
> > >
> > > </2cents>
> > >
> > > ~Gregory
>

2024-01-12 07:02:26

by Huang, Ying

[permalink] [raw]
Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Hao Xiang <[email protected]> writes:

> On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
> <[email protected]> wrote:
>>
>> On Tue, 9 Jan 2024 16:28:15 -0800
>> Hao Xiang <[email protected]> wrote:
>>
>> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <[email protected]> wrote:
>> > >
>> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
>> > > > On Tue, 09 Jan 2024 11:41:11 +0800
>> > > > "Huang, Ying" <[email protected]> wrote:
>> > > > > Gregory Price <[email protected]> writes:
>> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> > > > > It's possible to change the performance of a NUMA node changed, if we
>> > > > > hot-remove a memory device, then hot-add another different memory
>> > > > > device. It's hoped that the CDAT changes too.
>> > > >
>> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
>> > > > HMAT values based on firmware notifications... So we 'could' make
>> > > > it work for HMAT based description.
>> > > >
>> > > > Ultimately my current thinking is we'll end up emulating CXL type3
>> > > > devices (hiding topology complexity) and you can update CDAT but
>> > > > IIRC that is only meant to be for degraded situations - so if you
>> > > > want multiple performance regions, CDAT should describe them form the start.
>> > > >
>> > >
>> > > That was my thought. I don't think it's particularly *realistic* for
>> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
>> > > it could be valuable.
>> > >
>> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>> > > > > >
>> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
>> > > > > > (i.e. host CXL expander memory passed through to the guest), and
>> > > > > > allow the guest to apply memory tiering.
>> > > > > >
>> > > > > > There are multiple issues with this, presently:
>> > > > > >
>> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
>> > > > > > performant enough to be a commodity class virtualization.
>> > > >
>> > > > I'd flex that a bit - we will end up with a solution for virtualization but
>> > > > it isn't the emulation that is there today because it's not possible to
>> > > > emulate some of the topology in a peformant manner (interleaving with sub
>> > > > page granularity / interleaving at all (to a lesser degree)). There are
>> > > > ways to do better than we are today, but they start to look like
>> > > > software dissagregated memory setups (think lots of page faults in the host).
>> > > >
>> > >
>> > > Agreed, the emulated device as-is can't be the virtualization device,
>> > > but it doesn't mean it can't be the basis for it.
>> > >
>> > > My thought is, if you want to pass host CXL *memory* through to the
>> > > guest, you don't actually care to pass CXL *control* through to the
>> > > guest. That control lies pretty squarely with the host/hypervisor.
>> > >
>> > > So, at least in theory, you can just cut the type3 device out of the
>> > > QEMU configuration entirely and just pass it through as a distinct numa
>> > > node with specific hmat qualities.
>> > >
>> > > Barring that, if we must go through the type3 device, the question is
>> > > how difficult would it be to just make a stripped down type3 device
>> > > to provide the informational components, but hack off anything
>> > > topology/interleave related? Then you just do direct passthrough as you
>> > > described below.
>> > >
>> > > qemu/kvm would report errors if you tried to touch the naughty bits.
>> > >
>> > > The second question is... is that device "compliant" or does it need
>> > > super special handling from the kernel driver :D? If what i described
>> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
>> > > just hide the CXL device entirely from the guest (for this use case)
>> > > and just pass the memory through as a numa node.
>> > >
>> > > Which gets us back to: The memory-tiering component needs a way to
>> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
>> > > of those seem like totally valid ways to go about it.
>> > >
>> > > > > >
>> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
>> > > > > > part of a CXL memory device, the nodes are lumped together in the
>> > > > > > DRAM tier.
>> > > > > >
>> > > > > > None of this has to do with firmware.
>> > > > > >
>> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
>> > > > > > have HMAT information that can be passed through via QEMU:
>> > > > > >
>> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
>> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
>> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
>> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
>> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
>> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
>> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
>> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>> > > > > >
>> > > > > > Not only would it be nice if we could change tier membership based on
>> > > > > > this data, it's realistically the only way to allow guests to accomplish
>> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>> > > >
>> > > > This I fully agree with. There will be systems with a bunch of normal DDR with different
>> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
>> > > > before we get anything more complex in place for CXL.
>> > > >
>> > >
>> > > Had not even considered this, but that's completely accurate as well.
>> > >
>> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
>> > > isn't necessarily a violation of any standard. There probably could be
>> > > a release valve for us to still make those devices useful.
>> > >
>> > > The concern I have with not implementing a movement mechanism *at all*
>> > > is that a one-size-fits-all initial-placement heuristic feels gross
>> > > when we're, at least ideologically, moving toward "software defined memory".
>> > >
>> > > Personally I think the movement mechanism is a good idea that gets folks
>> > > where they're going sooner, and it doesn't hurt anything by existing We
>> > > can change the initial placement mechanism too.
>> >
>> > I think providing users a way to "FIX" the memory tiering is a backup
>> > option. Given that DDRs with different access characteristics provide
>> > the relevant CDAT/HMAT information, the kernel should be able to
>> > correctly establish memory tiering on boot.
>>
>> Include hotplug and I'll be happier! I know that's messy though.
>>
>> > Current memory tiering code has
>> > 1) memory_tier_init() to iterate through all boot onlined memory
>> > nodes. All nodes are assumed to be fast tier (adistance
>> > MEMTIER_ADISTANCE_DRAM is used).
>> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
>> > nodes. This is the place the kernel reads the memory attributes from
>> > HMAT and recognizes the memory nodes into the correct tier (devdax
>> > controlled CXL, pmem, etc).
>> > If we want DDRs with different memory characteristics to be put into
>> > the correct tier (as in the guest VM memory tiering case), we probably
>> > need a third path to iterate the boot onlined memory nodes and also be
>> > able to read their memory attributes. I don't think we can do that in
>> > 1) because the ACPI subsystem is not yet initialized.
>>
>> Can we move it later in general? Or drag HMAT parsing earlier?
>> ACPI table availability is pretty early, it's just that we don't bother
>> with HMAT because nothing early uses it.
>> IIRC SRAT parsing occurs way before memory_tier_init() will be called.
>
> I tested the call sequence under a debugger earlier. hmat_init() is
> called after memory_tier_init(). Let me poke around and see what our
> options are.

This sounds reasonable.

Please keep in mind that we need a way to identify the base line memory
type(default_dram_type). A simple method is to use NUMA nodes with CPU
attached. But I remember that Aneesh said that some NUMA nodes without
CPU will need to be put in default_dram_type too on their systems. We
need a way to identify that.

--
Best Regards,
Huang, Ying

2024-01-12 08:14:27

by Hao Xiang

[permalink] [raw]
Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

On Thu, Jan 11, 2024 at 11:02 PM Huang, Ying <[email protected]> wrote:
>
> Hao Xiang <[email protected]> writes:
>
> > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
> > <[email protected]> wrote:
> >>
> >> On Tue, 9 Jan 2024 16:28:15 -0800
> >> Hao Xiang <[email protected]> wrote:
> >>
> >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <[email protected]> wrote:
> >> > >
> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800
> >> > > > "Huang, Ying" <[email protected]> wrote:
> >> > > > > Gregory Price <[email protected]> writes:
> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >> > > > > It's possible to change the performance of a NUMA node changed, if we
> >> > > > > hot-remove a memory device, then hot-add another different memory
> >> > > > > device. It's hoped that the CDAT changes too.
> >> > > >
> >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
> >> > > > HMAT values based on firmware notifications... So we 'could' make
> >> > > > it work for HMAT based description.
> >> > > >
> >> > > > Ultimately my current thinking is we'll end up emulating CXL type3
> >> > > > devices (hiding topology complexity) and you can update CDAT but
> >> > > > IIRC that is only meant to be for degraded situations - so if you
> >> > > > want multiple performance regions, CDAT should describe them form the start.
> >> > > >
> >> > >
> >> > > That was my thought. I don't think it's particularly *realistic* for
> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
> >> > > it could be valuable.
> >> > >
> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> >> > > > > >
> >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
> >> > > > > > (i.e. host CXL expander memory passed through to the guest), and
> >> > > > > > allow the guest to apply memory tiering.
> >> > > > > >
> >> > > > > > There are multiple issues with this, presently:
> >> > > > > >
> >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
> >> > > > > > performant enough to be a commodity class virtualization.
> >> > > >
> >> > > > I'd flex that a bit - we will end up with a solution for virtualization but
> >> > > > it isn't the emulation that is there today because it's not possible to
> >> > > > emulate some of the topology in a peformant manner (interleaving with sub
> >> > > > page granularity / interleaving at all (to a lesser degree)). There are
> >> > > > ways to do better than we are today, but they start to look like
> >> > > > software dissagregated memory setups (think lots of page faults in the host).
> >> > > >
> >> > >
> >> > > Agreed, the emulated device as-is can't be the virtualization device,
> >> > > but it doesn't mean it can't be the basis for it.
> >> > >
> >> > > My thought is, if you want to pass host CXL *memory* through to the
> >> > > guest, you don't actually care to pass CXL *control* through to the
> >> > > guest. That control lies pretty squarely with the host/hypervisor.
> >> > >
> >> > > So, at least in theory, you can just cut the type3 device out of the
> >> > > QEMU configuration entirely and just pass it through as a distinct numa
> >> > > node with specific hmat qualities.
> >> > >
> >> > > Barring that, if we must go through the type3 device, the question is
> >> > > how difficult would it be to just make a stripped down type3 device
> >> > > to provide the informational components, but hack off anything
> >> > > topology/interleave related? Then you just do direct passthrough as you
> >> > > described below.
> >> > >
> >> > > qemu/kvm would report errors if you tried to touch the naughty bits.
> >> > >
> >> > > The second question is... is that device "compliant" or does it need
> >> > > super special handling from the kernel driver :D? If what i described
> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> >> > > just hide the CXL device entirely from the guest (for this use case)
> >> > > and just pass the memory through as a numa node.
> >> > >
> >> > > Which gets us back to: The memory-tiering component needs a way to
> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> >> > > of those seem like totally valid ways to go about it.
> >> > >
> >> > > > > >
> >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
> >> > > > > > part of a CXL memory device, the nodes are lumped together in the
> >> > > > > > DRAM tier.
> >> > > > > >
> >> > > > > > None of this has to do with firmware.
> >> > > > > >
> >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
> >> > > > > > have HMAT information that can be passed through via QEMU:
> >> > > > > >
> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> >> > > > > >
> >> > > > > > Not only would it be nice if we could change tier membership based on
> >> > > > > > this data, it's realistically the only way to allow guests to accomplish
> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> >> > > >
> >> > > > This I fully agree with. There will be systems with a bunch of normal DDR with different
> >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> >> > > > before we get anything more complex in place for CXL.
> >> > > >
> >> > >
> >> > > Had not even considered this, but that's completely accurate as well.
> >> > >
> >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
> >> > > isn't necessarily a violation of any standard. There probably could be
> >> > > a release valve for us to still make those devices useful.
> >> > >
> >> > > The concern I have with not implementing a movement mechanism *at all*
> >> > > is that a one-size-fits-all initial-placement heuristic feels gross
> >> > > when we're, at least ideologically, moving toward "software defined memory".
> >> > >
> >> > > Personally I think the movement mechanism is a good idea that gets folks
> >> > > where they're going sooner, and it doesn't hurt anything by existing. We
> >> > > can change the initial placement mechanism too.
> >> >
> >> > I think providing users a way to "FIX" the memory tiering is a backup
> >> > option. Given that DDRs with different access characteristics provide
> >> > the relevant CDAT/HMAT information, the kernel should be able to
> >> > correctly establish memory tiering on boot.
> >>
> >> Include hotplug and I'll be happier! I know that's messy though.
> >>
> >> > Current memory tiering code has
> >> > 1) memory_tier_init() to iterate through all boot onlined memory
> >> > nodes. All nodes are assumed to be fast tier (adistance
> >> > MEMTIER_ADISTANCE_DRAM is used).
> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
> >> > nodes. This is the place the kernel reads the memory attributes from
> >> > HMAT and recognizes the memory nodes into the correct tier (devdax
> >> > controlled CXL, pmem, etc).
> >> > If we want DDRs with different memory characteristics to be put into
> >> > the correct tier (as in the guest VM memory tiering case), we probably
> >> > need a third path to iterate the boot onlined memory nodes and also be
> >> > able to read their memory attributes. I don't think we can do that in
> >> > 1) because the ACPI subsystem is not yet initialized.
> >>
> >> Can we move it later in general? Or drag HMAT parsing earlier?
> >> ACPI table availability is pretty early, it's just that we don't bother
> >> with HMAT because nothing early uses it.
> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called.
> >
> > I tested the call sequence under a debugger earlier. hmat_init() is
> > called after memory_tier_init(). Let me poke around and see what our
> > options are.
>
> This sounds reasonable.
>
> Please keep in mind that we need a way to identify the base line memory
> type(default_dram_type). A simple method is to use NUMA nodes with CPU
> attached. But I remember that Aneesh said that some NUMA nodes without
> CPU will need to be put in default_dram_type too on their systems. We
> need a way to identify that.

Yes, I am doing some prototyping the way you described. In
memory_tier_init(), we will just set the memory tier for the NUMA
nodes with CPU. In hmat_init(), I am trying to call back to mm to
finish the memory tier initialization for the CPUless NUMA nodes. If a
CPUless numa node can't get the effective adistance from
mt_calc_adistance(), we will fallback to add that node to
default_dram_type.
The other thing I want to experiment is to call mt_calc_adistance() on
a memory node with CPU and see what kind of adistance will be
returned.

>
> --
> Best Regards,
> Huang, Ying

2024-01-15 01:26:35

by Huang, Ying

[permalink] [raw]
Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

Hao Xiang <[email protected]> writes:

> On Thu, Jan 11, 2024 at 11:02 PM Huang, Ying <[email protected]> wrote:
>>
>> Hao Xiang <[email protected]> writes:
>>
>> > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
>> > <[email protected]> wrote:
>> >>
>> >> On Tue, 9 Jan 2024 16:28:15 -0800
>> >> Hao Xiang <[email protected]> wrote:
>> >>
>> >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <[email protected]> wrote:
>> >> > >
>> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
>> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800
>> >> > > > "Huang, Ying" <[email protected]> wrote:
>> >> > > > > Gregory Price <[email protected]> writes:
>> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> >> > > > > It's possible to change the performance of a NUMA node changed, if we
>> >> > > > > hot-remove a memory device, then hot-add another different memory
>> >> > > > > device. It's hoped that the CDAT changes too.
>> >> > > >
>> >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
>> >> > > > HMAT values based on firmware notifications... So we 'could' make
>> >> > > > it work for HMAT based description.
>> >> > > >
>> >> > > > Ultimately my current thinking is we'll end up emulating CXL type3
>> >> > > > devices (hiding topology complexity) and you can update CDAT but
>> >> > > > IIRC that is only meant to be for degraded situations - so if you
>> >> > > > want multiple performance regions, CDAT should describe them form the start.
>> >> > > >
>> >> > >
>> >> > > That was my thought. I don't think it's particularly *realistic* for
>> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
>> >> > > it could be valuable.
>> >> > >
>> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>> >> > > > > >
>> >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
>> >> > > > > > (i.e. host CXL expander memory passed through to the guest), and
>> >> > > > > > allow the guest to apply memory tiering.
>> >> > > > > >
>> >> > > > > > There are multiple issues with this, presently:
>> >> > > > > >
>> >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
>> >> > > > > > performant enough to be a commodity class virtualization.
>> >> > > >
>> >> > > > I'd flex that a bit - we will end up with a solution for virtualization but
>> >> > > > it isn't the emulation that is there today because it's not possible to
>> >> > > > emulate some of the topology in a peformant manner (interleaving with sub
>> >> > > > page granularity / interleaving at all (to a lesser degree)). There are
>> >> > > > ways to do better than we are today, but they start to look like
>> >> > > > software dissagregated memory setups (think lots of page faults in the host).
>> >> > > >
>> >> > >
>> >> > > Agreed, the emulated device as-is can't be the virtualization device,
>> >> > > but it doesn't mean it can't be the basis for it.
>> >> > >
>> >> > > My thought is, if you want to pass host CXL *memory* through to the
>> >> > > guest, you don't actually care to pass CXL *control* through to the
>> >> > > guest. That control lies pretty squarely with the host/hypervisor.
>> >> > >
>> >> > > So, at least in theory, you can just cut the type3 device out of the
>> >> > > QEMU configuration entirely and just pass it through as a distinct numa
>> >> > > node with specific hmat qualities.
>> >> > >
>> >> > > Barring that, if we must go through the type3 device, the question is
>> >> > > how difficult would it be to just make a stripped down type3 device
>> >> > > to provide the informational components, but hack off anything
>> >> > > topology/interleave related? Then you just do direct passthrough as you
>> >> > > described below.
>> >> > >
>> >> > > qemu/kvm would report errors if you tried to touch the naughty bits.
>> >> > >
>> >> > > The second question is... is that device "compliant" or does it need
>> >> > > super special handling from the kernel driver :D? If what i described
>> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
>> >> > > just hide the CXL device entirely from the guest (for this use case)
>> >> > > and just pass the memory through as a numa node.
>> >> > >
>> >> > > Which gets us back to: The memory-tiering component needs a way to
>> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
>> >> > > of those seem like totally valid ways to go about it.
>> >> > >
>> >> > > > > >
>> >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
>> >> > > > > > part of a CXL memory device, the nodes are lumped together in the
>> >> > > > > > DRAM tier.
>> >> > > > > >
>> >> > > > > > None of this has to do with firmware.
>> >> > > > > >
>> >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
>> >> > > > > > have HMAT information that can be passed through via QEMU:
>> >> > > > > >
>> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
>> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
>> >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
>> >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
>> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
>> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
>> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
>> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>> >> > > > > >
>> >> > > > > > Not only would it be nice if we could change tier membership based on
>> >> > > > > > this data, it's realistically the only way to allow guests to accomplish
>> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>> >> > > >
>> >> > > > This I fully agree with. There will be systems with a bunch of normal DDR with different
>> >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
>> >> > > > before we get anything more complex in place for CXL.
>> >> > > >
>> >> > >
>> >> > > Had not even considered this, but that's completely accurate as well.
>> >> > >
>> >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
>> >> > > isn't necessarily a violation of any standard. There probably could be
>> >> > > a release valve for us to still make those devices useful.
>> >> > >
>> >> > > The concern I have with not implementing a movement mechanism *at all*
>> >> > > is that a one-size-fits-all initial-placement heuristic feels gross
>> >> > > when we're, at least ideologically, moving toward "software defined memory".
>> >> > >
>> >> > > Personally I think the movement mechanism is a good idea that gets folks
>> >> > > where they're going sooner, and it doesn't hurt anything by existing. We
>> >> > > can change the initial placement mechanism too.
>> >> >
>> >> > I think providing users a way to "FIX" the memory tiering is a backup
>> >> > option. Given that DDRs with different access characteristics provide
>> >> > the relevant CDAT/HMAT information, the kernel should be able to
>> >> > correctly establish memory tiering on boot.
>> >>
>> >> Include hotplug and I'll be happier! I know that's messy though.
>> >>
>> >> > Current memory tiering code has
>> >> > 1) memory_tier_init() to iterate through all boot onlined memory
>> >> > nodes. All nodes are assumed to be fast tier (adistance
>> >> > MEMTIER_ADISTANCE_DRAM is used).
>> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
>> >> > nodes. This is the place the kernel reads the memory attributes from
>> >> > HMAT and recognizes the memory nodes into the correct tier (devdax
>> >> > controlled CXL, pmem, etc).
>> >> > If we want DDRs with different memory characteristics to be put into
>> >> > the correct tier (as in the guest VM memory tiering case), we probably
>> >> > need a third path to iterate the boot onlined memory nodes and also be
>> >> > able to read their memory attributes. I don't think we can do that in
>> >> > 1) because the ACPI subsystem is not yet initialized.
>> >>
>> >> Can we move it later in general? Or drag HMAT parsing earlier?
>> >> ACPI table availability is pretty early, it's just that we don't bother
>> >> with HMAT because nothing early uses it.
>> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called.
>> >
>> > I tested the call sequence under a debugger earlier. hmat_init() is
>> > called after memory_tier_init(). Let me poke around and see what our
>> > options are.
>>
>> This sounds reasonable.
>>
>> Please keep in mind that we need a way to identify the base line memory
>> type(default_dram_type). A simple method is to use NUMA nodes with CPU
>> attached. But I remember that Aneesh said that some NUMA nodes without
>> CPU will need to be put in default_dram_type too on their systems. We
>> need a way to identify that.
>
> Yes, I am doing some prototyping the way you described. In
> memory_tier_init(), we will just set the memory tier for the NUMA
> nodes with CPU. In hmat_init(), I am trying to call back to mm to
> finish the memory tier initialization for the CPUless NUMA nodes. If a
> CPUless numa node can't get the effective adistance from
> mt_calc_adistance(), we will fallback to add that node to
> default_dram_type.

Sound reasonable for me.

> The other thing I want to experiment is to call mt_calc_adistance() on
> a memory node with CPU and see what kind of adistance will be
> returned.

Anyway, we need a base line to start. The abstract distance is
calculated based on the ratio of the performance of a node to that of
default DRAM node.

--
Best Regards,
Huang, Ying