2024-04-05 06:11:59

by Honggyu Kim

[permalink] [raw]
Subject: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made. This RFC is about its implementation for physical address
space.


Changes from RFC v2:
1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
2. Create 'target_nid' to set the migration target node instead of
depending on node distance based information.
3. Instead of having page level access check in this patch series,
delegate the job to a new DAMOS filter type YOUNG[2].
4. Introduce vmstat counters "damon_migrate_{hot,cold}".
5. Rebase from v6.7 to v6.8.

Changes from RFC:
1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c.
2. Simplify some functions of vmscan.c and used in paddr.c, but need
to be reviewed more in depth.
3. Refactor most functions for common usage for both promote and
demote actions and introduce an enum migration_mode for its control.
4. Add "target_nid" sysfs knob for migration destination node for both
promote and demote actions.
5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above
DAMOS_STAT.


Introduction
============

With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics. They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency. Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly. Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature. The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers. This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted
to slow tiers so that the system can increase the chance to allocate
more hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 17~18% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.


DAMON configuration
===================

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3]. It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.


Evaluation Workload
===================

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5]. We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these migrate_{hot,cold} actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload. The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes. It's not true when using numa balancing, but it is not the
scope of this DAMON based tiered memory management support.

If the working set of redis can be fit fully into the DRAM node, then
the redis will access the fast DRAM only. Since the performance of DRAM
only is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate this patch series.

To make pages of redis be distributed across fast DRAM node and slow
CXL node to evaluate our migrate_{hot,cold} actions, we pre-allocate
some cold memory externally using mmap and memset before launching
redis-server. We assumed that there are enough amount of cold memory in
datacenters as TMO[6] and TPP[7] papers mentioned.

The evaluation sequence is as follows.

1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and
DAMOS_MIGRATE_HOT action for CXL node. It demotes cold pages on DRAM
node and promotes hot pages on CXL node in a regular interval.
2. Allocate a huge block of cold memory by calling mmap and memset at
the fast tier DRAM node, then make the process sleep to make the fast
tier has insufficient space for redis-server.
3. Launch redis-server and load prebaked snapshot image, dump.rdb. The
redis-server consumes 52GB of anon pages and 33GB of file pages, but
due to the cold memory allocated at 2, it fails allocating the entire
memory of redis-server on the fast tier DRAM node so it partially
allocates the remaining on the slow tier CXL node. The ratio of
DRAM:CXL depends on the size of the pre-allocated cold memory.
4. Run YCSB to make zipfian or latest distribution of memory accesses to
redis-server, then measure its execution time when it's completed.
5. Repeat 4 over 50 times to measure the average execution time for each
run.
6. Increase the cold memory size then repeat goes to 2.

For each test at 4 took about a minute so repeating it 50 times almost
took about 1 hour for each test with a specific cold memory from 440GB
to 500GB in 10GB increments for each evaluation. So it took about more
than 10 hours for both zipfian and latest workloads to get the entire
evaluation results. Repeating the same test set multiple times doesn't
show much difference so I think it might be enough to make the result
reliable.


Evaluation Results
==================

All the result values are normalized to DRAM-only execution time because
the workload cannot be faster than DRAM-only unless the workload hits
the peak bandwidth but our redis test doesn't go beyond the bandwidth
limit.

So the DRAM-only execution time is the ideal result without affected by
the gap between DRAM and CXL performance difference. The NUMA node
environment is as follows.

node0 - local DRAM, 512GB with a CPU socket (fast tier)
node1 - disabled
node2 - CXL DRAM, 96GB, no CPU attached (slow tier)

The following is the result of generating zipfian distribution to
redis-server and the numbers are averaged by 50 times of execution.

1. YCSB zipfian distribution read only workload
memory pressure with cold memory on node0 with 512GB of local DRAM.
=============+================================================+=========
| cold memory occupied by mmap and memset |
| 0G 440G 450G 460G 470G 480G 490G 500G |
=============+================================================+=========
Execution time normalized to DRAM-only values | GEOMEAN
-------------+------------------------------------------------+---------
DRAM-only | 1.00 - - - - - - - | 1.00
CXL-only | 1.22 - - - - - - - | 1.22
default | - 1.12 1.13 1.14 1.16 1.19 1.21 1.21 | 1.17
DAMON tiered | - 1.04 1.03 1.04 1.06 1.05 1.05 1.05 | 1.05
=============+================================================+=========
CXL usage of redis-server in GB | AVERAGE
-------------+------------------------------------------------+---------
DRAM-only | 0.0 - - - - - - - | 0.0
CXL-only | 52.6 - - - - - - - | 52.6
default | - 20.4 27.0 33.1 39.5 45.6 50.5 50.3 | 38.1
DAMON tiered | - 0.1 0.3 0.8 0.6 0.7 1.3 0.9 | 0.7
=============+================================================+=========

Each test result is based on the exeuction environment as follows.

DRAM-only : redis-server uses only local DRAM memory.
CXL-only : redis-server uses only CXL memory.
default : default memory policy(MPOL_DEFAULT).
numa balancing disabled.
DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
DAMOS_MIGRATE_HOT for CXL nodes.

The above result shows the "default" execution time goes up as the size
of cold memory is increased from 440G to 500G because the more cold
memory used, the more CXL memory is used for the target redis workload
and this makes the execution time increase.

However, "DAMON tiered" result shows less slowdown because the
DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
cold memory to CXL node and this free space at DRAM increases more
chance to allocate hot or warm pages of redis-server to fast DRAM node.
Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
of redis-server to DRAM node actively.

As a result, it makes more memory of redis-server stay in DRAM node
compared to "default" memory policy and this makes the performance
improvement.

The following result of latest distribution workload shows similar data.

2. YCSB latest distribution read only workload
memory pressure with cold memory on node0 with 512GB of local DRAM.
=============+================================================+=========
| cold memory occupied by mmap and memset |
| 0G 440G 450G 460G 470G 480G 490G 500G |
=============+================================================+=========
Execution time normalized to DRAM-only values | GEOMEAN
-------------+------------------------------------------------+---------
DRAM-only | 1.00 - - - - - - - | 1.00
CXL-only | 1.18 - - - - - - - | 1.18
default | - 1.18 1.19 1.18 1.18 1.17 1.19 1.18 | 1.18
DAMON tiered | - 1.04 1.04 1.04 1.05 1.04 1.05 1.05 | 1.04
=============+================================================+=========
CXL usage of redis-server in GB | AVERAGE
-------------+------------------------------------------------+---------
DRAM-only | 0.0 - - - - - - - | 0.0
CXL-only | 52.6 - - - - - - - | 52.6
default | - 20.5 27.1 33.2 39.5 45.5 50.4 50.5 | 38.1
DAMON tiered | - 0.2 0.4 0.7 1.6 1.2 1.1 3.4 | 1.2
=============+================================================+=========

In summary of both results, our evaluation shows that "DAMON tiered"
memory management reduces the performance slowdown compared to the
"default" memory policy from 17~18% to 4~5% when the system runs with
high memory pressure on its fast tier DRAM nodes.

Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
tiered memory systems run more efficiently under high memory pressures.

Signed-off-by: Honggyu Kim <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>

[1] https://lore.kernel.org/damon/[email protected]
[2] https://lore.kernel.org/damon/[email protected]
[3] https://github.com/skhynix/hmsdk
[4] https://github.com/redis/redis/tree/7.0.0
[5] https://github.com/brianfrankcooper/YCSB/tree/0.17.0
[6] https://dl.acm.org/doi/10.1145/3503222.3507731
[7] https://dl.acm.org/doi/10.1145/3582016.3582063


Honggyu Kim (5):
mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode
mm: make alloc_demote_folio externally invokable for migration
mm/migrate: add MR_DAMON to migrate_reason
mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion
mm/damon: Add "damon_migrate_{hot,cold}" vmstat

Hyeongtak Ji (2):
mm/damon/sysfs-schemes: add target_nid on sysfs-schemes
mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

include/linux/damon.h | 15 ++-
include/linux/migrate_mode.h | 1 +
include/linux/mmzone.h | 4 +
include/trace/events/migrate.h | 3 +-
mm/damon/core.c | 5 +-
mm/damon/dbgfs.c | 2 +-
mm/damon/lru_sort.c | 3 +-
mm/damon/paddr.c | 191 +++++++++++++++++++++++++++++++--
mm/damon/reclaim.c | 3 +-
mm/damon/sysfs-schemes.c | 39 ++++++-
mm/internal.h | 1 +
mm/vmscan.c | 10 +-
mm/vmstat.c | 4 +
13 files changed, 265 insertions(+), 16 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
--
2.34.1



2024-04-05 06:12:48

by Honggyu Kim

[permalink] [raw]
Subject: [RFC PATCH v3 3/7] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes

From: Hyeongtak Ji <[email protected]>

This patch adds target_nid under
/sys/kernel/mm/damon/admin/kdamonds/<N>/contexts/<N>/schemes/<N>/

The 'target_nid' can be used as the destination node for DAMOS actions
such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches.

Signed-off-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
---
include/linux/damon.h | 11 ++++++++++-
mm/damon/core.c | 5 ++++-
mm/damon/dbgfs.c | 2 +-
mm/damon/lru_sort.c | 3 ++-
mm/damon/reclaim.c | 3 ++-
mm/damon/sysfs-schemes.c | 33 ++++++++++++++++++++++++++++++++-
6 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 5881e4ac30be..24ea33a03d5d 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -337,6 +337,7 @@ struct damos_access_pattern {
* @apply_interval_us: The time between applying the @action.
* @quota: Control the aggressiveness of this scheme.
* @wmarks: Watermarks for automated (in)activation of this scheme.
+ * @target_nid: Destination node if @action is "migrate_{hot,cold}".
* @filters: Additional set of &struct damos_filter for &action.
* @stat: Statistics of this scheme.
* @list: List head for siblings.
@@ -352,6 +353,10 @@ struct damos_access_pattern {
* monitoring context are inactive, DAMON stops monitoring either, and just
* repeatedly checks the watermarks.
*
+ * @target_nid is used to set the migration target node for migrate_hot or
+ * migrate_cold actions, which means it's only meaningful when @action is either
+ * "migrate_hot" or "migrate_cold".
+ *
* Before applying the &action to a memory region, &struct damon_operations
* implementation could check pages of the region and skip &action to respect
* &filters
@@ -373,6 +378,9 @@ struct damos {
/* public: */
struct damos_quota quota;
struct damos_watermarks wmarks;
+ union {
+ int target_nid;
+ };
struct list_head filters;
struct damos_stat stat;
struct list_head list;
@@ -677,7 +685,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
- struct damos_watermarks *wmarks);
+ struct damos_watermarks *wmarks,
+ int target_nid);
void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
void damon_destroy_scheme(struct damos *s);

diff --git a/mm/damon/core.c b/mm/damon/core.c
index 5b325749fc12..7ff0259d9fa6 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -316,7 +316,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
- struct damos_watermarks *wmarks)
+ struct damos_watermarks *wmarks,
+ int target_nid)
{
struct damos *scheme;

@@ -341,6 +342,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
scheme->wmarks = *wmarks;
scheme->wmarks.activated = true;

+ scheme->target_nid = target_nid;
+
return scheme;
}

diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index 7dac24e69e3b..d04fdccfa65b 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -279,7 +279,7 @@ static struct damos **str_to_schemes(const char *str, ssize_t len,

pos += parsed;
scheme = damon_new_scheme(&pattern, action, 0, &quota,
- &wmarks);
+ &wmarks, NUMA_NO_NODE);
if (!scheme)
goto fail;

diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index 3de2916a65c3..3775f0f2743d 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme(
/* under the quota. */
&quota,
/* (De)activate this according to the watermarks. */
- &damon_lru_sort_wmarks);
+ &damon_lru_sort_wmarks,
+ NUMA_NO_NODE);
}

/* Create a DAMON-based operation scheme for hot memory regions */
diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index 66e190f0374a..84e6e96b5dcc 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -147,7 +147,8 @@ static struct damos *damon_reclaim_new_scheme(void)
/* under the quota. */
&damon_reclaim_quota,
/* (De)activate this according to the watermarks. */
- &damon_reclaim_wmarks);
+ &damon_reclaim_wmarks,
+ NUMA_NO_NODE);
}

static void damon_reclaim_copy_quota_status(struct damos_quota *dst,
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index ae0f0b314f3a..1a30ea82c890 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -6,6 +6,7 @@
*/

#include <linux/slab.h>
+#include <linux/numa.h>

#include "sysfs-common.h"

@@ -1393,6 +1394,7 @@ struct damon_sysfs_scheme {
struct damon_sysfs_scheme_filters *filters;
struct damon_sysfs_stats *stats;
struct damon_sysfs_scheme_regions *tried_regions;
+ int target_nid;
};

/* This should match with enum damos_action */
@@ -1418,6 +1420,7 @@ static struct damon_sysfs_scheme *damon_sysfs_scheme_alloc(
scheme->kobj = (struct kobject){};
scheme->action = action;
scheme->apply_interval_us = apply_interval_us;
+ scheme->target_nid = NUMA_NO_NODE;
return scheme;
}

@@ -1640,6 +1643,28 @@ static ssize_t apply_interval_us_store(struct kobject *kobj,
return err ? err : count;
}

+static ssize_t target_nid_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct damon_sysfs_scheme *scheme = container_of(kobj,
+ struct damon_sysfs_scheme, kobj);
+
+ return sysfs_emit(buf, "%d\n", scheme->target_nid);
+}
+
+static ssize_t target_nid_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ struct damon_sysfs_scheme *scheme = container_of(kobj,
+ struct damon_sysfs_scheme, kobj);
+ int err = 0;
+
+ /* TODO: error handling for target_nid range. */
+ err = kstrtoint(buf, 0, &scheme->target_nid);
+
+ return err ? err : count;
+}
+
static void damon_sysfs_scheme_release(struct kobject *kobj)
{
kfree(container_of(kobj, struct damon_sysfs_scheme, kobj));
@@ -1651,9 +1676,13 @@ static struct kobj_attribute damon_sysfs_scheme_action_attr =
static struct kobj_attribute damon_sysfs_scheme_apply_interval_us_attr =
__ATTR_RW_MODE(apply_interval_us, 0600);

+static struct kobj_attribute damon_sysfs_scheme_target_nid_attr =
+ __ATTR_RW_MODE(target_nid, 0600);
+
static struct attribute *damon_sysfs_scheme_attrs[] = {
&damon_sysfs_scheme_action_attr.attr,
&damon_sysfs_scheme_apply_interval_us_attr.attr,
+ &damon_sysfs_scheme_target_nid_attr.attr,
NULL,
};
ATTRIBUTE_GROUPS(damon_sysfs_scheme);
@@ -1956,7 +1985,8 @@ static struct damos *damon_sysfs_mk_scheme(
damos_sysfs_set_quota_score(sysfs_quotas->goals, &quota);

scheme = damon_new_scheme(&pattern, sysfs_scheme->action,
- sysfs_scheme->apply_interval_us, &quota, &wmarks);
+ sysfs_scheme->apply_interval_us, &quota, &wmarks,
+ sysfs_scheme->target_nid);
if (!scheme)
return NULL;

@@ -1987,6 +2017,7 @@ static void damon_sysfs_update_scheme(struct damos *scheme,

scheme->action = sysfs_scheme->action;
scheme->apply_interval_us = sysfs_scheme->apply_interval_us;
+ scheme->target_nid = sysfs_scheme->target_nid;

scheme->quota.ms = sysfs_quotas->ms;
scheme->quota.sz = sysfs_quotas->sz;
--
2.34.1


2024-04-05 06:13:01

by Honggyu Kim

[permalink] [raw]
Subject: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
instead of swapping them out.

The 'target_nid' sysfs knob is created by this patch to inform the
migration target node ID.

Here is one of the example usage of this 'migrate_cold' action.

$ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
$ cat contexts/<N>/schemes/<N>/action
migrate_cold
$ echo 2 > contexts/<N>/schemes/<N>/target_nid
$ echo commit > state
$ numactl -p 0 ./hot_cold 500M 600M &
$ numastat -c -p hot_cold

Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Total
-------------- ------ ------ ------ -----
701 (hot_cold) 501 0 601 1101

Since there are some common routines with pageout, many functions have
similar logics between pageout and migrate cold.

damon_pa_migrate_folio_list() is a minimized version of
shrink_folio_list(), but it's minified only for demotion.

Signed-off-by: Honggyu Kim <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
---
include/linux/damon.h | 2 +
mm/damon/paddr.c | 146 ++++++++++++++++++++++++++++++++++++++-
mm/damon/sysfs-schemes.c | 4 ++
3 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 24ea33a03d5d..df8671e69a70 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
* @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
* @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists.
* @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_COLD: Migrate for the given cold region.
* @DAMOS_STAT: Do nothing but count the stat.
* @NR_DAMOS_ACTIONS: Total number of DAMOS actions
*
@@ -122,6 +123,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+ DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
};
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 277a1c4d833c..fe217a26f788 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -12,6 +12,9 @@
#include <linux/pagemap.h>
#include <linux/rmap.h>
#include <linux/swap.h>
+#include <linux/memory-tiers.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>

#include "../internal.h"
#include "ops-common.h"
@@ -226,8 +229,137 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)

enum migration_mode {
MIG_PAGEOUT,
+ MIG_MIGRATE_COLD,
};

+static unsigned int migrate_folio_list(struct list_head *migrate_folios,
+ struct pglist_data *pgdat,
+ int target_nid)
+{
+ unsigned int nr_succeeded;
+ nodemask_t allowed_mask = NODE_MASK_NONE;
+
+ struct migration_target_control mtc = {
+ /*
+ * Allocate from 'node', or fail quickly and quietly.
+ * When this happens, 'page' will likely just be discarded
+ * instead of migrated.
+ */
+ .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
+ __GFP_NOMEMALLOC | GFP_NOWAIT,
+ .nid = target_nid,
+ .nmask = &allowed_mask
+ };
+
+ if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
+ return 0;
+
+ if (list_empty(migrate_folios))
+ return 0;
+
+ /* Migration ignores all cpuset and mempolicy settings */
+ migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
+ (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
+ &nr_succeeded);
+
+ return nr_succeeded;
+}
+
+static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
+ struct pglist_data *pgdat,
+ enum migration_mode mm,
+ int target_nid)
+{
+ unsigned int nr_migrated = 0;
+ struct folio *folio;
+ LIST_HEAD(ret_folios);
+ LIST_HEAD(migrate_folios);
+
+ cond_resched();
+
+ while (!list_empty(folio_list)) {
+ struct folio *folio;
+
+ cond_resched();
+
+ folio = lru_to_folio(folio_list);
+ list_del(&folio->lru);
+
+ if (!folio_trylock(folio))
+ goto keep;
+
+ VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+
+ /* Relocate its contents to another node. */
+ list_add(&folio->lru, &migrate_folios);
+ folio_unlock(folio);
+ continue;
+keep:
+ list_add(&folio->lru, &ret_folios);
+ VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
+ }
+ /* 'folio_list' is always empty here */
+
+ /* Migrate folios selected for migration */
+ nr_migrated += migrate_folio_list(&migrate_folios, pgdat, target_nid);
+ /* Folios that could not be migrated are still in @migrate_folios */
+ if (!list_empty(&migrate_folios)) {
+ /* Folios which weren't migrated go back on @folio_list */
+ list_splice_init(&migrate_folios, folio_list);
+ }
+
+ try_to_unmap_flush();
+
+ list_splice(&ret_folios, folio_list);
+
+ while (!list_empty(folio_list)) {
+ folio = lru_to_folio(folio_list);
+ list_del(&folio->lru);
+ folio_putback_lru(folio);
+ }
+
+ return nr_migrated;
+}
+
+static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
+ enum migration_mode mm,
+ int target_nid)
+{
+ int nid;
+ unsigned int nr_migrated = 0;
+ LIST_HEAD(node_folio_list);
+ unsigned int noreclaim_flag;
+
+ if (list_empty(folio_list))
+ return nr_migrated;
+
+ noreclaim_flag = memalloc_noreclaim_save();
+
+ nid = folio_nid(lru_to_folio(folio_list));
+ do {
+ struct folio *folio = lru_to_folio(folio_list);
+
+ if (nid == folio_nid(folio)) {
+ folio_clear_active(folio);
+ list_move(&folio->lru, &node_folio_list);
+ continue;
+ }
+
+ nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
+ NODE_DATA(nid), mm,
+ target_nid);
+ nid = folio_nid(lru_to_folio(folio_list));
+ } while (!list_empty(folio_list));
+
+ nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
+ NODE_DATA(nid), mm,
+ target_nid);
+
+ memalloc_noreclaim_restore(noreclaim_flag);
+
+ return nr_migrated;
+}
+
static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
enum migration_mode mm)
{
@@ -247,7 +379,11 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
folio_test_clear_young(folio);
if (!folio_isolate_lru(folio))
goto put_folio;
- if (folio_test_unevictable(folio))
+ /*
+ * Since unevictable folios can be demoted or promoted,
+ * unevictable test is needed only for pageout.
+ */
+ if (mm == MIG_PAGEOUT && folio_test_unevictable(folio))
folio_putback_lru(folio);
else
list_add(&folio->lru, &folio_list);
@@ -258,6 +394,10 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
case MIG_PAGEOUT:
applied = reclaim_pages(&folio_list);
break;
+ case MIG_MIGRATE_COLD:
+ applied = damon_pa_migrate_pages(&folio_list, mm,
+ s->target_nid);
+ break;
default:
/* Unexpected migration mode. */
return 0;
@@ -314,6 +454,8 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_pa_deactivate_pages(r, scheme);
+ case DAMOS_MIGRATE_COLD:
+ return damon_pa_migrate(r, scheme, MIG_MIGRATE_COLD);
case DAMOS_STAT:
break;
default:
@@ -334,6 +476,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
return damon_hot_score(context, r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_cold_score(context, r, scheme);
+ case DAMOS_MIGRATE_COLD:
+ return damon_cold_score(context, r, scheme);
default:
break;
}
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 1a30ea82c890..18b7d054c748 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1406,6 +1406,7 @@ static const char * const damon_sysfs_damos_action_strs[] = {
"nohugepage",
"lru_prio",
"lru_deprio",
+ "migrate_cold",
"stat",
};

@@ -1659,6 +1660,9 @@ static ssize_t target_nid_store(struct kobject *kobj,
struct damon_sysfs_scheme, kobj);
int err = 0;

+ if (scheme->action != DAMOS_MIGRATE_COLD)
+ return -EINVAL;
+
/* TODO: error handling for target_nid range. */
err = kstrtoint(buf, 0, &scheme->target_nid);

--
2.34.1


2024-04-05 06:13:45

by Honggyu Kim

[permalink] [raw]
Subject: [RFC PATCH v3 6/7] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

From: Hyeongtak Ji <[email protected]>

This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
DAMOS_MIGRATE_COLD, but it is targeted to migrate hot pages.

It migrates pages inside the given region to the 'target_nid' NUMA node
in the sysfs.

Here is one of the example usage of this 'migrate_hot' action.

$ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
$ cat contexts/<N>/schemes/<N>/action
migrate_hot
$ echo 0 > contexts/<N>/schemes/<N>/target_nid
$ echo commit > state
$ numactl -p 2 ./hot_cold 500M 600M &
$ numastat -c -p hot_cold

Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Total
-------------- ------ ------ ------ -----
701 (hot_cold) 501 0 601 1101

Signed-off-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
---
include/linux/damon.h | 2 ++
mm/damon/paddr.c | 12 ++++++++++--
mm/damon/sysfs-schemes.c | 4 +++-
3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index df8671e69a70..934c95a7c042 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
* @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
* @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists.
* @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_HOT: Migrate for the given hot region.
* @DAMOS_MIGRATE_COLD: Migrate for the given cold region.
* @DAMOS_STAT: Do nothing but count the stat.
* @NR_DAMOS_ACTIONS: Total number of DAMOS actions
@@ -123,6 +124,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+ DAMOS_MIGRATE_HOT,
DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index fe217a26f788..fd9d35b5cc83 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -229,6 +229,7 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)

enum migration_mode {
MIG_PAGEOUT,
+ MIG_MIGRATE_HOT,
MIG_MIGRATE_COLD,
};

@@ -375,8 +376,10 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
if (damos_pa_filter_out(s, folio))
goto put_folio;

- folio_clear_referenced(folio);
- folio_test_clear_young(folio);
+ if (mm != MIG_MIGRATE_HOT) {
+ folio_clear_referenced(folio);
+ folio_test_clear_young(folio);
+ }
if (!folio_isolate_lru(folio))
goto put_folio;
/*
@@ -394,6 +397,7 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
case MIG_PAGEOUT:
applied = reclaim_pages(&folio_list);
break;
+ case MIG_MIGRATE_HOT:
case MIG_MIGRATE_COLD:
applied = damon_pa_migrate_pages(&folio_list, mm,
s->target_nid);
@@ -454,6 +458,8 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_pa_deactivate_pages(r, scheme);
+ case DAMOS_MIGRATE_HOT:
+ return damon_pa_migrate(r, scheme, MIG_MIGRATE_HOT);
case DAMOS_MIGRATE_COLD:
return damon_pa_migrate(r, scheme, MIG_MIGRATE_COLD);
case DAMOS_STAT:
@@ -476,6 +482,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
return damon_hot_score(context, r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_cold_score(context, r, scheme);
+ case DAMOS_MIGRATE_HOT:
+ return damon_hot_score(context, r, scheme);
case DAMOS_MIGRATE_COLD:
return damon_cold_score(context, r, scheme);
default:
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 18b7d054c748..1d2f62aa79ca 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1406,6 +1406,7 @@ static const char * const damon_sysfs_damos_action_strs[] = {
"nohugepage",
"lru_prio",
"lru_deprio",
+ "migrate_hot",
"migrate_cold",
"stat",
};
@@ -1660,7 +1661,8 @@ static ssize_t target_nid_store(struct kobject *kobj,
struct damon_sysfs_scheme, kobj);
int err = 0;

- if (scheme->action != DAMOS_MIGRATE_COLD)
+ if (scheme->action != DAMOS_MIGRATE_HOT &&
+ scheme->action != DAMOS_MIGRATE_COLD)
return -EINVAL;

/* TODO: error handling for target_nid range. */
--
2.34.1


2024-04-05 06:13:47

by Honggyu Kim

[permalink] [raw]
Subject: [RFC PATCH v3 7/7] mm/damon: Add "damon_migrate_{hot,cold}" vmstat

This patch adds "damon_migrate_{hot,cold}" under node specific vmstat
counters at the following location.

/sys/devices/system/node/node*/vmstat

The counted values are accumulcated to the global vmstat so it also
introduces the same counter at /proc/vmstat as well.

Signed-off-by: Honggyu Kim <[email protected]>
---
include/linux/mmzone.h | 4 ++++
mm/damon/paddr.c | 17 ++++++++++++++++-
mm/vmstat.c | 4 ++++
3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a497f189d988..0005372c5503 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -214,6 +214,10 @@ enum node_stat_item {
PGDEMOTE_KSWAPD,
PGDEMOTE_DIRECT,
PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_DAMON_PADDR
+ DAMON_MIGRATE_HOT,
+ DAMON_MIGRATE_COLD,
+#endif
NR_VM_NODE_STAT_ITEMS
};

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index fd9d35b5cc83..d559c242d151 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -235,10 +235,23 @@ enum migration_mode {

static unsigned int migrate_folio_list(struct list_head *migrate_folios,
struct pglist_data *pgdat,
+ enum migration_mode mm,
int target_nid)
{
unsigned int nr_succeeded;
nodemask_t allowed_mask = NODE_MASK_NONE;
+ enum node_stat_item node_stat;
+
+ switch (mm) {
+ case MIG_MIGRATE_HOT:
+ node_stat = DAMON_MIGRATE_HOT;
+ break;
+ case MIG_MIGRATE_COLD:
+ node_stat = DAMON_MIGRATE_COLD;
+ break;
+ default:
+ return 0;
+ }

struct migration_target_control mtc = {
/*
@@ -263,6 +276,8 @@ static unsigned int migrate_folio_list(struct list_head *migrate_folios,
(unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
&nr_succeeded);

+ mod_node_page_state(pgdat, node_stat, nr_succeeded);
+
return nr_succeeded;
}

@@ -302,7 +317,7 @@ static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
/* 'folio_list' is always empty here */

/* Migrate folios selected for migration */
- nr_migrated += migrate_folio_list(&migrate_folios, pgdat, target_nid);
+ nr_migrated += migrate_folio_list(&migrate_folios, pgdat, mm, target_nid);
/* Folios that could not be migrated are still in @migrate_folios */
if (!list_empty(&migrate_folios)) {
/* Folios which weren't migrated go back on @folio_list */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..be9ba89fede1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1252,6 +1252,10 @@ const char * const vmstat_text[] = {
"pgdemote_kswapd",
"pgdemote_direct",
"pgdemote_khugepaged",
+#ifdef CONFIG_DAMON_PADDR
+ "damon_migrate_hot",
+ "damon_migrate_cold",
+#endif

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
2.34.1


2024-04-05 06:15:52

by Honggyu Kim

[permalink] [raw]
Subject: [RFC PATCH v3 2/7] mm: make alloc_demote_folio externally invokable for migration

The alloc_demote_folio can be used out of vmscan.c so it'd be better to
remove static keyword from it.

This function can also be used for both demotion and promotion so it'd
be better to rename it from alloc_demote_folio to alloc_migrate_folio.

Signed-off-by: Honggyu Kim <[email protected]>
---
mm/internal.h | 1 +
mm/vmscan.c | 10 +++++++---
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..c96ff9bc82d0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -866,6 +866,7 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long,
unsigned long, unsigned long);

extern void set_pageblock_order(void);
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private);
unsigned long reclaim_pages(struct list_head *folio_list);
unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *folio_list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4255619a1a31..9e456cac03b4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -910,8 +910,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
}

-static struct folio *alloc_demote_folio(struct folio *src,
- unsigned long private)
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
{
struct folio *dst;
nodemask_t *allowed_mask;
@@ -935,6 +934,11 @@ static struct folio *alloc_demote_folio(struct folio *src,
if (dst)
return dst;

+ /*
+ * Allocation failed from the target node so try to allocate from
+ * fallback nodes based on allowed_mask.
+ * See fallback_alloc() at mm/slab.c.
+ */
mtc->gfp_mask &= ~__GFP_THISNODE;
mtc->nmask = allowed_mask;

@@ -973,7 +977,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
node_get_allowed_targets(pgdat, &allowed_mask);

/* Demotion ignores all cpuset and mempolicy settings */
- migrate_pages(demote_folios, alloc_demote_folio, NULL,
+ migrate_pages(demote_folios, alloc_migrate_folio, NULL,
(unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
&nr_succeeded);

--
2.34.1


2024-04-05 06:16:11

by Honggyu Kim

[permalink] [raw]
Subject: [RFC PATCH v3 1/7] mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode

This is a preparation patch that introduces migration modes.

The damon_pa_pageout is renamed to damon_pa_migrate and it receives an
extra argument for migration_mode.

No functional changes applied.

Signed-off-by: Honggyu Kim <[email protected]>
---
mm/damon/paddr.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 081e2a325778..277a1c4d833c 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -224,7 +224,12 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
return false;
}

-static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
+enum migration_mode {
+ MIG_PAGEOUT,
+};
+
+static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
+ enum migration_mode mm)
{
unsigned long addr, applied;
LIST_HEAD(folio_list);
@@ -249,7 +254,14 @@ static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
put_folio:
folio_put(folio);
}
- applied = reclaim_pages(&folio_list);
+ switch (mm) {
+ case MIG_PAGEOUT:
+ applied = reclaim_pages(&folio_list);
+ break;
+ default:
+ /* Unexpected migration mode. */
+ return 0;
+ }
cond_resched();
return applied * PAGE_SIZE;
}
@@ -297,7 +309,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
{
switch (scheme->action) {
case DAMOS_PAGEOUT:
- return damon_pa_pageout(r, scheme);
+ return damon_pa_migrate(r, scheme, MIG_PAGEOUT);
case DAMOS_LRU_PRIO:
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
--
2.34.1


2024-04-05 06:17:30

by Honggyu Kim

[permalink] [raw]
Subject: [RFC PATCH v3 4/7] mm/migrate: add MR_DAMON to migrate_reason

The current patch series introduces DAMON based migration across NUMA
nodes so it'd be better to have a new migrate_reason in trace events.

Signed-off-by: Honggyu Kim <[email protected]>
---
include/linux/migrate_mode.h | 1 +
include/trace/events/migrate.h | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..cec36b7e7ced 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+ MR_DAMON,
MR_TYPES
};

diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 0190ef725b43..cd01dd7b3640 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -22,7 +22,8 @@
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
EM( MR_CONTIG_RANGE, "contig_range") \
EM( MR_LONGTERM_PIN, "longterm_pin") \
- EMe(MR_DEMOTION, "demotion")
+ EM( MR_DEMOTION, "demotion") \
+ EMe(MR_DAMON, "damon")

/*
* First define the enums in the above macros to be exported to userspace
--
2.34.1


2024-04-05 07:56:52

by Hyeongtak Ji

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim <[email protected]> wrote:

..snip...

> +static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
> + enum migration_mode mm,
> + int target_nid)
> +{
> + int nid;
> + unsigned int nr_migrated = 0;
> + LIST_HEAD(node_folio_list);
> + unsigned int noreclaim_flag;
> +
> + if (list_empty(folio_list))
> + return nr_migrated;

How about checking if `target_nid` is `NUMA_NO_NODE` or not earlier,

> +
> + noreclaim_flag = memalloc_noreclaim_save();
> +
> + nid = folio_nid(lru_to_folio(folio_list));
> + do {
> + struct folio *folio = lru_to_folio(folio_list);
> +
> + if (nid == folio_nid(folio)) {
> + folio_clear_active(folio);
> + list_move(&folio->lru, &node_folio_list);
> + continue;
> + }
> +
> + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
> + NODE_DATA(nid), mm,
> + target_nid);
> + nid = folio_nid(lru_to_folio(folio_list));
> + } while (!list_empty(folio_list));
> +
> + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
> + NODE_DATA(nid), mm,
> + target_nid);
> +
> + memalloc_noreclaim_restore(noreclaim_flag);
> +
> + return nr_migrated;
> +}
> +

..snip...

> +static unsigned int migrate_folio_list(struct list_head *migrate_folios,
> + struct pglist_data *pgdat,
> + int target_nid)
> +{
> + unsigned int nr_succeeded;
> + nodemask_t allowed_mask = NODE_MASK_NONE;
> +
> + struct migration_target_control mtc = {
> + /*
> + * Allocate from 'node', or fail quickly and quietly.
> + * When this happens, 'page' will likely just be discarded
> + * instead of migrated.
> + */
> + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> + __GFP_NOMEMALLOC | GFP_NOWAIT,
> + .nid = target_nid,
> + .nmask = &allowed_mask
> + };
> +
> + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
> + return 0;

instead of here.

> +
> + if (list_empty(migrate_folios))
> + return 0;
> +
> + /* Migration ignores all cpuset and mempolicy settings */
> + migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
> + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
> + &nr_succeeded);
> +
> + return nr_succeeded;
> +}
> +

..snip...

Kind regards,
Hyeongtak

2024-04-05 16:56:37

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote:
> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
>
> 1. YCSB zipfian distribution read only workload
> memory pressure with cold memory on node0 with 512GB of local DRAM.
> =============+================================================+=========
> | cold memory occupied by mmap and memset |
> | 0G 440G 450G 460G 470G 480G 490G 500G |
> =============+================================================+=========
> Execution time normalized to DRAM-only values | GEOMEAN
> -------------+------------------------------------------------+---------
> DRAM-only | 1.00 - - - - - - - | 1.00
> CXL-only | 1.22 - - - - - - - | 1.22
> default | - 1.12 1.13 1.14 1.16 1.19 1.21 1.21 | 1.17
> DAMON tiered | - 1.04 1.03 1.04 1.06 1.05 1.05 1.05 | 1.05
> =============+================================================+=========
> CXL usage of redis-server in GB | AVERAGE
> -------------+------------------------------------------------+---------
> DRAM-only | 0.0 - - - - - - - | 0.0
> CXL-only | 52.6 - - - - - - - | 52.6
> default | - 20.4 27.0 33.1 39.5 45.6 50.5 50.3 | 38.1
> DAMON tiered | - 0.1 0.3 0.8 0.6 0.7 1.3 0.9 | 0.7
> =============+================================================+=========
>
> Each test result is based on the exeuction environment as follows.
>
> DRAM-only : redis-server uses only local DRAM memory.
> CXL-only : redis-server uses only CXL memory.
> default : default memory policy(MPOL_DEFAULT).
> numa balancing disabled.
> DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
> DAMOS_MIGRATE_HOT for CXL nodes.
>
> The above result shows the "default" execution time goes up as the size
> of cold memory is increased from 440G to 500G because the more cold
> memory used, the more CXL memory is used for the target redis workload
> and this makes the execution time increase.
>
> However, "DAMON tiered" result shows less slowdown because the
> DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
> cold memory to CXL node and this free space at DRAM increases more
> chance to allocate hot or warm pages of redis-server to fast DRAM node.
> Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
> of redis-server to DRAM node actively.
>
> As a result, it makes more memory of redis-server stay in DRAM node
> compared to "default" memory policy and this makes the performance
> improvement.
>
> The following result of latest distribution workload shows similar data.
>
> 2. YCSB latest distribution read only workload
> memory pressure with cold memory on node0 with 512GB of local DRAM.
> =============+================================================+=========
> | cold memory occupied by mmap and memset |
> | 0G 440G 450G 460G 470G 480G 490G 500G |
> =============+================================================+=========
> Execution time normalized to DRAM-only values | GEOMEAN
> -------------+------------------------------------------------+---------
> DRAM-only | 1.00 - - - - - - - | 1.00
> CXL-only | 1.18 - - - - - - - | 1.18
> default | - 1.18 1.19 1.18 1.18 1.17 1.19 1.18 | 1.18
> DAMON tiered | - 1.04 1.04 1.04 1.05 1.04 1.05 1.05 | 1.04
> =============+================================================+=========
> CXL usage of redis-server in GB | AVERAGE
> -------------+------------------------------------------------+---------
> DRAM-only | 0.0 - - - - - - - | 0.0
> CXL-only | 52.6 - - - - - - - | 52.6
> default | - 20.5 27.1 33.2 39.5 45.5 50.4 50.5 | 38.1
> DAMON tiered | - 0.2 0.4 0.7 1.6 1.2 1.1 3.4 | 1.2
> =============+================================================+=========
>
> In summary of both results, our evaluation shows that "DAMON tiered"
> memory management reduces the performance slowdown compared to the
> "default" memory policy from 17~18% to 4~5% when the system runs with
> high memory pressure on its fast tier DRAM nodes.
>
> Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
> tiered memory systems run more efficiently under high memory pressures.
>

Hi,

It's hard to determine from your results whether the performance
mitigation is being caused primarily by MIGRATE_COLD freeing up space
for new allocations, or from some combination of HOT/COLD actions
occurring during execution but after the database has already been
warmed up.

Do you have test results which enable only DAMOS_MIGRATE_COLD actions
but not DAMOS_MIGRATE_HOT actions? (and vice versa)

The question I have is exactly how often is MIGRATE_HOT actually being
utilized, and how much data is being moved. Testing MIGRATE_COLD only
would at least give a rough approximation of that.


Additionally, do you have any data on workloads that exceed the capacity
of the DRAM tier? Here you say you have 512GB of local DRAM, but only
test a workload that caps out at 500G. Have you run a test of, say,
550GB to see the effect of DAMON HOT/COLD migration actions when DRAM
capacity is exceeded?

Can you also provide the DRAM-only results for each test? Presumably,
as workload size increases from 440G to 500G, the system probably starts
using some amount of swap/zswap/whatever. It would be good to know how
this system compares to swap small amounts of overflow.

~Gregory

2024-04-05 19:19:26

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/7] mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode

On Fri, 5 Apr 2024 15:08:50 +0900 Honggyu Kim <[email protected]> wrote:

> This is a preparation patch that introduces migration modes.
>
> The damon_pa_pageout is renamed to damon_pa_migrate and it receives an
> extra argument for migration_mode.

I personally think keeping damon_pa_pageout() as is and adding a new function
(damon_pa_migrate()) with some duplicated code is also ok, but this approach
also looks fine to me. So I have no strong opinion here, but just letting you
know I would have no objection at both approaches.

>
> No functional changes applied.
>
> Signed-off-by: Honggyu Kim <[email protected]>
> ---
> mm/damon/paddr.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index 081e2a325778..277a1c4d833c 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -224,7 +224,12 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
> return false;
> }
>
> -static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
> +enum migration_mode {
> + MIG_PAGEOUT,
> +};

To avoid name conflicts, what about renaming to 'damos_migration_mode' and
'DAMOS_MIG_PAGEOUT'?

> +
> +static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> + enum migration_mode mm)

My poor brain has a bit confused with the name. What about calling it 'mode'?

> {
> unsigned long addr, applied;
> LIST_HEAD(folio_list);
> @@ -249,7 +254,14 @@ static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
> put_folio:
> folio_put(folio);
> }
> - applied = reclaim_pages(&folio_list);
> + switch (mm) {
> + case MIG_PAGEOUT:
> + applied = reclaim_pages(&folio_list);
> + break;
> + default:
> + /* Unexpected migration mode. */
> + return 0;
> + }
> cond_resched();
> return applied * PAGE_SIZE;
> }
> @@ -297,7 +309,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
> {
> switch (scheme->action) {
> case DAMOS_PAGEOUT:
> - return damon_pa_pageout(r, scheme);
> + return damon_pa_migrate(r, scheme, MIG_PAGEOUT);
> case DAMOS_LRU_PRIO:
> return damon_pa_mark_accessed(r, scheme);
> case DAMOS_LRU_DEPRIO:
> --
> 2.34.1


Thanks,
SJ

2024-04-05 19:21:01

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/7] mm/migrate: add MR_DAMON to migrate_reason

On Fri, 5 Apr 2024 15:08:53 +0900 Honggyu Kim <[email protected]> wrote:

> The current patch series introduces DAMON based migration across NUMA
> nodes so it'd be better to have a new migrate_reason in trace events.
>
> Signed-off-by: Honggyu Kim <[email protected]>

Reviewed-by: SeongJae Park <[email protected]>


Thanks,
SJ

> ---
> include/linux/migrate_mode.h | 1 +
> include/trace/events/migrate.h | 3 ++-
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
> index f37cc03f9369..cec36b7e7ced 100644
> --- a/include/linux/migrate_mode.h
> +++ b/include/linux/migrate_mode.h
> @@ -29,6 +29,7 @@ enum migrate_reason {
> MR_CONTIG_RANGE,
> MR_LONGTERM_PIN,
> MR_DEMOTION,
> + MR_DAMON,
> MR_TYPES
> };
>
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 0190ef725b43..cd01dd7b3640 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -22,7 +22,8 @@
> EM( MR_NUMA_MISPLACED, "numa_misplaced") \
> EM( MR_CONTIG_RANGE, "contig_range") \
> EM( MR_LONGTERM_PIN, "longterm_pin") \
> - EMe(MR_DEMOTION, "demotion")
> + EM( MR_DEMOTION, "demotion") \
> + EMe(MR_DAMON, "damon")
>
> /*
> * First define the enums in the above macros to be exported to userspace
> --
> 2.34.1
>
>

2024-04-05 19:23:01

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/7] mm: make alloc_demote_folio externally invokable for migration

On Fri, 5 Apr 2024 15:08:51 +0900 Honggyu Kim <[email protected]> wrote:

> The alloc_demote_folio can be used out of vmscan.c so it'd be better to
> remove static keyword from it.
>
> This function can also be used for both demotion and promotion so it'd
> be better to rename it from alloc_demote_folio to alloc_migrate_folio.

I'm not sure if renaming is really needed, but has no strong opinion.

>
> Signed-off-by: Honggyu Kim <[email protected]>

I have one more trivial comment below, but finds no blocker for me.

Reviewed-by: SeongJae Park <[email protected]>

> ---
> mm/internal.h | 1 +
> mm/vmscan.c | 10 +++++++---
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index f309a010d50f..c96ff9bc82d0 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -866,6 +866,7 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long,
> unsigned long, unsigned long);
>
> extern void set_pageblock_order(void);
> +struct folio *alloc_migrate_folio(struct folio *src, unsigned long private);
> unsigned long reclaim_pages(struct list_head *folio_list);
> unsigned int reclaim_clean_pages_from_list(struct zone *zone,
> struct list_head *folio_list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4255619a1a31..9e456cac03b4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -910,8 +910,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
> mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
> }
>
> -static struct folio *alloc_demote_folio(struct folio *src,
> - unsigned long private)
> +struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
> {
> struct folio *dst;
> nodemask_t *allowed_mask;
> @@ -935,6 +934,11 @@ static struct folio *alloc_demote_folio(struct folio *src,
> if (dst)
> return dst;
>
> + /*
> + * Allocation failed from the target node so try to allocate from
> + * fallback nodes based on allowed_mask.
> + * See fallback_alloc() at mm/slab.c.
> + */

I think this might better to be a separate cleanup patch, but given its small
size, I have no strong opinion.

> mtc->gfp_mask &= ~__GFP_THISNODE;
> mtc->nmask = allowed_mask;
>
> @@ -973,7 +977,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> node_get_allowed_targets(pgdat, &allowed_mask);
>
> /* Demotion ignores all cpuset and mempolicy settings */
> - migrate_pages(demote_folios, alloc_demote_folio, NULL,
> + migrate_pages(demote_folios, alloc_migrate_folio, NULL,
> (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
> &nr_succeeded);
>
> --
> 2.34.1


Thanks,
SJ

2024-04-05 19:25:00

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim <[email protected]> wrote:

> This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
> DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
> instead of swapping them out.
>
> The 'target_nid' sysfs knob is created by this patch to inform the
> migration target node ID.

Isn't it created by the previous patch?

>
> Here is one of the example usage of this 'migrate_cold' action.
>
> $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
> $ cat contexts/<N>/schemes/<N>/action
> migrate_cold
> $ echo 2 > contexts/<N>/schemes/<N>/target_nid
> $ echo commit > state
> $ numactl -p 0 ./hot_cold 500M 600M &
> $ numastat -c -p hot_cold
>
> Per-node process memory usage (in MBs)
> PID Node 0 Node 1 Node 2 Total
> -------------- ------ ------ ------ -----
> 701 (hot_cold) 501 0 601 1101
>
> Since there are some common routines with pageout, many functions have
> similar logics between pageout and migrate cold.
>
> damon_pa_migrate_folio_list() is a minimized version of
> shrink_folio_list(), but it's minified only for demotion.

MIGRATE_COLD is not only for demotion, right? I think the last two words are
better to be removed for reducing unnecessary confuses.

>
> Signed-off-by: Honggyu Kim <[email protected]>
> Signed-off-by: Hyeongtak Ji <[email protected]>
> ---
> include/linux/damon.h | 2 +
> mm/damon/paddr.c | 146 ++++++++++++++++++++++++++++++++++++++-
> mm/damon/sysfs-schemes.c | 4 ++
> 3 files changed, 151 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 24ea33a03d5d..df8671e69a70 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -105,6 +105,7 @@ struct damon_target {
> * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
> * @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists.
> * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists.
> + * @DAMOS_MIGRATE_COLD: Migrate for the given cold region.

Whether it will be for cold region or not is depending on the target access
pattern. What about 'Migrate the regions in coldest regions first manner.'?
Or, simply 'Migrate the regions (prioritize cold)' here, and explain about the
prioritization under quota on the detailed comments part?

Also, let's use tab consistently.

> * @DAMOS_STAT: Do nothing but count the stat.
> * @NR_DAMOS_ACTIONS: Total number of DAMOS actions
> *
> @@ -122,6 +123,7 @@ enum damos_action {
> DAMOS_NOHUGEPAGE,
> DAMOS_LRU_PRIO,
> DAMOS_LRU_DEPRIO,
> + DAMOS_MIGRATE_COLD,
> DAMOS_STAT, /* Do nothing but only record the stat */
> NR_DAMOS_ACTIONS,
> };
> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index 277a1c4d833c..fe217a26f788 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -12,6 +12,9 @@
> #include <linux/pagemap.h>
> #include <linux/rmap.h>
> #include <linux/swap.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_inline.h>
>
> #include "../internal.h"
> #include "ops-common.h"
> @@ -226,8 +229,137 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
>
> enum migration_mode {
> MIG_PAGEOUT,
> + MIG_MIGRATE_COLD,
> };
>
> +static unsigned int migrate_folio_list(struct list_head *migrate_folios,
> + struct pglist_data *pgdat,
> + int target_nid)

To avoid name collisions, I'd prefer having damon_pa_prefix. I show this patch
is defining damon_pa_migrate_folio_list() below, though. What about
__damon_pa_migrate_folio_list()?

> +{
> + unsigned int nr_succeeded;
> + nodemask_t allowed_mask = NODE_MASK_NONE;
> +

I personally prefer not having empty lines in the middle of variable
declarations/definitions. Could we remove this empty line?

> + struct migration_target_control mtc = {
> + /*
> + * Allocate from 'node', or fail quickly and quietly.
> + * When this happens, 'page' will likely just be discarded
> + * instead of migrated.
> + */
> + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> + __GFP_NOMEMALLOC | GFP_NOWAIT,
> + .nid = target_nid,
> + .nmask = &allowed_mask
> + };
> +
> + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
> + return 0;
> +
> + if (list_empty(migrate_folios))
> + return 0;

Can't these checks be done by the caller?

> +
> + /* Migration ignores all cpuset and mempolicy settings */
> + migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
> + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
> + &nr_succeeded);
> +
> + return nr_succeeded;
> +}
> +
> +static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
> + struct pglist_data *pgdat,
> + enum migration_mode mm,

Again, 'mm' makes my poor brain a bit confused. What about 'mode'?
And, seems this is not used at all in this function? Can we just drop this?

> + int target_nid)
> +{
> + unsigned int nr_migrated = 0;
> + struct folio *folio;
> + LIST_HEAD(ret_folios);
> + LIST_HEAD(migrate_folios);
> +
> + cond_resched();

We will do this again at the beginning of the loop. Do we need this here?

> +
> + while (!list_empty(folio_list)) {
> + struct folio *folio;
> +
> + cond_resched();
> +
> + folio = lru_to_folio(folio_list);
> + list_del(&folio->lru);
> +
> + if (!folio_trylock(folio))
> + goto keep;
> +
> + VM_BUG_ON_FOLIO(folio_test_active(folio), folio);

Why? I think we could want to migrate active pages in some use case, e.g., to
reduce memory bandwidth?

> +
> + /* Relocate its contents to another node. */
> + list_add(&folio->lru, &migrate_folios);
> + folio_unlock(folio);
> + continue;
> +keep:
> + list_add(&folio->lru, &ret_folios);
> + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

Can this happen? I think this could be too much test? checkpatch.pl also
warns.

> + }
> + /* 'folio_list' is always empty here */
> +
> + /* Migrate folios selected for migration */
> + nr_migrated += migrate_folio_list(&migrate_folios, pgdat, target_nid);
> + /* Folios that could not be migrated are still in @migrate_folios */
> + if (!list_empty(&migrate_folios)) {
> + /* Folios which weren't migrated go back on @folio_list */
> + list_splice_init(&migrate_folios, folio_list);
> + }

Let's not use braces for single statement
(https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces).

> +
> + try_to_unmap_flush();
> +
> + list_splice(&ret_folios, folio_list);

Can't we move remaining folios in migrate_folios to ret_folios at once?

> +
> + while (!list_empty(folio_list)) {
> + folio = lru_to_folio(folio_list);
> + list_del(&folio->lru);
> + folio_putback_lru(folio);
> + }
> +
> + return nr_migrated;
> +}
> +
> +static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
> + enum migration_mode mm,

Again, I'd prefer calling this 'mode' or something other than 'mm'.
And, seems 'mm' is not really used in this function. It is passed to
'damon_pa_migrate_folio_list()' but it deosn't really use it. Can we drop
this?

> + int target_nid)
> +{
> + int nid;
> + unsigned int nr_migrated = 0;

Let's make this matches with the return type of this function.

> + LIST_HEAD(node_folio_list);
> + unsigned int noreclaim_flag;
> +
> + if (list_empty(folio_list))
> + return nr_migrated;
> +
> + noreclaim_flag = memalloc_noreclaim_save();
> +
> + nid = folio_nid(lru_to_folio(folio_list));
> + do {
> + struct folio *folio = lru_to_folio(folio_list);
> +
> + if (nid == folio_nid(folio)) {
> + folio_clear_active(folio);

I think this was necessary for demotion, but now this should be removed since
this function is no more for demotion but for migrating random pages, right?

> + list_move(&folio->lru, &node_folio_list);
> + continue;
> + }
> +
> + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
> + NODE_DATA(nid), mm,
> + target_nid);
> + nid = folio_nid(lru_to_folio(folio_list));
> + } while (!list_empty(folio_list));
> +
> + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
> + NODE_DATA(nid), mm,
> + target_nid);
> +
> + memalloc_noreclaim_restore(noreclaim_flag);
> +
> + return nr_migrated;
> +}
> +
> static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> enum migration_mode mm)
> {
> @@ -247,7 +379,11 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> folio_test_clear_young(folio);
> if (!folio_isolate_lru(folio))
> goto put_folio;
> - if (folio_test_unevictable(folio))
> + /*
> + * Since unevictable folios can be demoted or promoted,

Let's use the term 'migrated' instead of 'demoted' or 'promoted'.

> + * unevictable test is needed only for pageout.
> + */
> + if (mm == MIG_PAGEOUT && folio_test_unevictable(folio))
> folio_putback_lru(folio);
> else
> list_add(&folio->lru, &folio_list);
> @@ -258,6 +394,10 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> case MIG_PAGEOUT:
> applied = reclaim_pages(&folio_list);
> break;
> + case MIG_MIGRATE_COLD:
> + applied = damon_pa_migrate_pages(&folio_list, mm,
> + s->target_nid);
> + break;
> default:
> /* Unexpected migration mode. */
> return 0;
> @@ -314,6 +454,8 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
> return damon_pa_mark_accessed(r, scheme);
> case DAMOS_LRU_DEPRIO:
> return damon_pa_deactivate_pages(r, scheme);
> + case DAMOS_MIGRATE_COLD:
> + return damon_pa_migrate(r, scheme, MIG_MIGRATE_COLD);
> case DAMOS_STAT:
> break;
> default:
> @@ -334,6 +476,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
> return damon_hot_score(context, r, scheme);
> case DAMOS_LRU_DEPRIO:
> return damon_cold_score(context, r, scheme);
> + case DAMOS_MIGRATE_COLD:
> + return damon_cold_score(context, r, scheme);
> default:
> break;
> }
> diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
> index 1a30ea82c890..18b7d054c748 100644
> --- a/mm/damon/sysfs-schemes.c
> +++ b/mm/damon/sysfs-schemes.c
> @@ -1406,6 +1406,7 @@ static const char * const damon_sysfs_damos_action_strs[] = {
> "nohugepage",
> "lru_prio",
> "lru_deprio",
> + "migrate_cold",
> "stat",
> };
>
> @@ -1659,6 +1660,9 @@ static ssize_t target_nid_store(struct kobject *kobj,
> struct damon_sysfs_scheme, kobj);
> int err = 0;
>
> + if (scheme->action != DAMOS_MIGRATE_COLD)
> + return -EINVAL;
> +

I think user could set target_nid first, and then action. So I think this
should not return error?

> /* TODO: error handling for target_nid range. */
> err = kstrtoint(buf, 0, &scheme->target_nid);
>
> --
> 2.34.1
>
>


Thanks,
SJ

2024-04-05 19:25:14

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

On Fri, 5 Apr 2024 16:55:57 +0900 Hyeongtak Ji <[email protected]> wrote:

> On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim <[email protected]> wrote:
>
> ...snip...
>
> > +static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
> > + enum migration_mode mm,
> > + int target_nid)
> > +{
> > + int nid;
> > + unsigned int nr_migrated = 0;
> > + LIST_HEAD(node_folio_list);
> > + unsigned int noreclaim_flag;
> > +
> > + if (list_empty(folio_list))
> > + return nr_migrated;
>
> How about checking if `target_nid` is `NUMA_NO_NODE` or not earlier,
>
> > +
> > + noreclaim_flag = memalloc_noreclaim_save();
> > +
> > + nid = folio_nid(lru_to_folio(folio_list));
> > + do {
> > + struct folio *folio = lru_to_folio(folio_list);
> > +
> > + if (nid == folio_nid(folio)) {
> > + folio_clear_active(folio);
> > + list_move(&folio->lru, &node_folio_list);
> > + continue;
> > + }
> > +
> > + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
> > + NODE_DATA(nid), mm,
> > + target_nid);
> > + nid = folio_nid(lru_to_folio(folio_list));
> > + } while (!list_empty(folio_list));
> > +
> > + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
> > + NODE_DATA(nid), mm,
> > + target_nid);
> > +
> > + memalloc_noreclaim_restore(noreclaim_flag);
> > +
> > + return nr_migrated;
> > +}
> > +
>
> ...snip...
>
> > +static unsigned int migrate_folio_list(struct list_head *migrate_folios,
> > + struct pglist_data *pgdat,
> > + int target_nid)
> > +{
> > + unsigned int nr_succeeded;
> > + nodemask_t allowed_mask = NODE_MASK_NONE;
> > +
> > + struct migration_target_control mtc = {
> > + /*
> > + * Allocate from 'node', or fail quickly and quietly.
> > + * When this happens, 'page' will likely just be discarded
> > + * instead of migrated.
> > + */
> > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> > + __GFP_NOMEMALLOC | GFP_NOWAIT,
> > + .nid = target_nid,
> > + .nmask = &allowed_mask
> > + };
> > +
> > + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
> > + return 0;
>
> instead of here.

Agree. As I replied on the previous reply, I think this check can be done from
the caller (or the caller of the caller) of this function.

>
> > +
> > + if (list_empty(migrate_folios))
> > + return 0;

Same for this.

> > +
> > + /* Migration ignores all cpuset and mempolicy settings */
> > + migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
> > + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
> > + &nr_succeeded);
> > +
> > + return nr_succeeded;
> > +}
> > +
>
> ...snip...
>
> Kind regards,
> Hyeongtak
>

Thanks,
SJ

2024-04-05 19:27:44

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 7/7] mm/damon: Add "damon_migrate_{hot,cold}" vmstat

On Fri, 5 Apr 2024 15:08:56 +0900 Honggyu Kim <[email protected]> wrote:

> This patch adds "damon_migrate_{hot,cold}" under node specific vmstat
> counters at the following location.
>
> /sys/devices/system/node/node*/vmstat
>
> The counted values are accumulcated to the global vmstat so it also
> introduces the same counter at /proc/vmstat as well.

DAMON provides its own DAMOS stats via DAMON sysfs interface. Do we really
need this change?


Thanks,
SJ

[...]

2024-04-05 19:28:32

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 6/7] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

On Fri, 5 Apr 2024 15:08:55 +0900 Honggyu Kim <[email protected]> wrote:

> From: Hyeongtak Ji <[email protected]>
>
> This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
> DAMOS_MIGRATE_COLD, but it is targeted to migrate hot pages.

My understanding of our last discussion was that 'HOT/COLD' here is only for
prioritization score function. If I'm not wrong, this is not for targeting,
but just prioritize migrating hot pages first under the quota.

>
> It migrates pages inside the given region to the 'target_nid' NUMA node
> in the sysfs.
>
> Here is one of the example usage of this 'migrate_hot' action.
>
> $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
> $ cat contexts/<N>/schemes/<N>/action
> migrate_hot
> $ echo 0 > contexts/<N>/schemes/<N>/target_nid
> $ echo commit > state
> $ numactl -p 2 ./hot_cold 500M 600M &
> $ numastat -c -p hot_cold
>
> Per-node process memory usage (in MBs)
> PID Node 0 Node 1 Node 2 Total
> -------------- ------ ------ ------ -----
> 701 (hot_cold) 501 0 601 1101
>
> Signed-off-by: Hyeongtak Ji <[email protected]>
> Signed-off-by: Honggyu Kim <[email protected]>
> ---
> include/linux/damon.h | 2 ++
> mm/damon/paddr.c | 12 ++++++++++--
> mm/damon/sysfs-schemes.c | 4 +++-
> 3 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index df8671e69a70..934c95a7c042 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -105,6 +105,7 @@ struct damon_target {
> * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
> * @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists.
> * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists.
> + * @DAMOS_MIGRATE_HOT: Migrate for the given hot region.

As commented on the previous patch, this could be bit re-phrased.
Also, let's use tabs consistently.

> * @DAMOS_MIGRATE_COLD: Migrate for the given cold region.
> * @DAMOS_STAT: Do nothing but count the stat.
> * @NR_DAMOS_ACTIONS: Total number of DAMOS actions
> @@ -123,6 +124,7 @@ enum damos_action {
> DAMOS_NOHUGEPAGE,
> DAMOS_LRU_PRIO,
> DAMOS_LRU_DEPRIO,
> + DAMOS_MIGRATE_HOT,
> DAMOS_MIGRATE_COLD,
> DAMOS_STAT, /* Do nothing but only record the stat */
> NR_DAMOS_ACTIONS,
> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index fe217a26f788..fd9d35b5cc83 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -229,6 +229,7 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
>
> enum migration_mode {
> MIG_PAGEOUT,
> + MIG_MIGRATE_HOT,
> MIG_MIGRATE_COLD,
> };

It looks like we don't need MIG_MIGRATE_HOT and MIG_MIGRATE_COLD in real, but
just one, say, MIG_MIGRATE, since the code can know if it should use what
prioritization score function with DAMOS action?

Also, as I commented on the previous one, I'd prefer having DAMOS_ prefix.

>
> @@ -375,8 +376,10 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> if (damos_pa_filter_out(s, folio))
> goto put_folio;
>
> - folio_clear_referenced(folio);
> - folio_test_clear_young(folio);
> + if (mm != MIG_MIGRATE_HOT) {
> + folio_clear_referenced(folio);
> + folio_test_clear_young(folio);
> + }

We agreed to this check via 'young' page type DAMOS filter, and let this
doesn't care about it, right? If I'm not wrong, I think this should be
removed?

> if (!folio_isolate_lru(folio))
> goto put_folio;
> /*
> @@ -394,6 +397,7 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> case MIG_PAGEOUT:
> applied = reclaim_pages(&folio_list);
> break;
> + case MIG_MIGRATE_HOT:
> case MIG_MIGRATE_COLD:
> applied = damon_pa_migrate_pages(&folio_list, mm,
> s->target_nid);
> @@ -454,6 +458,8 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
> return damon_pa_mark_accessed(r, scheme);
> case DAMOS_LRU_DEPRIO:
> return damon_pa_deactivate_pages(r, scheme);
> + case DAMOS_MIGRATE_HOT:
> + return damon_pa_migrate(r, scheme, MIG_MIGRATE_HOT);
> case DAMOS_MIGRATE_COLD:
> return damon_pa_migrate(r, scheme, MIG_MIGRATE_COLD);
> case DAMOS_STAT:
> @@ -476,6 +482,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
> return damon_hot_score(context, r, scheme);
> case DAMOS_LRU_DEPRIO:
> return damon_cold_score(context, r, scheme);
> + case DAMOS_MIGRATE_HOT:
> + return damon_hot_score(context, r, scheme);
> case DAMOS_MIGRATE_COLD:
> return damon_cold_score(context, r, scheme);
> default:
> diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
> index 18b7d054c748..1d2f62aa79ca 100644
> --- a/mm/damon/sysfs-schemes.c
> +++ b/mm/damon/sysfs-schemes.c
> @@ -1406,6 +1406,7 @@ static const char * const damon_sysfs_damos_action_strs[] = {
> "nohugepage",
> "lru_prio",
> "lru_deprio",
> + "migrate_hot",
> "migrate_cold",
> "stat",
> };
> @@ -1660,7 +1661,8 @@ static ssize_t target_nid_store(struct kobject *kobj,
> struct damon_sysfs_scheme, kobj);
> int err = 0;
>
> - if (scheme->action != DAMOS_MIGRATE_COLD)
> + if (scheme->action != DAMOS_MIGRATE_HOT &&
> + scheme->action != DAMOS_MIGRATE_COLD)
> return -EINVAL;
>
> /* TODO: error handling for target_nid range. */
> --
> 2.34.1
>
>


Thanks,
SJ

2024-04-05 19:29:51

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

Hello Honggyu,

On Fri, 5 Apr 2024 15:08:49 +0900 Honggyu Kim <[email protected]> wrote:

> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
>
> It says there is no implementation of the demote/promote DAMOS action
> are made. This RFC is about its implementation for physical address
> space.
>
>
> Changes from RFC v2:
> 1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
> 2. Create 'target_nid' to set the migration target node instead of
> depending on node distance based information.
> 3. Instead of having page level access check in this patch series,
> delegate the job to a new DAMOS filter type YOUNG[2].
> 4. Introduce vmstat counters "damon_migrate_{hot,cold}".
> 5. Rebase from v6.7 to v6.8.

Thank you for patiently keeping discussion and making this great version! I
left comments on each patch, but found no special concerns. Per-page access
recheck for MIGRATE_HOT and vmstat change are taking my eyes, though. I doubt
if those really needed. It would be nice if you could answer to the comments.

Once my comments on this version are addressed, I would have no reason to
object at dropping the RFC tag from this patchset.

Nonetheless, I show some warnings and errors from checkpatch.pl. I don't
really care about those for RFC patches, so no problem at all. But if you
agree to my opinion about RFC tag dropping, and therefore if you will send next
version without RFC tag, please make sure you also run checkpatch.pl before
posting.


Thanks,
SJ

[...]

2024-04-08 10:59:08

by Honggyu Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

Hi SeongJae,

On Fri, 5 Apr 2024 12:28:00 -0700 SeongJae Park <[email protected]> wrote:
> Hello Honggyu,
>
> On Fri, 5 Apr 2024 15:08:49 +0900 Honggyu Kim <[email protected]> wrote:
>
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> >
> > It says there is no implementation of the demote/promote DAMOS action
> > are made. This RFC is about its implementation for physical address
> > space.
> >
> >
> > Changes from RFC v2:
> > 1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
> > 2. Create 'target_nid' to set the migration target node instead of
> > depending on node distance based information.
> > 3. Instead of having page level access check in this patch series,
> > delegate the job to a new DAMOS filter type YOUNG[2].
> > 4. Introduce vmstat counters "damon_migrate_{hot,cold}".
> > 5. Rebase from v6.7 to v6.8.
>
> Thank you for patiently keeping discussion and making this great version! I
> left comments on each patch, but found no special concerns. Per-page access
> recheck for MIGRATE_HOT and vmstat change are taking my eyes, though. I doubt
> if those really needed. It would be nice if you could answer to the comments.

I will answer them where you made the comments.

> Once my comments on this version are addressed, I would have no reason to
> object at dropping the RFC tag from this patchset.

Thanks. I will drop the RFC after addressing your comments.

> Nonetheless, I show some warnings and errors from checkpatch.pl. I don't
> really care about those for RFC patches, so no problem at all. But if you
> agree to my opinion about RFC tag dropping, and therefore if you will send next
> version without RFC tag, please make sure you also run checkpatch.pl before
> posting.

Sure. I will run checkpatch.pl from the next revision.

Thanks,
Honggyu

>
> Thanks,
> SJ
>
> [...]

2024-04-08 12:07:14

by Honggyu Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

On Fri, 5 Apr 2024 12:24:30 -0700 SeongJae Park <[email protected]> wrote:
> On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim <[email protected]> wrote:
>
> > This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
> > DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
> > instead of swapping them out.
> >
> > The 'target_nid' sysfs knob is created by this patch to inform the
> > migration target node ID.
>
> Isn't it created by the previous patch?

Right. I didn't fix the commit message after split this patch. I will
fix it.

> >
> > Here is one of the example usage of this 'migrate_cold' action.
> >
> > $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
> > $ cat contexts/<N>/schemes/<N>/action
> > migrate_cold
> > $ echo 2 > contexts/<N>/schemes/<N>/target_nid
> > $ echo commit > state
> > $ numactl -p 0 ./hot_cold 500M 600M &
> > $ numastat -c -p hot_cold
> >
> > Per-node process memory usage (in MBs)
> > PID Node 0 Node 1 Node 2 Total
> > -------------- ------ ------ ------ -----
> > 701 (hot_cold) 501 0 601 1101
> >
> > Since there are some common routines with pageout, many functions have
> > similar logics between pageout and migrate cold.
> >
> > damon_pa_migrate_folio_list() is a minimized version of
> > shrink_folio_list(), but it's minified only for demotion.
>
> MIGRATE_COLD is not only for demotion, right? I think the last two words are
> better to be removed for reducing unnecessary confuses.

You mean the last two sentences? I will remove them if you feel it's
confusing.

> >
> > Signed-off-by: Honggyu Kim <[email protected]>
> > Signed-off-by: Hyeongtak Ji <[email protected]>
> > ---
> > include/linux/damon.h | 2 +
> > mm/damon/paddr.c | 146 ++++++++++++++++++++++++++++++++++++++-
> > mm/damon/sysfs-schemes.c | 4 ++
> > 3 files changed, 151 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/damon.h b/include/linux/damon.h
> > index 24ea33a03d5d..df8671e69a70 100644
> > --- a/include/linux/damon.h
> > +++ b/include/linux/damon.h
> > @@ -105,6 +105,7 @@ struct damon_target {
> > * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
> > * @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists.
> > * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists.
> > + * @DAMOS_MIGRATE_COLD: Migrate for the given cold region.
>
> Whether it will be for cold region or not is depending on the target access
> pattern. What about 'Migrate the regions in coldest regions first manner.'?
> Or, simply 'Migrate the regions (prioritize cold)' here, and explain about the
> prioritization under quota on the detailed comments part?

"Migrate the regions in coldest regions first manner under quota" sounds
better. I will change it.

> Also, let's use tab consistently.

Yeah, it's a mistake. will fix it.

> > * @DAMOS_STAT: Do nothing but count the stat.
> > * @NR_DAMOS_ACTIONS: Total number of DAMOS actions
> > *
> > @@ -122,6 +123,7 @@ enum damos_action {
> > DAMOS_NOHUGEPAGE,
> > DAMOS_LRU_PRIO,
> > DAMOS_LRU_DEPRIO,
> > + DAMOS_MIGRATE_COLD,
> > DAMOS_STAT, /* Do nothing but only record the stat */
> > NR_DAMOS_ACTIONS,
> > };
> > diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> > index 277a1c4d833c..fe217a26f788 100644
> > --- a/mm/damon/paddr.c
> > +++ b/mm/damon/paddr.c
> > @@ -12,6 +12,9 @@
> > #include <linux/pagemap.h>
> > #include <linux/rmap.h>
> > #include <linux/swap.h>
> > +#include <linux/memory-tiers.h>
> > +#include <linux/migrate.h>
> > +#include <linux/mm_inline.h>
> >
> > #include "../internal.h"
> > #include "ops-common.h"
> > @@ -226,8 +229,137 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
> >
> > enum migration_mode {
> > MIG_PAGEOUT,
> > + MIG_MIGRATE_COLD,
> > };
> >
> > +static unsigned int migrate_folio_list(struct list_head *migrate_folios,
> > + struct pglist_data *pgdat,
> > + int target_nid)
>
> To avoid name collisions, I'd prefer having damon_pa_prefix. I show this patch
> is defining damon_pa_migrate_folio_list() below, though. What about
> __damon_pa_migrate_folio_list()?

Ack. I will change it to __damon_pa_migrate_folio_list().

> > +{
> > + unsigned int nr_succeeded;
> > + nodemask_t allowed_mask = NODE_MASK_NONE;
> > +
>
> I personally prefer not having empty lines in the middle of variable
> declarations/definitions. Could we remove this empty line?

I can remove it, but I would like to have more discussion about this
issue. The current implementation allows only a single migration
target with "target_nid", but users might want to provide fall back
migration target nids.

For example, if more than two CXL nodes exist in the system, users might
want to migrate cold pages to any CXL nodes. In such cases, we might
have to make "target_nid" accept comma separated node IDs. nodemask can
be better but we should provide a way to change the scanning order.

I would like to hear how you think about this.

> > + struct migration_target_control mtc = {
> > + /*
> > + * Allocate from 'node', or fail quickly and quietly.
> > + * When this happens, 'page' will likely just be discarded
> > + * instead of migrated.
> > + */
> > + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> > + __GFP_NOMEMALLOC | GFP_NOWAIT,
> > + .nid = target_nid,
> > + .nmask = &allowed_mask
> > + };
> > +
> > + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
> > + return 0;
> > +
> > + if (list_empty(migrate_folios))
> > + return 0;
>
> Can't these checks be done by the caller?

Sure. I will move them to the caller.

> > +
> > + /* Migration ignores all cpuset and mempolicy settings */
> > + migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
> > + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
> > + &nr_succeeded);
> > +
> > + return nr_succeeded;
> > +}
> > +
> > +static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
> > + struct pglist_data *pgdat,
> > + enum migration_mode mm,
>
> Again, 'mm' makes my poor brain a bit confused. What about 'mode'?
> And, seems this is not used at all in this function? Can we just drop this?

Ack. I will remove it in this patch and introduce it in the patch where
it's used.

> > + int target_nid)
> > +{
> > + unsigned int nr_migrated = 0;
> > + struct folio *folio;
> > + LIST_HEAD(ret_folios);
> > + LIST_HEAD(migrate_folios);
> > +
> > + cond_resched();
>
> We will do this again at the beginning of the loop. Do we need this here?

This comes from shrink_folio_list() but this function is way simpler so
it can be removed.

> > +
> > + while (!list_empty(folio_list)) {
> > + struct folio *folio;
> > +
> > + cond_resched();
> > +
> > + folio = lru_to_folio(folio_list);
> > + list_del(&folio->lru);
> > +
> > + if (!folio_trylock(folio))
> > + goto keep;
> > +
> > + VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
>
> Why? I think we could want to migrate active pages in some use case, e.g., to
> reduce memory bandwidth?

Yeah, I will remove it.

> > +
> > + /* Relocate its contents to another node. */
> > + list_add(&folio->lru, &migrate_folios);
> > + folio_unlock(folio);
> > + continue;
> > +keep:
> > + list_add(&folio->lru, &ret_folios);
> > + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>
> Can this happen? I think this could be too much test? checkpatch.pl also
> warns.

Likewise, the current shrink_folio_list does so brought it in this patch
as well, but I think we can remove it here.

> > + }
> > + /* 'folio_list' is always empty here */
> > +
> > + /* Migrate folios selected for migration */
> > + nr_migrated += migrate_folio_list(&migrate_folios, pgdat, target_nid);
> > + /* Folios that could not be migrated are still in @migrate_folios */
> > + if (!list_empty(&migrate_folios)) {
> > + /* Folios which weren't migrated go back on @folio_list */
> > + list_splice_init(&migrate_folios, folio_list);
> > + }
>
> Let's not use braces for single statement
> (https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces).

Hmm.. I know the convention but left it as is because of the comment.
If I remove the braces, it would have a weird alignment for the two
lines for comment and statement lines.

> > +
> > + try_to_unmap_flush();
> > +
> > + list_splice(&ret_folios, folio_list);
>
> Can't we move remaining folios in migrate_folios to ret_folios at once?

I will see if it's possible.

> > +
> > + while (!list_empty(folio_list)) {
> > + folio = lru_to_folio(folio_list);
> > + list_del(&folio->lru);
> > + folio_putback_lru(folio);
> > + }
> > +
> > + return nr_migrated;
> > +}
> > +
> > +static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
> > + enum migration_mode mm,
>
> Again, I'd prefer calling this 'mode' or something other than 'mm'.
> And, seems 'mm' is not really used in this function. It is passed to
> 'damon_pa_migrate_folio_list()' but it deosn't really use it. Can we drop
> this?

Sure. I will drop it here and rename it to "mode" where it's used.

> > + int target_nid)
> > +{
> > + int nid;
> > + unsigned int nr_migrated = 0;
>
> Let's make this matches with the return type of this function.

Ack. will change it to unsigned long.

> > + LIST_HEAD(node_folio_list);
> > + unsigned int noreclaim_flag;
> > +
> > + if (list_empty(folio_list))
> > + return nr_migrated;
> > +
> > + noreclaim_flag = memalloc_noreclaim_save();
> > +
> > + nid = folio_nid(lru_to_folio(folio_list));
> > + do {
> > + struct folio *folio = lru_to_folio(folio_list);
> > +
> > + if (nid == folio_nid(folio)) {
> > + folio_clear_active(folio);
>
> I think this was necessary for demotion, but now this should be removed since
> this function is no more for demotion but for migrating random pages, right?

Yeah, it can be removed because we do migration instead of demotion,
but I need to make sure if it doesn't change the performance evaluation
results.

> > + list_move(&folio->lru, &node_folio_list);
> > + continue;
> > + }
> > +
> > + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
> > + NODE_DATA(nid), mm,
> > + target_nid);
> > + nid = folio_nid(lru_to_folio(folio_list));
> > + } while (!list_empty(folio_list));
> > +
> > + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
> > + NODE_DATA(nid), mm,
> > + target_nid);
> > +
> > + memalloc_noreclaim_restore(noreclaim_flag);
> > +
> > + return nr_migrated;
> > +}
> > +
> > static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> > enum migration_mode mm)
> > {
> > @@ -247,7 +379,11 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> > folio_test_clear_young(folio);
> > if (!folio_isolate_lru(folio))
> > goto put_folio;
> > - if (folio_test_unevictable(folio))
> > + /*
> > + * Since unevictable folios can be demoted or promoted,
>
> Let's use the term 'migrated' instead of 'demoted' or 'promoted'.

Ack.

> > + * unevictable test is needed only for pageout.
> > + */
> > + if (mm == MIG_PAGEOUT && folio_test_unevictable(folio))
> > folio_putback_lru(folio);
> > else
> > list_add(&folio->lru, &folio_list);
> > @@ -258,6 +394,10 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> > case MIG_PAGEOUT:
> > applied = reclaim_pages(&folio_list);
> > break;
> > + case MIG_MIGRATE_COLD:
> > + applied = damon_pa_migrate_pages(&folio_list, mm,
> > + s->target_nid);
> > + break;
> > default:
> > /* Unexpected migration mode. */
> > return 0;
> > @@ -314,6 +454,8 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
> > return damon_pa_mark_accessed(r, scheme);
> > case DAMOS_LRU_DEPRIO:
> > return damon_pa_deactivate_pages(r, scheme);
> > + case DAMOS_MIGRATE_COLD:
> > + return damon_pa_migrate(r, scheme, MIG_MIGRATE_COLD);
> > case DAMOS_STAT:
> > break;
> > default:
> > @@ -334,6 +476,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
> > return damon_hot_score(context, r, scheme);
> > case DAMOS_LRU_DEPRIO:
> > return damon_cold_score(context, r, scheme);
> > + case DAMOS_MIGRATE_COLD:
> > + return damon_cold_score(context, r, scheme);
> > default:
> > break;
> > }
> > diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
> > index 1a30ea82c890..18b7d054c748 100644
> > --- a/mm/damon/sysfs-schemes.c
> > +++ b/mm/damon/sysfs-schemes.c
> > @@ -1406,6 +1406,7 @@ static const char * const damon_sysfs_damos_action_strs[] = {
> > "nohugepage",
> > "lru_prio",
> > "lru_deprio",
> > + "migrate_cold",
> > "stat",
> > };
> >
> > @@ -1659,6 +1660,9 @@ static ssize_t target_nid_store(struct kobject *kobj,
> > struct damon_sysfs_scheme, kobj);
> > int err = 0;
> >
> > + if (scheme->action != DAMOS_MIGRATE_COLD)
> > + return -EINVAL;
> > +
>
> I think user could set target_nid first, and then action. So I think this
> should not return error?

Make sense. I will drop this check.

Thanks,
Honggyu

> > /* TODO: error handling for target_nid range. */
> > err = kstrtoint(buf, 0, &scheme->target_nid);
> >
> > --
> > 2.34.1
> >
> >
>
>
> Thanks,
> SJ

2024-04-08 13:46:18

by Honggyu Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

Hi Gregory,

On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price <[email protected]> wrote:
> On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote:
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> >
> > 1. YCSB zipfian distribution read only workload
> > memory pressure with cold memory on node0 with 512GB of local DRAM.
> > =============+================================================+=========
> > | cold memory occupied by mmap and memset |
> > | 0G 440G 450G 460G 470G 480G 490G 500G |
> > =============+================================================+=========
> > Execution time normalized to DRAM-only values | GEOMEAN
> > -------------+------------------------------------------------+---------
> > DRAM-only | 1.00 - - - - - - - | 1.00
> > CXL-only | 1.22 - - - - - - - | 1.22
> > default | - 1.12 1.13 1.14 1.16 1.19 1.21 1.21 | 1.17
> > DAMON tiered | - 1.04 1.03 1.04 1.06 1.05 1.05 1.05 | 1.05
> > =============+================================================+=========
> > CXL usage of redis-server in GB | AVERAGE
> > -------------+------------------------------------------------+---------
> > DRAM-only | 0.0 - - - - - - - | 0.0
> > CXL-only | 52.6 - - - - - - - | 52.6
> > default | - 20.4 27.0 33.1 39.5 45.6 50.5 50.3 | 38.1
> > DAMON tiered | - 0.1 0.3 0.8 0.6 0.7 1.3 0.9 | 0.7
> > =============+================================================+=========
> >
> > Each test result is based on the exeuction environment as follows.
> >
> > DRAM-only : redis-server uses only local DRAM memory.
> > CXL-only : redis-server uses only CXL memory.
> > default : default memory policy(MPOL_DEFAULT).
> > numa balancing disabled.
> > DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
> > DAMOS_MIGRATE_HOT for CXL nodes.
> >
> > The above result shows the "default" execution time goes up as the size
> > of cold memory is increased from 440G to 500G because the more cold
> > memory used, the more CXL memory is used for the target redis workload
> > and this makes the execution time increase.
> >
> > However, "DAMON tiered" result shows less slowdown because the
> > DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
> > cold memory to CXL node and this free space at DRAM increases more
> > chance to allocate hot or warm pages of redis-server to fast DRAM node.
> > Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
> > of redis-server to DRAM node actively.
> >
> > As a result, it makes more memory of redis-server stay in DRAM node
> > compared to "default" memory policy and this makes the performance
> > improvement.
> >
> > The following result of latest distribution workload shows similar data.
> >
> > 2. YCSB latest distribution read only workload
> > memory pressure with cold memory on node0 with 512GB of local DRAM.
> > =============+================================================+=========
> > | cold memory occupied by mmap and memset |
> > | 0G 440G 450G 460G 470G 480G 490G 500G |
> > =============+================================================+=========
> > Execution time normalized to DRAM-only values | GEOMEAN
> > -------------+------------------------------------------------+---------
> > DRAM-only | 1.00 - - - - - - - | 1.00
> > CXL-only | 1.18 - - - - - - - | 1.18
> > default | - 1.18 1.19 1.18 1.18 1.17 1.19 1.18 | 1.18
> > DAMON tiered | - 1.04 1.04 1.04 1.05 1.04 1.05 1.05 | 1.04
> > =============+================================================+=========
> > CXL usage of redis-server in GB | AVERAGE
> > -------------+------------------------------------------------+---------
> > DRAM-only | 0.0 - - - - - - - | 0.0
> > CXL-only | 52.6 - - - - - - - | 52.6
> > default | - 20.5 27.1 33.2 39.5 45.5 50.4 50.5 | 38.1
> > DAMON tiered | - 0.2 0.4 0.7 1.6 1.2 1.1 3.4 | 1.2
> > =============+================================================+=========
> >
> > In summary of both results, our evaluation shows that "DAMON tiered"
> > memory management reduces the performance slowdown compared to the
> > "default" memory policy from 17~18% to 4~5% when the system runs with
> > high memory pressure on its fast tier DRAM nodes.
> >
> > Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
> > tiered memory systems run more efficiently under high memory pressures.
> >
>
> Hi,
>
> It's hard to determine from your results whether the performance
> mitigation is being caused primarily by MIGRATE_COLD freeing up space
> for new allocations, or from some combination of HOT/COLD actions
> occurring during execution but after the database has already been
> warmed up.

Thanks for the question. I didn't include all the details for the
evaluation result, but this is a chance to share more in details.

I would say the mitigation comes from both. DAMOS_MIGRATE_COLD demotes
some cold data to CXL so redis can allocate more data on the fast DRAM
during launching time as the mmap+memset and redis launching takes
several minutes. But it also promotes some redis data while running.

> Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> but not DAMOS_MIGRATE_HOT actions? (and vice versa)
>
> The question I have is exactly how often is MIGRATE_HOT actually being
> utilized, and how much data is being moved. Testing MIGRATE_COLD only
> would at least give a rough approximation of that.

To explain this, I better share more test results. In the section of
"Evaluation Workload", the test sequence can be summarized as follows.

*. "Turn on DAMON."
1. Allocate cold memory(mmap+memset) at DRAM node, then make the
process sleep.
2. Launch redis-server and load prebaked snapshot image, dump.rdb.
(85GB consumed: 52GB for anon and 33GB for file cache)
3. Run YCSB to make zipfian distribution of memory accesses to
redis-server, then measure execution time.
4. Repeat 4 over 50 times to measure the average execution time for
each run.
5. Increase the cold memory size then repeat goes to 2.

I didn't want to make the evaluation too long in the cover letter, but
I have also evaluated another senario, which lazyly enabled DAMON just
before YCSB run at step 4. I will call this test as "DAMON lazy". This
is missing part from the cover letter.

1. Allocate cold memory(mmap+memset) at DRAM node, then make the
process sleep.
2. Launch redis-server and load prebaked snapshot image, dump.rdb.
(85GB consumed: 52GB for anon and 33GB for file cache)
*. "Turn on DAMON."
4. Run YCSB to make zipfian distribution of memory accesses to
redis-server, then measure execution time.
5. Repeat 4 over 50 times to measure the average execution time for
each run.
6. Increase the cold memory size then repeat goes to 2.

In the "DAMON lazy" senario, DAMON started monitoring late so the
initial redis-server placement is same as "default", but started to
demote cold data and promote redis data just before YCSB run.

The full test result is as follows.

1. YCSB zipfian distribution read only workload
memory pressure with cold memory on node0 with 512GB of local DRAM.
=============+================================================+=========
| cold memory occupied by mmap and memset |
| 0G 440G 450G 460G 470G 480G 490G 500G |
=============+================================================+=========
Execution time normalized to DRAM-only values | GEOMEAN
-------------+------------------------------------------------+---------
DRAM-only | 1.00 - - - - - - - | 1.00
CXL-only | 1.22 - - - - - - - | 1.22
default | - 1.12 1.13 1.14 1.16 1.19 1.21 1.21 | 1.17
DAMON tiered | - 1.04 1.03 1.04 1.06 1.05 1.05 1.05 | 1.05
DAMON lazy | - 1.04 1.05 1.05 1.06 1.06 1.07 1.07 | 1.06
=============+================================================+=========
CXL usage of redis-server in GB | AVERAGE
-------------+------------------------------------------------+---------
DRAM-only | 0.0 - - - - - - - | 0.0
CXL-only | 52.6 - - - - - - - | 52.6
default | - 20.4 27.0 33.1 39.5 45.6 50.5 50.3 | 38.1
DAMON tiered | - 0.1 0.3 0.8 0.6 0.7 1.3 0.9 | 0.7
DAMON lazy | - 2.9 3.1 3.7 4.7 6.6 8.2 9.7 | 5.6
=============+================================================+=========
Migration size in GB by DAMOS_MIGRATE_COLD(demotion) and |
DAMOS_MIGRATE_HOT(promotion) | AVERAGE
-------------+------------------------------------------------+---------
DAMON tiered | |
- demotion | - 522 510 523 520 513 558 558 | 529
- promotion | - 0.1 1.3 6.2 8.1 7.2 22 17 | 8.8
DAMON lazy | |
- demotion | - 288 277 322 343 315 312 320 | 311
- promotion | - 33 44 41 55 73 89 101 | 5.6
=============+================================================+=========

I have included "DAMON lazy" result and also the migration size by new
DAMOS migrate actions. Please note that demotion size is way higher
than promotion because promotion target is only for redis data, but
demotion target includes huge cold memory allocated by mmap + memset.
(there could be some ping-pong issue though.)

As you mentioned, "DAMON tiered" case gets more benefit because new
redis allocations go to DRAM more than "default", but it also gets
benefit from promotion when it is under higher memory pressure as shown
in 490G and 500G cases. It promotes 22GB and 17GB of redis data to DRAM
from CXL.

In the case of "DAMON lazy", it shows more promotion size as expected
and it gets increases as memory pressure goes higher from left to right.

I will share "latest" workload result as well and it shows similar
tendency.

2. YCSB latest distribution read only workload
memory pressure with cold memory on node0 with 512GB of local DRAM.
=============+================================================+=========
| cold memory occupied by mmap and memset |
| 0G 440G 450G 460G 470G 480G 490G 500G |
=============+================================================+=========
Execution time normalized to DRAM-only values | GEOMEAN
-------------+------------------------------------------------+---------
DRAM-only | 1.00 - - - - - - - | 1.00
CXL-only | 1.18 - - - - - - - | 1.18
default | - 1.18 1.19 1.18 1.18 1.17 1.19 1.18 | 1.18
DAMON tiered | - 1.04 1.04 1.04 1.05 1.04 1.05 1.05 | 1.04
DAMON lazy | - 1.05 1.05 1.06 1.06 1.07 1.06 1.07 | 1.06
=============+================================================+=========
CXL usage of redis-server in GB | AVERAGE
-------------+------------------------------------------------+---------
DRAM-only | 0.0 - - - - - - - | 0.0
CXL-only | 52.6 - - - - - - - | 52.6
default | - 20.5 27.1 33.2 39.5 45.5 50.4 50.5 | 38.1
DAMON tiered | - 0.2 0.4 0.7 1.6 1.2 1.1 3.4 | 1.2
DAMON lazy | - 5.3 4.1 3.9 6.4 8.8 10.1 11.3 | 7.1
=============+================================================+=========
Migration size in GB by DAMOS_MIGRATE_COLD(demotion) and |
DAMOS_MIGRATE_HOT(promotion) | AVERAGE
-------------+------------------------------------------------+---------
DAMON tiered | |
- demotion | - 493 478 487 516 510 540 512 | 505
- promotion | - 0.1 0.2 8.2 5.6 4.0 5.9 29 | 7.5
DAMON lazy | |
- demotion | - 315 318 293 290 308 322 286 | 305
- promotion | - 36 45 38 56 74 91 99 | 63
=============+================================================+=========

> Additionally, do you have any data on workloads that exceed the capacity
> of the DRAM tier? Here you say you have 512GB of local DRAM, but only
> test a workload that caps out at 500G. Have you run a test of, say,
> 550GB to see the effect of DAMON HOT/COLD migration actions when DRAM
> capacity is exceeded?

I didn't want to remove DRAM from my server so kept using 512GB of DRAM,
but I couldn't make a single workload that consumes more than the DRAM
size.

I wanted to use more realistic workload rather than micro benchmarks.
And the core concept of this test is to cover realisitic senarios with
the system wide view. I think if the system has 512GB of local DRAM,
then it wouldn't be possible to make the entire 512GB of DRAM hot and
it'd have some amount of cold memory, which can be the target of
demotion. Then we can find some workload that is actively used and
promote it as much as possible. That's why I made the promotion policy
aggressively.

> Can you also provide the DRAM-only results for each test? Presumably,
> as workload size increases from 440G to 500G, the system probably starts
> using some amount of swap/zswap/whatever. It would be good to know how
> this system compares to swap small amounts of overflow.

It looks like my explanation doesn't correctly inform you. The size
from 440GB to 500GB is for pre allocated cold data to give memory
pressure on the system so that redis-server cannot be fully allocated at
fast DRAM, then partially allocated at CXL memory as well.

And my evaluation environment doesn't have swap space to focus on
migration rather than swap.

>
> ~Gregory

I hope my explanation is helpful for you to understand. Please let me
know if you have more questions.

Thanks,
Honggyu


2024-04-08 18:37:33

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

On Mon, 8 Apr 2024 21:06:44 +0900 Honggyu Kim <[email protected]> wrote:

> On Fri, 5 Apr 2024 12:24:30 -0700 SeongJae Park <[email protected]> wrote:
> > On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim <[email protected]> wrote:
[...]
> > > Here is one of the example usage of this 'migrate_cold' action.
> > >
> > > $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
> > > $ cat contexts/<N>/schemes/<N>/action
> > > migrate_cold
> > > $ echo 2 > contexts/<N>/schemes/<N>/target_nid
> > > $ echo commit > state
> > > $ numactl -p 0 ./hot_cold 500M 600M &
> > > $ numastat -c -p hot_cold
> > >
> > > Per-node process memory usage (in MBs)
> > > PID Node 0 Node 1 Node 2 Total
> > > -------------- ------ ------ ------ -----
> > > 701 (hot_cold) 501 0 601 1101
> > >
> > > Since there are some common routines with pageout, many functions have
> > > similar logics between pageout and migrate cold.
> > >
> > > damon_pa_migrate_folio_list() is a minimized version of
> > > shrink_folio_list(), but it's minified only for demotion.
> >
> > MIGRATE_COLD is not only for demotion, right? I think the last two words are
> > better to be removed for reducing unnecessary confuses.
>
> You mean the last two sentences? I will remove them if you feel it's
> confusing.

Yes. My real intended suggestion was 's/only for demotion/only for
migration/', but entirely removing the sentences is also ok for me.

>
> > >
> > > Signed-off-by: Honggyu Kim <[email protected]>
> > > Signed-off-by: Hyeongtak Ji <[email protected]>
> > > ---
> > > include/linux/damon.h | 2 +
> > > mm/damon/paddr.c | 146 ++++++++++++++++++++++++++++++++++++++-
> > > mm/damon/sysfs-schemes.c | 4 ++
> > > 3 files changed, 151 insertions(+), 1 deletion(-)
[...]
> > > --- a/mm/damon/paddr.c
> > > +++ b/mm/damon/paddr.c
[...]
> > > +{
> > > + unsigned int nr_succeeded;
> > > + nodemask_t allowed_mask = NODE_MASK_NONE;
> > > +
> >
> > I personally prefer not having empty lines in the middle of variable
> > declarations/definitions. Could we remove this empty line?
>
> I can remove it, but I would like to have more discussion about this
> issue. The current implementation allows only a single migration
> target with "target_nid", but users might want to provide fall back
> migration target nids.
>
> For example, if more than two CXL nodes exist in the system, users might
> want to migrate cold pages to any CXL nodes. In such cases, we might
> have to make "target_nid" accept comma separated node IDs. nodemask can
> be better but we should provide a way to change the scanning order.
>
> I would like to hear how you think about this.

Good point. I think we could later extend the sysfs file to receive the
comma-separated numbers, or even mask. For simplicity, adding sysfs files
dedicated for the different format of inputs could also be an option (e.g.,
target_nids_list, target_nids_mask). But starting from this single node as is
now looks ok to me.

[...]
> > > + /* 'folio_list' is always empty here */
> > > +
> > > + /* Migrate folios selected for migration */
> > > + nr_migrated += migrate_folio_list(&migrate_folios, pgdat, target_nid);
> > > + /* Folios that could not be migrated are still in @migrate_folios */
> > > + if (!list_empty(&migrate_folios)) {
> > > + /* Folios which weren't migrated go back on @folio_list */
> > > + list_splice_init(&migrate_folios, folio_list);
> > > + }
> >
> > Let's not use braces for single statement
> > (https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces).
>
> Hmm.. I know the convention but left it as is because of the comment.
> If I remove the braces, it would have a weird alignment for the two
> lines for comment and statement lines.

I don't really hate such alignment. But if you don't like it, how about moving
the comment out of the if statement? Having one comment for one-line if
statement looks not bad to me.

>
> > > +
> > > + try_to_unmap_flush();
> > > +
> > > + list_splice(&ret_folios, folio_list);
> >
> > Can't we move remaining folios in migrate_folios to ret_folios at once?
>
> I will see if it's possible.

Thank you. Not a strict request, though.

[...]
> > > + nid = folio_nid(lru_to_folio(folio_list));
> > > + do {
> > > + struct folio *folio = lru_to_folio(folio_list);
> > > +
> > > + if (nid == folio_nid(folio)) {
> > > + folio_clear_active(folio);
> >
> > I think this was necessary for demotion, but now this should be removed since
> > this function is no more for demotion but for migrating random pages, right?
>
> Yeah, it can be removed because we do migration instead of demotion,
> but I need to make sure if it doesn't change the performance evaluation
> results.

Yes, please ensure the test results are valid :)


Thanks,
SJ

[...]

2024-04-09 09:57:20

by Honggyu Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

Hi SeongJae,

On Mon, 8 Apr 2024 10:52:28 -0700 SeongJae Park <[email protected]> wrote:
> On Mon, 8 Apr 2024 21:06:44 +0900 Honggyu Kim <[email protected]> wrote:
>
> > On Fri, 5 Apr 2024 12:24:30 -0700 SeongJae Park <[email protected]> wrote:
> > > On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim <[email protected]> wrote:
> [...]
> > > > Here is one of the example usage of this 'migrate_cold' action.
> > > >
> > > > $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
> > > > $ cat contexts/<N>/schemes/<N>/action
> > > > migrate_cold
> > > > $ echo 2 > contexts/<N>/schemes/<N>/target_nid
> > > > $ echo commit > state
> > > > $ numactl -p 0 ./hot_cold 500M 600M &
> > > > $ numastat -c -p hot_cold
> > > >
> > > > Per-node process memory usage (in MBs)
> > > > PID Node 0 Node 1 Node 2 Total
> > > > -------------- ------ ------ ------ -----
> > > > 701 (hot_cold) 501 0 601 1101
> > > >
> > > > Since there are some common routines with pageout, many functions have
> > > > similar logics between pageout and migrate cold.
> > > >
> > > > damon_pa_migrate_folio_list() is a minimized version of
> > > > shrink_folio_list(), but it's minified only for demotion.
> > >
> > > MIGRATE_COLD is not only for demotion, right? I think the last two words are
> > > better to be removed for reducing unnecessary confuses.
> >
> > You mean the last two sentences? I will remove them if you feel it's
> > confusing.
>
> Yes. My real intended suggestion was 's/only for demotion/only for
> migration/', but entirely removing the sentences is also ok for me.

Ack.

> >
> > > >
> > > > Signed-off-by: Honggyu Kim <[email protected]>
> > > > Signed-off-by: Hyeongtak Ji <[email protected]>
> > > > ---
> > > > include/linux/damon.h | 2 +
> > > > mm/damon/paddr.c | 146 ++++++++++++++++++++++++++++++++++++++-
> > > > mm/damon/sysfs-schemes.c | 4 ++
> > > > 3 files changed, 151 insertions(+), 1 deletion(-)
> [...]
> > > > --- a/mm/damon/paddr.c
> > > > +++ b/mm/damon/paddr.c
> [...]
> > > > +{
> > > > + unsigned int nr_succeeded;
> > > > + nodemask_t allowed_mask = NODE_MASK_NONE;
> > > > +
> > >
> > > I personally prefer not having empty lines in the middle of variable
> > > declarations/definitions. Could we remove this empty line?
> >
> > I can remove it, but I would like to have more discussion about this
> > issue. The current implementation allows only a single migration
> > target with "target_nid", but users might want to provide fall back
> > migration target nids.
> >
> > For example, if more than two CXL nodes exist in the system, users might
> > want to migrate cold pages to any CXL nodes. In such cases, we might
> > have to make "target_nid" accept comma separated node IDs. nodemask can
> > be better but we should provide a way to change the scanning order.
> >
> > I would like to hear how you think about this.
>
> Good point. I think we could later extend the sysfs file to receive the
> comma-separated numbers, or even mask. For simplicity, adding sysfs files
> dedicated for the different format of inputs could also be an option (e.g.,
> target_nids_list, target_nids_mask). But starting from this single node as is
> now looks ok to me.

If you think we can start from a single node, then I will keep it as is.
But are you okay if I change the same 'target_nid' to accept
comma-separated numbers later? Or do you want to introduce another knob
such as 'target_nids_list'? What about rename 'target_nid' to
'target_nids' at the first place?

> [...]
> > > > + /* 'folio_list' is always empty here */
> > > > +
> > > > + /* Migrate folios selected for migration */
> > > > + nr_migrated += migrate_folio_list(&migrate_folios, pgdat, target_nid);
> > > > + /* Folios that could not be migrated are still in @migrate_folios */
> > > > + if (!list_empty(&migrate_folios)) {
> > > > + /* Folios which weren't migrated go back on @folio_list */
> > > > + list_splice_init(&migrate_folios, folio_list);
> > > > + }
> > >
> > > Let's not use braces for single statement
> > > (https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces).
> >
> > Hmm.. I know the convention but left it as is because of the comment.
> > If I remove the braces, it would have a weird alignment for the two
> > lines for comment and statement lines.
>
> I don't really hate such alignment. But if you don't like it, how about moving
> the comment out of the if statement? Having one comment for one-line if
> statement looks not bad to me.

Ack. I will manage this in the next revision.

> >
> > > > +
> > > > + try_to_unmap_flush();
> > > > +
> > > > + list_splice(&ret_folios, folio_list);
> > >
> > > Can't we move remaining folios in migrate_folios to ret_folios at once?
> >
> > I will see if it's possible.
>
> Thank you. Not a strict request, though.
>
> [...]
> > > > + nid = folio_nid(lru_to_folio(folio_list));
> > > > + do {
> > > > + struct folio *folio = lru_to_folio(folio_list);
> > > > +
> > > > + if (nid == folio_nid(folio)) {
> > > > + folio_clear_active(folio);
> > >
> > > I think this was necessary for demotion, but now this should be removed since
> > > this function is no more for demotion but for migrating random pages, right?
> >
> > Yeah, it can be removed because we do migration instead of demotion,
> > but I need to make sure if it doesn't change the performance evaluation
> > results.
>
> Yes, please ensure the test results are valid :)

Sure. Thanks for your review in details!

Please note that I will be out of office this week so won't be able to
answer quickly.

Thanks,
Honggyu

>
> Thanks,
> SJ
>
> [...]
>

2024-04-09 10:00:14

by Honggyu Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

On Mon, 8 Apr 2024 22:41:04 +0900 Honggyu Kim <[email protected]> wrote:
[...]
> To explain this, I better share more test results. In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
>
> *. "Turn on DAMON."
> 1. Allocate cold memory(mmap+memset) at DRAM node, then make the
> process sleep.
> 2. Launch redis-server and load prebaked snapshot image, dump.rdb.
> (85GB consumed: 52GB for anon and 33GB for file cache)
> 3. Run YCSB to make zipfian distribution of memory accesses to
> redis-server, then measure execution time.
> 4. Repeat 4 over 50 times to measure the average execution time for
> each run.

Sorry, "Repeat 4 over 50 times" is incorrect. This should be "Repeat 3
over 50 times".

> 5. Increase the cold memory size then repeat goes to 2.
>
> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4. I will call this test as "DAMON lazy". This
> is missing part from the cover letter.
>
> 1. Allocate cold memory(mmap+memset) at DRAM node, then make the
> process sleep.
> 2. Launch redis-server and load prebaked snapshot image, dump.rdb.
> (85GB consumed: 52GB for anon and 33GB for file cache)
> *. "Turn on DAMON."
> 4. Run YCSB to make zipfian distribution of memory accesses to
> redis-server, then measure execution time.
> 5. Repeat 4 over 50 times to measure the average execution time for
> each run.
> 6. Increase the cold memory size then repeat goes to 2.
>
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
[...]

Thanks,
Honggyu

2024-04-09 16:44:24

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

Hi Honggyu,

On Tue, 9 Apr 2024 18:54:14 +0900 Honggyu Kim <[email protected]> wrote:
> On Mon, 8 Apr 2024 10:52:28 -0700 SeongJae Park <[email protected]> wrote:
> > On Mon, 8 Apr 2024 21:06:44 +0900 Honggyu Kim <[email protected]> wrote:
> > > On Fri, 5 Apr 2024 12:24:30 -0700 SeongJae Park <[email protected]> wrote:
> > > > On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim <[email protected]> wrote:
[...]
> > > I can remove it, but I would like to have more discussion about this
> > > issue. The current implementation allows only a single migration
> > > target with "target_nid", but users might want to provide fall back
> > > migration target nids.
> > >
> > > For example, if more than two CXL nodes exist in the system, users might
> > > want to migrate cold pages to any CXL nodes. In such cases, we might
> > > have to make "target_nid" accept comma separated node IDs. nodemask can
> > > be better but we should provide a way to change the scanning order.
> > >
> > > I would like to hear how you think about this.
> >
> > Good point. I think we could later extend the sysfs file to receive the
> > comma-separated numbers, or even mask. For simplicity, adding sysfs files
> > dedicated for the different format of inputs could also be an option (e.g.,
> > target_nids_list, target_nids_mask). But starting from this single node as is
> > now looks ok to me.
>
> If you think we can start from a single node, then I will keep it as is.
> But are you okay if I change the same 'target_nid' to accept
> comma-separated numbers later? Or do you want to introduce another knob
> such as 'target_nids_list'? What about rename 'target_nid' to
> 'target_nids' at the first place?

I have no strong concern or opinion about this at the moment. Please feel free
to renaming it to 'taget_nids' if you think that's better.

[...]
> Please note that I will be out of office this week so won't be able to
> answer quickly.

No problem, I hope you to take and enjoy your time :)


Thanks,
SJ

[...]

2024-04-10 00:00:37

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

On Mon, Apr 08, 2024 at 10:41:04PM +0900, Honggyu Kim wrote:
> Hi Gregory,
>
> On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price <[email protected]> wrote:
> > Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> > but not DAMOS_MIGRATE_HOT actions? (and vice versa)
> >
> > The question I have is exactly how often is MIGRATE_HOT actually being
> > utilized, and how much data is being moved. Testing MIGRATE_COLD only
> > would at least give a rough approximation of that.
>
> To explain this, I better share more test results. In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
>
> *. "Turn on DAMON."
> 1. Allocate cold memory(mmap+memset) at DRAM node, then make the
> process sleep.
> 2. Launch redis-server and load prebaked snapshot image, dump.rdb.
> (85GB consumed: 52GB for anon and 33GB for file cache)

Aha! I see now, you are allocating memory to ensure the real workload
(redis-server) pressures the DRAM tier and causes "spillage" to the CXL
tier, and then measure the overhead in different scenarios.

I would still love to know what the result of a demote-only system would
produce, mosty because it would very clearly demonstrate the value of
the demote+promote system when the system is memory-pressured.

Given the additional results below, it shows a demote-only system would
likely trend toward CXL-only, and so this shows an affirmative support
for the promotion logic.

Just another datum that is useful and paints a more complete picture.

> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4. I will call this test as "DAMON lazy". This
> is missing part from the cover letter.
>
> 1. Allocate cold memory(mmap+memset) at DRAM node, then make the
> process sleep.
> 2. Launch redis-server and load prebaked snapshot image, dump.rdb.
> (85GB consumed: 52GB for anon and 33GB for file cache)
> *. "Turn on DAMON."
>
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
>

This is excellent and definitely demonstrates part of the picture I was
alluding to, thank you for the additional data!

>
> I have included "DAMON lazy" result and also the migration size by new
> DAMOS migrate actions. Please note that demotion size is way higher
> than promotion because promotion target is only for redis data, but
> demotion target includes huge cold memory allocated by mmap + memset.
> (there could be some ping-pong issue though.)
>
> As you mentioned, "DAMON tiered" case gets more benefit because new
> redis allocations go to DRAM more than "default", but it also gets
> benefit from promotion when it is under higher memory pressure as shown
> in 490G and 500G cases. It promotes 22GB and 17GB of redis data to DRAM
> from CXL.

I think a better way of saying this is that "DAMON tiered" more
effectively mitigates the effect of memory-pressure on faster tier
before spillage occurs, while "DAMON lazy" demonstrates the expected
performance of the system after memory pressure outruns the demotion
logic, so you wind up with hot data stuck in the slow tier.

There are some out there that would simply say "just demote more
aggressively", so this is useful information for the discussion.

+/- ~2% despite greater meomry migration is an excellent result

> > Can you also provide the DRAM-only results for each test? Presumably,
> > as workload size increases from 440G to 500G, the system probably starts
> > using some amount of swap/zswap/whatever. It would be good to know how
> > this system compares to swap small amounts of overflow.
>
> It looks like my explanation doesn't correctly inform you. The size
> from 440GB to 500GB is for pre allocated cold data to give memory
> pressure on the system so that redis-server cannot be fully allocated at
> fast DRAM, then partially allocated at CXL memory as well.
>

Yes, sorry for the misunderstanding. This makes it much clearer.

>
> I hope my explanation is helpful for you to understand. Please let me
> know if you have more questions.
>

Excellent work, exciting results! Thank you for the additional answers
:]

~Gregory

2024-05-11 20:16:31

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/7] mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode

On Fri, 5 Apr 2024 12:19:07 -0700 SeongJae Park <[email protected]> wrote:

> On Fri, 5 Apr 2024 15:08:50 +0900 Honggyu Kim <[email protected]> wrote:
>
> > This is a preparation patch that introduces migration modes.
> >
> > The damon_pa_pageout is renamed to damon_pa_migrate and it receives an
> > extra argument for migration_mode.
>
> I personally think keeping damon_pa_pageout() as is and adding a new function
> (damon_pa_migrate()) with some duplicated code is also ok, but this approach
> also looks fine to me. So I have no strong opinion here, but just letting you
> know I would have no objection at both approaches.

Meanwhile, we added one more logic in damon_pa_pageout() for doing page
idleness double check on its own[1]. It makes reusing damon_pa_pageout() for
multiple reason a bit complex. I think the complexity added a problem in this
patch that I also missed before due to the complexity. Show below comment in
line. Hence now I think it would be better to do the suggested way.

If we use the approach, this patch is no more necessary, and therefore can be
dropped.

[1] https://lore.kernel.org/[email protected]


Thanks,
SJ

[...]
>
> >
> > No functional changes applied.
> >
> > Signed-off-by: Honggyu Kim <[email protected]>
> > ---
> > mm/damon/paddr.c | 18 +++++++++++++++---
> > 1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> > index 081e2a325778..277a1c4d833c 100644
> > --- a/mm/damon/paddr.c
> > +++ b/mm/damon/paddr.c
> > @@ -224,7 +224,12 @@ static bool damos_pa_filter_out(struct damos *scheme, struct folio *folio)
> > return false;
> > }
> >
> > -static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
> > +enum migration_mode {
> > + MIG_PAGEOUT,
> > +};
>
> To avoid name conflicts, what about renaming to 'damos_migration_mode' and
> 'DAMOS_MIG_PAGEOUT'?
>
> > +
> > +static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> > + enum migration_mode mm)
>
> My poor brain has a bit confused with the name. What about calling it 'mode'?
>
> > {
> > unsigned long addr, applied;
> > LIST_HEAD(folio_list);
> > @@ -249,7 +254,14 @@ static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)

Before this line, damon_pa_pageout() calls folio_clear_referenced() and
folio_test_clear_young() for the folio, because this is pageout code. Changed
function, damon_pa_migrate() is not only for cold pages but general migrations.
Hence it should also be handled based on the migration mode, but not handled.

I think this problem came from the increased complexity of this function.
Hence I think it is better to keep damon_pa_pageout() as is and adding a new
function for migration.


Thanks,
SJ

[...]

2024-05-12 18:00:57

by SeongJae Park

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/7] mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode

On Sat, 11 May 2024 13:16:17 -0700 SeongJae Park <[email protected]> wrote:

> On Fri, 5 Apr 2024 12:19:07 -0700 SeongJae Park <[email protected]> wrote:
>
> > On Fri, 5 Apr 2024 15:08:50 +0900 Honggyu Kim <[email protected]> wrote:
> >
> > > This is a preparation patch that introduces migration modes.
> > >
> > > The damon_pa_pageout is renamed to damon_pa_migrate and it receives an
> > > extra argument for migration_mode.
> >
> > I personally think keeping damon_pa_pageout() as is and adding a new function
> > (damon_pa_migrate()) with some duplicated code is also ok, but this approach
> > also looks fine to me. So I have no strong opinion here, but just letting you
> > know I would have no objection at both approaches.
>
> Meanwhile, we added one more logic in damon_pa_pageout() for doing page
> idleness double check on its own[1]. It makes reusing damon_pa_pageout() for
> multiple reason a bit complex. I think the complexity added a problem in this
> patch that I also missed before due to the complexity. Show below comment in
> line. Hence now I think it would be better to do the suggested way.
>
> If we use the approach, this patch is no more necessary, and therefore can be
> dropped.
>
> [1] https://lore.kernel.org/[email protected]

I updated this patchset to address comments on this thread, and posted it as
RFC patchset v4 on behalf of Honggyu under his approval:
https://lore.kernel.org/[email protected]


Thanks,
SJ

[...]