LinuxLists.cc - [PATCH v6 0/7] DAMON based tiered memory management for CXL memory

2024-06-14 03:00:34

Subject: [PATCH v6 0/7] DAMON based tiered memory management for CXL memory

There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made. This patch series is about its implementation for physical
address space so that this scheme can be applied in system wide level.

Changes from v5:
https://lore.kernel.org/[email protected]
1. Remove new actions in usage document as its for debugfs
2. Apply minor fixes on cover letter

Changes from RFC v4:
https://lore.kernel.org/[email protected]
1. Add usage and design documents
2. Rename alloc_demote_folio to alloc_migrate_folio
3. Add evaluation results with "demotion_enabled" true
4. Rebase based on v6.10-rc3

Changes from RFC v3:
https://lore.kernel.org/[email protected]
0. updated from v3 and posted by SJ on behalf of Honggyu under his
approval.
1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode'
2. Drop vmstat change
3. Drop unnecessary page reference check

Changes from RFC v2:
https://lore.kernel.org/[email protected]
1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
2. Create 'target_nid' to set the migration target node instead of
depending on node distance based information.
3. Instead of having page level access check in this patch series,
delegate the job to a new DAMOS filter type YOUNG[2].
4. Introduce vmstat counters "damon_migrate_{hot,cold}".
5. Rebase from v6.7 to v6.8.

Changes from RFC:
https://lore.kernel.org/[email protected]
1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c.
2. Simplify some functions of vmscan.c and used in paddr.c, but need
to be reviewed more in depth.
3. Refactor most functions for common usage for both promote and
demote actions and introduce an enum migration_mode for its control.
4. Add "target_nid" sysfs knob for migration destination node for both
promote and demote actions.
5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above
DAMOS_STAT.

Introduction
============

With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics. They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency. Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly. Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature. The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers. This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted
to slow tiers so that the system can increase the chance to allocate
more hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 11% to 3~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.

DAMON configuration
===================

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3]. It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.

Evaluation Workload
===================

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5]. We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these migrate_{hot,cold} actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload. The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes. It's not true when using numa balancing, but it is not the
scope of this DAMON based tiered memory management support.

If the working set of redis can be fit fully into the DRAM node, then
the redis will access the fast DRAM only. Since the performance of DRAM
only is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate this patch series.

To make pages of redis be distributed across fast DRAM node and slow
CXL node to evaluate our migrate_{hot,cold} actions, we pre-allocate
some cold memory externally using mmap and memset before launching
redis-server. We assumed that there are enough amount of cold memory in
datacenters as TMO[6] and TPP[7] papers mentioned.

The evaluation sequence is as follows.

1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and
DAMOS_MIGRATE_HOT action for CXL node. It demotes cold pages on DRAM
node and promotes hot pages on CXL node in a regular interval.
2. Allocate a huge block of cold memory by calling mmap and memset at
the fast tier DRAM node, then make the process sleep to make the fast
tier has insufficient space for redis-server.
3. Launch redis-server and load prebaked snapshot image, dump.rdb. The
redis-server consumes 52GB of anon pages and 33GB of file pages, but
due to the cold memory allocated at 2, it fails allocating the entire
memory of redis-server on the fast tier DRAM node so it partially
allocates the remaining on the slow tier CXL node. The ratio of
DRAM:CXL depends on the size of the pre-allocated cold memory.
4. Run YCSB to make zipfian or latest distribution of memory accesses to
redis-server, then measure its execution time when it's completed.
5. Repeat 4 over 50 times to measure the average execution time for each
run.
6. Increase the cold memory size then repeat goes to 2.

For each test at 4 took about a minute so repeating it 50 times almost
took about 1 hour for each test with a specific cold memory from 440GB
to 500GB in 10GB increments for each evaluation. So it took about more
than 10 hours for both zipfian and latest workloads to get the entire
evaluation results. Repeating the same test set multiple times doesn't
show much difference so I think it might be enough to make the result
reliable.

Evaluation Results
==================

All the result values are normalized to DRAM-only execution time because
the workload cannot be faster than DRAM-only unless the workload hits
the peak bandwidth but our redis test doesn't go beyond the bandwidth
limit.

So the DRAM-only execution time is the ideal result without affected by
the gap between DRAM and CXL performance difference. The NUMA node
environment is as follows.

node0 - local DRAM, 512GB with a CPU socket (fast tier)
node1 - disabled
node2 - CXL DRAM, 96GB, no CPU attached (slow tier)

The following is the result of generating zipfian distribution to
redis-server and the numbers are averaged by 50 times of execution.

1. YCSB zipfian distribution read only workload
memory pressure with cold memory on node0 with 512GB of local DRAM.
====================+================================================+=========
| cold memory occupied by mmap and memset |
| 0G 440G 450G 460G 470G 480G 490G 500G |
====================+================================================+=========
Execution time normalized to DRAM-only values | GEOMEAN
--------------------+------------------------------------------------+---------
DRAM-only | 1.00 - - - - - - - | 1.00
CXL-only | 1.19 - - - - - - - | 1.19
default | - 1.00 1.05 1.08 1.12 1.14 1.18 1.18 | 1.11
DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 *1.05 | 1.04
DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 *1.06 | 1.05
====================+================================================+=========
CXL usage of redis-server in GB | AVERAGE
--------------------+------------------------------------------------+---------
DRAM-only | 0.0 - - - - - - - | 0.0
CXL-only | 51.4 - - - - - - - | 51.4
default | - 0.6 10.6 20.5 30.5 40.5 47.6 50.4 | 28.7
DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2
DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5
====================+================================================+=========

Each test result is based on the execution environment as follows.

DRAM-only: redis-server uses only local DRAM memory.
CXL-only: redis-server uses only CXL memory.
default: default memory policy(MPOL_DEFAULT).
numa balancing disabled.
DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM
nodes and DAMOS_MIGRATE_HOT for CXL nodes.
DAMON lazy: same as DAMON tiered, but turn on DAMON just
before making memory access request via YCSB.

The above result shows the "default" execution time goes up as the size
of cold memory is increased from 440G to 500G because the more cold
memory used, the more CXL memory is used for the target redis workload
and this makes the execution time increase.

However, "DAMON tiered" and other DAMON results show less slowdown
because the DAMOS_MIGRATE_COLD action at DRAM node proactively demotes
pre-allocated cold memory to CXL node and this free space at DRAM
increases more chance to allocate hot or warm pages of redis-server to
fast DRAM node. Moreover, DAMOS_MIGRATE_HOT action at CXL node also
promotes hot pages of redis-server to DRAM node actively.

As a result, it makes more memory of redis-server stay in DRAM node
compared to "default" memory policy and this makes the performance
improvement.

Please note that the result numbers of "DAMON tiered" and "DAMON lazy"
at 500G are marked with * stars, which means their test results are
replaced with reproduced tests that didn't have OOM issue.

That was needed because sometimes the test processes get OOM when DRAM
has insufficient space. The DAMOS_MIGRATE_HOT doesn't kick reclaim but
just gives up migration when there is not enough space at DRAM side.
The problem happens when there is competition between normal allocation
and migration and the migration is done before normal allocation, then
the completely unrelated normal allocation can trigger reclaim, which
incurs OOM.

Because of this issue, I have also tested more cases with
"demotion_enabled" flag enabled to make such reclaim doesn't trigger
OOM, but just demote reclaimed pages. The following test results show
more tests with "kswapd" marked.

2. YCSB zipfian distribution read only workload (with demotion_enabled true)
memory pressure with cold memory on node0 with 512GB of local DRAM.
====================+================================================+=========
| cold memory occupied by mmap and memset |
| 0G 440G 450G 460G 470G 480G 490G 500G |
====================+================================================+=========
Execution time normalized to DRAM-only values | GEOMEAN
--------------------+------------------------------------------------+---------
DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 1.05 | 1.04
DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 1.06 | 1.05
DAMON tiered kswapd | - 1.03 1.03 1.03 1.03 1.02 1.02 1.03 | 1.03
DAMON lazy kswapd | - 1.04 1.04 1.04 1.03 1.05 1.04 1.05 | 1.04
====================+================================================+=========
CXL usage of redis-server in GB | AVERAGE
--------------------+------------------------------------------------+---------
DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2
DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5
DAMON tiered kswapd | - 0.0 0.0 0.4 0.5 0.1 0.8 1.0 | 0.4
DAMON lazy kswapd | - 4.2 4.6 5.3 1.7 6.8 8.1 5.8 | 5.2
====================+================================================+=========

Each test result is based on the exeuction environment as follows.

DAMON tiered: same as before
DAMON lazy: same as before
DAMON tiered kswapd: same as DAMON tiered, but turn on
/sys/kernel/mm/numa/demotion_enabled to make
kswapd or direct reclaim does demotion.
DAMON lazy kswapd: same as DAMON lazy, but turn on
/sys/kernel/mm/numa/demotion_enabled to make
kswapd or direct reclaim does demotion.

The "DAMON tiered kswapd" and "DAMON lazy kswapd" didn't trigger OOM at
all unlike other tests because kswapd and direct reclaim from DRAM node
can demote reclaimed pages to CXL node independently from DAMON actions
and their results are slightly better than without having
"demotion_enabled".

In summary, the evaluation results show that DAMON memory management
with DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown
compared to the "default" memory policy from 11% to 3~5% when the system
runs with high memory pressure on its fast tier DRAM nodes.

Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
tiered memory systems run more efficiently under high memory pressures.

Signed-off-by: Honggyu Kim <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Rakie Kim <[email protected]>
Signed-off-by: Yunjeong Mun <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>

[1] https://lore.kernel.org/damon/[email protected]
[2] https://lore.kernel.org/damon/[email protected]
[3] https://github.com/skhynix/hmsdk
[4] https://github.com/redis/redis/tree/7.0.0
[5] https://github.com/brianfrankcooper/YCSB/tree/0.17.0
[6] https://dl.acm.org/doi/10.1145/3503222.3507731
[7] https://dl.acm.org/doi/10.1145/3582016.3582063

Honggyu Kim (5):
mm: make alloc_demote_folio externally invokable for migration
mm: rename alloc_demote_folio to alloc_migrate_folio
mm/migrate: add MR_DAMON to migrate_reason
mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion
Docs/damon: document damos_migrate_{hot,cold}

Hyeongtak Ji (2):
mm/damon/sysfs-schemes: add target_nid on sysfs-schemes
mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

Documentation/admin-guide/mm/damon/usage.rst | 4 +
Documentation/mm/damon/design.rst | 4 +
include/linux/damon.h | 15 +-
include/linux/migrate_mode.h | 1 +
include/trace/events/migrate.h | 3 +-
mm/damon/core.c | 5 +-
mm/damon/dbgfs.c | 2 +-
mm/damon/lru_sort.c | 3 +-
mm/damon/paddr.c | 157 +++++++++++++++++++
mm/damon/reclaim.c | 3 +-
mm/damon/sysfs-schemes.c | 35 ++++-
mm/internal.h | 1 +
mm/vmscan.c | 5 +-
13 files changed, 228 insertions(+), 10 deletions(-)

--
2.34.1

2024-06-14 03:00:50

by Honggyu Kim

[permalink] [raw]

Subject: [PATCH v6 2/7] mm: rename alloc_demote_folio to alloc_migrate_folio

The alloc_demote_folio can also be used for general migration including
both demotion and promotion so it'd be better to rename it from
alloc_demote_folio to alloc_migrate_folio.

Signed-off-by: Honggyu Kim <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
---
mm/internal.h | 2 +-
mm/vmscan.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b3ca996a4efc..9f967842f636 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1052,7 +1052,7 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long,
unsigned long, unsigned long);

extern void set_pageblock_order(void);
-struct folio *alloc_demote_folio(struct folio *src, unsigned long private);
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private);
unsigned long reclaim_pages(struct list_head *folio_list);
unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *folio_list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2f4406872f43..f5414b101909 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -916,7 +916,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
}

-struct folio *alloc_demote_folio(struct folio *src, unsigned long private)
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
{
struct folio *dst;
nodemask_t *allowed_mask;
@@ -979,7 +979,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
node_get_allowed_targets(pgdat, &allowed_mask);

/* Demotion ignores all cpuset and mempolicy settings */
- migrate_pages(demote_folios, alloc_demote_folio, NULL,
+ migrate_pages(demote_folios, alloc_migrate_folio, NULL,
(unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
&nr_succeeded);

--
2.34.1

2024-06-14 03:00:56

by Honggyu Kim

[permalink] [raw]

Subject: [PATCH v6 4/7] mm/migrate: add MR_DAMON to migrate_reason

The current patch series introduces DAMON based migration across NUMA
nodes so it'd be better to have a new migrate_reason in trace events.

Signed-off-by: Honggyu Kim <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
---
include/linux/migrate_mode.h | 1 +
include/trace/events/migrate.h | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..cec36b7e7ced 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+ MR_DAMON,
MR_TYPES
};

diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 0190ef725b43..cd01dd7b3640 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -22,7 +22,8 @@
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
EM( MR_CONTIG_RANGE, "contig_range") \
EM( MR_LONGTERM_PIN, "longterm_pin") \
- EMe(MR_DEMOTION, "demotion")
+ EM( MR_DEMOTION, "demotion") \
+ EMe(MR_DAMON, "damon")

/*
* First define the enums in the above macros to be exported to userspace
--
2.34.1

2024-06-14 03:00:58

by Honggyu Kim

[permalink] [raw]

Subject: [PATCH v6 3/7] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes

From: Hyeongtak Ji <[email protected]>

This patch adds target_nid under
/sys/kernel/mm/damon/admin/kdamonds/<N>/contexts/<N>/schemes/<N>/

The 'target_nid' can be used as the destination node for DAMOS actions
such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches.

Signed-off-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
---
include/linux/damon.h | 11 ++++++++++-
mm/damon/core.c | 5 ++++-
mm/damon/dbgfs.c | 2 +-
mm/damon/lru_sort.c | 3 ++-
mm/damon/reclaim.c | 3 ++-
mm/damon/sysfs-schemes.c | 33 ++++++++++++++++++++++++++++++++-
6 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index f7da65e1ac04..21d6b69a015c 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -374,6 +374,7 @@ struct damos_access_pattern {
* @apply_interval_us: The time between applying the @action.
* @quota: Control the aggressiveness of this scheme.
* @wmarks: Watermarks for automated (in)activation of this scheme.
+ * @target_nid: Destination node if @action is "migrate_{hot,cold}".
* @filters: Additional set of &struct damos_filter for &action.
* @stat: Statistics of this scheme.
* @list: List head for siblings.
@@ -389,6 +390,10 @@ struct damos_access_pattern {
* monitoring context are inactive, DAMON stops monitoring either, and just
* repeatedly checks the watermarks.
*
+ * @target_nid is used to set the migration target node for migrate_hot or
+ * migrate_cold actions, which means it's only meaningful when @action is either
+ * "migrate_hot" or "migrate_cold".
+ *
* Before applying the &action to a memory region, &struct damon_operations
* implementation could check pages of the region and skip &action to respect
* &filters
@@ -410,6 +415,9 @@ struct damos {
/* public: */
struct damos_quota quota;
struct damos_watermarks wmarks;
+ union {
+ int target_nid;
+ };
struct list_head filters;
struct damos_stat stat;
struct list_head list;
@@ -726,7 +734,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
- struct damos_watermarks *wmarks);
+ struct damos_watermarks *wmarks,
+ int target_nid);
void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
void damon_destroy_scheme(struct damos *s);

diff --git a/mm/damon/core.c b/mm/damon/core.c
index 6392f1cc97a3..c0ec5be4f56e 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -354,7 +354,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
- struct damos_watermarks *wmarks)
+ struct damos_watermarks *wmarks,
+ int target_nid)
{
struct damos *scheme;

@@ -381,6 +382,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
scheme->wmarks = *wmarks;
scheme->wmarks.activated = true;

+ scheme->target_nid = target_nid;
+
return scheme;
}

diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index 2461cfe2e968..51a6f1cac385 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -281,7 +281,7 @@ static struct damos **str_to_schemes(const char *str, ssize_t len,

pos += parsed;
scheme = damon_new_scheme(&pattern, action, 0, &quota,
- &wmarks);
+ &wmarks, NUMA_NO_NODE);
if (!scheme)
goto fail;

diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index 3de2916a65c3..3775f0f2743d 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme(
/* under the quota. */
&quota,
/* (De)activate this according to the watermarks. */
- &damon_lru_sort_wmarks);
+ &damon_lru_sort_wmarks,
+ NUMA_NO_NODE);
}

/* Create a DAMON-based operation scheme for hot memory regions */
diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index 9bd341d62b4c..a05ccb41749b 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -177,7 +177,8 @@ static struct damos *damon_reclaim_new_scheme(void)
/* under the quota. */
&damon_reclaim_quota,
/* (De)activate this according to the watermarks. */
- &damon_reclaim_wmarks);
+ &damon_reclaim_wmarks,
+ NUMA_NO_NODE);
}

static void damon_reclaim_copy_quota_status(struct damos_quota *dst,
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index bea5bc52846a..0632d28b67f8 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -6,6 +6,7 @@
*/

#include <linux/slab.h>
+#include <linux/numa.h>

#include "sysfs-common.h"

@@ -1445,6 +1446,7 @@ struct damon_sysfs_scheme {
struct damon_sysfs_scheme_filters *filters;
struct damon_sysfs_stats *stats;
struct damon_sysfs_scheme_regions *tried_regions;
+ int target_nid;
};

/* This should match with enum damos_action */
@@ -1470,6 +1472,7 @@ static struct damon_sysfs_scheme *damon_sysfs_scheme_alloc(
scheme->kobj = (struct kobject){};
scheme->action = action;
scheme->apply_interval_us = apply_interval_us;
+ scheme->target_nid = NUMA_NO_NODE;
return scheme;
}

@@ -1692,6 +1695,28 @@ static ssize_t apply_interval_us_store(struct kobject *kobj,
return err ? err : count;
}

+static ssize_t target_nid_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct damon_sysfs_scheme *scheme = container_of(kobj,
+ struct damon_sysfs_scheme, kobj);
+
+ return sysfs_emit(buf, "%d\n", scheme->target_nid);
+}
+
+static ssize_t target_nid_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ struct damon_sysfs_scheme *scheme = container_of(kobj,
+ struct damon_sysfs_scheme, kobj);
+ int err = 0;
+
+ /* TODO: error handling for target_nid range. */
+ err = kstrtoint(buf, 0, &scheme->target_nid);
+
+ return err ? err : count;
+}
+
static void damon_sysfs_scheme_release(struct kobject *kobj)
{
kfree(container_of(kobj, struct damon_sysfs_scheme, kobj));
@@ -1703,9 +1728,13 @@ static struct kobj_attribute damon_sysfs_scheme_action_attr =
static struct kobj_attribute damon_sysfs_scheme_apply_interval_us_attr =
__ATTR_RW_MODE(apply_interval_us, 0600);

+static struct kobj_attribute damon_sysfs_scheme_target_nid_attr =
+ __ATTR_RW_MODE(target_nid, 0600);
+
static struct attribute *damon_sysfs_scheme_attrs[] = {
&damon_sysfs_scheme_action_attr.attr,
&damon_sysfs_scheme_apply_interval_us_attr.attr,
+ &damon_sysfs_scheme_target_nid_attr.attr,
NULL,
};
ATTRIBUTE_GROUPS(damon_sysfs_scheme);
@@ -2031,7 +2060,8 @@ static struct damos *damon_sysfs_mk_scheme(
};

scheme = damon_new_scheme(&pattern, sysfs_scheme->action,
- sysfs_scheme->apply_interval_us, &quota, &wmarks);
+ sysfs_scheme->apply_interval_us, &quota, &wmarks,
+ sysfs_scheme->target_nid);
if (!scheme)
return NULL;

@@ -2068,6 +2098,7 @@ static void damon_sysfs_update_scheme(struct damos *scheme,

scheme->action = sysfs_scheme->action;
scheme->apply_interval_us = sysfs_scheme->apply_interval_us;
+ scheme->target_nid = sysfs_scheme->target_nid;

scheme->quota.ms = sysfs_quotas->ms;
scheme->quota.sz = sysfs_quotas->sz;
--
2.34.1

2024-06-14 03:08:36

by Honggyu Kim

[permalink] [raw]

Subject: [PATCH v6 7/7] Docs/damon: document damos_migrate_{hot,cold}

This patch adds damon description for "migrate_hot" and "migrate_cold"
actions for both usage and design documents as long as a new
"target_nid" knob to set the migration target node.

Signed-off-by: Honggyu Kim <[email protected]>
---
Documentation/admin-guide/mm/damon/usage.rst | 4 ++++
Documentation/mm/damon/design.rst | 4 ++++
2 files changed, 8 insertions(+)

diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index e58ceb89ea2a..98804e34448b 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -300,6 +300,10 @@ from the file and their meaning are same to those of the list on
The ``apply_interval_us`` file is for setting and getting the scheme's
:ref:`apply_interval <damon_design_damos>` in microseconds.

+The ``target_nid`` file is for setting the migration target node, which is
+only meaningful when the ``action`` is either ``migrate_hot`` or
+``migrate_cold``.
+
.. _sysfs_access_pattern:

schemes/<N>/access_pattern/
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 3df387249937..3f12c884eb3a 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -325,6 +325,10 @@ that supports each action are as below.
Supported by ``paddr`` operations set.
- ``lru_deprio``: Deprioritize the region on its LRU lists.
Supported by ``paddr`` operations set.
+ - ``migrate_hot``: Migrate the regions prioritizing warmer regions.
+ Supported by ``paddr`` operations set.
+ - ``migrate_cold``: Migrate the regions prioritizing colder regions.
+ Supported by ``paddr`` operations set.
- ``stat``: Do nothing but count the statistics.
Supported by all operations sets.

--
2.34.1

2024-06-14 03:11:57

by Honggyu Kim

[permalink] [raw]

Subject: [PATCH v6 6/7] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

From: Hyeongtak Ji <[email protected]>

This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
DAMOS_MIGRATE_COLD, but proritizes hot pages.

It migrates pages inside the given region to the 'target_nid' NUMA node
in the sysfs.

Here is one of the example usage of this 'migrate_hot' action.

$ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
$ cat contexts/<N>/schemes/<N>/action
migrate_hot
$ echo 0 > contexts/<N>/schemes/<N>/target_nid
$ echo commit > state
$ numactl -p 2 ./hot_cold 500M 600M &
$ numastat -c -p hot_cold

Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Total
-------------- ------ ------ ------ -----
701 (hot_cold) 501 0 601 1101

Signed-off-by: Hyeongtak Ji <[email protected]>
Signed-off-by: Honggyu Kim <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
---
include/linux/damon.h | 2 ++
mm/damon/paddr.c | 3 +++
mm/damon/sysfs-schemes.c | 1 +
3 files changed, 6 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 56714b6eb0d7..3d62d98d6359 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
* @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
* @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists.
* @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_HOT: Migrate the regions prioritizing warmer regions.
* @DAMOS_MIGRATE_COLD: Migrate the regions prioritizing colder regions.
* @DAMOS_STAT: Do nothing but count the stat.
* @NR_DAMOS_ACTIONS: Total number of DAMOS actions
@@ -123,6 +124,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+ DAMOS_MIGRATE_HOT,
DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 882ae54af829..af6aac388a43 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -486,6 +486,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_pa_deactivate_pages(r, scheme);
+ case DAMOS_MIGRATE_HOT:
case DAMOS_MIGRATE_COLD:
return damon_pa_migrate(r, scheme);
case DAMOS_STAT:
@@ -508,6 +509,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
return damon_hot_score(context, r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_cold_score(context, r, scheme);
+ case DAMOS_MIGRATE_HOT:
+ return damon_hot_score(context, r, scheme);
case DAMOS_MIGRATE_COLD:
return damon_cold_score(context, r, scheme);
default:
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 880015d5b5ea..66fccfa776d7 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1458,6 +1458,7 @@ static const char * const damon_sysfs_damos_action_strs[] = {
"nohugepage",
"lru_prio",
"lru_deprio",
+ "migrate_hot",
"migrate_cold",
"stat",
};
--
2.34.1

2024-06-14 03:12:02

by Honggyu Kim

[permalink] [raw]

Subject: [PATCH v6 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
instead of swapping them out.

The 'target_nid' sysfs knob informs the migration target node ID.

Here is one of the example usage of this 'migrate_cold' action.

$ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
$ cat contexts/<N>/schemes/<N>/action
migrate_cold
$ echo 2 > contexts/<N>/schemes/<N>/target_nid
$ echo commit > state
$ numactl -p 0 ./hot_cold 500M 600M &
$ numastat -c -p hot_cold

Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Total
-------------- ------ ------ ------ -----
701 (hot_cold) 501 0 601 1101

Since there are some common routines with pageout, many functions have
similar logics between pageout and migrate cold.

damon_pa_migrate_folio_list() is a minimized version of
shrink_folio_list().

Signed-off-by: Honggyu Kim <[email protected]>
Signed-off-by: Hyeongtak Ji <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
---
include/linux/damon.h | 2 +
mm/damon/paddr.c | 154 +++++++++++++++++++++++++++++++++++++++
mm/damon/sysfs-schemes.c | 1 +
3 files changed, 157 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 21d6b69a015c..56714b6eb0d7 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
* @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
* @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists.
* @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_COLD: Migrate the regions prioritizing colder regions.
* @DAMOS_STAT: Do nothing but count the stat.
* @NR_DAMOS_ACTIONS: Total number of DAMOS actions
*
@@ -122,6 +123,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+ DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
};
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 18797c1b419b..882ae54af829 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -12,6 +12,9 @@
#include <linux/pagemap.h>
#include <linux/rmap.h>
#include <linux/swap.h>
+#include <linux/memory-tiers.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>

#include "../internal.h"
#include "ops-common.h"
@@ -325,6 +328,153 @@ static unsigned long damon_pa_deactivate_pages(struct damon_region *r,
return damon_pa_mark_accessed_or_deactivate(r, s, false);
}

+static unsigned int __damon_pa_migrate_folio_list(
+ struct list_head *migrate_folios, struct pglist_data *pgdat,
+ int target_nid)
+{
+ unsigned int nr_succeeded;
+ nodemask_t allowed_mask = NODE_MASK_NONE;
+ struct migration_target_control mtc = {
+ /*
+ * Allocate from 'node', or fail quickly and quietly.
+ * When this happens, 'page' will likely just be discarded
+ * instead of migrated.
+ */
+ .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
+ __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT,
+ .nid = target_nid,
+ .nmask = &allowed_mask
+ };
+
+ if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
+ return 0;
+
+ if (list_empty(migrate_folios))
+ return 0;
+
+ /* Migration ignores all cpuset and mempolicy settings */
+ migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
+ (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON,
+ &nr_succeeded);
+
+ return nr_succeeded;
+}
+
+static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
+ struct pglist_data *pgdat,
+ int target_nid)
+{
+ unsigned int nr_migrated = 0;
+ struct folio *folio;
+ LIST_HEAD(ret_folios);
+ LIST_HEAD(migrate_folios);
+
+ while (!list_empty(folio_list)) {
+ struct folio *folio;
+
+ cond_resched();
+
+ folio = lru_to_folio(folio_list);
+ list_del(&folio->lru);
+
+ if (!folio_trylock(folio))
+ goto keep;
+
+ /* Relocate its contents to another node. */
+ list_add(&folio->lru, &migrate_folios);
+ folio_unlock(folio);
+ continue;
+keep:
+ list_add(&folio->lru, &ret_folios);
+ }
+ /* 'folio_list' is always empty here */
+
+ /* Migrate folios selected for migration */
+ nr_migrated += __damon_pa_migrate_folio_list(
+ &migrate_folios, pgdat, target_nid);
+ /*
+ * Folios that could not be migrated are still in @migrate_folios. Add
+ * those back on @folio_list
+ */
+ if (!list_empty(&migrate_folios))
+ list_splice_init(&migrate_folios, folio_list);
+
+ try_to_unmap_flush();
+
+ list_splice(&ret_folios, folio_list);
+
+ while (!list_empty(folio_list)) {
+ folio = lru_to_folio(folio_list);
+ list_del(&folio->lru);
+ folio_putback_lru(folio);
+ }
+
+ return nr_migrated;
+}
+
+static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
+ int target_nid)
+{
+ int nid;
+ unsigned long nr_migrated = 0;
+ LIST_HEAD(node_folio_list);
+ unsigned int noreclaim_flag;
+
+ if (list_empty(folio_list))
+ return nr_migrated;
+
+ noreclaim_flag = memalloc_noreclaim_save();
+
+ nid = folio_nid(lru_to_folio(folio_list));
+ do {
+ struct folio *folio = lru_to_folio(folio_list);
+
+ if (nid == folio_nid(folio)) {
+ list_move(&folio->lru, &node_folio_list);
+ continue;
+ }
+
+ nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
+ NODE_DATA(nid),
+ target_nid);
+ nid = folio_nid(lru_to_folio(folio_list));
+ } while (!list_empty(folio_list));
+
+ nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
+ NODE_DATA(nid),
+ target_nid);
+
+ memalloc_noreclaim_restore(noreclaim_flag);
+
+ return nr_migrated;
+}
+
+static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s)
+{
+ unsigned long addr, applied;
+ LIST_HEAD(folio_list);
+
+ for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) {
+ struct folio *folio = damon_get_folio(PHYS_PFN(addr));
+
+ if (!folio)
+ continue;
+
+ if (damos_pa_filter_out(s, folio))
+ goto put_folio;
+
+ if (!folio_isolate_lru(folio))
+ goto put_folio;
+ list_add(&folio->lru, &folio_list);
+put_folio:
+ folio_put(folio);
+ }
+ applied = damon_pa_migrate_pages(&folio_list, s->target_nid);
+ cond_resched();
+ return applied * PAGE_SIZE;
+}
+
+
static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
struct damon_target *t, struct damon_region *r,
struct damos *scheme)
@@ -336,6 +486,8 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx,
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_pa_deactivate_pages(r, scheme);
+ case DAMOS_MIGRATE_COLD:
+ return damon_pa_migrate(r, scheme);
case DAMOS_STAT:
break;
default:
@@ -356,6 +508,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
return damon_hot_score(context, r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_cold_score(context, r, scheme);
+ case DAMOS_MIGRATE_COLD:
+ return damon_cold_score(context, r, scheme);
default:
break;
}
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 0632d28b67f8..880015d5b5ea 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1458,6 +1458,7 @@ static const char * const damon_sysfs_damos_action_strs[] = {
"nohugepage",
"lru_prio",
"lru_deprio",
+ "migrate_cold",
"stat",
};

--
2.34.1

2024-06-14 16:38:26

by SeongJae Park

[permalink] [raw]

Subject: Re: [PATCH v6 7/7] Docs/damon: document damos_migrate_{hot,cold}

On Fri, 14 Jun 2024 12:00:09 +0900 Honggyu Kim <[email protected]> wrote:

> This patch adds damon description for "migrate_hot" and "migrate_cold"
> actions for both usage and design documents as long as a new
> "target_nid" knob to set the migration target node.
>
> Signed-off-by: Honggyu Kim <[email protected]>

Reviewed-by: SeongJae Park <[email protected]>

Thanks,
SJ

[...]

2024-06-14 17:08:42

by SeongJae Park

[permalink] [raw]

Subject: Re: [PATCH v6 0/7] DAMON based tiered memory management for CXL memory

On Fri, 14 Jun 2024 12:00:02 +0900 Honggyu Kim <[email protected]> wrote:

> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
>
> It says there is no implementation of the demote/promote DAMOS action
> are made. This patch series is about its implementation for physical
> address space so that this scheme can be applied in system wide level.
>
> Changes from v5:
> https://lore.kernel.org/[email protected]
> 1. Remove new actions in usage document as its for debugfs

Thank you, I confirmed this and gave you my Reviewed-by: tag.

> 2. Apply minor fixes on cover letter

But...

[...]
> 2. YCSB zipfian distribution read only workload (with demotion_enabled true)
> memory pressure with cold memory on node0 with 512GB of local DRAM.
> ====================+================================================+=========
> | cold memory occupied by mmap and memset |
> | 0G 440G 450G 460G 470G 480G 490G 500G |
> ====================+================================================+=========
> Execution time normalized to DRAM-only values | GEOMEAN
> --------------------+------------------------------------------------+---------
> DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 1.05 | 1.04
> DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 1.06 | 1.05
> DAMON tiered kswapd | - 1.03 1.03 1.03 1.03 1.02 1.02 1.03 | 1.03
> DAMON lazy kswapd | - 1.04 1.04 1.04 1.03 1.05 1.04 1.05 | 1.04
> ====================+================================================+=========
> CXL usage of redis-server in GB | AVERAGE
> --------------------+------------------------------------------------+---------
> DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2
> DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5
> DAMON tiered kswapd | - 0.0 0.0 0.4 0.5 0.1 0.8 1.0 | 0.4
> DAMON lazy kswapd | - 4.2 4.6 5.3 1.7 6.8 8.1 5.8 | 5.2
> ====================+================================================+=========
>
> Each test result is based on the exeuction environment as follows.

Seems the typo is not fixed?

I don't want to delay this work for such trivial thing, though. For the patch
series,

Acked-by: SeongJae Park <[email protected]>

Thanks,
SJ

[...]