2024-06-14 22:16:18

by Shivank Garg

[permalink] [raw]
Subject: [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA

This series introduces enhancements to the page migration code to optimize
the "folio move" operations by batching them and enable offloading on DMA
hardware accelerators.

Page migration involves three key steps:
1. Unmap: Allocating dst folios and replace the src folio PTEs with
migration PTEs.
2. TLB Flush: Flushing the TLB for all unmapped folios.
3. Move: Copying the page mappings, flags and contents from src to dst.
Update metadata, lists, refcounts and restore working PTEs.

While the first two steps (setting TLB flush pending for unmapped folios
and TLB batch flush) been optimized with batching, this series focuses
on optimizing the folio move step.

In the current design, the folio move operation is performed sequentially
for each folio:
for_each_folio() {
Copy folio metadata like flags and mappings
Copy the folio content from src to dst
Update PTEs with new mappings
}

In the proposed design, we batch the folio copy operations to leverage DMA
offloading. The updated design is as follows:
for_each_folio() {
Copy folio metadata like flags and mappings
}
Batch copy the page content from src to dst by offloading to DMA engine
for_each_folio() {
Update PTEs with new mappings
}

Motivation:
Data copying across NUMA nodes while page migration incurs significant
overhead. For instance, folio copy can take up to 26.6% of the total
migration cost for migrating 256MB of data.
Modern systems are equipped with powerful DMA engines for bulk data
copying. Utilizing these hardware accelerators will become essential for
large-scale tiered-memory systems with CXL nodes where lots of page
promotion and demotion can happen.
Following the trend of batching operations in the memory migration core
path (like batch migration and batch TLB flush), batch copying folio data
is a logical progression in this direction.

We conducted experiments to measure folio copy overheads for page
migration from a remote node to a local NUMA node, modeling page
promotions for different workload sizes (4KB, 2MB, 256MB and 1GB).

Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT
Enabled), 1 NUMA node connected to each socket.
Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz.
THP, compaction, numa_balancing are disabled to reduce interfernce.

migrate_pages() { <- t1
..
<- t2
folio_copy()
<- t3
..
} <- t4

overheads Fraction, F= (t3-t2)/(t4-t1)
Measurement: Mean ± SD is measured in cpu_cycles/page
Generic Kernel
4KB:: migrate_pages:17799.00±4278.25 folio_copy:794±232.87 F:0.0478±0.0199
2MB:: migrate_pages:3478.42±94.93 folio_copy:493.84±28.21 F:0.1418±0.0050
256MB:: migrate_pages:3668.56±158.47 folio_copy:815.40±171.76 F:0.2206±0.0371
1GB:: migrate_pages:3769.98±55.79 folio_copy:804.68±60.07 F:0.2132±0.0134

Results with patched kernel:
1. Offload disabled - folios batch-move using CPU
4KB:: migrate_pages:14941.60±2556.53 folio_copy:799.60±211.66 F:0.0554±0.0190
2MB:: migrate_pages:3448.44±83.74 folio_copy:533.34±37.81 F:0.1545±0.0085
256MB:: migrate_pages:3723.56±132.93 folio_copy:907.64±132.63 F:0.2427±0.0270
1GB:: migrate_pages:3788.20±46.65 folio_copy:888.46±49.50 F:0.2344±0.0107

2. Offload enabled - folios batch-move using DMAengine
4KB:: migrate_pages:46739.80±4827.15 folio_copy:32222.40±3543.42 F:0.6904±0.0423
2MB:: migrate_pages:13798.10±205.33 folio_copy:10971.60±202.50 F:0.7951±0.0033
256MB:: migrate_pages:13217.20±163.99 folio_copy:10431.20±167.25 F:0.7891±0.0029
1GB:: migrate_pages:13309.70±113.93 folio_copy:10410.00±117.77 F:0.7821±0.0023

Discussion:
The DMAEngine achieved net throughput of 768MB/s. Additional optimizations
are needed to make DMA offloading beneficial compared to CPU-based
migration. This can include parallelism, specialized DMA hardware,
asynchronous and speculative data migration.

Status:
Current patchset is functional, except for non-LRU folios.

Dependencies:
1. This series is based on Linux-v6.8.
2. Patch 1,2,3 involve preparatory work and implementation for batching
the folio move. Patch 4 adds support for DMA offload.
3. DMA hardware and driver support are required to enable DMA offload.
Without suitable support, CPU is used for batch migration. Requirements
are described in Patch 4.
4. Patch 5 adds a DMA driver using DMAengine APIs for end-to-end
testing and validation.

Testing:
The patch series has been tested with migrate_pages(2) and move_pages(2)
using anonymous memory and memory-mapped files.

Byungchul Park (1):
mm: separate move/undo doing on folio list from migrate_pages_batch()

Mike Day (1):
mm: add support for DMA folio Migration

Shivank Garg (3):
mm: add folios_copy() for copying pages in batch during migration
mm: add migrate_folios_batch_move to batch the folio move operations
dcbm: add dma core batch migrator for batch page offloading

drivers/dma/Kconfig | 2 +
drivers/dma/Makefile | 1 +
drivers/dma/dcbm/Kconfig | 7 +
drivers/dma/dcbm/Makefile | 1 +
drivers/dma/dcbm/dcbm.c | 229 +++++++++++++++++++++
include/linux/migrate_dma.h | 36 ++++
include/linux/mm.h | 1 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/migrate.c | 385 +++++++++++++++++++++++++++++++-----
mm/migrate_dma.c | 51 +++++
mm/util.c | 22 +++
12 files changed, 692 insertions(+), 52 deletions(-)
create mode 100644 drivers/dma/dcbm/Kconfig
create mode 100644 drivers/dma/dcbm/Makefile
create mode 100644 drivers/dma/dcbm/dcbm.c
create mode 100644 include/linux/migrate_dma.h
create mode 100644 mm/migrate_dma.c

--
2.34.1



2024-06-14 22:17:07

by Shivank Garg

[permalink] [raw]
Subject: [RFC PATCH 3/5] mm: add migrate_folios_batch_move to batch the folio move operations

This is a preparatory patch that enable batch copying for folios undergoing
migration. By enabling batch copying the folio content, we can efficiently
utilize the capabilities of DMA hardware.

Currently, the folio move operation is performed individually for each
folio in sequential manner:
for_each_folio() {
Copy folio metadata like flags and mappings
Copy the folio bytes from src to dst
Update PTEs with new mappings
}

With this patch, we transition to a batch processing approach as shown
below:
for_each_folio() {
Copy folio metadata like flags and mappings
}
Batch copy all pages from src to dst
for_each_folio() {
Update PTEs with new mappings
}

Signed-off-by: Shivank Garg <[email protected]>
---
mm/migrate.c | 217 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 215 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6c36c6e0a360..fce69a494742 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,11 @@

#include "internal.h"

+struct migrate_folio_info {
+ unsigned long private;
+ struct list_head list;
+};
+
bool isolate_movable_page(struct page *page, isolate_mode_t mode)
{
struct folio *folio = folio_get_nontail_page(page);
@@ -1055,6 +1060,14 @@ static void __migrate_folio_extract(struct folio *dst,
dst->private = NULL;
}

+static void __migrate_folio_extract_private(unsigned long private,
+ int *old_page_state,
+ struct anon_vma **anon_vmap)
+{
+ *anon_vmap = (struct anon_vma *)(private & ~PAGE_OLD_STATES);
+ *old_page_state = private & PAGE_OLD_STATES;
+}
+
/* Restore the source folio to the original state upon failure */
static void migrate_folio_undo_src(struct folio *src,
int page_was_mapped,
@@ -1658,6 +1671,201 @@ static void migrate_folios_move(struct list_head *src_folios,
}
}

+static void migrate_folios_batch_move(struct list_head *src_folios,
+ struct list_head *dst_folios,
+ free_folio_t put_new_folio, unsigned long private,
+ enum migrate_mode mode, int reason,
+ struct list_head *ret_folios,
+ struct migrate_pages_stats *stats,
+ int *retry, int *thp_retry, int *nr_failed,
+ int *nr_retry_pages)
+{
+ struct folio *folio, *folio2, *dst, *dst2;
+ int rc, nr_pages = 0, nr_mig_folios = 0;
+ int old_page_state = 0;
+ struct anon_vma *anon_vma = NULL;
+ bool is_lru;
+ int is_thp = 0;
+ struct migrate_folio_info *mig_info, *mig_info2;
+ LIST_HEAD(temp_src_folios);
+ LIST_HEAD(temp_dst_folios);
+ LIST_HEAD(mig_info_list);
+
+ if (mode != MIGRATE_ASYNC) {
+ *retry += 1;
+ return;
+ }
+
+ /*
+ * Iterate over the list of locked src/dst folios to copy the metadata
+ */
+ dst = list_first_entry(dst_folios, struct folio, lru);
+ dst2 = list_next_entry(dst, lru);
+ list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+ mig_info = kmalloc(sizeof(*mig_info), GFP_KERNEL);
+ if (!mig_info)
+ break;
+ is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+ nr_pages = folio_nr_pages(folio);
+ is_lru = !__folio_test_movable(folio);
+
+ __migrate_folio_extract(dst, &old_page_state, &anon_vma);
+
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_BUG_ON_FOLIO(!folio_test_locked(dst), dst);
+
+ /*
+ * Use MIGRATE_SYNC_NO_COPY mode in migrate_folio family functions
+ * to copy the flags, mapping and some other ancillary information.
+ * This does everything except the page copy. The actual page copy
+ * is handled later in a batch manner.
+ */
+ if (likely(is_lru)) {
+ struct address_space *mapping = folio_mapping(folio);
+
+ if (!mapping)
+ rc = migrate_folio(mapping, dst, folio, MIGRATE_SYNC_NO_COPY);
+ else if (mapping_unmovable(mapping))
+ rc = -EOPNOTSUPP;
+ else if (mapping->a_ops->migrate_folio)
+ rc = mapping->a_ops->migrate_folio(mapping, dst, folio,
+ MIGRATE_SYNC_NO_COPY);
+ else
+ rc = fallback_migrate_folio(mapping, dst, folio,
+ MIGRATE_SYNC_NO_COPY);
+ } else {
+ /*
+ * Let CPU handle the non-LRU pages for initial review.
+ * TODO: implement
+ * Can we move non-MOVABLE LRU case and mapping_unmovable case
+ * in unmap_and_move_huge_page and migrate_folio_unmap?
+ */
+ rc = -EAGAIN;
+ }
+ /*
+ * Turning back after successful migrate_folio may create
+ * side-effects as dst mapping/index and xarray are updated.
+ */
+
+ /*
+ * -EAGAIN: Move src/dst folios to tmp lists for retry
+ * Other Errno: Put src folio on ret_folios list, remove the dst folio
+ * Success: Copy the folio bytes, restoring working pte, unlock and
+ * decrement refcounter
+ */
+ if (rc == -EAGAIN) {
+ *retry += 1;
+ *thp_retry += is_thp;
+ *nr_retry_pages += nr_pages;
+
+ kfree(mig_info);
+ list_move_tail(&folio->lru, &temp_src_folios);
+ list_move_tail(&dst->lru, &temp_dst_folios);
+ __migrate_folio_record(dst, old_page_state, anon_vma);
+ } else if (rc != MIGRATEPAGE_SUCCESS) {
+ *nr_failed += 1;
+ stats->nr_thp_failed += is_thp;
+ stats->nr_failed_pages += nr_pages;
+
+ kfree(mig_info);
+ list_del(&dst->lru);
+ migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
+ anon_vma, true, ret_folios);
+ migrate_folio_undo_dst(dst, true, put_new_folio, private);
+ } else { /* MIGRATEPAGE_SUCCESS */
+ nr_mig_folios++;
+ mig_info->private = (unsigned long)((void *)anon_vma + old_page_state);
+ list_add_tail(&mig_info->list, &mig_info_list);
+ }
+ dst = dst2;
+ dst2 = list_next_entry(dst, lru);
+ }
+
+ /* Exit if folio list for batch migration is empty */
+ if (!nr_mig_folios)
+ goto out;
+
+ /* Batch copy the folios */
+ folios_copy(dst_folios, src_folios);
+
+ /*
+ * Iterate the folio lists to remove migration pte and restore them
+ * as working pte. Unlock the folios, add/remove them to LRU lists (if
+ * applicable) and release the src folios.
+ */
+ mig_info = list_first_entry(&mig_info_list, struct migrate_folio_info, list);
+ mig_info2 = list_next_entry(mig_info, list);
+ dst = list_first_entry(dst_folios, struct folio, lru);
+ dst2 = list_next_entry(dst, lru);
+ list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+ is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+ nr_pages = folio_nr_pages(folio);
+ __migrate_folio_extract_private(mig_info->private, &old_page_state, &anon_vma);
+ list_del(&dst->lru);
+ if (__folio_test_movable(folio)) {
+ VM_BUG_ON_FOLIO(!folio_test_isolated(folio), folio);
+ /*
+ * We clear PG_movable under page_lock so any compactor
+ * cannot try to migrate this page.
+ */
+ folio_clear_isolated(folio);
+ }
+
+ /*
+ * Anonymous and movable src->mapping will be cleared by
+ * free_pages_prepare so don't reset it here for keeping
+ * the type to work PageAnon, for example.
+ */
+ if (!folio_mapping_flags(folio))
+ folio->mapping = NULL;
+
+ if (likely(!folio_is_zone_device(dst)))
+ flush_dcache_folio(dst);
+
+ /*
+ * Below few steps are only applicable for lru pages which is
+ * ensured as we have removed the non-lru pages from our list.
+ */
+ folio_add_lru(dst);
+ if (old_page_state & PAGE_WAS_MLOCKED)
+ lru_add_drain(); // can this step be optimized for batch?
+ if (old_page_state & PAGE_WAS_MAPPED)
+ remove_migration_ptes(folio, dst, false);
+
+ folio_unlock(dst);
+ set_page_owner_migrate_reason(&dst->page, reason);
+
+ /*
+ * Decrease refcount of dst. It will not free the page because
+ * new page owner increased refcounter.
+ */
+ folio_put(dst);
+ /* Remove the source folio from the list */
+ list_del(&folio->lru);
+ /* Drop an anon_vma reference if we took one */
+ if (anon_vma)
+ put_anon_vma(anon_vma);
+ folio_unlock(folio);
+ migrate_folio_done(folio, reason);
+
+ /* Page migration successful, increase stat counter */
+ stats->nr_succeeded += nr_pages;
+ stats->nr_thp_succeeded += is_thp;
+
+ list_del(&mig_info->list);
+ kfree(mig_info);
+ mig_info = mig_info2;
+ mig_info2 = list_next_entry(mig_info, list);
+
+ dst = dst2;
+ dst2 = list_next_entry(dst, lru);
+ }
+out:
+ /* Add tmp folios back to the list to let CPU re-attempt migration. */
+ list_splice(&temp_src_folios, src_folios);
+ list_splice(&temp_dst_folios, dst_folios);
+}
+
static void migrate_folios_undo(struct list_head *src_folios,
struct list_head *dst_folios,
free_folio_t put_new_folio, unsigned long private,
@@ -1833,13 +2041,18 @@ static int migrate_pages_batch(struct list_head *from,
/* Flush TLBs for all unmapped folios */
try_to_unmap_flush();

- retry = 1;
+ retry = 0;
+ /* Batch move the unmapped folios */
+ migrate_folios_batch_move(&unmap_folios, &dst_folios, put_new_folio,
+ private, mode, reason, ret_folios, stats, &retry,
+ &thp_retry, &nr_failed, &nr_retry_pages);
+
for (pass = 0; pass < nr_pass && retry; pass++) {
retry = 0;
thp_retry = 0;
nr_retry_pages = 0;

- /* Move the unmapped folios */
+ /* Move the remaining unmapped folios */
migrate_folios_move(&unmap_folios, &dst_folios,
put_new_folio, private, mode, reason,
ret_folios, stats, &retry, &thp_retry,
--
2.34.1


2024-06-14 22:18:02

by Shivank Garg

[permalink] [raw]
Subject: [RFC PATCH 4/5] mm: add support for DMA folio Migration

From: Mike Day <[email protected]>

DMA drivers should implement following functions to enable folio migration
offloading:
migrate_dma() - This function takes src and dst folios list undergoing
migration. It is responsible for transfer of page content between the
src and dst folios.
can_migrate_dma() - It performs necessary checks if DMA-migration is
supported for the give src and dst folios.

DMA driver should include a mechanism to call start_offloading and
stop_offloading for enabling and disabling migration offload respectively.

Signed-off-by: Mike Day <[email protected]>
Signed-off-by: Shivank Garg <[email protected]>
---
include/linux/migrate_dma.h | 36 ++++++++++++++++++++++++++
mm/Kconfig | 8 ++++++
mm/Makefile | 1 +
mm/migrate.c | 40 +++++++++++++++++++++++++++--
mm/migrate_dma.c | 51 +++++++++++++++++++++++++++++++++++++
5 files changed, 134 insertions(+), 2 deletions(-)
create mode 100644 include/linux/migrate_dma.h
create mode 100644 mm/migrate_dma.c

diff --git a/include/linux/migrate_dma.h b/include/linux/migrate_dma.h
new file mode 100644
index 000000000000..307b234450c3
--- /dev/null
+++ b/include/linux/migrate_dma.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _MIGRATE_DMA_H
+#define _MIGRATE_DMA_H
+#include <linux/migrate_mode.h>
+
+#define MIGRATOR_NAME_LEN 32
+struct migrator {
+ char name[MIGRATOR_NAME_LEN];
+ void (*migrate_dma)(struct list_head *dst_list, struct list_head *src_list);
+ bool (*can_migrate_dma)(struct folio *dst, struct folio *src);
+ struct rcu_head srcu_head;
+ struct module *owner;
+};
+
+extern struct migrator migrator;
+extern struct mutex migrator_mut;
+extern struct srcu_struct mig_srcu;
+
+#ifdef CONFIG_DMA_MIGRATION
+void srcu_mig_cb(struct rcu_head *head);
+void dma_update_migrator(struct migrator *mig);
+unsigned char *get_active_migrator_name(void);
+bool can_dma_migrate(struct folio *dst, struct folio *src);
+void start_offloading(struct migrator *migrator);
+void stop_offloading(void);
+#else
+static inline void srcu_mig_cb(struct rcu_head *head) { };
+static inline void dma_update_migrator(struct migrator *mig) { };
+static inline unsigned char *get_active_migrator_name(void) { return NULL; };
+static inline bool can_dma_migrate(struct folio *dst, struct folio *src) {return true; };
+static inline void start_offloading(struct migrator *migrator) { };
+static inline void stop_offloading(void) { };
+#endif /* CONFIG_DMA_MIGRATION */
+
+#endif /* _MIGRATE_DMA_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index ffc3a2ba3a8c..e3ff6583fedb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -662,6 +662,14 @@ config MIGRATION
config DEVICE_MIGRATION
def_bool MIGRATION && ZONE_DEVICE

+config DMA_MIGRATION
+ bool "Migrate Pages offloading copy to DMA"
+ def_bool n
+ depends on MIGRATION
+ help
+ An interface allowing external modules or driver to offload
+ page copying in page migration.
+
config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool

diff --git a/mm/Makefile b/mm/Makefile
index e4b5b75aaec9..1e31fb79d700 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
obj-$(CONFIG_MEMTEST) += memtest.o
obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_DMA_MIGRATION) += migrate_dma.o
obj-$(CONFIG_NUMA) += memory-tiers.o
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/migrate.c b/mm/migrate.c
index fce69a494742..db826e3862a1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@
#include <linux/random.h>
#include <linux/sched/sysctl.h>
#include <linux/memory-tiers.h>
+#include <linux/migrate_dma.h>

#include <asm/tlbflush.h>

@@ -656,6 +657,37 @@ void folio_migrate_copy(struct folio *newfolio, struct folio *folio)
}
EXPORT_SYMBOL(folio_migrate_copy);

+DEFINE_STATIC_CALL(_folios_copy, folios_copy);
+DEFINE_STATIC_CALL(_can_dma_migrate, can_dma_migrate);
+
+#ifdef CONFIG_DMA_MIGRATION
+void srcu_mig_cb(struct rcu_head *head)
+{
+ static_call_query(_folios_copy);
+}
+
+void dma_update_migrator(struct migrator *mig)
+{
+ int index;
+
+ mutex_lock(&migrator_mut);
+ index = srcu_read_lock(&mig_srcu);
+ strscpy(migrator.name, mig ? mig->name : "kernel", MIGRATOR_NAME_LEN);
+ static_call_update(_folios_copy, mig ? mig->migrate_dma : folios_copy);
+ static_call_update(_can_dma_migrate, mig ? mig->can_migrate_dma : can_dma_migrate);
+ if (READ_ONCE(migrator.owner))
+ module_put(migrator.owner);
+ xchg(&migrator.owner, mig ? mig->owner : NULL);
+ if (READ_ONCE(migrator.owner))
+ try_module_get(migrator.owner);
+ srcu_read_unlock(&mig_srcu, index);
+ mutex_unlock(&migrator_mut);
+ call_srcu(&mig_srcu, &migrator.srcu_head, srcu_mig_cb);
+ srcu_barrier(&mig_srcu);
+}
+
+#endif /* CONFIG_DMA_MIGRATION */
+
/************************************************************
* Migration functions
***********************************************************/
@@ -1686,6 +1718,7 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
struct anon_vma *anon_vma = NULL;
bool is_lru;
int is_thp = 0;
+ bool can_migrate = true;
struct migrate_folio_info *mig_info, *mig_info2;
LIST_HEAD(temp_src_folios);
LIST_HEAD(temp_dst_folios);
@@ -1720,7 +1753,10 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
* This does everything except the page copy. The actual page copy
* is handled later in a batch manner.
*/
- if (likely(is_lru)) {
+ can_migrate = static_call(_can_dma_migrate)(dst, folio);
+ if (unlikely(!can_migrate))
+ rc = -EAGAIN;
+ else if (likely(is_lru)) {
struct address_space *mapping = folio_mapping(folio);

if (!mapping)
@@ -1786,7 +1822,7 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
goto out;

/* Batch copy the folios */
- folios_copy(dst_folios, src_folios);
+ static_call(_folios_copy)(dst_folios, src_folios);

/*
* Iterate the folio lists to remove migration pte and restore them
diff --git a/mm/migrate_dma.c b/mm/migrate_dma.c
new file mode 100644
index 000000000000..c8b078fdff17
--- /dev/null
+++ b/mm/migrate_dma.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/migrate.h>
+#include <linux/migrate_dma.h>
+#include <linux/rculist.h>
+#include <linux/static_call.h>
+
+atomic_t dispatch_to_dma = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(dispatch_to_dma);
+
+DEFINE_MUTEX(migrator_mut);
+DEFINE_SRCU(mig_srcu);
+
+struct migrator migrator = {
+ .name = "kernel",
+ .migrate_dma = folios_copy,
+ .can_migrate_dma = can_dma_migrate,
+ .srcu_head.func = srcu_mig_cb,
+ .owner = NULL,
+};
+
+bool can_dma_migrate(struct folio *dst, struct folio *src)
+{
+ return true;
+}
+EXPORT_SYMBOL_GPL(can_dma_migrate);
+
+void start_offloading(struct migrator *m)
+{
+ int offloading = 0;
+
+ pr_info("starting migration offload by %s\n", m->name);
+ dma_update_migrator(m);
+ atomic_try_cmpxchg(&dispatch_to_dma, &offloading, 1);
+}
+EXPORT_SYMBOL_GPL(start_offloading);
+
+void stop_offloading(void)
+{
+ int offloading = 1;
+
+ pr_info("stopping migration offload by %s\n", migrator.name);
+ dma_update_migrator(NULL);
+ atomic_try_cmpxchg(&dispatch_to_dma, &offloading, 0);
+}
+EXPORT_SYMBOL_GPL(stop_offloading);
+
+unsigned char *get_active_migrator_name(void)
+{
+ return migrator.name;
+}
+EXPORT_SYMBOL_GPL(get_active_migrator_name);
--
2.34.1


2024-06-14 22:18:13

by Shivank Garg

[permalink] [raw]
Subject: [RFC PATCH 5/5] dcbm: add dma core batch migrator for batch page offloading

This commit is example code on how to leverage mm's migrate offload support
for offloading batch page migration. The dcbm (DMA core batch migrator)
provides a generic interface using DMAEngine for end-to-end testing of
the batch page migration offload feature. This facilitates testing and
validation of the functionality.

Enable DCBM offload: echo 1 > /sys/kernel/dcbm/offloading
Disable DCBM offload: echo 0 > /sys/kernel/dcbm/offloading

Signed-off-by: Shivank Garg <[email protected]>
---
drivers/dma/Kconfig | 2 +
drivers/dma/Makefile | 1 +
drivers/dma/dcbm/Kconfig | 7 ++
drivers/dma/dcbm/Makefile | 1 +
drivers/dma/dcbm/dcbm.c | 229 ++++++++++++++++++++++++++++++++++++++
5 files changed, 240 insertions(+)
create mode 100644 drivers/dma/dcbm/Kconfig
create mode 100644 drivers/dma/dcbm/Makefile
create mode 100644 drivers/dma/dcbm/dcbm.c

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index e928f2ca0f1e..376bd13d46f8 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -750,6 +750,8 @@ config XILINX_ZYNQMP_DPDMA
# driver files
source "drivers/dma/bestcomm/Kconfig"

+source "drivers/dma/dcbm/Kconfig"
+
source "drivers/dma/mediatek/Kconfig"

source "drivers/dma/ptdma/Kconfig"
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index dfd40d14e408..7d67fc29bce2 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_AT_HDMAC) += at_hdmac.o
obj-$(CONFIG_AT_XDMAC) += at_xdmac.o
obj-$(CONFIG_AXI_DMAC) += dma-axi-dmac.o
obj-$(CONFIG_BCM_SBA_RAID) += bcm-sba-raid.o
+obj-$(CONFIG_DCBM_DMA) += dcbm/
obj-$(CONFIG_DMA_BCM2835) += bcm2835-dma.o
obj-$(CONFIG_DMA_JZ4780) += dma-jz4780.o
obj-$(CONFIG_DMA_SA11X0) += sa11x0-dma.o
diff --git a/drivers/dma/dcbm/Kconfig b/drivers/dma/dcbm/Kconfig
new file mode 100644
index 000000000000..e58eca03fb52
--- /dev/null
+++ b/drivers/dma/dcbm/Kconfig
@@ -0,0 +1,7 @@
+config DCBM_DMA
+ bool "DMA Core Batch Migrator"
+ depends on DMA_ENGINE
+ default n
+ help
+ Interface driver for batch page migration offloading. Say Y
+ if you want to try offloading with DMAEngine APIs.
diff --git a/drivers/dma/dcbm/Makefile b/drivers/dma/dcbm/Makefile
new file mode 100644
index 000000000000..56ba47cce0f1
--- /dev/null
+++ b/drivers/dma/dcbm/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_DCBM_DMA) += dcbm.o
diff --git a/drivers/dma/dcbm/dcbm.c b/drivers/dma/dcbm/dcbm.c
new file mode 100644
index 000000000000..dac87fa55327
--- /dev/null
+++ b/drivers/dma/dcbm/dcbm.c
@@ -0,0 +1,229 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *
+ * DMA batch-offlading interface driver
+ *
+ * Copyright (C) 2024 Advanced Micro Devices, Inc.
+ */
+
+/*
+ * This code exemplifies how to leverage mm layer's migration offload support
+ * for batch page offloading using DMA Engine APIs.
+ * Developers can use this template to write interface for custom hardware
+ * accelerators with specialized capabilities for batch page migration.
+ * This interface driver is end-to-end working and can be used for testing the
+ * patch series without special hardware given DMAEngine support is available.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/printk.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include <linux/dma-mapping.h>
+#include <linux/dmaengine.h>
+#include <linux/migrate.h>
+#include <linux/migrate_dma.h>
+#include <linux/printk.h>
+#include <linux/sysfs.h>
+
+static struct dma_chan *chan;
+static int is_dispatching;
+
+static void folios_copy_dma(struct list_head *dst_list, struct list_head *src_list);
+static bool can_migrate_dma(struct folio *dst, struct folio *src);
+
+static DEFINE_MUTEX(migratecfg_mutex);
+
+/* DMA Core Batch Migrator */
+struct migrator dmigrator = {
+ .name = "DCBM\0",
+ .migrate_dma = folios_copy_dma,
+ .can_migrate_dma = can_migrate_dma,
+ .owner = THIS_MODULE,
+};
+
+static ssize_t offloading_set(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int ccode;
+ int action;
+ dma_cap_mask_t mask;
+
+ ccode = kstrtoint(buf, 0, &action);
+ if (ccode) {
+ pr_debug("(%s:) error parsing input %s\n", __func__, buf);
+ return ccode;
+ }
+
+ /*
+ * action is 0: User wants to disable DMA offloading.
+ * action is 1: User wants to enable DMA offloading.
+ */
+ switch (action) {
+ case 0:
+ mutex_lock(&migratecfg_mutex);
+ if (is_dispatching == 1) {
+ stop_offloading();
+ dma_release_channel(chan);
+ is_dispatching = 0;
+ } else
+ pr_debug("migration offloading is already OFF\n");
+ mutex_unlock(&migratecfg_mutex);
+ break;
+ case 1:
+ mutex_lock(&migratecfg_mutex);
+ if (is_dispatching == 0) {
+ dma_cap_zero(mask);
+ dma_cap_set(DMA_MEMCPY, mask);
+ chan = dma_request_channel(mask, NULL, NULL);
+ if (!chan) {
+ chan = ERR_PTR(-ENODEV);
+ pr_err("Error requesting DMA channel\n");
+ mutex_unlock(&migratecfg_mutex);
+ return -ENODEV;
+ }
+ start_offloading(&dmigrator);
+ is_dispatching = 1;
+ } else
+ pr_debug("migration offloading is already ON\n");
+ mutex_unlock(&migratecfg_mutex);
+ break;
+ default:
+ pr_debug("input should be zero or one, parsed as %d\n", action);
+ }
+ return sizeof(action);
+}
+
+static ssize_t offloading_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", is_dispatching);
+}
+
+static bool can_migrate_dma(struct folio *dst, struct folio *src)
+{
+ if (folio_test_hugetlb(src) || folio_test_hugetlb(dst) ||
+ folio_has_private(src) || folio_has_private(dst) ||
+ (folio_nr_pages(src) != folio_nr_pages(dst)) ||
+ folio_nr_pages(src) != 1)
+ return false;
+ return true;
+}
+
+static void folios_copy_dma(struct list_head *dst_list,
+ struct list_head *src_list)
+{
+ int ret = 0;
+ struct folio *src, *dst;
+ struct dma_device *dev;
+ struct device *dma_dev;
+ static dma_cookie_t cookie;
+ struct dma_async_tx_descriptor *tx;
+ enum dma_status status;
+ enum dma_ctrl_flags flags = DMA_CTRL_ACK;
+ dma_addr_t srcdma_handle;
+ dma_addr_t dstdma_handle;
+
+
+ if (!chan) {
+ pr_err("error chan uninitialized\n");
+ goto fail;
+ }
+ dev = chan->device;
+ if (!dev) {
+ pr_err("error dev is NULL\n");
+ goto fail;
+ }
+ dma_dev = dmaengine_get_dma_device(chan);
+ if (!dma_dev) {
+ pr_err("error dma_dev is NULL\n");
+ goto fail;
+ }
+ dst = list_first_entry(dst_list, struct folio, lru);
+ list_for_each_entry(src, src_list, lru) {
+ srcdma_handle = dma_map_page(dma_dev, &src->page, 0, 4096, DMA_BIDIRECTIONAL);
+ ret = dma_mapping_error(dma_dev, srcdma_handle);
+ if (ret) {
+ pr_err("src mapping error\n");
+ goto fail1;
+ }
+ dstdma_handle = dma_map_page(dma_dev, &dst->page, 0, 4096, DMA_BIDIRECTIONAL);
+ ret = dma_mapping_error(dma_dev, dstdma_handle);
+ if (ret) {
+ pr_err("dst mapping error\n");
+ goto fail2;
+ }
+ tx = dev->device_prep_dma_memcpy(chan, dstdma_handle, srcdma_handle, 4096, flags);
+ if (!tx) {
+ ret = -EBUSY;
+ pr_err("prep_dma_error\n");
+ goto fail3;
+ }
+ cookie = tx->tx_submit(tx);
+ if (dma_submit_error(cookie)) {
+ ret = -EINVAL;
+ pr_err("dma_submit_error\n");
+ goto fail3;
+ }
+ status = dma_sync_wait(chan, cookie);
+ dmaengine_terminate_sync(chan);
+ if (status != DMA_COMPLETE) {
+ ret = -EINVAL;
+ pr_err("error while dma wait\n");
+ goto fail3;
+ }
+fail3:
+ dma_unmap_page(dma_dev, dstdma_handle, 4096, DMA_BIDIRECTIONAL);
+fail2:
+ dma_unmap_page(dma_dev, srcdma_handle, 4096, DMA_BIDIRECTIONAL);
+fail1:
+ if (ret)
+ folio_copy(dst, src);
+
+ dst = list_next_entry(dst, lru);
+ }
+fail:
+ folios_copy(dst_list, src_list);
+}
+
+static struct kobject *kobj_ref;
+static struct kobj_attribute offloading_attribute = __ATTR(offloading, 0664,
+ offloading_show, offloading_set);
+
+static int __init dma_module_init(void)
+{
+ int ret = 0;
+
+ kobj_ref = kobject_create_and_add("dcbm", kernel_kobj);
+ if (!kobj_ref)
+ return -ENOMEM;
+
+ ret = sysfs_create_file(kobj_ref, &offloading_attribute.attr);
+ if (ret)
+ goto out;
+
+ is_dispatching = 0;
+
+ return 0;
+out:
+ kobject_put(kobj_ref);
+ return ret;
+}
+
+static void __exit dma_module_exit(void)
+{
+ /* Stop the DMA offloading to unload the module */
+
+ //sysfs_remove_file(kobj, &offloading_show.attr);
+ kobject_put(kobj_ref);
+}
+
+module_init(dma_module_init);
+module_exit(dma_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Shivank Garg");
+MODULE_DESCRIPTION("DCBM"); /* DMA Core Batch Migrator */
--
2.34.1


2024-06-14 22:18:44

by Shivank Garg

[permalink] [raw]
Subject: [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch()

From: Byungchul Park <[email protected]>

Functionally, no change. This is a preparatory patch picked from luf
(lazy unmap flush) patch series. This patch improve code organization
and readability for steps involving migrate_folio_move().

Refactored migrate_pages_batch() and separated move and undo parts
operating on folio list, from migrate_pages_batch().

Signed-off-by: Byungchul Park <[email protected]>
Signed-off-by: Shivank Garg <[email protected]>
---
mm/migrate.c | 134 +++++++++++++++++++++++++++++++--------------------
1 file changed, 83 insertions(+), 51 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c27b1f8097d4..6c36c6e0a360 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1606,6 +1606,81 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
return nr_failed;
}

+static void migrate_folios_move(struct list_head *src_folios,
+ struct list_head *dst_folios,
+ free_folio_t put_new_folio, unsigned long private,
+ enum migrate_mode mode, int reason,
+ struct list_head *ret_folios,
+ struct migrate_pages_stats *stats,
+ int *retry, int *thp_retry, int *nr_failed,
+ int *nr_retry_pages)
+{
+ struct folio *folio, *folio2, *dst, *dst2;
+ bool is_thp;
+ int nr_pages;
+ int rc;
+
+ dst = list_first_entry(dst_folios, struct folio, lru);
+ dst2 = list_next_entry(dst, lru);
+ list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+ is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+ nr_pages = folio_nr_pages(folio);
+
+ cond_resched();
+
+ rc = migrate_folio_move(put_new_folio, private,
+ folio, dst, mode,
+ reason, ret_folios);
+ /*
+ * The rules are:
+ * Success: folio will be freed
+ * -EAGAIN: stay on the unmap_folios list
+ * Other errno: put on ret_folios list
+ */
+ switch (rc) {
+ case -EAGAIN:
+ *retry += 1;
+ *thp_retry += is_thp;
+ *nr_retry_pages += nr_pages;
+ break;
+ case MIGRATEPAGE_SUCCESS:
+ stats->nr_succeeded += nr_pages;
+ stats->nr_thp_succeeded += is_thp;
+ break;
+ default:
+ *nr_failed += 1;
+ stats->nr_thp_failed += is_thp;
+ stats->nr_failed_pages += nr_pages;
+ break;
+ }
+ dst = dst2;
+ dst2 = list_next_entry(dst, lru);
+ }
+}
+
+static void migrate_folios_undo(struct list_head *src_folios,
+ struct list_head *dst_folios,
+ free_folio_t put_new_folio, unsigned long private,
+ struct list_head *ret_folios)
+{
+ struct folio *folio, *folio2, *dst, *dst2;
+
+ dst = list_first_entry(dst_folios, struct folio, lru);
+ dst2 = list_next_entry(dst, lru);
+ list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+ int old_page_state = 0;
+ struct anon_vma *anon_vma = NULL;
+
+ __migrate_folio_extract(dst, &old_page_state, &anon_vma);
+ migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
+ anon_vma, true, ret_folios);
+ list_del(&dst->lru);
+ migrate_folio_undo_dst(dst, true, put_new_folio, private);
+ dst = dst2;
+ dst2 = list_next_entry(dst, lru);
+ }
+}
+
/*
* migrate_pages_batch() first unmaps folios in the from list as many as
* possible, then move the unmapped folios.
@@ -1628,7 +1703,7 @@ static int migrate_pages_batch(struct list_head *from,
int pass = 0;
bool is_thp = false;
bool is_large = false;
- struct folio *folio, *folio2, *dst = NULL, *dst2;
+ struct folio *folio, *folio2, *dst = NULL;
int rc, rc_saved = 0, nr_pages;
LIST_HEAD(unmap_folios);
LIST_HEAD(dst_folios);
@@ -1764,42 +1839,11 @@ static int migrate_pages_batch(struct list_head *from,
thp_retry = 0;
nr_retry_pages = 0;

- dst = list_first_entry(&dst_folios, struct folio, lru);
- dst2 = list_next_entry(dst, lru);
- list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
- is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
- nr_pages = folio_nr_pages(folio);
-
- cond_resched();
-
- rc = migrate_folio_move(put_new_folio, private,
- folio, dst, mode,
- reason, ret_folios);
- /*
- * The rules are:
- * Success: folio will be freed
- * -EAGAIN: stay on the unmap_folios list
- * Other errno: put on ret_folios list
- */
- switch(rc) {
- case -EAGAIN:
- retry++;
- thp_retry += is_thp;
- nr_retry_pages += nr_pages;
- break;
- case MIGRATEPAGE_SUCCESS:
- stats->nr_succeeded += nr_pages;
- stats->nr_thp_succeeded += is_thp;
- break;
- default:
- nr_failed++;
- stats->nr_thp_failed += is_thp;
- stats->nr_failed_pages += nr_pages;
- break;
- }
- dst = dst2;
- dst2 = list_next_entry(dst, lru);
- }
+ /* Move the unmapped folios */
+ migrate_folios_move(&unmap_folios, &dst_folios,
+ put_new_folio, private, mode, reason,
+ ret_folios, stats, &retry, &thp_retry,
+ &nr_failed, &nr_retry_pages);
}
nr_failed += retry;
stats->nr_thp_failed += thp_retry;
@@ -1808,20 +1852,8 @@ static int migrate_pages_batch(struct list_head *from,
rc = rc_saved ? : nr_failed;
out:
/* Cleanup remaining folios */
- dst = list_first_entry(&dst_folios, struct folio, lru);
- dst2 = list_next_entry(dst, lru);
- list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
- int old_page_state = 0;
- struct anon_vma *anon_vma = NULL;
-
- __migrate_folio_extract(dst, &old_page_state, &anon_vma);
- migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
- anon_vma, true, ret_folios);
- list_del(&dst->lru);
- migrate_folio_undo_dst(dst, true, put_new_folio, private);
- dst = dst2;
- dst2 = list_next_entry(dst, lru);
- }
+ migrate_folios_undo(&unmap_folios, &dst_folios,
+ put_new_folio, private, ret_folios);

return rc;
}
--
2.34.1


2024-06-14 22:19:00

by Shivank Garg

[permalink] [raw]
Subject: [RFC PATCH 2/5] mm: add folios_copy() for copying pages in batch during migration

This patch introduces the folios_copy() function to copy the folio content
from the list of src folios to the list of dst folios. This is preparatory
patch for batch page migration offloading.

Signed-off-by: Shivank Garg <[email protected]>
---
include/linux/mm.h | 1 +
mm/util.c | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..cd5f37ec72f0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1300,6 +1300,7 @@ void put_pages_list(struct list_head *pages);

void split_page(struct page *page, unsigned int order);
void folio_copy(struct folio *dst, struct folio *src);
+void folios_copy(struct list_head *dst_list, struct list_head *src_list);

unsigned long nr_free_buffer_pages(void);

diff --git a/mm/util.c b/mm/util.c
index 5a6a9802583b..3a278db28429 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -811,6 +811,28 @@ void folio_copy(struct folio *dst, struct folio *src)
}
EXPORT_SYMBOL(folio_copy);

+/**
+ * folios_copy - Copy the contents of list of folios.
+ * @dst_list: Folios to copy to.
+ * @src_list: Folios to copy from.
+ *
+ * The folio contents are copied from @src_list to @dst_list.
+ * Assume the caller has validated that lists are not empty and both lists
+ * have equal number of folios. This may sleep.
+ */
+void folios_copy(struct list_head *dst_list,
+ struct list_head *src_list)
+{
+ struct folio *src, *dst;
+
+ dst = list_first_entry(dst_list, struct folio, lru);
+ list_for_each_entry(src, src_list, lru) {
+ cond_resched();
+ folio_copy(dst, src);
+ dst = list_next_entry(dst, lru);
+ }
+}
+
int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
int sysctl_overcommit_ratio __read_mostly = 50;
unsigned long sysctl_overcommit_kbytes __read_mostly;
--
2.34.1


2024-06-15 07:07:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA

On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote:
> We conducted experiments to measure folio copy overheads for page
> migration from a remote node to a local NUMA node, modeling page
> promotions for different workload sizes (4KB, 2MB, 256MB and 1GB).
>
> Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT
> Enabled), 1 NUMA node connected to each socket.
> Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz.
> THP, compaction, numa_balancing are disabled to reduce interfernce.
>
> migrate_pages() { <- t1
> ..
> <- t2
> folio_copy()
> <- t3
> ..
> } <- t4
>
> overheads Fraction, F= (t3-t2)/(t4-t1)
> Measurement: Mean ? SD is measured in cpu_cycles/page
> Generic Kernel
> 4KB:: migrate_pages:17799.00?4278.25 folio_copy:794?232.87 F:0.0478?0.0199
> 2MB:: migrate_pages:3478.42?94.93 folio_copy:493.84?28.21 F:0.1418?0.0050
> 256MB:: migrate_pages:3668.56?158.47 folio_copy:815.40?171.76 F:0.2206?0.0371
> 1GB:: migrate_pages:3769.98?55.79 folio_copy:804.68?60.07 F:0.2132?0.0134
>
> Results with patched kernel:
> 1. Offload disabled - folios batch-move using CPU
> 4KB:: migrate_pages:14941.60?2556.53 folio_copy:799.60?211.66 F:0.0554?0.0190
> 2MB:: migrate_pages:3448.44?83.74 folio_copy:533.34?37.81 F:0.1545?0.0085
> 256MB:: migrate_pages:3723.56?132.93 folio_copy:907.64?132.63 F:0.2427?0.0270
> 1GB:: migrate_pages:3788.20?46.65 folio_copy:888.46?49.50 F:0.2344?0.0107
>
> 2. Offload enabled - folios batch-move using DMAengine
> 4KB:: migrate_pages:46739.80?4827.15 folio_copy:32222.40?3543.42 F:0.6904?0.0423
> 2MB:: migrate_pages:13798.10?205.33 folio_copy:10971.60?202.50 F:0.7951?0.0033
> 256MB:: migrate_pages:13217.20?163.99 folio_copy:10431.20?167.25 F:0.7891?0.0029
> 1GB:: migrate_pages:13309.70?113.93 folio_copy:10410.00?117.77 F:0.7821?0.0023

You haven't measured the important thing though -- what's the cost _to
userspace_? When the CPU does the copy, the data is now cache-hot in
that CPU's cache. When the DMA engine does the copy, it's not cache-hot
in any CPU.

Now, this may not be a big problem. I don't think we do anything to
ensure that the CPU that is going to access the folio in userspace is
the one which does the copy.

But your methodology is wrong.