v2:
new:
- ktest for move_phys_pages
- draft man page
- comments about do_pages_move refactor/arguments (mm, task_nodes)
for the phys_pages usage path.
- compat ptr fixes for do_pages_move
(will pull this out into a separate patch request as well)
fixes:
- capable(CAP_SYS_ADMIN) requirements
- additional error checking in do_pages_move
- bad refactor in add_page_for_migration, fixed null-deref due to
not checking the result of follow_page correctly.
- fixed get/put folio ordering issue in add_phys_page_for_migration
- fixed non-memcg build issue in phys_page_migratable
- added additional physical address source information to RFC text
- some formatting and inconsistencies between definitions
=== RFC
This patch set is a proposal for a syscall analogous to move_pages,
that migrates pages between NUMA nodes using physical addressing.
The intent is to better enable user-land system-wide memory tiering
as CXL devices begin to provide memory resources on the PCIe bus.
This patch set broken into 5 patches:
1) compat ptr bug reported by Arnd
2) refactor of existing migration code for code reuse
3) refactor of existing migration code for code reuse
4) The sys_move_phys_pages system call.
5) ktest of the syscall
6) draft man page
The sys_move_phys_pages system call validates the page may be
migrated by checking migratable-status of each vma mapping the page,
and the intersection of cpuset policies each vma's task.
Background:
Userspace job schedulers, memory managers, and tiering software
solutions depend on page migration syscalls to reallocate resources
across NUMA nodes. Currently, these calls enable movement of memory
associated with a specific PID. Moves can be requested in coarse,
process-sized strokes (as with migrate_pages), and on specific virtual
pages (via move_pages).
However, a number of profiling mechanisms provide system-wide information
that would benefit from a physical-addressing version move_pages.
There are presently at least 4 ways userland can acquire physical
address information for use with this interface, and 1 that is being
proposed here in the near future.
1) /proc/pid/pagemap: can be used to do page table translations.
This is only really useful for testing, and the ktest was
written using this functionality.
2) X86: IBS (AMD) and PEBS (Intel) can be configured to return physical
and/or vitual address information.
3) zoneinfo: /proc/zoneinfo exposes the start PFN of zones
4) /sys/kernel/mm/page_idle: A way to query whether a PFN is idle.
So long as the page size is known, this can be used to identify
system-wide idle pages that could be migrated to lower tiers.
https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
hot/cold information about its memory. If a heatmap is provided,
for example, it would have to be a device address (0-based) or a
physical address (some base defined by the kernel and programmed
into device decoders such that it can convert them to 0-based).
This is presently being proposed but has not been agreed upon yet.
Information from these sources facilitates systemwide resource management,
but with the limitations of migrate_pages and move_pages applying to
individual tasks, their outputs must be converted back to virtual addresses
and re-associated with specific PIDs.
Doing this reverse-translation outside of the kernel requires considerable
space and compute, and it will have to be performed again by the existing
system calls. Much of this work can be avoided if the pages can be
migrated directly with physical memory addressing.
Gregory Price (5):
mm/migrate: fix do_pages_move for compat pointers
mm/migrate: remove unused mm argument from do_move_pages_to_node
mm/migrate: refactor add_page_for_migration for code re-use
mm/migrate: Create move_phys_pages syscall
ktest: sys_move_phys_pages ktest
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 5 +
include/uapi/asm-generic/unistd.h | 8 +-
kernel/sys_ni.c | 1 +
mm/migrate.c | 319 ++++++++++++++++++++----
tools/include/uapi/asm-generic/unistd.h | 8 +-
tools/testing/selftests/mm/migration.c | 101 ++++++++
8 files changed, 396 insertions(+), 48 deletions(-)
--
2.39.1
Draft of the move_phys_pages syscall proposed in RFC:
https://lore.kernel.org/all/[email protected]/
Signed-off-by: Gregory Price <[email protected]>
---
man2/move_phys_pages.2 | 180 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 180 insertions(+)
create mode 100644 man2/move_phys_pages.2
diff --git a/man2/move_phys_pages.2 b/man2/move_phys_pages.2
new file mode 100644
index 000000000..4f4b68915
--- /dev/null
+++ b/man2/move_phys_pages.2
@@ -0,0 +1,180 @@
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft-2-para
+.\"
+.\" This manpage is Copyright (C) 2006 Silicon Graphics, Inc.
+.\" Christoph Lameter
+.\" This manpage is Copyright (C) 2023 MemVerge, Inc.
+.\" Gregory Price
+.\"
+.\"
+.TH move_phys_pages 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+move_phys_pages \- move individual physically-addressed pages to another node
+.SH LIBRARY
+NUMA (Non-Uniform Memory Access) policy library
+.RI ( libnuma ", " \-lnuma )
+.SH SYNOPSIS
+.nf
+.B #include <numaif.h>
+.PP
+.BI "long move_phys_pages(unsigned long " count ", \
+uint64_t *" pages [. count ],
+.BI " const int " nodes [. count "], int " status [. count "], \
+int " flags );
+.fi
+.SH DESCRIPTION
+.BR move_phys_pages ()
+moves the specified
+.I physical pages
+to the memory nodes specified by
+.IR nodes .
+The result of the move is reflected in
+.IR status .
+The
+.I flags
+indicate constraints on the pages to be moved.
+.PP
+This interface requires
+.RB ( CAP_SYS_ADMIN ) .
+.PP
+.I count
+is the number of pages to move.
+It defines the size of the three arrays
+.IR pages ,
+.IR nodes ,
+and
+.IR status .
+.PP
+.I pages
+is an array of physical addresses to the pages that should be moved.
+These are addresses that should be aligned to page boundaries.
+.PP
+.I nodes
+is an array of integers that specify the desired location for each page.
+Each element in the array is a node number.
+.I nodes
+can also be NULL, in which case
+.BR move_phys_pages ()
+does not move any pages but instead will return the node
+where each page currently resides, in the
+.I status
+array.
+Obtaining the status of each page may be necessary to determine
+pages that need to be moved.
+.PP
+.I status
+is an array of integers that return the status of each page.
+The array contains valid values only if
+.BR move_phys_pages ()
+did not return an error.
+Preinitialization of the array to a value
+which cannot represent a real numa node or valid error of status array
+could help to identify pages that have been migrated if a partial
+failure occurs.
+.PP
+.I flags
+specify what types of pages to move.
+.B MPOL_MF_MOVE
+means that only pages that are in exclusive use by a process
+are to be moved.
+.B MPOL_MF_MOVE_ALL
+means that pages shared between multiple processes can also be moved.
+.SS Page states in the status array
+The following values can be returned in each element of the
+.I status
+array.
+.TP
+.B 0..MAX_NUMNODES
+Identifies the node on which the page resides.
+.TP
+.B \-EACCES
+The target node for the page is not in the insectional set of allowed
+nodes defined by all tasks mapping the address. At least one task
+mapping the address does not allow memory the target node.
+.TP
+.B \-EBUSY
+The page is currently busy and cannot be moved.
+Try again later.
+This occurs if a page is undergoing I/O or another kernel subsystem
+is holding a reference to the page.
+.TP
+.B \-EFAULT
+This is a zero page, the memory area is not mapped by the process,
+or the memory is not migratable.
+.TP
+.B \-EIO
+Unable to write back a page.
+The page has to be written back
+in order to move it since the page is dirty and the filesystem
+does not provide a migration function that would allow the move
+of dirty pages.
+.TP
+.B \-EINVAL
+A dirty page cannot be moved.
+The filesystem does not
+provide a migration function and has no ability to write back pages.
+
+.TP
+.B \-ENOENT
+The physical page is not online or the page is not present in any VMA.
+.TP
+.B \-ENOMEM
+Unable to allocate memory on target node.
+.SH RETURN VALUE
+On success
+.BR move_phys_pages ()
+returns zero.
+.\" FIXME . Is the following quite true: does the wrapper in numactl
+.\" do the right thing?
+On error, it returns \-1, and sets
+.I errno
+to indicate the error.
+If positive value is returned, it is the number of
+nonmigrated pages.
+.SH ERRORS
+.TP
+.B Positive value
+The number of nonmigrated pages if they were the result of nonfatal
+reasons.
+.TP
+.B EFAULT
+Parameter array could not be accessed.
+.TP
+.B EINVAL
+The flag value was not 0 (Linux 6.6), or an attempt was made to
+migrate pages of a kernel thread.
+.TP
+.B ENODEV
+One of the target nodes is not online.
+.TP
+.B EPERM
+The caller specified has insufficient privileges
+.RB ( CAP_SYS_ADMIN ).
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux X.Y.Z
+.SH NOTES
+For information on library support, see
+.BR numa (7).
+.PP
+Use of this function may result in pages whose location
+(node) violates the memory policy established for the
+specified addresses (See
+.BR mbind (2))
+and/or the specified process (See
+.BR set_mempolicy (2)).
+That is, memory policy does not constrain the destination
+nodes used by
+.BR move_phys_pages ().
+.PP
+The
+.I <numaif.h>
+header is not included with glibc, but requires installing
+.I libnuma\-devel
+or a similar package.
+.SH SEE ALSO
+.BR mbind (2),
+.BR numa (3),
+.BR numa_maps (5),
+.BR cpuset (7),
+.BR numa (7),
+.BR migratepages (8),
+.BR numastat (8)
--
2.34.1
preparatory work to re-use do_move_pages_to_node with a physical
address instead of virtual address. This function does not actively
use the mm_struct, so it can be removed.
Signed-off-by: Gregory Price <[email protected]>
---
mm/migrate.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index a0b0c5a7f8a5..dbe436163d65 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2026,8 +2026,7 @@ static int store_status(int __user *status, int start, int value, int nr)
return 0;
}
-static int do_move_pages_to_node(struct mm_struct *mm,
- struct list_head *pagelist, int node)
+static int do_move_pages_to_node(struct list_head *pagelist, int node)
{
int err;
struct migration_target_control mtc = {
@@ -2123,7 +2122,7 @@ static int add_page_for_migration(struct mm_struct *mm, const void __user *p,
return err;
}
-static int move_pages_and_store_status(struct mm_struct *mm, int node,
+static int move_pages_and_store_status(int node,
struct list_head *pagelist, int __user *status,
int start, int i, unsigned long nr_pages)
{
@@ -2132,7 +2131,7 @@ static int move_pages_and_store_status(struct mm_struct *mm, int node,
if (list_empty(pagelist))
return 0;
- err = do_move_pages_to_node(mm, pagelist, node);
+ err = do_move_pages_to_node(pagelist, node);
if (err) {
/*
* Positive err means the number of failed
@@ -2200,7 +2199,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
current_node = node;
start = i;
} else if (node != current_node) {
- err = move_pages_and_store_status(mm, current_node,
+ err = move_pages_and_store_status(current_node,
&pagelist, status, start, i, nr_pages);
if (err)
goto out;
@@ -2235,7 +2234,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (err)
goto out_flush;
- err = move_pages_and_store_status(mm, current_node, &pagelist,
+ err = move_pages_and_store_status(current_node, &pagelist,
status, start, i, nr_pages);
if (err) {
/* We have accounted for page i */
@@ -2247,7 +2246,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
}
out_flush:
/* Make sure we do not overwrite the existing error */
- err1 = move_pages_and_store_status(mm, current_node, &pagelist,
+ err1 = move_pages_and_store_status(current_node, &pagelist,
status, start, i, nr_pages);
if (err >= 0)
err = err1;
--
2.39.1
add_page_for_migration presently does two actions:
1) validates the page is present and migratable
2) isolates the page from LRU and puts it into the migration list
Break add_page_for_migration into 2 functions:
add_page_for_migration - isolate the page from LUR and add to list
add_virt_page_for_migration - validate the page and call the above
add_page_for_migration does not require the mm_struct and so can be
re-used for a physical addressing version of move_pages
Signed-off-by: Gregory Price <[email protected]>
---
mm/migrate.c | 83 +++++++++++++++++++++++++++++++---------------------
1 file changed, 50 insertions(+), 33 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index dbe436163d65..1123d841a7f1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2042,52 +2042,33 @@ static int do_move_pages_to_node(struct list_head *pagelist, int node)
}
/*
- * Resolves the given address to a struct page, isolates it from the LRU and
- * puts it to the given pagelist.
+ * Isolates the page from the LRU and puts it into the given pagelist
* Returns:
* errno - if the page cannot be found/isolated
* 0 - when it doesn't have to be migrated because it is already on the
* target node
* 1 - when it has been queued
*/
-static int add_page_for_migration(struct mm_struct *mm, const void __user *p,
- int node, struct list_head *pagelist, bool migrate_all)
+static int add_page_for_migration(struct page *page, int node,
+ struct list_head *pagelist, bool migrate_all)
{
- struct vm_area_struct *vma;
- unsigned long addr;
- struct page *page;
int err;
bool isolated;
- mmap_read_lock(mm);
- addr = (unsigned long)untagged_addr_remote(mm, p);
-
- err = -EFAULT;
- vma = vma_lookup(mm, addr);
- if (!vma || !vma_migratable(vma))
- goto out;
-
- /* FOLL_DUMP to ignore special (like zero) pages */
- page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
-
- err = PTR_ERR(page);
- if (IS_ERR(page))
- goto out;
-
err = -ENOENT;
if (!page)
goto out;
if (is_zone_device_page(page))
- goto out_putpage;
+ goto out;
err = 0;
if (page_to_nid(page) == node)
- goto out_putpage;
+ goto out;
err = -EACCES;
if (page_mapcount(page) > 1 && !migrate_all)
- goto out_putpage;
+ goto out;
if (PageHuge(page)) {
if (PageHead(page)) {
@@ -2101,7 +2082,7 @@ static int add_page_for_migration(struct mm_struct *mm, const void __user *p,
isolated = isolate_lru_page(head);
if (!isolated) {
err = -EBUSY;
- goto out_putpage;
+ goto out;
}
err = 1;
@@ -2110,12 +2091,48 @@ static int add_page_for_migration(struct mm_struct *mm, const void __user *p,
NR_ISOLATED_ANON + page_is_file_lru(head),
thp_nr_pages(head));
}
-out_putpage:
- /*
- * Either remove the duplicate refcount from
- * isolate_lru_page() or drop the page ref if it was
- * not isolated.
- */
+out:
+ return err;
+}
+
+/*
+ * Resolves the given address to a struct page, isolates it from the LRU and
+ * puts it to the given pagelist.
+ * Returns:
+ * errno - if the page cannot be found/isolated
+ * 0 - when it doesn't have to be migrated because it is already on the
+ * target node
+ * 1 - when it has been queued
+ */
+static int add_virt_page_for_migration(struct mm_struct *mm,
+ const void __user *p, int node, struct list_head *pagelist,
+ bool migrate_all)
+{
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ struct page *page;
+ int err = -EFAULT;
+
+ mmap_read_lock(mm);
+ addr = (unsigned long)untagged_addr_remote(mm, p);
+
+ vma = vma_lookup(mm, addr);
+ if (!vma || !vma_migratable(vma))
+ goto out;
+
+ /* FOLL_DUMP to ignore special (like zero) pages */
+ page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+
+ err = PTR_ERR(page);
+ if (IS_ERR(page))
+ goto out;
+
+ err = -ENOENT;
+ if (!page)
+ goto out;
+
+ err = add_page_for_migration(page, node, pagelist, migrate_all);
+
put_page(page);
out:
mmap_read_unlock(mm);
@@ -2211,7 +2228,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
* Errors in the page lookup or isolation are not fatal and we simply
* report them via status
*/
- err = add_page_for_migration(mm, p, current_node, &pagelist,
+ err = add_virt_page_for_migration(mm, p, current_node, &pagelist,
flags & MPOL_MF_MOVE_ALL);
if (err > 0) {
--
2.39.1
Similar to the move_pages system call, instead of taking a pid and
list of virtual addresses, this system call takes a list of physical
addresses.
Because there is no task to validate the memory policy against, each
page needs to be interrogated to determine whether the migration is
valid, and all tasks that map it need to be interrogated.
This is accomplished in via a rmap_walk on the folio containing
the page, and an interrogation of all tasks that map the page (by
way of each task's vma).
Each page must be interrogated individually, which should be
considered when using this to migrate shared regions.
The remaining logic is the same as the move_pages syscall. Some
minor changes to do_pages_move are made (to check whether an
mm_struct is passed) in order to re-use the existing migration code.
Signed-off-by: Gregory Price <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 5 +
include/uapi/asm-generic/unistd.h | 8 +-
kernel/sys_ni.c | 1 +
mm/migrate.c | 211 +++++++++++++++++++++++-
tools/include/uapi/asm-generic/unistd.h | 8 +-
7 files changed, 228 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 2d0b1bd866ea..25db6d71af0c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -457,3 +457,4 @@
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
451 i386 cachestat sys_cachestat
452 i386 fchmodat2 sys_fchmodat2
+454 i386 move_phys_pages sys_move_phys_pages
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1d6eee30eceb..9676f2e7698c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -375,6 +375,7 @@
451 common cachestat sys_cachestat
452 common fchmodat2 sys_fchmodat2
453 64 map_shadow_stack sys_map_shadow_stack
+454 common move_phys_pages sys_move_phys_pages
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 22bc6bc147f8..6860675a942f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages,
const int __user *nodes,
int __user *status,
int flags);
+asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
+ const void __user * __user *pages,
+ const int __user *nodes,
+ int __user *status,
+ int flags);
asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, int sig,
siginfo_t __user *uinfo);
asmlinkage long sys_perf_event_open(
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index abe087c53b4b..8838fcfaf261 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -823,8 +823,14 @@ __SYSCALL(__NR_cachestat, sys_cachestat)
#define __NR_fchmodat2 452
__SYSCALL(__NR_fchmodat2, sys_fchmodat2)
+/* CONFIG_MMU only */
+#ifndef __ARCH_NOMMU
+#define __NR_move_phys_pages 454
+__SYSCALL(__NR_move_phys_pages, sys_move_phys_pages)
+#endif
+
#undef __NR_syscalls
-#define __NR_syscalls 453
+#define __NR_syscalls 455
/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e137c1385c56..07441b10f92a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -192,6 +192,7 @@ COND_SYSCALL(migrate_pages);
COND_SYSCALL(move_pages);
COND_SYSCALL(set_mempolicy_home_node);
COND_SYSCALL(cachestat);
+COND_SYSCALL(move_phys_pages);
COND_SYSCALL(perf_event_open);
COND_SYSCALL(accept4);
diff --git a/mm/migrate.c b/mm/migrate.c
index 1123d841a7f1..2d06557c0b80 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2165,9 +2165,118 @@ static int move_pages_and_store_status(int node,
return store_status(status, start, node, i - start);
}
+struct rmap_page_ctxt {
+ bool found;
+ bool migratable;
+ bool node_allowed;
+ int node;
+};
+
+/*
+ * Walks each vma mapping a given page and determines if those
+ * vma's are both migratable, and that the target node is within
+ * the allowed cpuset of the owning task.
+ */
+static bool phys_page_migratable(struct folio *folio,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ void *arg)
+{
+ struct rmap_page_ctxt *ctxt = (struct rmap_page_ctxt *)arg;
+ struct task_struct *owner = vma->vm_mm->owner;
+ /* On non-memcg systems, the allowed set is the possible set */
+#ifdef CONFIG_MEMCG
+ nodemask_t task_nodes = cpuset_mems_allowed(owner);
+#else
+ nodemask_t task_nodes = node_possible_map;
+#endif
+
+ ctxt->found |= true;
+ ctxt->migratable &= vma_migratable(vma);
+ ctxt->node_allowed &= node_isset(ctxt->node, task_nodes);
+
+ return ctxt->migratable && ctxt->node_allowed;
+}
+
+static struct folio *phys_migrate_get_folio(struct page *page)
+{
+ struct folio *folio;
+
+ folio = page_folio(page);
+ if (!folio_test_lru(folio) || !folio_try_get(folio))
+ return NULL;
+ if (unlikely(page_folio(page) != folio || !folio_test_lru(folio))) {
+ folio_put(folio);
+ folio = NULL;
+ }
+ return folio;
+}
+
+/*
+ * Validates the physical address is online and migratable. Walks the folio
+ * containing the page to validate the vma is migratable and the cpuset node
+ * restrictions. Then calls add_page_for_migration to isolate it from the
+ * LRU and place it into the given pagelist.
+ * Returns:
+ * errno - if the page is not online, migratable, or can't be isolated
+ * 0 - when it doesn't have to be migrated because it is already on the
+ * target node
+ * 1 - when it has been queued
+ */
+static int add_phys_page_for_migration(const void __user *p, int node,
+ struct list_head *pagelist,
+ bool migrate_all)
+{
+ unsigned long pfn;
+ struct page *page;
+ struct folio *folio;
+ int err;
+ struct rmap_page_ctxt rmctxt = {
+ .found = false,
+ .migratable = true,
+ .node_allowed = true,
+ .node = node
+ };
+ struct rmap_walk_control rwc = {
+ .rmap_one = phys_page_migratable,
+ .arg = &rmctxt
+ };
+
+ pfn = ((unsigned long)p) >> PAGE_SHIFT;
+ page = pfn_to_online_page(pfn);
+ if (!page || PageTail(page))
+ return -ENOENT;
+
+ folio = phys_migrate_get_folio(page);
+ if (folio)
+ rmap_walk(folio, &rwc);
+
+ if (!rmctxt.found)
+ err = -ENOENT;
+ else if (!rmctxt.migratable)
+ err = -EFAULT;
+ else if (!rmctxt.node_allowed)
+ err = -EACCES;
+ else
+ err = add_page_for_migration(page, node, pagelist, migrate_all);
+
+ if (folio)
+ folio_put(folio);
+
+ return err;
+}
+
/*
* Migrate an array of page address onto an array of nodes and fill
* the corresponding array of status.
+ *
+ * When the mm argument is not NULL, task_nodes is expected to be the
+ * cpuset nodemask for the task which owns the mm_struct, and the
+ * values located in (*pages) are expected to be virtual addresses.
+ *
+ * When the mm argument is NULL, the values located at (*pages) are
+ * expected to be physical addresses, and task_nodes is expected to
+ * be empty.
*/
static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
unsigned long nr_pages,
@@ -2181,6 +2290,10 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
int start, i;
int err = 0, err1;
+ /* This should never occur in regular operation */
+ if (!mm && nodes_weight(task_nodes) > 0)
+ return -EINVAL;
+
lru_cache_disable();
for (i = start = 0; i < nr_pages; i++) {
@@ -2209,7 +2322,14 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
goto out_flush;
err = -EACCES;
- if (!node_isset(node, task_nodes))
+ /*
+ * if mm is NULL, then the pages are addressed via physical
+ * address and the task_nodes structure is empty. Validation
+ * of migratability is deferred to add_phys_page_for_migration
+ * where vma's that map the address will have their node_mask
+ * checked to ensure the requested node bit is set.
+ */
+ if (mm && !node_isset(node, task_nodes))
goto out_flush;
if (current_node == NUMA_NO_NODE) {
@@ -2226,10 +2346,17 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
/*
* Errors in the page lookup or isolation are not fatal and we simply
- * report them via status
+ * report them via status.
+ *
+ * If mm is NULL, then p treated as is a physical address.
*/
- err = add_virt_page_for_migration(mm, p, current_node, &pagelist,
- flags & MPOL_MF_MOVE_ALL);
+ if (mm)
+ err = add_virt_page_for_migration(mm, p, current_node, &pagelist,
+ flags & MPOL_MF_MOVE_ALL);
+ else
+ err = add_phys_page_for_migration(p, current_node, &pagelist,
+ flags & MPOL_MF_MOVE_ALL);
+
if (err > 0) {
/* The page is successfully queued for migration */
@@ -2317,6 +2444,37 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
mmap_read_unlock(mm);
}
+/*
+ * Determine the nodes pages pointed to by the physical addresses in the
+ * pages array, and store those node values in the status array
+ */
+static void do_phys_pages_stat_array(unsigned long nr_pages,
+ const void __user **pages, int *status)
+{
+ unsigned long i;
+
+ for (i = 0; i < nr_pages; i++) {
+ unsigned long pfn = (unsigned long)(*pages) >> PAGE_SHIFT;
+ struct page *page = pfn_to_online_page(pfn);
+ int err = -ENOENT;
+
+ if (!page)
+ goto set_status;
+
+ get_page(page);
+
+ if (!is_zone_device_page(page))
+ err = page_to_nid(page);
+
+ put_page(page);
+set_status:
+ *status = err;
+
+ pages++;
+ status++;
+ }
+}
+
static int get_compat_pages_array(const void __user *chunk_pages[],
const void __user * __user *pages,
unsigned long chunk_nr)
@@ -2359,7 +2517,10 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
break;
}
- do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status);
+ if (mm)
+ do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status);
+ else
+ do_phys_pages_stat_array(chunk_nr, chunk_pages, chunk_status);
if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status)))
break;
@@ -2460,6 +2621,46 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
return kernel_move_pages(pid, nr_pages, pages, nodes, status, flags);
}
+/*
+ * Move a list of physically-addressed pages to the list of target nodes
+ */
+static int kernel_move_phys_pages(unsigned long nr_pages,
+ const void __user * __user *pages,
+ const int __user *nodes,
+ int __user *status, int flags)
+{
+ int err;
+ nodemask_t dummy_nodes;
+
+ if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+ return -EINVAL;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /*
+ * When the mm argument to do_pages_move is null, the task_nodes
+ * argument is ignored, so pass in an empty nodemask as a dummy.
+ */
+ nodes_clear(dummy_nodes);
+ if (nodes)
+ err = do_pages_move(NULL, dummy_nodes, nr_pages, pages,
+ nodes, status, flags);
+ else
+ err = do_pages_stat(NULL, nr_pages, pages, status);
+
+ return err;
+}
+
+SYSCALL_DEFINE5(move_phys_pages, unsigned long, nr_pages,
+ const void __user * __user *, pages,
+ const int __user *, nodes,
+ int __user *, status, int, flags)
+{
+ return kernel_move_phys_pages(nr_pages, pages, nodes, status, flags);
+}
+
+
#ifdef CONFIG_NUMA_BALANCING
/*
* Returns true if this is a safe migration target node for misplaced NUMA
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index fd6c1cb585db..b140ad444946 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -820,8 +820,14 @@ __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
#define __NR_cachestat 451
__SYSCALL(__NR_cachestat, sys_cachestat)
+/* CONFIG_MMU only */
+#ifndef __ARCH_NOMMU
+#define __NR_move_phys_pages 454
+__SYSCALL(__NR_move_phys_pages, sys_move_phys_pages)
+#endif
+
#undef __NR_syscalls
-#define __NR_syscalls 452
+#define __NR_syscalls 455
/*
* 32 bit systems traditionally used different
--
2.39.1
do_pages_move does not handle compat pointers for the page list.
correctly. Add in_compat_syscall check and appropriate get_user
fetch when iterating the page list.
Signed-off-by: Gregory Price <[email protected]>
Reported-by: Arnd Bergmann <[email protected]>
Co-developed-by: Arnd Bergmann <[email protected]>
---
mm/migrate.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index b7fa020003f3..a0b0c5a7f8a5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2159,6 +2159,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
const int __user *nodes,
int __user *status, int flags)
{
+ compat_uptr_t __user *compat_pages = (void __user *)pages;
int current_node = NUMA_NO_NODE;
LIST_HEAD(pagelist);
int start, i;
@@ -2171,8 +2172,17 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
int node;
err = -EFAULT;
- if (get_user(p, pages + i))
- goto out_flush;
+ if (in_compat_syscall()) {
+ compat_uptr_t cp;
+
+ if (get_user(cp, compat_pages + i))
+ goto out_flush;
+
+ p = compat_ptr(cp);
+ } else {
+ if (get_user(p, pages + i))
+ goto out_flush;
+ }
if (get_user(node, nodes + i))
goto out_flush;
--
2.39.1
On Mon, Oct 02, 2023 at 03:07:46PM +0100, Jonathan Cameron wrote:
> On Tue, 19 Sep 2023 19:09:07 -0400
> Gregory Price <[email protected]> wrote:
>
> > @@ -2181,6 +2290,10 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> > int start, i;
> > int err = 0, err1;
> >
> > + /* This should never occur in regular operation */
> > + if (!mm && nodes_weight(task_nodes) > 0)
>
> You document below that task_nodes isn't used if !mm but it is used
> to indicate this condition...
>
good point, and in fact the code below does ignore the contents of
nodes_weight if mm is null, so i can just drop this check.
> > +set_status:
> > + *status = err;
>
> Given you update it unconditionally this seems to wipe out earlier
> potential errors with the nid.
>
> > +
> > + pages++;
> > + status++;
> > + }
> > +}
>
status is a list, the status++ increments the list pointer.
> > + /*
> > + * When the mm argument to do_pages_move is null, the task_nodes
> > + * argument is ignored, so pass in an empty nodemask as a dummy.
>
> If ignored, why bother clearing it? Looks like you are using it as
> as signal rather than ignoring it. So not sure this comment is correct.
>
> > + */
> > + nodes_clear(dummy_nodes);
I'm a bit of a stickler around uninitialized memory, seemed better to
ensure the nodelist was empty to force errors associated with an empty
nodelist should future updates break do_pages_move to remove the
ignoring mechanism.
That said, I updated the function such that it does actually ignore it,
but I'll leave the here since there's no harm in clearing the nodemask
and leads to more predictable behavior.
Unless there is particularly strong feelings about this, then i'm fine
to remove it.